tech.datatype
now supports persistent vectors made via clojure.core.vector-of
.
vector-of
is a nice middle ground between raw persistent vectors and java arrays
and may be a simple path for many users into typed storage and datasets.->dataset
breaking changes
There is now an efficient conversion to/from smile dataframes.
->dataset
conversion a smile dataframe to a dataset.dataset->smile-dataframe
conversion a dataset to a smile dataframe.
Columns that are reader based will be copied into java arrays. To enable
predictable behavior a new function was added.ensure-array-backed
- ensure each column in the dataset has a zerocopy
conversion to a java array enabled by tech.v2.datatype/->array
.invert-string->number
- The pipeline function string->number
stores a string
table in the column metadata. Using this metadata, invert the string->number
operation returning the column back to its original state. This metadata is
:label-map which is a map from column-data to number.dtype/clone
works correctly for arrays.tech.ml.dataset.column/scan-data-for-missing
fix.dataset->str
.tech.ml.dataset.column/stats
was wrong for columns with missing
values.tech.v2.datatype
was causing double-read on boolean readers.max-num-columns
because csv and tsv files with more than 512
columns were failing to parse. New default is 8192.n-initial-skip-rows
works with xlsx spreadsheets.assoc
, dissoc
implemented in the main dataset namespace.tech.v2.datatype.functional
functions are updated to be
more permissive about their inputs and cast the result to the appropriate
datatype.tech.v2.datatype.functional
will now change the datatype appropriately on a
lot of unary math operations. So for instance calling sin, cos, log, or log1p
on an integer reader will now return a floating point reader. These methods used
to throw.tech.ml.dataset/column-cast
- Changes the column datatype via a an optionally
privided cast function. This function is powerful - it will correctly convert
packed types to their string representation, it will use the parsing system on
string columns and it uses the same complex datatype argument as
tech.ml.dataset.column/parse-column
:user> (doc ds/column-cast)
-------------------------
tech.ml.dataset/column-cast
([dataset colname datatype])
Cast a column to a new datatype. This is never a lazy operation. If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.
colname may be a scalar or a tuple of [src-col dst-col].
datatype may be a datatype enumeration or a tuple of
[datatype cast-fn] where cast-fn may return either a new value,
the :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure.
Exceptions are propagated to the caller. The new column has at least the
existing missing set if no attempt returns :missing or :cast-failure.
:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.
If the existing datatype is string, then tech.ml.datatype.column/parse-column
is called.
Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.ml.dataset.column/parse-column.
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| :symbol | :date | :price |
|---------+------------+--------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| :symbol | :date | :price |
|---------+------------+--------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (take 5 (stocks :price))
(39.81 36.35 43.22 28.37 25.45)
user> (take 5 ((ds/column-cast stocks :price :string) :price))
("39.81" "36.35" "43.22" "28.37" "25.45")
user> (take 5 ((ds/column-cast stocks :price [:int32 #(Math/round (double %))]) :price))
(40 36 43 28 25)
user>
user> (-> (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}])
(ds/column-map
:summed
(fn ^double [^double lhs ^double rhs]
(+ lhs rhs))
:a :b))
_unnamed [3 3]:
| :a | :b | :summed |
|----+-------+---------|
| 1 | | |
| | 2.000 | |
| 2 | 3.000 | 5.000 |
user> (tech.ml.dataset.column/missing
(*1 :summed))
#{0,1}
tech.v2.datatype/typed-reader-map
where the result datatype is derived
from the input datatypes of the input readers. The result of map-fn is
unceremoniously coerced to this datatype -user> (-> (ds/->dataset [{:a 1.0} {:a 2.0}])
(ds/update-column
:a
#(dtype/typed-reader-map (fn ^double [^double in]
(if (< in 2.0) (- in) in))
%)))
_unnamed [2 1]:
| :a |
|--------|
| -1.000 |
| 2.000 |
unroll-column
takes an optional argument :indexes?
that will record the source
index in the entry the unrolled data came from..addAll
tech.datatype
- all readers are marked as sequential.unroll-column
- Given a column that may container either iterable or scalar data,
unroll it so it only contains scalar data duplicating rows.tech.ml.dataset.column/unique
and especially
`tech.ml.dataset.pipeline/string->number.tech.v2.datatype
namespace has a new function - make-reader - that reifies
a reader of the appropriate type. This allows you to make new columns that have
nontrivial translations and datatypes much easier than before.tech.v2.datatype
namespace has a new function - ->typed-reader - that typecasts the incoming object into a reader of the appropriate datatype.
This means that .read calls will be strongly typed and is useful for building up a set
of typed variables before using make-reader
above.tech.datatype
added a method
to transform a reader into a persistent-vector-like object that derives from
clojure.lang.APersistentVector
and thus gains benefit from the excellent equality
and hash semantics of persistent vectors.columnwise-concat
which is a far simpler version of dplyr's
https://tidyr.tidyverse.org/reference/pivot_longer.html. This is implemented
efficiently in terms of indexed reader concatentation and as such should work
on tables of any size.->>
) then any options must be
passed before the dataset. Same is true for the set of functions that are dataset
first. We will be more strict about this from now on.tech.v2.datatype.bitmap/bitmap-value->bitmap-map
. This is used for
replace-missing type operations.brief
now does not return missing values. Double or float NaN or INF values
from a mapseq result in maps with fewer keys.brief
overrides this
to provide defaults to get more information.unique-by
returns indexes in order.->>
operators.tech.datatype
with upgraded and fewer dependencies.
:missing-nil?
false as an option.brief
function to main namespace so you can get a nice brief description
of your dataset when working from the REPL. This prints out better than
descriptive-stats
.->
versions of sort added so you can sort in -> pathwayscolumn->dataset
- map a transform function over a column and return a new
dataset from the result. It is expected the transform function returns a map.drop-rows
, select-rows
, drop-columns
- more granular select calls.append-columns
- append a list of columns to a dataset. Used with column->dataset.column-labeled-mapseq
- Create a sequence of maps with a :value and :label members.
this flattens the dataset by producing Y maps per row instead of 1 map per row
where the maps themselves are labeled with the value in their :value member. This
is useful to building vega charts.->distinct-by-column
- take the first row where a given key is present. The arrow
form of this indicats the dataset is the first argument.->sort-by
, ->sort-by-column
- Forms of these functions for using in (->)
dataflows.interpolate-loess
- Produce a new column from a given pair of columns using loess
interpolation to create the column. The interpolator is saved as metadata on the
new column.tech.ml.dataset.column/parse-column
- given a string column that failed to parse for
some reason, you can force the system to attempt to parse it using, for instance,
relaxed parsing semantics where failures simply record the failure in metadata.Can you improve this documentation? These fine people already did:
Chris Nuernberger & HaroldEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close