Functions are linked to their source but if no namespace is specified they are
also accessible via the tech.ml.dataset
namespace.
This is not an exhaustive listing of all functionality; just a quick brief way to find functions that are we find most useful.
(ds :colname)
will
return the column named :colname
. Datasets implement IPersistentMap
so
(map (comp second meta) ds)
or (map meta (vals ds))
will return a sequence of column
metadata. keys
, vals
, contains?
and map-style destructuring all work on
datasets.map
, count
and nth
and overload IFn such that they are functions of their indexes similar
to persistent vectors.(tech.v2.datatype/->reader col)
transformation. This is guaranteed to return an implementation of java.util.List
and also overloads IFn
such that like a persistent vector passing in the index
will return the value - e.g. (col 0)
returns the value at index 0. Direct access
to packed datetime columns may be surprising; call tech.v2.datatype.datetime/unpack
on the column prior to calling tech.v2.datatype/->reader
to get to the unpacked
datatype.java.util.List
of persistent-map-like maps. Implemented as a flyweight
implementation of clojure.lang.APersistentMap
. This keeps the data in the backing
store and lazily reads it so you will have relatively more expensive reading of the
data but will not increase your memory working-set size.vec
and get real persistent vectors if
you intend to call equals or hashCode on these.clojure.lang.IObj
so you can get/set metadata on them freely. :name
has meaning in the system and setting it
directly on a column is not recommended. Metadata is generally carried forward through most of the operations below.mapseq-reader
and value-reader
are lazy and thus (rand-nth (mapseq-reader ds))
is
a relatively efficient pathway (and fun). (ds/mapseq-reader (ds/sample ds))
is also
pretty good for quick scans.
We use these options frequently during exploration to get more/less printing
output. These are used like (vary-meta ds assoc :print-column-max-width 10)
.
Often it is useful to print the entire table: (vary-meta ds assoc :print-index-range :all)
.
Several of the functions below come in ->column
variants and some come additional
in ->indexes
variants. ->column
variants are going to be faster than the base
versions and ->indexes
simply return indexes and thus skip creating sub-datasets
so these are faster yet.
tech.ml.dataset.column/new-column
or they could be maps containing at least keys {:name :data}
but also potentially {:metadata :missing}
in order to create a column with a specific set of missing values and metadata. :force-datatype true
will disable the system
from attempting to scan the data for missing values and e.g. create a float column
from a vector of Float objects.(apply ds/concat-copying x-seq)
is
far more efficient than (reduce ds/concat-copying x-seq)
; this also is true for
concat-inplace
.keep-fn
allows
you to choose either first, last, or some other criteria for rows that have the same
values.Functions in 'tech.v2.datatype.functional' all will apply various elementwise arithmetic operations to a column lazily returning a new column.
(ds/assoc ds
:value (dtype/set-datatype (ds :value) :int64)
:shrs-or-prn-amt (dtype/set-datatype (ds :shrs-or-prn-amt) :int64)
:cik (dtype/const-reader (:cik filing) (ds/row-count ds))
:investor (dtype/const-reader investor (ds/row-count ds))
:form-type (dtype/const-reader form-type (ds/row-count ds))
:edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
:weight (dfn// (ds :value)
(double (dfn/reduce-+ (ds :value)))))
The dataset system relies on index indirection and laziness quite often. This allows
you to aggregate up operations and pay relatively little for them however sometimes
it increases the accessing costs of the data by an undesirable amount. Because
of this we use clone
quite often to force calculations to complete before
beginning a new stage of data processing. Clone is multithreaded and very efficient
often boiling down into either parallelized iteration over the data or
System/arraycopy
calls.
Additionally calling 'clone' after loading will reduce the in-memory size of the dataset by a bit - sometimes 20%. This is because lists that have allocated extra capacity are copied into arrays that have no extra capacity.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close