Liking cljdoc? Tell your friends :D

Quick Reference - Core API

Functions are linked to their source but if no namespace is specified they are also accessible via the tech.ml.dataset namespace.

This is not an exhaustive listing of all functionality; just a quick brief way to find functions that are we find most useful.

Loading/Saving

  • ->dataset, ->>dataset - loads csv, tsv, sequence-of-maps, map-of-arrays, xlsx, xls, parquet and arrow.
  • write-csv! - Writes csv or tsv with gzipping. Depends on scanning file path string to determine options.
  • nippy freeze/thaw support.
  • dataset->data - Useful if you want the entire dataset represented as (mostly) pure Clojure/JVM datastructures. Missing sets are roaring bitmaps, data is probably in primitive arrays. String tables receive special treatment.
  • data->dataset - Inverse of data->dataset.
  • tech.ml.dataset.parse/csv->rows - Lazily parse a csv or tsv returning a sequence of string[] rows. This uses a subset of the ->dataset options.
  • tech.ml.dataset.parse/rows->dataset - Given a sequence of string[] rows, parse data into a dataset. Uses subset of the ->dataset options.

Accessing Values

  • Datasets overload Ifn so are functions of their column names. (ds :colname) will return the column named :colname. Datasets implement IPersistentMap so (map (comp second meta) ds) or (map meta (vals ds)) will return a sequence of column metadata. keys, vals, contains? and map-style destructuring all work on datasets.
  • Columns are iterable and implement indexed so you can use them with map, count and nth and overload IFn such that they are functions of their indexes similar to persistent vectors.
  • Typed random access is supported the (tech.v2.datatype/->reader col) transformation. This is guaranteed to return an implementation of java.util.List and also overloads IFn such that like a persistent vector passing in the index will return the value - e.g. (col 0) returns the value at index 0. Direct access to packed datetime columns may be surprising; call tech.v2.datatype.datetime/unpack on the column prior to calling tech.v2.datatype/->reader to get to the unpacked datatype.
  • row-count.
  • column-count.
  • mapseq-reader - get the rows of the dataset as a java.util.List of persistent-map-like maps. Implemented as a flyweight implementation of clojure.lang.APersistentMap. This keeps the data in the backing store and lazily reads it so you will have relatively more expensive reading of the data but will not increase your memory working-set size.
  • value-reader - Get the rows of the dataset as a 'java.util.List' of rows. These rows behave like persistent vectors but are not safe to use as keys in maps - use vec and get real persistent vectors if you intend to call equals or hashCode on these.
  • tech.ml.dataset.column/missing - return a RoaringBitmap of the missing indexes
  • columns-with-missing-seq
  • missing - Return the union of all missing indexes. Useful in combination with drop-rows to quickly eliminate missing values from the dataset.
  • meta, with-meta, vary-meta - Datasets and columns implement clojure.lang.IObj so you can get/set metadata on them freely. :name has meaning in the system and setting it directly on a column is not recommended. Metadata is generally carried forward through most of the operations below.

mapseq-reader and value-reader are lazy and thus (rand-nth (mapseq-reader ds)) is a relatively efficient pathway (and fun). (ds/mapseq-reader (ds/sample ds)) is also pretty good for quick scans.

Print Options

We use these options frequently during exploration to get more/less printing output. These are used like (vary-meta ds assoc :print-column-max-width 10). Often it is useful to print the entire table: (vary-meta ds assoc :print-index-range :all).

Dataset Exploration

Subrect Selection

Dataset Manipulation

Several of the functions below come in ->column variants and some come additional in ->indexes variants. ->column variants are going to be faster than the base versions and ->indexes simply return indexes and thus skip creating sub-datasets so these are faster yet.

  • assoc - add or replace columns.
  • dissoc - remove columns, similar to drop-columns.
  • new-dataset - Create a new dataset from a sequence of columns. Columns may be actual columns created via tech.ml.dataset.column/new-column or they could be maps containing at least keys {:name :data} but also potentially {:metadata :missing} in order to create a column with a specific set of missing values and metadata. :force-datatype true will disable the system from attempting to scan the data for missing values and e.g. create a float column from a vector of Float objects.
  • group-by-column, group-by - Create a persistent map of value->dataset. Sub-datasets are created via indexing into the original dataset so data is not copied.
  • sort-by-column, sort-by - Return a sorted dataset.
  • filter-column, filter - Return a new dataset with only rows that pass the predicate.
  • column-cast - Change the datatype of a column.
  • concat-copying, concat-inplace - Given Y datasets produce a new dataset. Whether to do in-place or copying concatenation depends roughly on the number of datasets. Inplace works best with under 5 datasets where all datasets have an identical row count. Copying can increase your ram usage but returns a dataset that will be more efficient to iterate over later. (apply ds/concat-copying x-seq) is far more efficient than (reduce ds/concat-copying x-seq); this also is true for concat-inplace.
  • unique-by-column, unique-by - Remove duplicate rows. Passing in keep-fn allows you to choose either first, last, or some other criteria for rows that have the same values.
  • columnwise-concat - Concatenate columns into longer dataset repeating values in other columns.

Elementwise Arithmetic

Functions in 'tech.v2.datatype.functional' all will apply various elementwise arithmetic operations to a column lazily returning a new column.

       (ds/assoc ds
                 :value (dtype/set-datatype (ds :value) :int64)
                 :shrs-or-prn-amt (dtype/set-datatype (ds :shrs-or-prn-amt) :int64)
                 :cik (dtype/const-reader (:cik filing) (ds/row-count ds))
                 :investor (dtype/const-reader investor (ds/row-count ds))
                 :form-type (dtype/const-reader form-type (ds/row-count ds))
                 :edgar-id (dtype/const-reader (:edgar-id filing) (ds/row-count ds))
                 :weight (dfn// (ds :value)
                                (double (dfn/reduce-+ (ds :value)))))

Forcing Lazy Evaluation

The dataset system relies on index indirection and laziness quite often. This allows you to aggregate up operations and pay relatively little for them however sometimes it increases the accessing costs of the data by an undesirable amount. Because of this we use clone quite often to force calculations to complete before beginning a new stage of data processing. Clone is multithreaded and very efficient often boiling down into either parallelized iteration over the data or System/arraycopy calls.

Additionally calling 'clone' after loading will reduce the in-memory size of the dataset by a bit - sometimes 20%. This is because lists that have allocated extra capacity are copied into arrays that have no extra capacity.

  • tech.v2.datatype/clone - Clones the dataset realizing lazy operation and copying the data into java arrays. Will clone datasets or columns.

Further Examples

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close