This namespace contains functions, which operate on a metamorph context. They all return the context as well.
So all functions in this namespace are metamorph compliant and can be placed in a metamorph pipeline.
Most functions here are only manipulating the dataset, which is in the ctx map under the key :metamorph/data. And they behave the same in pipeline mode :fit and :transform.
A few functions manipulate other keys inside the ctx map, and/or behave different in :fit and :transform.
This is documented per function in this form:
metamorph | . |
---|---|
Behaviour in mode :fit | . |
Behaviour in mode :transform | . |
Reads keys from ctx | . |
Writes keys to ctx | . |
The namespaces scicloj.ml.metamorph and scicloj.ml.dataset contain functions with the same name. But they operate on either a context map (ns metamorph) or on a dataset (ns dataset)
The functions in this namesspaces are re-exported from :
This namespace contains functions, which operate on a metamorph context. They all return the context as well. So all functions in this namespace are metamorph compliant and can be placed in a metamorph pipeline. Most functions here are only manipulating the dataset, which is in the ctx map under the key :metamorph/data. And they behave the same in pipeline mode :fit and :transform. A few functions manipulate other keys inside the ctx map, and/or behave different in :fit and :transform. This is documented per function in this form: metamorph | . -------------------------------------|------------------------------ Behaviour in mode :fit | . Behaviour in mode :transform | . Reads keys from ctx | . Writes keys to ctx | . The namespaces scicloj.ml.metamorph and scicloj.ml.dataset contain functions with the same name. But they operate on either a context map (ns metamorph) or on a dataset (ns dataset) The functions in this namesspaces are re-exported from : * tablecloth.pipeline * tech.v3.libs.smile.metamorph * scicloj.metamorph.ml * tech.v3.dataset.metamorph
(->array colname)
(->array colname datatype)
Convert numerical column(s) to java array
Convert numerical column(s) to java array
(add-column column-name column)
(add-column column-name column size-strategy)
Add or update (modify) column under column-name
.
column
can be sequence of values or generator function (which gets ds
as input).
ds
- a datasetcolumn-name
- if it's existing column name, column will be replacedcolumn
- can be column (from other dataset), sequence, single value or function (taking a dataset). Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy
argument)size-strategy
(optional) - when new column is shorter than dataset row count, following strategies are applied:
:cycle
- repeat data:na
- append missing values:strict
- (default) throws an exception when sizes mismatchAdd or update (modify) column under `column-name`. `column` can be sequence of values or generator function (which gets `ds` as input). * `ds` - a dataset * `column-name` - if it's existing column name, column will be replaced * `column` - can be column (from other dataset), sequence, single value or function (taking a dataset). Too big columns are always trimmed. Too small are cycled or extended with missing values (according to `size-strategy` argument) * `size-strategy` (optional) - when new column is shorter than dataset row count, following strategies are applied: - `:cycle` - repeat data - `:na` - append missing values - `:strict` - (default) throws an exception when sizes mismatch
(add-columns columns-map)
(add-columns columns-map size-strategy)
Add or updade (modify) columns defined in columns-map
(mapping: name -> column)
Add or updade (modify) columns defined in `columns-map` (mapping: name -> column)
(add-or-replace-column column-name column)
(add-or-replace-column column-name column size-strategy)
(add-or-replace-columns columns-map)
(add-or-replace-columns columns-map size-strategy)
(add-or-update-column column)
(add-or-update-column colname column)
If column exists, replace. Else append new column.
If column exists, replace. Else append new column.
(aggregate aggregator)
(aggregate aggregator options)
Aggregate dataset by providing:
Aggregation functions can return:
Aggregate dataset by providing: - aggregation function - map with column names and functions - sequence of aggregation functions Aggregation functions can return: - single value - seq of values - map of values with column names
(aggregate-columns columns-aggregators)
(aggregate-columns columns-selector column-aggregators)
(aggregate-columns columns-selector column-aggregators options)
Aggregates each column separately
Aggregates each column separately
(anti-join ds-right columns-selector)
(anti-join ds-right columns-selector options)
(append & args)
Concats columns of several datasets
Concats columns of several datasets
(array-column->columns src-column)
(array-column->columns src-column opts)
Converts a column of type java array into several columns, one for each element of the array of all rows. The source column is dropped afterwards. The function assumes that arrays in all rows have same type and length and are numeric.
ds
Datset to operate on.
src-column
The (array) column to convert
opts
can contain:
prefix
newly created column will get prefix before column number
Converts a column of type java array into several columns, one for each element of the array of all rows. The source column is dropped afterwards. The function assumes that arrays in all rows have same type and length and are numeric. `ds` Datset to operate on. `src-column` The (array) column to convert `opts` can contain: `prefix` newly created column will get prefix before column number
(as-regular-dataset)
Remove grouping tag
Remove grouping tag
(asof-join ds-right columns-selector)
(asof-join ds-right columns-selector options)
(assoc-ds cname cdata & args)
If dataset is not nil, calls clojure.core/assoc
. Else creates a new empty dataset and
then calls clojure.core/assoc
. Guaranteed to return a dataset (unlike assoc).
If dataset is not nil, calls `clojure.core/assoc`. Else creates a new empty dataset and then calls `clojure.core/assoc`. Guaranteed to return a dataset (unlike assoc).
(assoc-metadata filter-fn-or-ds k v & args)
Set metadata across a set of columns.
Set metadata across a set of columns.
(bow->something-sparse bow-col indices-col bow->sparse-fn options)
Converts a bag-of-word column bow-col
to a sparse data column indices-col
.
The exact transformation to the sparse representtaion is given by bow->sparse-fn
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | :scicloj.ml.smile.metamorph/bow->sparse-vocabulary |
Converts a bag-of-word column `bow-col` to a sparse data column `indices-col`. The exact transformation to the sparse representtaion is given by `bow->sparse-fn` metamorph |. -------------------------------------|--------- Behaviour in mode :fit |normal Behaviour in mode :transform |normal Reads keys from ctx |none Writes keys to ctx |:scicloj.ml.smile.metamorph/bow->sparse-vocabulary
(bow->sparse-array bow-col indices-col)
(bow->sparse-array bow-col indices-col options)
Converts a bag-of-word column bow-col
to sparse indices column
indices-col
, as needed by the Maxent model.
Options
can be of:
create-vocab-fn
A function which converts the bow map to a list of tokens.
Defaults to scicloj.ml.smile.nlp/create-vocab-all
The sparse data is represented as primitive int arrays
,
of which entries are the indices against the vocabulary
of the present tokens.
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | :scicloj.ml.smile.metamorph/bow->sparse-vocabulary |
Converts a bag-of-word column `bow-col` to sparse indices column `indices-col`, as needed by the Maxent model. `Options` can be of: `create-vocab-fn` A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all The sparse data is represented as `primitive int arrays`, of which entries are the indices against the vocabulary of the present tokens. metamorph |. -------------------------------------|--------- Behaviour in mode :fit |normal Behaviour in mode :transform |normal Reads keys from ctx |none Writes keys to ctx |:scicloj.ml.smile.metamorph/bow->sparse-vocabulary
(bow->SparseArray bow-col indices-col)
(bow->SparseArray bow-col indices-col options)
Converts a bag-of-word column bow-col
to sparse indices column indices-col
,
as needed by the discrete naive bayes model.
Options
can be of:
create-vocab-fn
A function which converts the bow map to a list of tokens.
Defaults to scicloj.ml.smile.nlp/create-vocab-all
The sparse data is represented as smile.util.SparseArray
.
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | :scicloj.ml.smile.metamorph/bow->sparse-vocabulary |
Converts a bag-of-word column `bow-col` to sparse indices column `indices-col`, as needed by the discrete naive bayes model. `Options` can be of: `create-vocab-fn` A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all The sparse data is represented as `smile.util.SparseArray`. metamorph |. -------------------------------------|--------- Behaviour in mode :fit |normal Behaviour in mode :transform |normal Reads keys from ctx |none Writes keys to ctx |:scicloj.ml.smile.metamorph/bow->sparse-vocabulary
(bow->tfidf bow-column tfidf-column options)
Calculates the tfidf score from bag-of-words (as token frequency maps)
in column bow-column
and stores them in a new column tfid-column
as maps of token->tfidf-score.
It calculates a global term-frequency map in :fit and reuses it in :transform
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | none |
Calculates the tfidf score from bag-of-words (as token frequency maps) in column `bow-column` and stores them in a new column `tfid-column` as maps of token->tfidf-score. It calculates a global term-frequency map in :fit and reuses it in :transform metamorph |. -------------------------------------|--------- Behaviour in mode :fit |normal Behaviour in mode :transform |normal Reads keys from ctx |none Writes keys to ctx |none
(brief)
(brief options)
Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.
Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.
(by-rank columns-selector rank-predicate)
(by-rank columns-selector rank-predicate options)
Select rows using rank
on a column, ties are resolved using :dense
method.
See R docs. Rank uses 0 based indexing.
Possible :ties
strategies: :average
, :first
, :last
, :random
, :min
, :max
, :dense
.
:dense
is the same as in data.table::frank
from R
:desc?
set to true (default) order descending before calculating rank
Select rows using `rank` on a column, ties are resolved using `:dense` method. See [R docs](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rank). Rank uses 0 based indexing. Possible `:ties` strategies: `:average`, `:first`, `:last`, `:random`, `:min`, `:max`, `:dense`. `:dense` is the same as in `data.table::frank` from R `:desc?` set to true (default) order descending before calculating rank
(categorical->number filter-fn-or-ds)
(categorical->number filter-fn-or-ds table-args)
(categorical->number filter-fn-or-ds table-args result-datatype)
Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.
Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.
(categorical->one-hot filter-fn-or-ds)
(categorical->one-hot filter-fn-or-ds table-args)
(categorical->one-hot filter-fn-or-ds table-args result-datatype)
Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot
Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot
(clone)
Clone an object. Can clone anything convertible to a reader.
Clone an object. Can clone anything convertible to a reader.
(cluster clustering-method clustering-method-args target-column)
Metamorph transformer, which clusters the data and creates a new column with the cluster id.
clustering-method
can be any of:
The clustering-args
is a vector with the positional arguments for each cluster function,
as documented here:
https://cljdoc.org/d/generateme/fastmath/2.1.5/api/fastmath.clustering
(but minus the data
argument, which will be passed in automatically)
The cluster id of each row gets written to the column in target-column
metamorph | . |
---|---|
Behaviour in mode :fit | Calculates cluster centers of the rows dataset at key :metamorph/data and stores them in ctx under key at :metamorph/id . Adds as wll column in target-column with cluster centers into the dataset. |
Behaviour in mode :transform | Reads cluster centers from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads cluster centers to use from ctx at key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores cluster centers in ctx under key in :metamorph/id . |
Metamorph transformer, which clusters the data and creates a new column with the cluster id. `clustering-method` can be any of: * :spectral * :dbscan * :k-means * :mec * :clarans * :g-means * :lloyd * :x-means * :deterministic-annealing * :denclue The `clustering-args` is a vector with the positional arguments for each cluster function, as documented here: https://cljdoc.org/d/generateme/fastmath/2.1.5/api/fastmath.clustering (but minus the `data` argument, which will be passed in automatically) The cluster id of each row gets written to the column in `target-column` metamorph | . -----------------------------|---------------------------------------------------------------------------- Behaviour in mode :fit | Calculates cluster centers of the rows dataset at key `:metamorph/data` and stores them in ctx under key at `:metamorph/id`. Adds as wll column in `target-column` with cluster centers into the dataset. Behaviour in mode :transform | Reads cluster centers from ctx and applies it to data in `:metamorph/data` Reads keys from ctx | In mode `:transform` : Reads cluster centers to use from ctx at key in `:metamorph/id`. Writes keys to ctx | In mode `:fit` : Stores cluster centers in ctx under key in `:metamorph/id`.
(column->dataset colname transform-fn)
(column->dataset colname transform-fn options)
Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.
Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.
(column-cast colname datatype)
Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.
colname may be a scalar or a tuple of [src-col dst-col].
datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.
If the existing datatype is string, then tech.v3.datatype.column/parse-column is called.
Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.v3.dataset.column/parse-column.
Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column. colname may be a scalar or a tuple of [src-col dst-col]. datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes. If the existing datatype is string, then tech.v3.datatype.column/parse-column is called. Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.v3.dataset.column/parse-column.
(column-labeled-mapseq value-colname-seq)
Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!
See also columnwise-concat
Return a sequence of maps with
{... - columns not in colname-seq
:value - value from one of the value columns
:label - name of the column the value came from
}
Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader! See also `columnwise-concat` Return a sequence of maps with ```clojure {... - columns not in colname-seq :value - value from one of the value columns :label - name of the column the value came from } ```
(column-map result-colname map-fn)
(column-map result-colname map-fn filter-fn-or-ds)
(column-map result-colname map-fn res-dtype-or-opts filter-fn-or-ds)
Produce a new (or updated) column as the result of mapping a fn over columns. This function is never lazy - all results are immediately calculated.
dataset
- dataset.result-colname
- Name of new (or existing) column.map-fn
- function to map over columns. Same rules as tech.v3.datatype/emap
.res-dtype-or-opts
- If not given result is scanned to infer missing and datatype.
If using an option map, options are described below.filter-fn-or-ds
- A dataset, a sequence of columns, or a tech.v3.datasets/column-filters
column filter function. Defaults to all the columns of the existing dataset.Returns a new dataset with a new or updated column.
Options:
:datatype
- Set the dataype of the result column. If not given result is scanned
to infer result datatype and missing set.:missing-fn
- if given, columns are first passed to missing-fn as a sequence and
this dictates the missing set. Else the missing set is by scanning the results
during the inference process. See tech.v3.dataset.column/union-missing-sets
and
tech.v3.dataset.column/intersect-missing-sets
for example functions to pass in
here.Examples:
;;From the tests --
(let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
;;result scanned for both datatype and missing set
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
;;result scanned for missing set only. Result used in-place.
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(when % (inc %))
{:datatype :float64} [:b]))))
;;Nothing scanned at all.
(is (= (vec [3.0 6.0 nil])
(:b2 (ds/column-map testds :b2 #(inc %)
{:datatype :float64
:missing-fn ds-col/union-missing-sets} [:b]))))
;;Missing set scanning causes NPE at inc.
(is (thrown? Throwable
(ds/column-map testds :b2 #(inc %)
{:datatype :float64}
[:b]))))
;;Ad-hoc repl --
user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------|------------|-------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
(ds/head))
test/data/stocks.csv [5 4]:
| symbol | date | price | price^2 |
|--------|------------|-------|-----------|
| MSFT | 2000-01-01 | 39.81 | 1584.8361 |
| MSFT | 2000-02-01 | 36.35 | 1321.3225 |
| MSFT | 2000-03-01 | 43.22 | 1867.9684 |
| MSFT | 2000-04-01 | 28.37 | 804.8569 |
| MSFT | 2000-05-01 | 25.45 | 647.7025 |
user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:
| :b | :a |
|----:|---:|
| | 1 |
| 2.0 | |
| 3.0 | 2 |
user> (ds/column-map ds1 :c (fn [a b]
(when (and a b)
(+ (double a) (double b))))
[:a :b])
_unnamed [3 3]:
| :b | :a | :c |
|----:|---:|----:|
| | 1 | |
| 2.0 | | |
| 3.0 | 2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}
Produce a new (or updated) column as the result of mapping a fn over columns. This function is never lazy - all results are immediately calculated. * `dataset` - dataset. * `result-colname` - Name of new (or existing) column. * `map-fn` - function to map over columns. Same rules as `tech.v3.datatype/emap`. * `res-dtype-or-opts` - If not given result is scanned to infer missing and datatype. If using an option map, options are described below. * `filter-fn-or-ds` - A dataset, a sequence of columns, or a `tech.v3.datasets/column-filters` column filter function. Defaults to all the columns of the existing dataset. Returns a new dataset with a new or updated column. Options: * `:datatype` - Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set. * `:missing-fn` - if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. See `tech.v3.dataset.column/union-missing-sets` and `tech.v3.dataset.column/intersect-missing-sets` for example functions to pass in here. Examples: ```clojure ;;From the tests -- (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])] ;;result scanned for both datatype and missing set (is (= (vec [3.0 6.0 nil]) (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b])))) ;;result scanned for missing set only. Result used in-place. (is (= (vec [3.0 6.0 nil]) (:b2 (ds/column-map testds :b2 #(when % (inc %)) {:datatype :float64} [:b])))) ;;Nothing scanned at all. (is (= (vec [3.0 6.0 nil]) (:b2 (ds/column-map testds :b2 #(inc %) {:datatype :float64 :missing-fn ds-col/union-missing-sets} [:b])))) ;;Missing set scanning causes NPE at inc. (is (thrown? Throwable (ds/column-map testds :b2 #(inc %) {:datatype :float64} [:b])))) ;;Ad-hoc repl -- user> (require '[tech.v3.dataset :as ds])) nil user> (def ds (ds/->dataset "test/data/stocks.csv")) #'user/ds user> (ds/head ds) test/data/stocks.csv [5 3]: | symbol | date | price | |--------|------------|-------| | MSFT | 2000-01-01 | 39.81 | | MSFT | 2000-02-01 | 36.35 | | MSFT | 2000-03-01 | 43.22 | | MSFT | 2000-04-01 | 28.37 | | MSFT | 2000-05-01 | 25.45 | user> (-> (ds/column-map ds "price^2" #(* % %) ["price"]) (ds/head)) test/data/stocks.csv [5 4]: | symbol | date | price | price^2 | |--------|------------|-------|-----------| | MSFT | 2000-01-01 | 39.81 | 1584.8361 | | MSFT | 2000-02-01 | 36.35 | 1321.3225 | | MSFT | 2000-03-01 | 43.22 | 1867.9684 | | MSFT | 2000-04-01 | 28.37 | 804.8569 | | MSFT | 2000-05-01 | 25.45 | 647.7025 | user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}])) #'user/ds1 user> ds1 _unnamed [3 2]: | :b | :a | |----:|---:| | | 1 | | 2.0 | | | 3.0 | 2 | user> (ds/column-map ds1 :c (fn [a b] (when (and a b) (+ (double a) (double b)))) [:a :b]) _unnamed [3 3]: | :b | :a | :c | |----:|---:|----:| | | 1 | | | 2.0 | | | | 3.0 | 2 | 5.0 | user> (ds/missing (*1 :c)) {0,1} ```
(column-names)
(column-names columns-selector)
(column-names columns-selector meta-field)
Returns column names, given a selector. Columns-selector can be one of the following:
Column name can be anything.
column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following:
:name
(default) - to operate on column names:datatype
- to operated on column types:all
- if you want to process all metadataDatatype groups are:
:type/numerical
- any numerical type:type/float
- floating point number (:float32 and :float64):type/integer
- any integer:type/datetime
- any datetime typeIf qualified keyword starts with :!type, complement set is used.
Returns column names, given a selector. Columns-selector can be one of the following: * :all keyword - selects all columns * column name - for single column * sequence of column names - for collection of columns * regex - to apply pattern on column names or datatype * filter predicate - to filter column names or datatype * type namespaced keyword for specific datatype or group of datatypes Column name can be anything. column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following: * `:name` (default) - to operate on column names * `:datatype` - to operated on column types * `:all` - if you want to process all metadata Datatype groups are: * `:type/numerical` - any numerical type * `:type/float` - floating point number (:float32 and :float64) * `:type/integer` - any integer * `:type/datetime` - any datetime type If qualified keyword starts with :!type, complement set is used.
(column-values->categorical src-column)
Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values. In the case of one-hot mappings, src-column must be the original column name before the one-hot map
Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values. In the case of one-hot mappings, src-column must be the original column name before the one-hot map
(columns)
(columns result-type)
Returns columns of dataset. Result type can be any of:
:as-map
:as-double-arrays
:as-seqs
Returns columns of dataset. Result type can be any of: * `:as-map` * `:as-double-arrays` * `:as-seqs`
(columns->array-column column-selector new-column)
Converts several columns to a single column of type array. The src columns are dropped afterwards.
ds
Dataset to operate on.
column-selector
anything supported by select-columns
new-column
new column to create
Converts several columns to a single column of type array. The src columns are dropped afterwards. `ds` Dataset to operate on. `column-selector` anything supported by [[select-columns]] `new-column` new column to create
(columns-with-missing-seq)
Return a sequence of:
{:column-name column-name
:missing-count missing-count
}
or nil of no columns are missing data.
Return a sequence of: ```clojure {:column-name column-name :missing-count missing-count } ``` or nil of no columns are missing data.
(columnwise-concat colnames)
(columnwise-concat colnames options)
Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.
Example:
user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
(ds/->dataset)
(ds/columnwise-concat [:c :a :b]))
null [6 3]:
| :column | :value | :d |
|---------+--------+----|
| :c | 3 | 1 |
| :c | 6 | 2 |
| :a | 1 | 1 |
| :a | 4 | 2 |
| :b | 2 | 1 |
| :b | 5 | 2 |
Options:
value-column-name - defaults to :value colname-column-name - defaults to :column
Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated. Example: ```clojure user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}] (ds/->dataset) (ds/columnwise-concat [:c :a :b])) null [6 3]: | :column | :value | :d | |---------+--------+----| | :c | 3 | 1 | | :c | 6 | 2 | | :a | 1 | 1 | | :a | 4 | 2 | | :b | 2 | 1 | | :b | 5 | 2 | ``` Options: value-column-name - defaults to :value colname-column-name - defaults to :column
(complete columns-selector & args)
TidyR complete.
Fills a dataset with all possible combinations of selected columns. When a given combination doesn't exist, missing values are created.
TidyR complete. Fills a dataset with all possible combinations of selected columns. When a given combination doesn't exist, missing values are created.
(concat & args)
Joins rows from other datasets
Joins rows from other datasets
(concat-copying & args)
Joins rows from other datasets via a copy of data
Joins rows from other datasets via a copy of data
(concat-inplace)
(concat-inplace & args)
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
(convert-types coltype-map-or-columns-selector)
(convert-types columns-selector new-types)
Convert type of the column to the other type.
Convert type of the column to the other type.
(count-vectorize text-col bow-col)
(count-vectorize text-col bow-col options)
Transforms the text column text-col
into a map of token frequencies in column
bow-col
options
can be any of
text->bow-fn
A functions which takes as input a text as string and options.
The default is nlp/default-text->bow
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | none |
Transforms the text column `text-col` into a map of token frequencies in column `bow-col` `options` can be any of * `text->bow-fn` A functions which takes as input a text as string and options. The default is `nlp/default-text->bow` metamorph |. -------------------------------------|--------- Behaviour in mode :fit |normal Behaviour in mode :transform |normal Reads keys from ctx |none Writes keys to ctx |none
(cross-join ds-right)
(cross-join ds-right columns-selector)
(cross-join ds-right columns-selector options)
Cross product from selected columns
Cross product from selected columns
(crosstab row-selector col-selector)
(crosstab row-selector col-selector options)
Cross tabulation of two sets of columns.
Creates grouped dataset by [row-selector, col-selector] pairs and calls aggregation on each group.
Options:
Cross tabulation of two sets of columns. Creates grouped dataset by [row-selector, col-selector] pairs and calls aggregation on each group. Options: * pivot? - create pivot table or just flat structure (default: true) * replace-missing? - replace missing values? (default: true) * missing-value - a missing value (default: 0) * aggregator - aggregating function (default: row-count) * marginal-rows, marginal-cols - adds row and/or cols, it's a sum if true. Can be a custom fn.
(data->dataset)
Convert a data-ized dataset created via dataset->data back into a full dataset
Convert a data-ized dataset created via dataset->data back into a full dataset
(dataset->categorical-xforms)
Given a dataset, return a map of column-name->xform information.
Given a dataset, return a map of column-name->xform information.
(dataset->data)
Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.
Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.
(dataset->str)
(dataset->str options)
Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.
For options documentation see dataset-data->str.
Convert a dataset to a string. Prints a single line header and then calls dataset-data->str. For options documentation see dataset-data->str.
(descriptive-stats)
(descriptive-stats options)
Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.
Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.
(drop columns-selector rows-selector)
Drop columns and rows.
Drop columns and rows.
(drop-columns)
(drop-columns columns-selector)
(drop-columns columns-selector meta-field)
Drop columns by (returns dataset):
Drop columns by (returns dataset): - name - sequence of names - map of names with new names (rename) - function which filter names (via column metadata)
(drop-missing)
(drop-missing columns-selector)
Drop rows with missing values
columns-selector
selects columns to look at missing values
Drop rows with missing values `columns-selector` selects columns to look at missing values
(drop-rows)
(drop-rows rows-selector)
(drop-rows rows-selector options)
Drop rows using:
Drop rows using: - row id - seq of row ids - seq of true/false - fn with predicate
(ensure-array-backed)
(ensure-array-backed options)
Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.
Columns that are already array backed and that have no missing values are not changed and retuned.
The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.
Options:
:unpack?
- unpack packed datetime types. Defaults to trueEnsure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays. Columns that are already array backed and that have no missing values are not changed and retuned. The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column. Options: * `:unpack?` - unpack packed datetime types. Defaults to true
(expand columns-selector & args)
TidyR expand.
Creates all possible combinations of selected columns.
TidyR expand. Creates all possible combinations of selected columns.
(feature-ecount)
Number of feature columns. Feature columns are columns that are not inference targets.
Number of feature columns. Feature columns are columns that are not inference targets.
(fill-range-replace colname max-span)
(fill-range-replace colname max-span missing-strategy)
(fill-range-replace colname max-span missing-strategy missing-value)
Fill missing up with lacking values. Accepts
Fill missing up with lacking values. Accepts * dataset * column name * expected step (max-span, milliseconds in case of datetime column) * (optional) missing-strategy - how to replace missing, default :down (set to nil if none) * (optional) missing-value - optional value for replace missing
(filter predicate)
dataset->dataset transformation. Predicate is passed a map of colname->column-value.
dataset->dataset transformation. Predicate is passed a map of colname->column-value.
(filter-column colname)
(filter-column colname predicate)
Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %).
The 2-arity form of this function reads the column as a boolean reader so for instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are only false if nil?.
Returns a dataset.
Filter a given column by a predicate. Predicate is passed column values. If predicate is *not* an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). The 2-arity form of this function reads the column as a boolean reader so for instance numeric 0 values are false in that case as are Double/NaN, Float/NaN. Objects are only false if nil?. Returns a dataset.
(filter-dataset filter-fn-or-ds)
Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.
Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace. * If filter-fn-or-ds is a dataset, it is returned. * If filter-fn-or-ds is sequential, then select-columns is called. * If filter-fn-or-ds is :all, all columns are returned * If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.
(fold-by columns-selector)
(fold-by columns-selector folding-function)
Group-by and pack columns into vector - the output data set has a row for each unique combination of the provided columns while each remaining column has its valu(es) collected into a vector, similar to how clojure.core/group-by works. See https://scicloj.github.io/tablecloth/index.html#Fold-by
Group-by and pack columns into vector - the output data set has a row for each unique combination of the provided columns while each remaining column has its valu(es) collected into a vector, similar to how clojure.core/group-by works. See https://scicloj.github.io/tablecloth/index.html#Fold-by
(full-join ds-right columns-selector)
(full-join ds-right columns-selector options)
Join keeping all rows
Join keeping all rows
(get-entry column row)
Returns a single value from given column and row
Returns a single value from given column and row
(group-by grouping-selector)
(group-by grouping-selector options)
Group dataset by:
Options are:
select-keys
seq.:as-dataset
, default) or as map of datasets (:as-map
) or as map of row indexes (:as-indexes
) or as sequence of (sub)datasetsdataset
fnWhen dataset is returned, meta contains :grouped?
set to true. Columns in dataset:
Group dataset by: - column name - list of columns - map of keys and row indexes - function getting map of values Options are: - select-keys - when grouping is done by function, you can limit fields to a `select-keys` seq. - result-type - return results as dataset (`:as-dataset`, default) or as map of datasets (`:as-map`) or as map of row indexes (`:as-indexes`) or as sequence of (sub)datasets - other parameters which are passed to `dataset` fn When dataset is returned, meta contains `:grouped?` set to true. Columns in dataset: - name - group name - group-id - id of the group (int) - data - group as dataset
(group-by->indexes key-fn)
(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.
(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.
(group-by-column colname)
Return a map of column-value->dataset.
Return a map of column-value->dataset.
(group-by-column->indexes colname)
(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.
(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.
(grouped?)
Is dataset
represents grouped dataset (result of group-by
)?
Is `dataset` represents grouped dataset (result of `group-by`)?
(groups->map)
Convert grouped dataset to the map of groups
Convert grouped dataset to the map of groups
(groups->seq)
Convert grouped dataset to seq of the groups
Convert grouped dataset to seq of the groups
(induction induct-fn & args)
Given a dataset and a function from dataset->row produce a new dataset. The produced row will be merged with the current row and then added to the dataset.
Options are same as the options used for [[->dataset]] in order for the
user to control the parsing of the return values of induct-fn
.
A new dataset is returned.
Example:
user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]}))
#'user/ds
user> ds
_unnamed [4 2]:
| :a | :b |
|---:|---:|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
user> (ds/induction ds (fn [ds]
{:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1))
:sum-a (dfn/sum (ds :a))
:sum-b (dfn/sum (ds :b))}))
_unnamed [4 5]:
| :a | :b | :sum-b | :sum-a | :sum-of-previous-row |
|---:|---:|-------:|-------:|---------------------:|
| 0 | 1 | 0.0 | 0.0 | 0.0 |
| 1 | 2 | 1.0 | 0.0 | 1.0 |
| 2 | 3 | 3.0 | 1.0 | 5.0 |
| 3 | 4 | 6.0 | 3.0 | 14.0 |
Given a dataset and a function from dataset->row produce a new dataset. The produced row will be merged with the current row and then added to the dataset. Options are same as the options used for [[->dataset]] in order for the user to control the parsing of the return values of `induct-fn`. A new dataset is returned. Example: ```clojure user> (def ds (ds/->dataset {:a [0 1 2 3] :b [1 2 3 4]})) #'user/ds user> ds _unnamed [4 2]: | :a | :b | |---:|---:| | 0 | 1 | | 1 | 2 | | 2 | 3 | | 3 | 4 | user> (ds/induction ds (fn [ds] {:sum-of-previous-row (dfn/sum (ds/rowvec-at ds -1)) :sum-a (dfn/sum (ds :a)) :sum-b (dfn/sum (ds :b))})) _unnamed [4 5]: | :a | :b | :sum-b | :sum-a | :sum-of-previous-row | |---:|---:|-------:|-------:|---------------------:| | 0 | 1 | 0.0 | 0.0 | 0.0 | | 1 | 2 | 1.0 | 0.0 | 1.0 | | 2 | 3 | 3.0 | 1.0 | 5.0 | | 3 | 4 | 6.0 | 3.0 | 14.0 | ```
(inference-target-column-names)
Return the names of the columns that are inference targets.
Return the names of the columns that are inference targets.
(inference-target-ds)
Given a dataset return reverse-mapped inference target columns or nil in the case where there are no inference targets.
Given a dataset return reverse-mapped inference target columns or nil in the case where there are no inference targets.
(inference-target-label-inverse-map & args)
Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.
Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.
(info)
(info result-type)
Returns a statistcial information about the columns of a dataset.
result-type
can be :descriptive or :columns
Returns a statistcial information about the columns of a dataset. `result-type ` can be :descriptive or :columns
(inner-join ds-right columns-selector)
(inner-join ds-right columns-selector options)
(join-columns target-column columns-selector)
(join-columns target-column columns-selector conf)
Join clumns of dataset. Accepts:
dataset
column selector (as in select-columns)
options
:separator
(default -)
:drop-columns?
- whether to drop source columns or not (default true)
:result-type
:map
- packs data into map
:seq
- packs data into sequence
:string
- join strings with separator (default)
or custom function which gets row as a vector
:missing-subst
- substitution for missing value
Join clumns of dataset. Accepts: dataset column selector (as in select-columns) options `:separator` (default -) `:drop-columns?` - whether to drop source columns or not (default true) `:result-type` `:map` - packs data into map `:seq` - packs data into sequence `:string` - join strings with separator (default) or custom function which gets row as a vector `:missing-subst` - substitution for missing value
(labels)
Return the labels. The labels sequence is the reverse mapped inference column. This returns a single column of data or errors out.
Return the labels. The labels sequence is the reverse mapped inference column. This returns a single column of data or errors out.
(left-join ds-right columns-selector)
(left-join ds-right columns-selector options)
(map-columns column-name map-fn)
(map-columns column-name columns-selector map-fn)
(map-columns column-name new-type columns-selector map-fn)
Map over rows using a map function. The arity should match the columns selected.
Map over rows using a map function. The arity should match the columns selected.
(mapseq-reader)
(mapseq-reader options)
Return a reader that produces a map of column-name->column-value upon read.
Return a reader that produces a map of column-name->column-value upon read.
(min-max-scale columns-selector options)
Metamorph transfomer, which scales the column data into a given range.
columns-selector
tablecloth columns-selector to choose columns to work on
meta-field
tablecloth meta-field working with columns-selector
options
Options for scaler, can take:
min
Minimal value to scale to (default -0.5)
max
Maximum value to scale to (default 0.5)
metamorph | . |
---|---|
Behaviour in mode :fit | Scales the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id |
Behaviour in mode :transform | Reads trained min-max-scale model from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use for from key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in key $id |
Metamorph transfomer, which scales the column data into a given range. `columns-selector` tablecloth columns-selector to choose columns to work on `meta-field` tablecloth meta-field working with `columns-selector` `options` Options for scaler, can take: `min` Minimal value to scale to (default -0.5) `max` Maximum value to scale to (default 0.5) metamorph | . -------------------------------------|---------------------------------------------------------------------------- Behaviour in mode :fit | Scales the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id` Behaviour in mode :transform | Reads trained min-max-scale model from ctx and applies it to data in `:metamorph/data` Reads keys from ctx | In mode `:transform` : Reads trained model to use for from key in `:metamorph/id`. Writes keys to ctx | In mode `:fit` : Stores trained model in key $id
(min-n-by-column cname N)
(min-n-by-column cname N comparator)
(min-n-by-column cname N comparator options)
Find the minimum N entries (unsorted) by column. Resulting data will be indexed in original order. If you want a sorted order then sort the result.
See options to sort-by-column
.
Example:
user> (ds/min-n-by-column ds "price" 10 nil nil)
test/data/stocks.csv [10 3]:
| symbol | date | price |
|--------|------------|------:|
| AMZN | 2001-09-01 | 5.97 |
| AMZN | 2001-10-01 | 6.98 |
| AAPL | 2000-12-01 | 7.44 |
| AAPL | 2002-08-01 | 7.38 |
| AAPL | 2002-09-01 | 7.25 |
| AAPL | 2002-12-01 | 7.16 |
| AAPL | 2003-01-01 | 7.18 |
| AAPL | 2003-02-01 | 7.51 |
| AAPL | 2003-03-01 | 7.07 |
| AAPL | 2003-04-01 | 7.11 |
user> (ds/min-n-by-column ds "price" 10 > nil)
test/data/stocks.csv [10 3]:
| symbol | date | price |
|--------|------------|-------:|
| GOOG | 2007-09-01 | 567.27 |
| GOOG | 2007-10-01 | 707.00 |
| GOOG | 2007-11-01 | 693.00 |
| GOOG | 2007-12-01 | 691.48 |
| GOOG | 2008-01-01 | 564.30 |
| GOOG | 2008-04-01 | 574.29 |
| GOOG | 2008-05-01 | 585.80 |
| GOOG | 2009-11-01 | 583.00 |
| GOOG | 2009-12-01 | 619.98 |
| GOOG | 2010-03-01 | 560.19 |
Find the minimum N entries (unsorted) by column. Resulting data will be indexed in original order. If you want a sorted order then sort the result. See options to [[sort-by-column]]. Example: ```clojure user> (ds/min-n-by-column ds "price" 10 nil nil) test/data/stocks.csv [10 3]: | symbol | date | price | |--------|------------|------:| | AMZN | 2001-09-01 | 5.97 | | AMZN | 2001-10-01 | 6.98 | | AAPL | 2000-12-01 | 7.44 | | AAPL | 2002-08-01 | 7.38 | | AAPL | 2002-09-01 | 7.25 | | AAPL | 2002-12-01 | 7.16 | | AAPL | 2003-01-01 | 7.18 | | AAPL | 2003-02-01 | 7.51 | | AAPL | 2003-03-01 | 7.07 | | AAPL | 2003-04-01 | 7.11 | user> (ds/min-n-by-column ds "price" 10 > nil) test/data/stocks.csv [10 3]: | symbol | date | price | |--------|------------|-------:| | GOOG | 2007-09-01 | 567.27 | | GOOG | 2007-10-01 | 707.00 | | GOOG | 2007-11-01 | 693.00 | | GOOG | 2007-12-01 | 691.48 | | GOOG | 2008-01-01 | 564.30 | | GOOG | 2008-04-01 | 574.29 | | GOOG | 2008-05-01 | 585.80 | | GOOG | 2009-11-01 | 583.00 | | GOOG | 2009-12-01 | 619.98 | | GOOG | 2010-03-01 | 560.19 | ```
(missing)
Given a dataset or a column, return the missing set as a roaring bitmap
Given a dataset or a column, return the missing set as a roaring bitmap
(model options)
Executes a machine learning model in train/predict (depending on :mode)
from the metamorph.ml
model registry.
The model is passed between both invocation via the shared context ctx in a
key (a step indentifier) which is passed in key :metamorph/id
and guarantied to be unique for each
pipeline step.
The function writes and reads into this common context key.
Options:
:model-type
- Keyword for the model to useFurther options get passed to train
functions and are model specific.
See here for an overview for the models build into scicloj.ml:
https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html
Other libraries might contribute other models, which are documented as part of the library.
metamorph | . |
---|---|
Behaviour in mode :fit | Calls scicloj.metamorph.ml/train using data in :metamorph/data and options and stores trained model in ctx under key in :metamorph/id |
Behaviour in mode :transform | Reads trained model from ctx and calls scicloj.metamorph.ml/predict with the model in $id and data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use for prediction from key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at :scicloj.metamorph.ml/feature-ds /:scicloj.metamorph.ml/target-ds |
See as well:
scicloj.metamorph.ml/train
scicloj.metamorph.ml/predict
Executes a machine learning model in train/predict (depending on :mode) from the `metamorph.ml` model registry. The model is passed between both invocation via the shared context ctx in a key (a step indentifier) which is passed in key `:metamorph/id` and guarantied to be unique for each pipeline step. The function writes and reads into this common context key. Options: - `:model-type` - Keyword for the model to use Further options get passed to `train` functions and are model specific. See here for an overview for the models build into scicloj.ml: https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html Other libraries might contribute other models, which are documented as part of the library. metamorph | . -------------------------------------|---------------------------------------------------------------------------- Behaviour in mode :fit | Calls `scicloj.metamorph.ml/train` using data in `:metamorph/data` and `options`and stores trained model in ctx under key in `:metamorph/id` Behaviour in mode :transform | Reads trained model from ctx and calls `scicloj.metamorph.ml/predict` with the model in $id and data in `:metamorph/data` Reads keys from ctx | In mode `:transform` : Reads trained model to use for prediction from key in `:metamorph/id`. Writes keys to ctx | In mode `:fit` : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at `:scicloj.metamorph.ml/feature-ds` /`:scicloj.metamorph.ml/target-ds` See as well: * `scicloj.metamorph.ml/train` * `scicloj.metamorph.ml/predict`
(model-type & args)
Check the label column after dataset processing. Return either :regression :classification
Check the label column after dataset processing. Return either :regression :classification
(new-column)
(new-column data)
(new-column data metadata)
(new-column data metadata missing)
Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.
Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.
(new-dataset)
(new-dataset column-seq)
(new-dataset ds-metadata column-seq)
Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.
The return value fulfills the dataset protocols.
Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset. The return value fulfills the dataset protocols.
(num-inference-classes)
Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.
Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.
(order-by columns-or-fn)
(order-by columns-or-fn comparators)
(order-by columns-or-fn comparators options)
Order dataset by:
Order dataset by: - column name - columns (as sequence of names) - key-fn - sequence of columns / key-fn Additionally you can ask the order by: - :asc - :desc - custom comparator function
(order-column-names colname-seq)
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
(pivot->longer)
(pivot->longer columns-selector)
(pivot->longer columns-selector options)
tidyr
pivot_longer api
`tidyr` pivot_longer api
(pivot->wider columns-selector value-columns)
(pivot->wider columns-selector value-columns options)
Converts columns to rows. Arguments:
dataset
columns selector
options:
:target-columns
- names of the columns created or columns pattern (see below) (default: :$column)
:value-column-name
- name of the column for values (default: :$value)
:splitter
- string, regular expression or function which splits source column names into data
:drop-missing?
- remove rows with missing? (default: true)
:datatypes
- map of target columns data types
:coerce-to-number
- try to convert extracted values to numbers if possible (default: true)
target-columns - can be:
Converts columns to rows. Arguments: * dataset * columns selector * options: `:target-columns` - names of the columns created or columns pattern (see below) (default: :$column) `:value-column-name` - name of the column for values (default: :$value) `:splitter` - string, regular expression or function which splits source column names into data `:drop-missing?` - remove rows with missing? (default: true) `:datatypes` - map of target columns data types `:coerce-to-number` - try to convert extracted values to numbers if possible (default: true) * target-columns - can be: * column name - source columns names are put there as a data * column names as seqence - source columns names after split are put separately into :target-columns as data * pattern - is a sequence of names, where some of the names are nil. nil is replaced by a name taken from splitter and such column is used for values.
(pmap-ds ds-map-fn)
(pmap-ds ds-map-fn options)
Parallelize mapping a function from dataset->dataset across a single dataset. Results are coalesced back into a single dataset. The original dataset is simple sliced into n-core results and map-fn is called n-core times. ds-map-fn must be a function from dataset->dataset although it may return nil.
Options:
:max-batch-size
- this is a default for tech.v3.parallel.for/indexed-map-reduce. You
can control how many rows are processed in a given batch - the default is 64000. If your
mapping pathway produces a large expansion in the size of the dataset then it may be
good to reduce the max batch size and use :as-seq to produce a sequence of datasets.:result-type
:as-seq
- Return a sequence of datasets, one for each batch.:as-ds
- Return a single datasets with all results in memory (default option).Parallelize mapping a function from dataset->dataset across a single dataset. Results are coalesced back into a single dataset. The original dataset is simple sliced into n-core results and map-fn is called n-core times. ds-map-fn must be a function from dataset->dataset although it may return nil. Options: * `:max-batch-size` - this is a default for tech.v3.parallel.for/indexed-map-reduce. You can control how many rows are processed in a given batch - the default is 64000. If your mapping pathway produces a large expansion in the size of the dataset then it may be good to reduce the max batch size and use :as-seq to produce a sequence of datasets. * `:result-type` - `:as-seq` - Return a sequence of datasets, one for each batch. - `:as-ds` - Return a single datasets with all results in memory (default option).
(print-all)
Helper function equivalent to (tech.v3.dataset.print/print-range ... :all)
Helper function equivalent to `(tech.v3.dataset.print/print-range ... :all)`
(print-dataset)
(print-dataset options)
Prints dataset into console. For options see tech.v3.dataset.print/dataset-data->str
Prints dataset into console. For options see tech.v3.dataset.print/dataset-data->str
(probability-distributions->label-column dst-colname)
Given a dataset that has columns in which the column names describe labels and the rows describe a probability distribution, create a label column by taking the max value in each row and assign column that row value.
Given a dataset that has columns in which the column names describe labels and the rows describe a probability distribution, create a label column by taking the max value in each row and assign column that row value.
(process-group-data f)
(process-group-data f parallel?)
Internal: The passed-in function is applied on all groups
Internal: The passed-in function is applied on all groups
(rand-nth)
(rand-nth options)
Returns single random row
Returns single random row
(random)
(random n)
(random n options)
Returns (n) random rows with repetition
Returns (n) random rows with repetition
(reduce-dimensions algorithm target-dims cnames opts)
Metamorph transformer, which reduces the dimensions of a given dataset.
algorithm
can be any of:
target-dims
is number of dimensions to reduce to.
cnames
is a sequence of column names on which the reduction get performed
opts
are the options of the algorithm
metamorph | . |
---|---|
Behaviour in mode :fit | Reduces dimensions of the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id |
Behaviour in mode :transform | Reads trained reduction model from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use from ctx at key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in ctx under key in :metamorph/id . |
Metamorph transformer, which reduces the dimensions of a given dataset. `algorithm` can be any of: * :pca-cov * :pca-cor * :pca-prob * :kpca * :gha * :random `target-dims` is number of dimensions to reduce to. `cnames` is a sequence of column names on which the reduction get performed `opts` are the options of the algorithm metamorph | . -------------------------------------|---------------------------------------------------------------------------- Behaviour in mode :fit | Reduces dimensions of the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id` Behaviour in mode :transform | Reads trained reduction model from ctx and applies it to data in `:metamorph/data` Reads keys from ctx | In mode `:transform` : Reads trained model to use from ctx at key in `:metamorph/id`. Writes keys to ctx | In mode `:fit` : Stores trained model in ctx under key in `:metamorph/id`.
(remove-column col-name)
Same as:
(dissoc dataset col-name)
Same as: ```clojure (dissoc dataset col-name) ```
(remove-columns colname-seq-or-fn)
Remove columns indexed by column name seq or column filter function. For example:
(remove-columns DS [:A :B])
(remove-columns DS cf/categorical)
Remove columns indexed by column name seq or column filter function. For example: ```clojure (remove-columns DS [:A :B]) (remove-columns DS cf/categorical) ```
(rename-columns columns-mapping)
(rename-columns columns-selector columns-map-fn)
Rename columns with provided old -> new name map
Rename columns with provided old -> new name map
(reorder-columns columns-selector & args)
Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.
Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.
(replace-missing)
(replace-missing strategy)
(replace-missing columns-selector strategy)
(replace-missing columns-selector strategy value)
Replaces missing values. Accepts
Strategies are:
:value
- replace with given value
:up
- copy values up
:down
- copy values down
:updown
- copy values up and then down for missing values at the end
:downup
- copy values down and then up for missing values at the beginning
:mid
or :nearest
- copy values around known values
:midpoint
- use average value from previous and next non-missing
:lerp
- trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.
Replaces missing values. Accepts * dataset * column selector, default: :all * strategy, default: :nearest * value (optional) * single value * sequence of values (cycled) * function, applied on column(s) with stripped missings Strategies are: `:value` - replace with given value `:up` - copy values up `:down` - copy values down `:updown` - copy values up and then down for missing values at the end `:downup` - copy values down and then up for missing values at the beginning `:mid` or `:nearest` - copy values around known values `:midpoint` - use average value from previous and next non-missing `:lerp` - trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.
(replace-missing-value scalar-value)
(replace-missing-value filter-fn-or-ds scalar-value)
(reverse-rows)
Reverse the rows in the dataset or column.
Reverse the rows in the dataset or column.
(right-join ds-right columns-selector)
(right-join ds-right columns-selector options)
(row-at idx)
Get the row at an individual index. If indexes are negative then the dataset is indexed from the end.
user> (ds/row-at stocks 1)
{"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"],
"symbol" "MSFT",
"price" 36.35}
user> (ds/row-at stocks -1)
{"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"],
"symbol" "AAPL",
"price" 223.02}
Get the row at an individual index. If indexes are negative then the dataset is indexed from the end. ```clojure user> (ds/row-at stocks 1) {"date" #object[java.time.LocalDate 0x534cb03b "2000-02-01"], "symbol" "MSFT", "price" 36.35} user> (ds/row-at stocks -1) {"date" #object[java.time.LocalDate 0x6bf60ed5 "2010-03-01"], "symbol" "AAPL", "price" 223.02} ```
(row-map map-fn)
(row-map map-fn options)
Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the [[->dataset]] function so you can control the resulting column types by the usual dataset parsing options described there.
Options:
See options for pmap-ds
. In particular, note that you can
produce a sequence of datasets as opposed to a single large dataset.
Examples:
user> (def stocks (ds/->dataset "test/data/stocks.csv"))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| symbol | date | price |
|--------|------------|------:|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (ds/head (ds/row-map stocks (fn [row]
{"symbol" (keyword (row "symbol"))
:price2 (* (row "price")(row "price"))})))
test/data/stocks.csv [5 4]:
| symbol | date | price | :price2 |
|--------|------------|------:|----------:|
| :MSFT | 2000-01-01 | 39.81 | 1584.8361 |
| :MSFT | 2000-02-01 | 36.35 | 1321.3225 |
| :MSFT | 2000-03-01 | 43.22 | 1867.9684 |
| :MSFT | 2000-04-01 | 28.37 | 804.8569 |
| :MSFT | 2000-05-01 | 25.45 | 647.7025 |
Map a function across the rows of the dataset producing a new dataset that is merged back into the original potentially replacing existing columns. Options are passed into the [[->dataset]] function so you can control the resulting column types by the usual dataset parsing options described there. Options: See options for [[pmap-ds]]. In particular, note that you can produce a sequence of datasets as opposed to a single large dataset. Examples: ```clojure user> (def stocks (ds/->dataset "test/data/stocks.csv")) #'user/stocks user> (ds/head stocks) test/data/stocks.csv [5 3]: | symbol | date | price | |--------|------------|------:| | MSFT | 2000-01-01 | 39.81 | | MSFT | 2000-02-01 | 36.35 | | MSFT | 2000-03-01 | 43.22 | | MSFT | 2000-04-01 | 28.37 | | MSFT | 2000-05-01 | 25.45 | user> (ds/head (ds/row-map stocks (fn [row] {"symbol" (keyword (row "symbol")) :price2 (* (row "price")(row "price"))}))) test/data/stocks.csv [5 4]: | symbol | date | price | :price2 | |--------|------------|------:|----------:| | :MSFT | 2000-01-01 | 39.81 | 1584.8361 | | :MSFT | 2000-02-01 | 36.35 | 1321.3225 | | :MSFT | 2000-03-01 | 43.22 | 1867.9684 | | :MSFT | 2000-04-01 | 28.37 | 804.8569 | | :MSFT | 2000-05-01 | 25.45 | 647.7025 | ```
(row-mapcat mapcat-fn)
(row-mapcat mapcat-fn options)
Map a function across the rows of the dataset. The function must produce a sequence of
maps and the original dataset rows will be duplicated and then merged into the result
of calling (->> (apply concat) (->>dataset options) on the result of mapcat-fn
. Options
are the same as [[->dataset]].
The smaller the maps returned from mapcat-fn the better, perhaps consider using records.
In the case that a mapcat-fn result map has a key that overlaps a column name the
column will be replaced with the output of mapcat-fn. The returned map will have the
key :_row-id
assoc'd onto it so for absolutely minimal gc usage include this
as a member variable in your map.
Options:
pmap-ds
. Especially note :max-batch-size
and :result-type
.
In order to conserve memory it may be much more efficient to return a sequence of datasets
rather than one large dataset. If returning sequences of datasets perhaps consider
a transducing pathway across them or the [[tech.v3.dataset.reductions]] namespace.Example:
user> (def ds (ds/->dataset {:rid (range 10)
:data (repeatedly 10 #(rand-int 3))}))
#'user/ds
user> (ds/head ds)
_unnamed [5 2]:
| :rid | :data |
|-----:|------:|
| 0 | 0 |
| 1 | 2 |
| 2 | 0 |
| 3 | 1 |
| 4 | 2 |
user> (def mapcat-fn (fn [row]
(for [idx (range (row :data))]
{:idx idx})))
#'user/mapcat-fn
user> (mapcat mapcat-fn (ds/rows ds))
({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1})
user> (ds/row-mapcat ds mapcat-fn)
_unnamed [9 3]:
| :rid | :data | :idx |
|-----:|------:|-----:|
| 1 | 2 | 0 |
| 1 | 2 | 1 |
| 3 | 1 | 0 |
| 4 | 2 | 0 |
| 4 | 2 | 1 |
| 6 | 2 | 0 |
| 6 | 2 | 1 |
| 8 | 2 | 0 |
| 8 | 2 | 1 |
user>
Map a function across the rows of the dataset. The function must produce a sequence of maps and the original dataset rows will be duplicated and then merged into the result of calling (->> (apply concat) (->>dataset options) on the result of `mapcat-fn`. Options are the same as [[->dataset]]. The smaller the maps returned from mapcat-fn the better, perhaps consider using records. In the case that a mapcat-fn result map has a key that overlaps a column name the column will be replaced with the output of mapcat-fn. The returned map will have the key `:_row-id` assoc'd onto it so for absolutely minimal gc usage include this as a member variable in your map. Options: * See options for [[pmap-ds]]. Especially note `:max-batch-size` and `:result-type`. In order to conserve memory it may be much more efficient to return a sequence of datasets rather than one large dataset. If returning sequences of datasets perhaps consider a transducing pathway across them or the [[tech.v3.dataset.reductions]] namespace. Example: ```clojure user> (def ds (ds/->dataset {:rid (range 10) :data (repeatedly 10 #(rand-int 3))})) #'user/ds user> (ds/head ds) _unnamed [5 2]: | :rid | :data | |-----:|------:| | 0 | 0 | | 1 | 2 | | 2 | 0 | | 3 | 1 | | 4 | 2 | user> (def mapcat-fn (fn [row] (for [idx (range (row :data))] {:idx idx}))) #'user/mapcat-fn user> (mapcat mapcat-fn (ds/rows ds)) ({:idx 0} {:idx 1} {:idx 0} {:idx 0} {:idx 1} {:idx 0} {:idx 1} {:idx 0} {:idx 1}) user> (ds/row-mapcat ds mapcat-fn) _unnamed [9 3]: | :rid | :data | :idx | |-----:|------:|-----:| | 1 | 2 | 0 | | 1 | 2 | 1 | | 3 | 1 | 0 | | 4 | 2 | 0 | | 4 | 2 | 1 | | 6 | 2 | 0 | | 6 | 2 | 1 | | 8 | 2 | 0 | | 8 | 2 | 1 | user> ```
(rows)
(rows result-type)
Returns rows of dataset. Result type can be any of:
:as-maps
:as-double-arrays
:as-seqs
Returns rows of dataset. Result type can be any of: * `:as-maps` * `:as-double-arrays` * `:as-seqs`
(rowvec-at idx)
Return a persisent-vector-like row at a given index. Negative indexes index from the end.
user> (ds/rowvec-at stocks 1)
["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35]
user> (ds/rowvec-at stocks -1)
["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02]
Return a persisent-vector-like row at a given index. Negative indexes index from the end. ```clojure user> (ds/rowvec-at stocks 1) ["MSFT" #object[java.time.LocalDate 0x5848b8b3 "2000-02-01"] 36.35] user> (ds/rowvec-at stocks -1) ["AAPL" #object[java.time.LocalDate 0x4b70b0d5 "2010-03-01"] 223.02] ```
(rowvecs)
(rowvecs options)
Return a randomly addressable list of rows in persistent vector-like form.
Options:
user> (take 5 (ds/rowvecs stocks))
(["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81]
["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35]
["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22]
["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37]
["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45])
Return a randomly addressable list of rows in persistent vector-like form. Options: * copying? - When true the data is copied out of the dataset row by row upon read of that row. When false the data is only referenced upon each read of a particular key. Copying is appropriate if you want to use the row values as keys a map and it is inappropriate if you are only going to read a given key for a given row once. ```clojure user> (take 5 (ds/rowvecs stocks)) (["MSFT" #object[java.time.LocalDate 0x5be9e4c8 "2000-01-01"] 39.81] ["MSFT" #object[java.time.LocalDate 0xf758e5 "2000-02-01"] 36.35] ["MSFT" #object[java.time.LocalDate 0x752cc84d "2000-03-01"] 43.22] ["MSFT" #object[java.time.LocalDate 0x7bad4827 "2000-04-01"] 28.37] ["MSFT" #object[java.time.LocalDate 0x3a62c34a "2000-05-01"] 25.45]) ```
(sample)
(sample n)
(sample n options)
Sample n-rows from a dataset. Defaults to sampling without replacement.
For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle)
The returned dataset's metadata is altered merging {:print-index-range (range n)}
in so you
will always see the entire returned dataset. If this isn't desired, vary-meta
a good pathway.
Options:
:replacement?
- Do sampling with replacement. Defaults to false.:seed
- Provide a seed as a number or provide a Random implementation.Sample n-rows from a dataset. Defaults to sampling *without* replacement. For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle) The returned dataset's metadata is altered merging `{:print-index-range (range n)}` in so you will always see the entire returned dataset. If this isn't desired, `vary-meta` a good pathway. Options: * `:replacement?` - Do sampling with replacement. Defaults to false. * `:seed` - Provide a seed as a number or provide a Random implementation.
(select columns-selector rows-selector)
Select columns and rows.
Select columns and rows.
(select-by-index col-index row-index)
Trim dataset according to this sequence of indexes. Returns a new dataset.
col-index and row-index - one of:
Trim dataset according to this sequence of indexes. Returns a new dataset. col-index and row-index - one of: - :all - all the columns - list of indexes. May contain duplicates. Negative values will be counted from the end of the sequence.
(select-columns)
(select-columns columns-selector)
(select-columns columns-selector meta-field)
Select columns by (returns dataset):
Select columns by (returns dataset): - name - sequence of names - map of names with new names (rename) - function which filter names (via column metadata)
(select-columns-by-index col-index)
Select columns from the dataset by seq of index(includes negative) or :all.
See documentation for select-by-index
.
Select columns from the dataset by seq of index(includes negative) or :all. See documentation for `select-by-index`.
(select-missing)
(select-missing columns-selector)
Select rows with missing values
columns-selector
selects columns to look at missing values
Select rows with missing values `columns-selector` selects columns to look at missing values
(select-rows)
(select-rows rows-selector)
(select-rows rows-selector options)
Select rows using:
Select rows using: - row id - seq of row ids - seq of true/false - fn with predicate
(select-rows-by-index row-index)
Select rows from the dataset or column by seq of index(includes negative) or :all.
See documentation for select-by-index
.
Select rows from the dataset or column by seq of index(includes negative) or :all. See documentation for `select-by-index`.
(semi-join ds-right columns-selector)
(semi-join ds-right columns-selector options)
(separate-column column)
(separate-column column separator)
(separate-column column target-columns separator)
(separate-column column target-columns separator conf)
(set-inference-target target-name-or-target-name-seq)
Set the inference target on the column. This sets the :column-type member of the column metadata to :inference-target?.
Set the inference target on the column. This sets the :column-type member of the column metadata to :inference-target?.
(shape)
Returns shape of the dataset [rows, cols]
Returns shape of the dataset [rows, cols]
(shuffle)
(shuffle options)
Shuffle dataset (with seed)
Shuffle dataset (with seed)
(sort-by key-fn)
(sort-by key-fn compare-fn & args)
Sort a dataset by a key-fn and compare-fn.
key-fn
- function from map to sort value.compare-fn
may be one of:
:tech.numerics/<
, :tech.numerics/>
for unboxing comparisons of primitive
values.Options:
:nan-strategy
- General missing strategy. Options are :first
, :last
, and
:exception
.:parallel?
- Uses parallel quicksort when true and regular quicksort when false.Sort a dataset by a key-fn and compare-fn. * `key-fn` - function from map to sort value. * `compare-fn` may be one of: - a clojure operator like clojure.core/< - `:tech.numerics/<`, `:tech.numerics/>` for unboxing comparisons of primitive values. - clojure.core/compare - A custom java.util.Comparator instantiation. Options: * `:nan-strategy` - General missing strategy. Options are `:first`, `:last`, and `:exception`. * `:parallel?` - Uses parallel quicksort when true and regular quicksort when false.
(sort-by-column colname)
(sort-by-column colname compare-fn & args)
Sort a dataset by a given column using the given compare fn.
compare-fn
may be one of:
:tech.numerics/<
, :tech.numerics/>
for unboxing comparisons of primitive
values.Options:
:nan-strategy
- General missing strategy. Options are :first
, :last
, and
:exception
.:parallel?
- Uses parallel quicksort when true and regular quicksort when false.Sort a dataset by a given column using the given compare fn. * `compare-fn` may be one of: - a clojure operator like clojure.core/< - `:tech.numerics/<`, `:tech.numerics/>` for unboxing comparisons of primitive values. - clojure.core/compare - A custom java.util.Comparator instantiation. Options: * `:nan-strategy` - General missing strategy. Options are `:first`, `:last`, and `:exception`. * `:parallel?` - Uses parallel quicksort when true and regular quicksort when false.
(std-scale columns-selector options)
(std-scale columns-selector meta-field options)
Metamorph transfomer, which centers and scales the dataset per column.
columns-selector
tablecloth columns-selector to choose columns to work on
meta-field
tablecloth meta-field working with columns-selector
options
are the options for the scaler and can take:
mean?
If true (default), the data gets shifted by the column means, so 0 centered
stddev?
If true (default), the data gets scaled by the standard deviation of the column
metamorph | . |
---|---|
Behaviour in mode :fit | Centers and scales the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id |
Behaviour in mode :transform | Reads trained std-scale model from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use for from key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in key $id |
Metamorph transfomer, which centers and scales the dataset per column. `columns-selector` tablecloth columns-selector to choose columns to work on `meta-field` tablecloth meta-field working with `columns-selector` `options` are the options for the scaler and can take: `mean?` If true (default), the data gets shifted by the column means, so 0 centered `stddev?` If true (default), the data gets scaled by the standard deviation of the column metamorph | . -------------------------------------|---------------------------------------------------------------------------- Behaviour in mode :fit | Centers and scales the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id` Behaviour in mode :transform | Reads trained std-scale model from ctx and applies it to data in `:metamorph/data` Reads keys from ctx | In mode `:transform` : Reads trained model to use for from key in `:metamorph/id`. Writes keys to ctx | In mode `:fit` : Stores trained model in key $id
(transform-one-hot column-selector strategy)
(transform-one-hot column-selector strategy options)
Transformer which mapps categorical variables to numbers. Each value of the column gets its won column in one-hot-encoding.
To handle different levls of a variable between train an test data, three strategies are available:
:full
The levels are retrieved from a dataset at key :metamorph.ml/full-ds in the context:independent
One-hot columns are fitted and transformed independently for train and test data:fit
The mapping fitted in mode :fit is used in :transform, and it is assumed that all levels are present in the data during :fitTransformer which mapps categorical variables to numbers. Each value of the column gets its won column in one-hot-encoding. To handle different levls of a variable between train an test data, three strategies are available: * `:full` The levels are retrieved from a dataset at key :metamorph.ml/full-ds in the context * `:independent` One-hot columns are fitted and transformed independently for train and test data * `:fit` The mapping fitted in mode :fit is used in :transform, and it is assumed that all levels are present in the data during :fit
(ungroup)
(ungroup options)
Concat groups into dataset.
When add-group-as-column
or add-group-id-as-column
is set to true
or name(s), columns with group name(s) or group id is added to the result.
Before joining the groups groups can be sorted by group name.
Concat groups into dataset. When `add-group-as-column` or `add-group-id-as-column` is set to `true` or name(s), columns with group name(s) or group id is added to the result. Before joining the groups groups can be sorted by group name.
(unique-by)
(unique-by columns-selector)
(unique-by columns-selector options)
Remove rows which contains the same data
column-selector
Select columns for uniqueness
strategy
There are 4 strategies defined to handle duplicates
:first
- select first row (default)
:last
- select last row
:random
- select random row
any function - apply function to a columns which are subject of uniqueness
Remove rows which contains the same data `column-selector` Select columns for uniqueness `strategy` There are 4 strategies defined to handle duplicates `:first` - select first row (default) `:last` - select last row `:random` - select random row any function - apply function to a columns which are subject of uniqueness
(unique-by-column colname)
(unique-by-column options colname)
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.
:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep. :keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).
(unordered-select colname-seq index-seq)
Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.
Perform a selection but use the order of the columns in the existing table; do *not* reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.
(unroll columns-selector)
(unroll columns-selector options)
Unfolds sequences stored inside a column(s), turning it into multiple columns. Opposite of fold-by
.
Add each of the provided columns to the set that defines the "uniqe key" of each row.
Thus there will be a new row for each value inside the target column(s)' value sequence.
If you want instead to split the content of the columns into a set of new columns, look at separate-column
.
See https://scicloj.github.io/tablecloth/index.html#Unroll
Unfolds sequences stored inside a column(s), turning it into multiple columns. Opposite of [[fold-by]]. Add each of the provided columns to the set that defines the "uniqe key" of each row. Thus there will be a new row for each value inside the target column(s)' value sequence. If you want instead to split the content of the columns into a set of new _columns_, look at [[separate-column]]. See https://scicloj.github.io/tablecloth/index.html#Unroll
(unroll-column column-name)
(unroll-column column-name options)
Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.
Any missing indexes are dropped.
user> (-> (ds/->dataset [{:a 1 :b [2 3]}
{:a 2 :b [4 5]}
{:a 3 :b :a}])
(ds/unroll-column :b {:indexes? true}))
_unnamed [5 3]:
| :a | :b | :indexes |
|----+----+----------|
| 1 | 2 | 0 |
| 1 | 3 | 1 |
| 2 | 4 | 0 |
| 2 | 5 | 1 |
| 3 | :a | 0 |
Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.
Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data. Any missing indexes are dropped. ```clojure user> (-> (ds/->dataset [{:a 1 :b [2 3]} {:a 2 :b [4 5]} {:a 3 :b :a}]) (ds/unroll-column :b {:indexes? true})) _unnamed [5 3]: | :a | :b | :indexes | |----+----+----------| | 1 | 2 | 0 | | 1 | 3 | 1 | | 2 | 4 | 0 | | 2 | 5 | 1 | | 3 | :a | 0 | ``` Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.
(update filter-fn-or-ds update-fn & args)
Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.
This pathways is designed to work with the tech.v3.dataset.column-filters namespace.
filter-fn-or-ds
is a generalized parameter. May be a function,
a dataset or a sequence of column names.(ds/bind-> (ds/->dataset dataset) ds
(ds/remove-column "Id")
(ds/update cf/string ds/replace-missing-value "NA")
(ds/update-elemwise cf/string #(get {"" "NA"} % %))
(ds/update cf/numeric ds/replace-missing-value 0)
(ds/update cf/boolean ds/replace-missing-value false)
(ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
#(dtype/elemwise-cast % :float64)))
Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset. This pathways is designed to work with the tech.v3.dataset.column-filters namespace. * `filter-fn-or-ds` is a generalized parameter. May be a function, a dataset or a sequence of column names. * update-fn must take the dataset as the first argument and must return a dataset. ```clojure (ds/bind-> (ds/->dataset dataset) ds (ds/remove-column "Id") (ds/update cf/string ds/replace-missing-value "NA") (ds/update-elemwise cf/string #(get {"" "NA"} % %)) (ds/update cf/numeric ds/replace-missing-value 0) (ds/update cf/boolean ds/replace-missing-value false) (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds)) #(dtype/elemwise-cast % :float64))) ```
(update-column col-name update-fn)
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
(update-columns columns-map)
(update-columns columns-selector update-functions)
(update-columnwise filter-fn-or-ds cwise-update-fn & args)
Call update-fn on each column of the dataset. Returns the dataset. See arguments to update
Call update-fn on each column of the dataset. Returns the dataset. See arguments to update
(update-elemwise map-fn)
(update-elemwise filter-fn-or-ds map-fn)
Replace all elements in selected columns by calling selected function on each element. column-name-seq must be a sequence of column names if provided. filter-fn-or-ds has same rules as update. Implicitly clears the missing set so function must deal with type-specific missing values correctly. Returns new dataset
Replace all elements in selected columns by calling selected function on each element. column-name-seq must be a sequence of column names if provided. filter-fn-or-ds has same rules as update. Implicitly clears the missing set so function must deal with type-specific missing values correctly. Returns new dataset
(value-reader)
(value-reader options)
Return a reader that produces a reader of column values per index. Options: :copying? - Default to false - When true row values are copied on read.
Return a reader that produces a reader of column values per index. Options: :copying? - Default to false - When true row values are copied on read.
(write! output-path)
(write! output-path options)
Write a dataset out to a file. Supported forms are:
(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)
Options:
:max-chars-per-column
- csv,tsv specific, defaults to 65536 - values longer than this will
cause an exception during serialization.:max-num-columns
- csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
columns an exception will be thrown during serialization.:quoted-columns
- csv specific - sequence of columns names that you would like to always have quoted.:file-type
- Manually specify the file type. This is usually inferred from the filename but if you
pass in an output stream then you will need to specify the file type.:headers?
- if csv headers are written, defaults to true.Write a dataset out to a file. Supported forms are: ```clojure (ds/write! test-ds "test.csv") (ds/write! test-ds "test.tsv") (ds/write! test-ds "test.tsv.gz") (ds/write! test-ds "test.nippy") (ds/write! test-ds out-stream) ``` Options: * `:max-chars-per-column` - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization. * `:max-num-columns` - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization. * `:quoted-columns` - csv specific - sequence of columns names that you would like to always have quoted. * `:file-type` - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type. * `:headers?` - if csv headers are written, defaults to true.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close