Liking cljdoc? Tell your friends :D
Clojure only.

scicloj.ml.metamorph

This namespace contains functions, which operate on a metamorph context. They all return the context as well.

So all functions in this namespace are metamorph compliant and can be placed in a metamorph pipeline.

Most functions here are only manipulating the dataset, which is in the ctx map under the key :metamorph/data. And they behave the same in pipeline mode :fit and :transform.

A few functions manipulate other keys inside the ctx map, and/or behave different in :fit and :transform.

This is documented per function in this form:

metamorph.
Behaviour in mode :fit.
Behaviour in mode :transform.
Reads keys from ctx.
Writes keys to ctx.

The namespaces scicloj.ml.metamorph and scicloj.ml.dataset contain functions with the same name. But they operate on either a context map (ns metamorph) or on a dataset (ns dataset)

The functions in this namesspaces are re-exported from :

  • tablecloth.pipeline
  • tech.v3.libs.smile.metamorph
  • scicloj.metamorph.ml
  • tech.v3.dataset.metamorph
This namespace contains functions, which operate on a metamorph context.
They all return the context as well.

So all functions in this namespace are metamorph compliant and can
be placed in a metamorph pipeline.

Most functions here are only manipulating the dataset, which is in the ctx map
under the key :metamorph/data.
And they behave the same in pipeline mode :fit and :transform.


A few functions manipulate other keys inside the ctx map, and/or behave
different in :fit and :transform.

This is documented per function in this form:

metamorph                            | .
-------------------------------------|------------------------------
Behaviour in mode :fit               | .
Behaviour in mode :transform         | .
Reads keys from ctx                  | .
Writes keys to ctx                   | .



The namespaces scicloj.ml.metamorph and scicloj.ml.dataset contain
functions with the same name. But they operate on either a context
map (ns metamorph) or on a dataset (ns dataset)

The functions in this namesspaces are re-exported from :

* tablecloth.pipeline
* tech.v3.libs.smile.metamorph
* scicloj.metamorph.ml
* tech.v3.dataset.metamorph

raw docstring

->arrayclj

(->array colname)
(->array colname datatype)

Convert numerical column(s) to java array

Convert numerical column(s) to java array
raw docstring

add-columnclj

(add-column column-name column)
(add-column column-name column size-strategy)

Add or update (modify) column under column-name.

column can be sequence of values or generator function (which gets ds as input).

Add or update (modify) column under `column-name`.

`column` can be sequence of values or generator function (which gets `ds` as input).
raw docstring

add-columnsclj

(add-columns columns-map)
(add-columns columns-map size-strategy)

Add or updade (modify) columns defined in columns-map (mapping: name -> column)

Add or updade (modify) columns defined in `columns-map` (mapping: name -> column) 
raw docstring

add-or-replace-columnclj

(add-or-replace-column column-name column)
(add-or-replace-column column-name column size-strategy)

add-or-replace-columnsclj

(add-or-replace-columns columns-map)
(add-or-replace-columns columns-map size-strategy)

add-or-update-columnclj

(add-or-update-column column)
(add-or-update-column colname column)

If column exists, replace. Else append new column.

If column exists, replace.  Else append new column.
raw docstring

aggregateclj

(aggregate aggregator)
(aggregate aggregator options)

Aggregate dataset by providing:

  • aggregation function
  • map with column names and functions
  • sequence of aggregation functions

Aggregation functions can return:

  • single value
  • seq of values
  • map of values with column names
Aggregate dataset by providing:

- aggregation function
- map with column names and functions
- sequence of aggregation functions

Aggregation functions can return:
- single value
- seq of values
- map of values with column names
raw docstring

aggregate-columnsclj

(aggregate-columns columns-selector column-aggregators)
(aggregate-columns columns-selector column-aggregators options)

Aggregates each column separately

Aggregates each column separately
raw docstring

anti-joinclj

(anti-join ds-right columns-selector)
(anti-join ds-right columns-selector options)

appendclj

(append & datasets)

append-columnsclj

(append-columns column-seq)

as-regular-datasetclj

(as-regular-dataset)

Remove grouping tag

Remove grouping tag
raw docstring

asof-joinclj

(asof-join ds-right colname)
(asof-join ds-right colname options)

assoc-dsclj

(assoc-ds cname cdata & args)

If dataset is not nil, calls clojure.core/assoc. Else creates a new empty dataset and then calls clojure.core/assoc. Guaranteed to return a dataset (unlike assoc).

If dataset is not nil, calls `clojure.core/assoc`. Else creates a new empty dataset and
then calls `clojure.core/assoc`.  Guaranteed to return a dataset (unlike assoc).
raw docstring

assoc-metadataclj

(assoc-metadata filter-fn-or-ds k v & args)

Set metadata across a set of columns.

Set metadata across a set of columns.
raw docstring

bindclj

(bind & datasets)

bow->something-sparseclj

(bow->something-sparse bow-col indices-col bow->sparse-fn options)

Converts a bag-of-word column bow-col to a sparse data column indices-col. The exact transformation to the sparse representtaion is given by bow->sparse-fn

metamorph.
Behaviour in mode :fitnormal
Behaviour in mode :transformnormal
Reads keys from ctxnone
Writes keys to ctx:scicloj.ml.smile.metamorph/bow->sparse-vocabulary
Converts a bag-of-word column `bow-col` to a sparse data column `indices-col`.
 The exact transformation to the sparse representtaion is given by `bow->sparse-fn`

metamorph                            |.
-------------------------------------|---------
Behaviour in mode :fit               |normal
Behaviour in mode :transform         |normal
Reads keys from ctx                  |none
Writes keys to ctx                   |:scicloj.ml.smile.metamorph/bow->sparse-vocabulary

raw docstring

bow->sparse-arrayclj

(bow->sparse-array bow-col indices-col)
(bow->sparse-array bow-col indices-col options)

Converts a bag-of-word column bow-col to sparse indices column indices-col, as needed by the Maxent model. Options can be of:

create-vocab-fn A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all

The sparse data is represented as primitive int arrays, of which entries are the indices against the vocabulary of the present tokens.

metamorph.
Behaviour in mode :fitnormal
Behaviour in mode :transformnormal
Reads keys from ctxnone
Writes keys to ctx:scicloj.ml.smile.metamorph/bow->sparse-vocabulary
Converts a bag-of-word column `bow-col` to sparse indices column
`indices-col`,   as needed by the Maxent model.
`Options` can be of:

`create-vocab-fn` A function which converts the bow map to a list of tokens.
                  Defaults to scicloj.ml.smile.nlp/create-vocab-all


The sparse data is represented as `primitive int arrays`,
of which entries are the indices against the vocabulary
of the present tokens.

metamorph                            |.
-------------------------------------|---------
Behaviour in mode :fit               |normal
Behaviour in mode :transform         |normal
Reads keys from ctx                  |none
Writes keys to ctx                   |:scicloj.ml.smile.metamorph/bow->sparse-vocabulary

raw docstring

bow->SparseArrayclj

(bow->SparseArray bow-col indices-col)
(bow->SparseArray bow-col indices-col options)

Converts a bag-of-word column bow-col to sparse indices column indices-col, as needed by the discrete naive bayes model.

Options can be of:

create-vocab-fn A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all

The sparse data is represented as smile.util.SparseArray.

metamorph.
Behaviour in mode :fitnormal
Behaviour in mode :transformnormal
Reads keys from ctxnone
Writes keys to ctx:scicloj.ml.smile.metamorph/bow->sparse-vocabulary
Converts a bag-of-word column `bow-col` to sparse indices column `indices-col`,
 as needed by the discrete naive bayes model.

`Options` can be of:

`create-vocab-fn` A function which converts the bow map to a list of tokens.
                  Defaults to scicloj.ml.smile.nlp/create-vocab-all

The sparse data is represented as `smile.util.SparseArray`.

metamorph                            |.
-------------------------------------|---------
Behaviour in mode :fit               |normal
Behaviour in mode :transform         |normal
Reads keys from ctx                  |none
Writes keys to ctx                   |:scicloj.ml.smile.metamorph/bow->sparse-vocabulary

raw docstring

bow->tfidfclj

(bow->tfidf bow-column tfidf-column)

Calculates the tfidf score from bag-of-words (as token frequency maps) in column bow-column and stores them in a new column tfid-column as maps of token->tfidf-score.

metamorph.
Behaviour in mode :fitnormal
Behaviour in mode :transformnormal
Reads keys from ctxnone
Writes keys to ctxnone
Calculates the tfidf score from bag-of-words (as token frequency maps)
 in column `bow-column` and stores them in a new column `tfid-column` as maps of token->tfidf-score.


metamorph                            |.
-------------------------------------|---------
Behaviour in mode :fit               |normal
Behaviour in mode :transform         |normal
Reads keys from ctx                  |none
Writes keys to ctx                   |none
raw docstring

briefclj

(brief)
(brief options)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

Get a brief description, in mapseq form of a dataset.  A brief description is
the mapseq form of descriptive stats.
raw docstring

by-rankclj

(by-rank columns-selector rank-predicate)
(by-rank columns-selector rank-predicate options)

Select rows using rank on a column, ties are resolved using :dense method.

See R docs. Rank uses 0 based indexing.

Possible :ties strategies: :average, :first, :last, :random, :min, :max, :dense. :dense is the same as in data.table::frank from R

:desc? set to true (default) order descending before calculating rank

Select rows using `rank` on a column, ties are resolved using `:dense` method.

See [R docs](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rank).
Rank uses 0 based indexing.

Possible `:ties` strategies: `:average`, `:first`, `:last`, `:random`, `:min`, `:max`, `:dense`.
`:dense` is the same as in `data.table::frank` from R

`:desc?` set to true (default) order descending before calculating rank
raw docstring

categorical->numberclj

(categorical->number filter-fn-or-ds)
(categorical->number filter-fn-or-ds table-args)
(categorical->number filter-fn-or-ds table-args result-datatype)

Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.

Convert columns into a discrete , numeric representation
See tech.v3.dataset.categorical/fit-categorical-map.
raw docstring

categorical->one-hotclj

(categorical->one-hot filter-fn-or-ds)
(categorical->one-hot filter-fn-or-ds table-args)
(categorical->one-hot filter-fn-or-ds table-args result-datatype)

Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot

Convert string columns to numeric columns.
See tech.v3.dataset.categorical/fit-one-hot
raw docstring

cloneclj

(clone)

Clone an object. Can clone anything convertible to a reader.

Clone an object.  Can clone anything convertible to a reader.
raw docstring

clusterclj

(cluster clustering-method clustering-method-args target-column)

Metamorph transformer, which clusters the data and creates a new column with the cluster id.

clustering-method can be any of:

  • :spectral
  • :dbscan
  • :k-means
  • :mec
  • :clarans
  • :g-means
  • :lloyd
  • :x-means
  • :deterministic-annealing
  • :denclue

The clustering-args is a vector with the positional arguments for each cluster function, as documented here: https://cljdoc.org/d/generateme/fastmath/2.1.5/api/fastmath.clustering

The cluster id of each row gets written to the column in target-column

metamorph.
Behaviour in mode :fitCalculates cluster centers of the rows dataset at key :metamorph/data and stores them in ctx under key at :metamorph/id. Adds as wll column in target-column with cluster centers into the dataset.
Behaviour in mode :transformReads cluster centers from ctx and applies it to data in :metamorph/data
Reads keys from ctxIn mode :transform : Reads cluster centers to use from ctx at key in :metamorph/id.
Writes keys to ctxIn mode :fit : Stores cluster centers in ctx under key in :metamorph/id.
Metamorph transformer, which clusters the data and creates a new column with the cluster id.

  `clustering-method` can be any of:

* :spectral
* :dbscan
* :k-means
* :mec
* :clarans
* :g-means
* :lloyd
* :x-means
* :deterministic-annealing
* :denclue

The `clustering-args` is a vector with the positional arguments for each cluster function,
as documented here:
https://cljdoc.org/d/generateme/fastmath/2.1.5/api/fastmath.clustering

The cluster id of each row gets written to the column in `target-column`

  metamorph                            | .
  -------------------------------------|----------------------------------------------------------------------------
  Behaviour in mode :fit               | Calculates cluster centers of the rows dataset at key `:metamorph/data` and stores them in ctx under key at `:metamorph/id`. Adds as wll column in `target-column` with cluster centers into the dataset.
  Behaviour in mode :transform         | Reads cluster centers from ctx and applies it to data in `:metamorph/data`
  Reads keys from ctx                  | In mode `:transform` : Reads cluster centers to use from ctx at key in `:metamorph/id`.
  Writes keys to ctx                   | In mode `:fit` : Stores cluster centers in ctx under key in `:metamorph/id`.

  
raw docstring

columnclj

(column colname)

column->datasetclj

(column->dataset colname transform-fn)
(column->dataset colname transform-fn options)

Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.

Transform a column into a sequence of maps using transform-fn.
Return dataset created out of the sequence of maps.
raw docstring

column-castclj

(column-cast colname datatype)

Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.

colname may be a scalar or a tuple of [src-col dst-col].

datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.

If the existing datatype is string, then tech.v3.datatype.column/parse-column is called.

Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.v3.dataset.column/parse-column.

Cast a column to a new datatype.  This is never a lazy operation.  If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.

colname may be a scalar or a tuple of [src-col dst-col].

datatype may be a datatype enumeration or a tuple of
[datatype cast-fn] where cast-fn may return either a new value,
:tech.v3.dataset/missing, or :tech.v3.dataset/parse-failure.
Exceptions are propagated to the caller.  The new column has at least the
existing missing set (if no attempt returns :missing or :cast-failure).
:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.


If the existing datatype is string, then tech.v3.datatype.column/parse-column
is called.

Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.v3.dataset.column/parse-column.
raw docstring

column-countclj

(column-count)

column-labeled-mapseqclj

(column-labeled-mapseq value-colname-seq)

Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!

See also columnwise-concat

Return a sequence of maps with

  {... - columns not in colname-seq
   :value - value from one of the value columns
   :label - name of the column the value came from
  }
Given a dataset, return a sequence of maps where several columns are all stored
  in a :value key and a :label key contains a column name.  Used for quickly creating
  timeseries or scatterplot labeled graphs.  Returns a lazy sequence, not a reader!

  See also `columnwise-concat`

  Return a sequence of maps with
```clojure
  {... - columns not in colname-seq
   :value - value from one of the value columns
   :label - name of the column the value came from
  }
```
raw docstring

column-mapclj

(column-map result-colname map-fn)
(column-map result-colname map-fn filter-fn-or-ds)
(column-map result-colname map-fn res-dtype-or-opts filter-fn-or-ds)

Produce a new (or updated) column as the result of mapping a fn over columns.

  • dataset - dataset.
  • result-colname - Name of new (or existing) column.
  • map-fn - function to map over columns. Same rules as tech.v3.datatype/emap.
  • res-dtype-or-opts - If not given result is scanned to infer missing and datatype. If using an option map, options are described below.
  • filter-fn-or-ds - A dataset, a sequence of columns, or a tech.v3.datasets/column-filters column filter function. Defaults to all the columns of the existing dataset.

Returns a new dataset with a new or updated column.

Options:

  • :datatype - Set the dataype of the result column. If not given result is scanned to infer result datatype and missing set.
  • :missing-fn - if given, columns are first passed to missing-fn as a sequence and this dictates the missing set. Else the missing set is by scanning the results during the inference process. See tech.v3.dataset.column/union-missing-sets and tech.v3.dataset.column/intersect-missing-sets for example functions to pass in here.

Examples:


  ;;From the tests --

  (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
    ;;result scanned for both datatype and missing set
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
    ;;result scanned for missing set only.  Result used in-place.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %))
                               {:datatype :float64} [:b]))))
    ;;Nothing scanned at all.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(inc %)
                               {:datatype :float64
                                :missing-fn ds-col/union-missing-sets} [:b]))))
    ;;Missing set scanning causes NPE at inc.
    (is (thrown? Throwable
                 (ds/column-map testds :b2 #(inc %)
                                {:datatype :float64}
                                [:b]))))

  ;;Ad-hoc repl --

user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
          (ds/head))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   price^2 |
|--------|------------|-------|-----------|
|   MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|   MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|   MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|   MSFT | 2000-04-01 | 28.37 |  804.8569 |
|   MSFT | 2000-05-01 | 25.45 |  647.7025 |



user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:

|  :b | :a |
|----:|---:|
|     |  1 |
| 2.0 |    |
| 3.0 |  2 |
user> (ds/column-map ds1 :c (fn [a b]
                              (when (and a b)
                                (+ (double a) (double b))))
                     [:a :b])
_unnamed [3 3]:

|  :b | :a |  :c |
|----:|---:|----:|
|     |  1 |     |
| 2.0 |    |     |
| 3.0 |  2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}
Produce a new (or updated) column as the result of mapping a fn over columns.

  * `dataset` - dataset.
  * `result-colname` - Name of new (or existing) column.
  * `map-fn` - function to map over columns.  Same rules as `tech.v3.datatype/emap`.
  * `res-dtype-or-opts` - If not given result is scanned to infer missing and datatype.
  If using an option map, options are described below.
  * `filter-fn-or-ds` - A dataset, a sequence of columns, or a `tech.v3.datasets/column-filters`
     column filter function.  Defaults to all the columns of the existing dataset.

  Returns a new dataset with a new or updated column.

  Options:

  * `:datatype` - Set the dataype of the result column.  If not given result is scanned
  to infer result datatype and missing set.
  * `:missing-fn` - if given, columns are first passed to missing-fn as a sequence and
  this dictates the missing set.  Else the missing set is by scanning the results
  during the inference process. See `tech.v3.dataset.column/union-missing-sets` and
  `tech.v3.dataset.column/intersect-missing-sets` for example functions to pass in
  here.

  Examples:


```clojure

  ;;From the tests --

  (let [testds (ds/->dataset [{:a 1.0 :b 2.0} {:a 3.0 :b 5.0} {:a 4.0 :b nil}])]
    ;;result scanned for both datatype and missing set
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %)) [:b]))))
    ;;result scanned for missing set only.  Result used in-place.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(when % (inc %))
                               {:datatype :float64} [:b]))))
    ;;Nothing scanned at all.
    (is (= (vec [3.0 6.0 nil])
           (:b2 (ds/column-map testds :b2 #(inc %)
                               {:datatype :float64
                                :missing-fn ds-col/union-missing-sets} [:b]))))
    ;;Missing set scanning causes NPE at inc.
    (is (thrown? Throwable
                 (ds/column-map testds :b2 #(inc %)
                                {:datatype :float64}
                                [:b]))))

  ;;Ad-hoc repl --

user> (require '[tech.v3.dataset :as ds]))
nil
user> (def ds (ds/->dataset "test/data/stocks.csv"))
#'user/ds
user> (ds/head ds)
test/data/stocks.csv [5 3]:

| symbol |       date | price |
|--------|------------|-------|
|   MSFT | 2000-01-01 | 39.81 |
|   MSFT | 2000-02-01 | 36.35 |
|   MSFT | 2000-03-01 | 43.22 |
|   MSFT | 2000-04-01 | 28.37 |
|   MSFT | 2000-05-01 | 25.45 |
user> (-> (ds/column-map ds "price^2" #(* % %) ["price"])
          (ds/head))
test/data/stocks.csv [5 4]:

| symbol |       date | price |   price^2 |
|--------|------------|-------|-----------|
|   MSFT | 2000-01-01 | 39.81 | 1584.8361 |
|   MSFT | 2000-02-01 | 36.35 | 1321.3225 |
|   MSFT | 2000-03-01 | 43.22 | 1867.9684 |
|   MSFT | 2000-04-01 | 28.37 |  804.8569 |
|   MSFT | 2000-05-01 | 25.45 |  647.7025 |



user> (def ds1 (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}]))
#'user/ds1
user> ds1
_unnamed [3 2]:

|  :b | :a |
|----:|---:|
|     |  1 |
| 2.0 |    |
| 3.0 |  2 |
user> (ds/column-map ds1 :c (fn [a b]
                              (when (and a b)
                                (+ (double a) (double b))))
                     [:a :b])
_unnamed [3 3]:

|  :b | :a |  :c |
|----:|---:|----:|
|     |  1 |     |
| 2.0 |    |     |
| 3.0 |  2 | 5.0 |
user> (ds/missing (*1 :c))
{0,1}
```
raw docstring

column-namesclj

(column-names)
(column-names columns-selector)
(column-names columns-selector meta-field)

column-values->categoricalclj

(column-values->categorical src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values. In the case of one-hot mappings, src-column must be the original column name before the one-hot map

Given a column encoded via either string->number or one-hot, reverse
map to the a sequence of the original string column values.
In the case of one-hot mappings, src-column must be the original
column name before the one-hot map
raw docstring

columnsclj

(columns)
(columns result-type)

columns-with-missing-seqclj

(columns-with-missing-seq)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

Return a sequence of:
```clojure
  {:column-name column-name
   :missing-count missing-count
  }
```
  or nil of no columns are missing data.
raw docstring

columnwise-concatclj

(columnwise-concat colnames)
(columnwise-concat colnames options)

Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.

Example:

user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |

Options:

value-column-name - defaults to :value colname-column-name - defaults to :column

Given a dataset and a list of columns, produce a new dataset with
  the columns concatenated to a new column with a :column column indicating
  which column the original value came from.  Any columns not mentioned in the
  list of columns are duplicated.

  Example:
```clojure
user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |
```

  Options:

  value-column-name - defaults to :value
  colname-column-name - defaults to :column
  
raw docstring

concatclj

(concat & datasets)

concat-copyingclj

(concat-copying & datasets)

concat-inplaceclj

(concat-inplace & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.
raw docstring

convert-typesclj

(convert-types coltype-map-or-columns-selector)
(convert-types columns-selector new-types)

Convert type of the column to the other type.

Convert type of the column to the other type.
raw docstring

count-vectorizeclj

(count-vectorize text-col bow-col)
(count-vectorize text-col bow-col options)

Transforms the text column text-col into a map of token frequencies in column bow-col

options can be any of

text->bow-fn A functions which takes as input a

metamorph.
Behaviour in mode :fitnormal
Behaviour in mode :transformnormal
Reads keys from ctxnone
Writes keys to ctxnone
Transforms the text column `text-col` into a map of token frequencies in column
`bow-col`

`options` can be any of

`text->bow-fn` A functions which takes as input a

metamorph                            |.
-------------------------------------|---------
Behaviour in mode :fit               |normal
Behaviour in mode :transform         |normal
Reads keys from ctx                  |none
Writes keys to ctx                   |none
raw docstring

data->datasetclj

(data->dataset)

Convert a data-ized dataset created via dataset->data back into a full dataset

Convert a data-ized dataset created via dataset->data back into a
full dataset
raw docstring

dataset->categorical-xformsclj

(dataset->categorical-xforms)

Given a dataset, return a map of column-name->xform information.

Given a dataset, return a map of column-name->xform information.
raw docstring

dataset->dataclj

(dataset->data)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.

Convert a dataset to a pure clojure datastructure.  Returns a map with two keys:
{:metadata :columns}.
:columns is a vector of column definitions appropriate for passing directly back
into new-dataset.
A column definition in this case is a map of {:name :missing :data :metadata}.
raw docstring

dataset->strclj

(dataset->str)
(dataset->str options)

Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.

For options documentation see dataset-data->str.

Convert a dataset to a string.  Prints a single line header and then calls
dataset-data->str.

For options documentation see dataset-data->str.
raw docstring

dataset-nameclj

(dataset-name)

dataset?clj

(dataset?)

Is ds a dataset type?

Is `ds` a `dataset` type?
raw docstring

descriptive-statsclj

(descriptive-stats)
(descriptive-stats options)

Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.

Get descriptive statistics across the columns of the dataset.
In addition to the standard stats.
Options:
:stat-names - defaults to (remove #{:values :num-distinct-values}
                                  (all-descriptive-stats-names))
:n-categorical-values - Number of categorical values to report in the 'values'
   field. Defaults to 21.
raw docstring

differenceclj

(difference ds-right)
(difference ds-right options)

dropclj

(drop columns-selector rows-selector)

Drop columns and rows.

Drop columns and rows.
raw docstring

drop-columnsclj

(drop-columns)
(drop-columns columns-selector)
(drop-columns columns-selector meta-field)

Drop columns by (returns dataset):

  • name
  • sequence of names
  • map of names with new names (rename)
  • function which filter names (via column metadata)
Drop columns by (returns dataset):

- name
- sequence of names
- map of names with new names (rename)
- function which filter names (via column metadata)
raw docstring

drop-missingclj

(drop-missing)
(drop-missing columns-selector)

Drop rows with missing values

columns-selector selects columns to look at missing values

Drop rows with missing values

`columns-selector` selects columns to look at missing values
raw docstring

drop-rowsclj

(drop-rows)
(drop-rows rows-selector)
(drop-rows rows-selector options)

Drop rows using:

  • row id
  • seq of row ids
  • seq of true/false
  • fn with predicate
Drop rows using:

- row id
- seq of row ids
- seq of true/false
- fn with predicate
raw docstring

empty-ds?clj

(empty-ds?)

ensure-array-backedclj

(ensure-array-backed)
(ensure-array-backed options)

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

options - :unpack? - unpack packed datetime types. Defaults to true

Ensure the column data in the dataset is stored in pure java arrays.  This is
sometimes necessary for interop with other libraries and this operation will
force any lazy computations to complete.  This also clears the missing set
for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not
changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate
datatype for each column.

options -
:unpack? - unpack packed datetime types.  Defaults to true
raw docstring

feature-ecountclj

(feature-ecount)

Number of feature columns. Feature columns are columns that are not inference targets.

Number of feature columns.  Feature columns are columns that are not
inference targets.
raw docstring

fill-range-replaceclj

(fill-range-replace colname max-span)
(fill-range-replace colname max-span missing-strategy)
(fill-range-replace colname max-span missing-strategy missing-value)

filterclj

(filter predicate)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

dataset->dataset transformation.  Predicate is passed a map of
colname->column-value.
raw docstring

filter-columnclj

(filter-column colname predicate)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). Returns a dataset.

Filter a given column by a predicate.  Predicate is passed column values.
If predicate is *not* an instance of Ifn it is treated as a value and will
be used as if the predicate is #(= value %).
Returns a dataset.
raw docstring

filter-datasetclj

(filter-dataset filter-fn-or-ds)

Filter the columns of the dataset returning a new dataset. This pathway is designed to work with the tech.v3.dataset.column-filters namespace.

  • If filter-fn-or-ds is a dataset, it is returned.
  • If filter-fn-or-ds is sequential, then select-columns is called.
  • If filter-fn-or-ds is :all, all columns are returned
  • If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.
Filter the columns of the dataset returning a new dataset.  This pathway is
designed to work with the tech.v3.dataset.column-filters namespace.

* If filter-fn-or-ds is a dataset, it is returned.
* If filter-fn-or-ds is sequential, then select-columns is called.
* If filter-fn-or-ds is :all, all columns are returned
* If filter-fn-or-ds is an instance of IFn, the dataset is passed into it.
raw docstring

firstclj

(first)

fold-byclj

(fold-by columns-selector)
(fold-by columns-selector folding-function)

full-joinclj

(full-join ds-right columns-selector)
(full-join ds-right columns-selector options)

group-byclj

(group-by grouping-selector)
(group-by grouping-selector options)

Group dataset by:

  • column name
  • list of columns
  • map of keys and row indexes
  • function getting map of values

Options are:

  • select-keys - when grouping is done by function, you can limit fields to a select-keys seq.
  • result-type - return results as dataset (:as-dataset, default) or as map of datasets (:as-map) or as map of row indexes (:as-indexes) or as sequence of (sub)datasets
  • other parameters which are passed to dataset fn

When dataset is returned, meta contains :grouped? set to true. Columns in dataset:

  • name - group name
  • group-id - id of the group (int)
  • data - group as dataset
Group dataset by:

- column name
- list of columns
- map of keys and row indexes
- function getting map of values

Options are:

- select-keys - when grouping is done by function, you can limit fields to a `select-keys` seq.
- result-type - return results as dataset (`:as-dataset`, default) or as map of datasets (`:as-map`) or as map of row indexes (`:as-indexes`) or as sequence of (sub)datasets
- other parameters which are passed to `dataset` fn

When dataset is returned, meta contains `:grouped?` set to true. Columns in dataset:

- name - group name
- group-id - id of the group (int)
- data - group as dataset
raw docstring

group-by->indexesclj

(group-by->indexes key-fn)

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes
is an in-order contiguous group of indexes.
raw docstring

group-by-columnclj

(group-by-column colname)

Return a map of column-value->dataset.

Return a map of column-value->dataset.
raw docstring

group-by-column->indexesclj

(group-by-column->indexes colname)

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes
where indexes is an in-order contiguous group of indexes.
raw docstring

grouped?clj

(grouped?)

Is dataset represents grouped dataset (result of group-by)?

Is `dataset` represents grouped dataset (result of `group-by`)?
raw docstring

groups->mapclj

(groups->map)

Convert grouped dataset to the map of groups

Convert grouped dataset to the map of groups
raw docstring

groups->seqclj

(groups->seq)

has-column?clj

(has-column? column-name)

(head)
(head n)

inference-column?clj

(inference-column?)

inference-target-column-namesclj

(inference-target-column-names)

Return the names of the columns that are inference targets.

Return the names of the columns that are inference targets.
raw docstring

inference-target-dsclj

(inference-target-ds)

Given a dataset return reverse-mapped inference target columns or nil in the case where there are no inference targets.

Given a dataset return reverse-mapped inference target columns or nil
in the case where there are no inference targets.
raw docstring

inference-target-label-inverse-mapclj

(inference-target-label-inverse-map & [label-columns])

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

Given options generated during ETL operations and annotated with :label-columns
sequence container 1 label column, generate a reverse map that maps from a dataset
value back to the label that generated that value.
raw docstring

inference-target-label-mapclj

(inference-target-label-map & [label-columns])

infoclj

(info)
(info result-type)

inner-joinclj

(inner-join ds-right columns-selector)
(inner-join ds-right columns-selector options)

intersectclj

(intersect ds-right)
(intersect ds-right options)

join-columnsclj

(join-columns target-column columns-selector)
(join-columns target-column columns-selector options)

labelsclj

(labels)

Return the labels. The labels sequence is the reverse mapped inference column. This returns a single column of data or errors out.

Return the labels.  The labels sequence is the reverse mapped inference
column.  This returns a single column of data or errors out.
raw docstring

lastclj

(last)

left-joinclj

(left-join ds-right columns-selector)
(left-join ds-right columns-selector options)

map-columnsclj

(map-columns column-name map-fn)
(map-columns column-name columns-selector map-fn)
(map-columns column-name new-type columns-selector map-fn)

mapseq-readerclj

(mapseq-reader)

Return a reader that produces a map of column-name->column-value

Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a map of column-name->column-value

Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.
raw docstring

mark-as-groupclj

(mark-as-group)

Add grouping tag

Add grouping tag
raw docstring

min-max-scaleclj

(min-max-scale col-seq {:keys [min max] :or {min -0.5 max 0.5} :as options})

Metamorph transfomer, which scales the column data into a given range.

col-seq is a sequence of columns names to work on

options Options for scaler, can take:

min Minimal value to scale to (default -0.5)

max Maximum value to scale to (default 0.5)

metamorph.
Behaviour in mode :fitScales the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id
Behaviour in mode :transformReads trained min-max-scale model from ctx and applies it to data in :metamorph/data
Reads keys from ctxIn mode :transform : Reads trained model to use for from key in :metamorph/id.
Writes keys to ctxIn mode :fit : Stores trained model in key $id
Metamorph transfomer, which scales the column data into a given range.

`col-seq` is a sequence of columns names to work on

`options` Options for scaler, can take:

`min` Minimal value to scale to (default -0.5)

`max` Maximum value to scale to (default 0.5)

metamorph                            | .
-------------------------------------|----------------------------------------------------------------------------
Behaviour in mode :fit               | Scales the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id`
Behaviour in mode :transform         | Reads trained min-max-scale model from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx                  | In mode `:transform` : Reads trained model to use for from key in `:metamorph/id`.
Writes keys to ctx                   | In mode `:fit` : Stores trained model in key $id

raw docstring

missingclj

(missing)

Given a dataset or a column, return the missing set as a roaring bitmap

Given a dataset or a column, return the missing set as a roaring bitmap
raw docstring

modelclj

(model options)

Executes a machine learning model in train/predict (depending on :mode) from the metamorph.ml model registry.

The model is passed between both invocation via the shared context ctx in a key (a step indentifier) which is passed in key :metamorph/id and guarantied to be unique for each pipeline step.

The function writes and reads into this common context key.

Options:

  • :model-type - Keyword for the model to use

Further options get passed to train functions and are model specific.

See here for an overview for the models build into scicloj.ml:

https://scicloj.github.io/scicloj.ml/userguide-models.html

Other libraries might contribute other models, which are documented as part of the library.

metamorph.
Behaviour in mode :fitCalls scicloj.metamorph.ml/train using data in :metamorph/data and optionsand stores trained model in ctx under key in :metamorph/id
Behaviour in mode :transformReads trained model from ctx and calls scicloj.metamorph.ml/predict with the model in $id and data in :metamorph/data
Reads keys from ctxIn mode :transform : Reads trained model to use for prediction from key in :metamorph/id.
Writes keys to ctxIn mode :fit : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at :scicloj.metamorph.ml/feature-ds /:scicloj.metamorph.ml/target-ds

See as well:

  • scicloj.metamorph.ml/train
  • scicloj.metamorph.ml/predict
Executes a machine learning model in train/predict (depending on :mode)
from the `metamorph.ml` model registry.

The model is passed between both invocation via the shared context ctx in a
key (a step indentifier) which is passed in key `:metamorph/id` and guarantied to be unique for each
pipeline step.

The function writes and reads into this common context key.

Options:
- `:model-type` - Keyword for the model to use

Further options get passed to `train` functions and are model specific.

See here for an overview for the models build into scicloj.ml:

https://scicloj.github.io/scicloj.ml/userguide-models.html

Other libraries might contribute other models,
which are documented as part of the library.


metamorph                            | .
-------------------------------------|----------------------------------------------------------------------------
Behaviour in mode :fit               | Calls `scicloj.metamorph.ml/train` using data in `:metamorph/data` and `options`and stores trained model in ctx under key in `:metamorph/id`
Behaviour in mode :transform         | Reads trained model from ctx and calls `scicloj.metamorph.ml/predict` with the model in $id and data in `:metamorph/data`
Reads keys from ctx                  | In mode `:transform` : Reads trained model to use for prediction from key in `:metamorph/id`.
Writes keys to ctx                   | In mode `:fit` : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at `:scicloj.metamorph.ml/feature-ds` /`:scicloj.metamorph.ml/target-ds`




See as well:

* `scicloj.metamorph.ml/train`
* `scicloj.metamorph.ml/predict`

raw docstring

model-typeclj

(model-type & [column-name-seq])

Check the label column after dataset processing. Return either :regression :classification

Check the label column after dataset processing.
Return either
:regression
:classification
raw docstring

new-columnclj

(new-column)
(new-column data)
(new-column data metadata)
(new-column data metadata missing)

Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.

Create a new column.  Data will scanned for missing values
unless the full 4-argument pathway is used.
raw docstring

new-datasetclj

(new-dataset)
(new-dataset column-seq)
(new-dataset ds-metadata column-seq)

Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

Create a new dataset from a sequence of columns.  Data will be converted
into columns using ds-col-proto/ensure-column-seq.  If the column seq is simply a
collection of vectors, for instance, columns will be named ordinally.
options map -
  :dataset-name - Name of the dataset.  Defaults to "_unnamed".
  :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.
raw docstring

num-inference-classesclj

(num-inference-classes)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

Given a dataset and correctly built options from pipeline operations,
return the number of classes used for the label.  Error if not classification
dataset.
raw docstring

order-byclj

(order-by columns-or-fn)
(order-by columns-or-fn comparators)
(order-by columns-or-fn comparators options)

Order dataset by:

  • column name
  • columns (as sequence of names)
  • key-fn
  • sequence of columns / key-fn Additionally you can ask the order by:
  • :asc
  • :desc
  • custom comparator function
Order dataset by:
- column name
- columns (as sequence of names)
- key-fn
- sequence of columns / key-fn
Additionally you can ask the order by:
- :asc
- :desc
- custom comparator function
raw docstring

order-column-namesclj

(order-column-names colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

Order a sequence of columns names so they match the order in the
original dataset.  Missing columns are placed last.
raw docstring

pivot->longerclj

(pivot->longer)
(pivot->longer columns-selector)
(pivot->longer columns-selector options)

tidyr pivot_longer api

`tidyr` pivot_longer api
raw docstring

pivot->widerclj

(pivot->wider columns-selector value-columns)
(pivot->wider columns-selector value-columns options)

(print-dataset)
(print-dataset options)

probability-distributions->label-columnclj

(probability-distributions->label-column dst-colname)

Given a dataset that has columns in which the column names describe labels and the rows describe a probability distribution, create a label column by taking the max value in each row and assign column that row value.

Given a dataset that has columns in which the column names describe labels and the
rows describe a probability distribution, create a label column by taking the max
value in each row and assign column that row value.
raw docstring

process-group-dataclj

(process-group-data f)
(process-group-data f parallel?)

rand-nthclj

(rand-nth)
(rand-nth options)

randomclj

(random)
(random n)
(random n options)

read-nippyclj

(read-nippy)

reduce-dimensionsclj

(reduce-dimensions algorithm target-dims cnames opts)

Metamorph transformer, which reduces the dimensions of a given dataset.

algorithm can be any of:

  • :pca-cov
  • :pca-cor
  • :pca-prob
  • :kpca
  • :gha
  • :random

target-dims is number of dimensions to reduce to.

cnames is a sequence of column names on which the reduction get performed

opts are the options of the algorithm

metamorph.
Behaviour in mode :fitReduces dimensions of the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id
Behaviour in mode :transformReads trained reduction model from ctx and applies it to data in :metamorph/data
Reads keys from ctxIn mode :transform : Reads trained model to use from ctx at key in :metamorph/id.
Writes keys to ctxIn mode :fit : Stores trained model in ctx under key in :metamorph/id.
Metamorph transformer, which reduces the dimensions of a given dataset.

`algorithm` can be any of:
  * :pca-cov
  * :pca-cor
  * :pca-prob
  * :kpca
  * :gha
  * :random

`target-dims` is number of dimensions to reduce to.

`cnames` is a sequence of column names on which the reduction get performed

`opts` are the options of the algorithm

metamorph                            | .
-------------------------------------|----------------------------------------------------------------------------
Behaviour in mode :fit               | Reduces dimensions of the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id`
Behaviour in mode :transform         | Reads trained reduction model from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx                  | In mode `:transform` : Reads trained model to use from ctx at key in `:metamorph/id`.
Writes keys to ctx                   | In mode `:fit` : Stores trained model in ctx under key in `:metamorph/id`.

raw docstring

remove-columnclj

(remove-column col-name)

Same as:

(dissoc dataset col-name)
Same as:

```clojure
(dissoc dataset col-name)
```
  
raw docstring

remove-columnsclj

(remove-columns colname-seq-or-fn)

Remove columns indexed by column name seq or column filter function. For example:

  (remove-columns DS [:A :B])
  (remove-columns DS cf/categorical)
Remove columns indexed by column name seq or column filter function.
  For example:

```clojure
  (remove-columns DS [:A :B])
  (remove-columns DS cf/categorical)
```
raw docstring

remove-rowsclj

(remove-rows row-indexes)

Same as drop-rows.

Same as drop-rows.
raw docstring

rename-columnsclj

(rename-columns columns-mapping)
(rename-columns columns-selector columns-map-fn)

Rename columns with provided old -> new name map

Rename columns with provided old -> new name map
raw docstring

reorder-columnsclj

(reorder-columns columns-selector & columns-selectors)

Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.

Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.
raw docstring

replace-missingclj

(replace-missing)
(replace-missing strategy)
(replace-missing columns-selector strategy)
(replace-missing columns-selector strategy value)

replace-missing-valueclj

(replace-missing-value scalar-value)
(replace-missing-value filter-fn-or-ds scalar-value)

right-joinclj

(right-join ds-right columns-selector)
(right-join ds-right columns-selector options)

row-atclj

(row-at idx)

Get the row at an individual index. If indexes are negative then the dataset is indexed from the end.

Get the row at an individual index.  If indexes are negative then the dataset
is indexed from the end.
raw docstring

row-countclj

(row-count)

rowsclj

(rows)
(rows result-type)

sampleclj

(sample)
(sample n)
(sample n options)

Sample n-rows from a dataset. Defaults to sampling without replacement.

For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle)

The returned dataset's metadata is altered merging {:print-index-range (range n)} in so you will always see the entire returned dataset. If this isn't desired, vary-meta a good pathway.

Sample n-rows from a dataset.  Defaults to sampling *without* replacement.

For the definition of seed, see the argshuffle documentation](https://cnuernber.github.io/dtype-next/tech.v3.datatype.argops.html#var-argshuffle)

The returned dataset's metadata is altered merging `{:print-index-range (range n)}` in so you
will always see the entire returned dataset.  If this isn't desired, `vary-meta` a good pathway.
raw docstring

selectclj

(select columns-selector rows-selector)

Select columns and rows.

Select columns and rows.
raw docstring

select-by-indexclj

(select-by-index col-index row-index)

Trim dataset according to this sequence of indexes. Returns a new dataset.

col-index and row-index - one of:

  • :all - all the columns
  • list of indexes. May contain duplicates. Negative values will be counted from the end of the sequence.
Trim dataset according to this sequence of indexes.  Returns a new dataset.

col-index and row-index - one of:

  - :all - all the columns
  - list of indexes. May contain duplicates.  Negative values will be counted from
    the end of the sequence.
raw docstring

select-columnsclj

(select-columns)
(select-columns columns-selector)
(select-columns columns-selector meta-field)

Select columns by (returns dataset):

  • name
  • sequence of names
  • map of names with new names (rename)
  • function which filter names (via column metadata)
Select columns by (returns dataset):

- name
- sequence of names
- map of names with new names (rename)
- function which filter names (via column metadata)
raw docstring

select-columns-by-indexclj

(select-columns-by-index col-index)

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for select-by-index.

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for `select-by-index`.
raw docstring

select-missingclj

(select-missing)
(select-missing columns-selector)

Select rows with missing values

columns-selector selects columns to look at missing values

Select rows with missing values

`columns-selector` selects columns to look at missing values
raw docstring

select-rowsclj

(select-rows)
(select-rows rows-selector)
(select-rows rows-selector options)

Select rows using:

  • row id
  • seq of row ids
  • seq of true/false
  • fn with predicate
Select rows using:

- row id
- seq of row ids
- seq of true/false
- fn with predicate
raw docstring

select-rows-by-indexclj

(select-rows-by-index row-index)

Select rows from the dataset or column by seq of index(includes negative) or :all.

See documentation for select-by-index.

Select rows from the dataset or column by seq of index(includes negative) or :all.

See documentation for `select-by-index`.
raw docstring

semi-joinclj

(semi-join ds-right columns-selector)
(semi-join ds-right columns-selector options)

separate-columnclj

(separate-column column separator)
(separate-column column target-columns separator)
(separate-column column target-columns separator options)

set-dataset-nameclj

(set-dataset-name ds-name)

set-inference-targetclj

(set-inference-target target-name-or-target-name-seq)

Set the inference target on the column. This sets the :column-type member of the column metadata to :inference-target?.

Set the inference target on the column.  This sets the :column-type member
of the column metadata to :inference-target?.
raw docstring

shapeclj

(shape)

Returns shape of the dataset [rows, cols]

Returns shape of the dataset [rows, cols]
raw docstring

shuffleclj

(shuffle)
(shuffle options)

sort-byclj

(sort-by key-fn)
(sort-by key-fn compare-fn)

Sort a dataset by a key-fn and compare-fn.

Sort a dataset by a key-fn and compare-fn.
raw docstring

sort-by-columnclj

(sort-by-column colname)
(sort-by-column colname compare-fn)

Sort a dataset by a given column using the given compare fn.

Sort a dataset by a given column using the given compare fn.
raw docstring

std-scaleclj

(std-scale col-seq
           {:keys [mean? stddev?] :or {mean? true stddev? true} :as options})

Metamorph transfomer, which centers and scales the dataset per column.

col-seq is a sequence of column names to work on

options are the options for the scaler and can take:

mean? If true (default), the data gets shifted by the column means, so 0 centered

stddev? If true (default), the data gets scaled by the standard deviation of the column

metamorph.
Behaviour in mode :fitCenters and scales the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id
Behaviour in mode :transformReads trained std-scale model from ctx and applies it to data in :metamorph/data
Reads keys from ctxIn mode :transform : Reads trained model to use for from key in :metamorph/id.
Writes keys to ctxIn mode :fit : Stores trained model in key $id
Metamorph transfomer, which centers and scales the dataset per column.

`col-seq` is a sequence of column names to work on

`options` are the options for the scaler and can take:

`mean?` If true (default), the data gets shifted by the column means, so 0 centered

`stddev?` If true (default), the data gets scaled by the standard deviation of the column

metamorph                            | .
-------------------------------------|----------------------------------------------------------------------------
Behaviour in mode :fit               | Centers and scales the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id`
Behaviour in mode :transform         | Reads trained std-scale model from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx                  | In mode `:transform` : Reads trained model to use for from key in `:metamorph/id`.
Writes keys to ctx                   | In mode `:fit` : Stores trained model in key $id

raw docstring

tailclj

(tail)
(tail n)

take-nthclj

(take-nth n-val)

ungroupclj

(ungroup)
(ungroup options)

Concat groups into dataset.

When add-group-as-column or add-group-id-as-column is set to true or name(s), columns with group name(s) or group id is added to the result.

Before joining the groups groups can be sorted by group name.

Concat groups into dataset.

When `add-group-as-column` or `add-group-id-as-column` is set to `true` or name(s), columns with group name(s) or group id is added to the result.

Before joining the groups groups can be sorted by group name.
raw docstring

unionclj

(union & datasets)

unique-byclj

(unique-by)
(unique-by columns-selector)
(unique-by columns-selector options)

unique-by-columnclj

(unique-by-column colname)
(unique-by-column options colname)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx.  Defaults to #(first %2).
raw docstring

unmark-groupclj

(unmark-group)

Remove grouping tag

Remove grouping tag
raw docstring

unordered-selectclj

(unordered-select colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

Perform a selection but use the order of the columns in the existing table; do
*not* reorder the columns based on colname-seq.  Useful when doing selection based
on sets or persistent hash maps.
raw docstring

unrollclj

(unroll columns-selector)
(unroll columns-selector options)

unroll-columnclj

(unroll-column column-name)
(unroll-column column-name options)

Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.

Any missing indexes are dropped.

user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |

Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.

Unroll a column that has some (or all) sequential data as entries.
  Returns a new dataset with same columns but with other columns duplicated
  where the unroll happened.  Column now contains only scalar data.

  Any missing indexes are dropped.

```clojure
user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |
```

  Options -
  :datatype - datatype of the resulting column if one aside from :object is desired.
  :indexes? - If true, create a new column that records the indexes of the values from
    the original column.  Can also be a truthy value (like a keyword) and the column
    will be named this.
raw docstring

updateclj

(update filter-fn-or-ds update-fn & args)

Update this dataset. Filters this dataset into a new dataset, applies update-fn, then merges the result into original dataset.

This pathways is designed to work with the tech.v3.dataset.column-filters namespace.

  • filter-fn-or-ds is a generalized parameter. May be a function, a dataset or a sequence of column names.
  • update-fn must take the dataset as the first argument and must return a dataset.
(ds/bind-> (ds/->dataset dataset) ds
           (ds/remove-column "Id")
           (ds/update cf/string ds/replace-missing-value "NA")
           (ds/update-elemwise cf/string #(get {"" "NA"} % %))
           (ds/update cf/numeric ds/replace-missing-value 0)
           (ds/update cf/boolean ds/replace-missing-value false)
           (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
                                 #(dtype/elemwise-cast % :float64)))
Update this dataset.  Filters this dataset into a new dataset,
  applies update-fn, then merges the result into original dataset.

  This pathways is designed to work with the tech.v3.dataset.column-filters namespace.


  * `filter-fn-or-ds` is a generalized parameter.  May be a function,
     a dataset or a sequence of column names.
  *  update-fn must take the dataset as the first argument and must return
     a dataset.

```clojure
(ds/bind-> (ds/->dataset dataset) ds
           (ds/remove-column "Id")
           (ds/update cf/string ds/replace-missing-value "NA")
           (ds/update-elemwise cf/string #(get {"" "NA"} % %))
           (ds/update cf/numeric ds/replace-missing-value 0)
           (ds/update cf/boolean ds/replace-missing-value false)
           (ds/update-columnwise (cf/union (cf/numeric ds) (cf/boolean ds))
                                 #(dtype/elemwise-cast % :float64)))
```
raw docstring

update-columnclj

(update-column col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

Update a column returning a new dataset.  update-fn is a column->column
transformation.  Error if column does not exist.
raw docstring

update-columnsclj

(update-columns columns-map)
(update-columns columns-selector update-functions)

update-columnwiseclj

(update-columnwise filter-fn-or-ds cwise-update-fn & args)

Call update-fn on each column of the dataset. Returns the dataset. See arguments to update

Call update-fn on each column of the dataset.  Returns the dataset.
See arguments to update
raw docstring

update-elemwiseclj

(update-elemwise map-fn)
(update-elemwise filter-fn-or-ds map-fn)

Replace all elements in selected columns by calling selected function on each element. column-name-seq must be a sequence of column names if provided. filter-fn-or-ds has same rules as update. Implicitly clears the missing set so function must deal with type-specific missing values correctly. Returns new dataset

Replace all elements in selected columns by calling selected function on each
element.  column-name-seq must be a sequence of column names if provided.
filter-fn-or-ds has same rules as update.  Implicitly clears the missing set so
function must deal with type-specific missing values correctly.
Returns new dataset
raw docstring

value-readerclj

(value-reader)

Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a reader of column values per index.
Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.
raw docstring

write!clj

(write! output-path)
(write! output-path options)

Write a dataset out to a file. Supported forms are:

(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)

Options:

  • :max-chars-per-column - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.
  • :max-num-columns - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.
  • :quoted-columns - csv specific - sequence of columns names that you would like to always have quoted.
  • :file-type - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type.
  • :headers? - if csv headers are written, defaults to true.
Write a dataset out to a file.  Supported forms are:

```clojure
(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)
```

Options:

  * `:max-chars-per-column` - csv,tsv specific, defaults to 65536 - values longer than this will
     cause an exception during serialization.
  * `:max-num-columns` - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
     columns an exception will be thrown during serialization.
  * `:quoted-columns` - csv specific - sequence of columns names that you would like to always have quoted.
  * `:file-type` - Manually specify the file type.  This is usually inferred from the filename but if you
     pass in an output stream then you will need to specify the file type.
  * `:headers?` - if csv headers are written, defaults to true.
raw docstring

write-nippy!clj

(write-nippy! filename)

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close