Liking cljdoc? Tell your friends :D
Clojure only.

tech.ml.dataset

Column major dataset abstraction for efficiently manipulating in memory datasets.

Column major dataset abstraction for efficiently manipulating
in memory datasets.
raw docstring

->>datasetclj

(->>dataset dataset)
(->>dataset options dataset)

Please see documentation of ->dataset. Options are the same.

Please see documentation of ->dataset.  Options are the same.
sourceraw docstring

->datasetclj

(->dataset dataset)
(->dataset dataset {:keys [table-name dataset-name] :as options})

Create a dataset from either csv/tsv or a sequence of maps.

  • A String or InputStream will be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.
  • A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.

Options:

  • :dataset-name - set the name of the dataset.
  • :file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that arrow and parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :arrow :parquet}.
  • :gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream.
  • :column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist.
  • :column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist.
  • :num-rows - Number of rows to read
  • :header-row? - Defaults to true, indicates the first row is a header.
  • :key-fn - function to be applied to column names. Typical use is: :key-fn keyword.
  • :separator - Add a character separator to the list of separators to auto-detect.
  • :csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such).
  • :bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row.
  • :skip-bad-rows? - Legacy option. Use :bad-row-policy.
  • :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception.
  • :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301
  • :n-initial-skip-rows - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.
  • :parser-fn -
    • keyword? - all columns parsed to this datatype
    • ifn? - called with two arguments: (parser-fn column-name-or-idx column-data) - Return value must be implement tech.ml.dataset.parser.PColumnParser in which case that is used or can return nil in which case the default column parser is used.
    • tuple - pair of [datatype parse-data] in which case container of type [datatype] will be created. parse-data can be one of:
      • :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.
      • fn? - function from str-> one of :tech.ml.dataset.parser/missing, :tech.ml.dataset.parser/parse-failure, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated.
      • string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName.
      • DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
      • :encoded-text datatype - parse-data can be a string, a java.nio.charset.Charset, or an implementation of tech.ml.dataset.text/PEncodingToFn. If you want to serialize this format to nippy your encoding had better be nippy serializable (defrecords always are).
  • map? - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.
  • :parser-scan-len - Length of initial column data used for parser-fn's datatype detection routine. Defaults to 100.

Returns a new dataset

Create a dataset from either csv/tsv or a sequence of maps.
 *  A `String` or `InputStream` will be interpreted as a file (or gzipped file if it
 ends with .gz) of tsv or csv data.  The system will attempt to autodetect if this
 is csv or tsv and then engineering around detecting datatypes all of which can
 be overridden.
 *  A sequence of maps may be passed in in which case the first N maps are scanned in
 order to derive the column datatypes before the actual columns are created.

Options:
- `:dataset-name` - set the name of the dataset.
- `:file-type` - Override filetype discovery mechanism for strings or force a particular
    parser for an input stream.  Note that arrow and parquet must have paths on disk
    and cannot currently load from input stream.  Acceptible file types are:
    #{:csv :tsv :xlsx :xls :arrow :parquet}.
- `:gzipped?` - for file formats that support it, override autodetection and force
   creation of a gzipped input stream as opposed to a normal input stream.
- `:column-whitelist` - either sequence of string column names or sequence of column
   indices of columns to whitelist.
- `:column-blacklist` - either sequence of string column names or sequence of column
   indices of columns to blacklist.
- `:num-rows` - Number of rows to read
- `:header-row?` - Defaults to true, indicates the first row is a header.
- `:key-fn` - function to be applied to column names.  Typical use is:
   `:key-fn keyword`.
- `:separator` - Add a character separator to the list of separators to auto-detect.
- `:csv-parser` - Implementation of univocity's AbstractParser to use.  If not
   provided a default permissive parser is used.  This way you parse anything that
   univocity supports (so flat files and such).
- `:bad-row-policy` - One of three options: :skip, :error, :carry-on.  Defaults to
   :carry-on.  Some csv data has ragged rows and in this case we have several
   options. If the option is :carry-on then we either create a new column or add
   missing values for columns that had no data for that row.
- `:skip-bad-rows?` - Legacy option.  Use :bad-row-policy.
- `:max-chars-per-column` - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
- `:max-num-columns` - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301
- `:n-initial-skip-rows` - Skip N rows initially.  This currently may include the header
   row.  Works across both csv and spreadsheet datasets.
- `:parser-fn` -
  - `keyword?` - all columns parsed to this datatype
  - `ifn?` - called with two arguments: (parser-fn column-name-or-idx column-data)
        - Return value must be implement tech.ml.dataset.parser.PColumnParser in
          which case that is used or can return nil in which case the default
          column parser is used.
  - tuple - pair of [datatype `parse-data`] in which case container of type
    [datatype] will be created. `parse-data` can be one of:
      - `:relaxed?` - data will be parsed such that parse failures of the standard
         parse functions do not stop the parsing process.  :unparsed-values and
         :unparsed-indexes are available in the metadata of the column that tell
         you the values that failed to parse and their respective indexes.
      - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
         `:tech.ml.dataset.parser/parse-failure`, or the parsed value.
         Exceptions here always kill the parse process.  :missing will get marked
         in the missing indexes, and :parse-failure will result in the index being
         added to missing, the unparsed the column's :unparsed-values and
         :unparsed-indexes will be updated.
      - `string?` - for datetime types, this will turned into a DateTimeFormatter via
         DateTimeFormatter/ofPattern.  For encoded-text, this has to be a valid
         argument to Charset/forName.
      - `DateTimeFormatter` - use with the appropriate temporal parse static function
         to parse the value.
      -  :encoded-text datatype - `parse-data` can be a string, a
         java.nio.charset.Charset, or an implementation of
         tech.ml.dataset.text/PEncodingToFn.  If you want to serialize this format
         to nippy your encoding had better be nippy serializable (defrecords
         always are).
 - `map?` - the header-name-or-idx is used to lookup value.  If not nil, then
         value can be any of the above options.  Else the default column parser
         is used.
- `:parser-scan-len` - Length of initial column data used for parser-fn's datatype
     detection routine. Defaults to 100.

Returns a new dataset
sourceraw docstring

->flyweightclj

(->flyweight dataset
             &
             {:keys [column-name-seq error-on-missing-values? number->string?]
              :or {error-on-missing-values? true}})

Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. If the dataset has a label and number->string? is true then columns that have been converted from categorical to numeric will be reverse-mapped back to string columns.

Convert dataset to seq-of-maps dataset.  Flag indicates if errors should be thrown
on missing values or if nil should be inserted in the map.  If the dataset has a label
and number->string? is true then columns that have been converted from categorical to
numeric will be reverse-mapped back to string columns.
sourceraw docstring

->k-fold-datasetsclj

(->k-fold-datasets dataset k)
(->k-fold-datasets
  dataset
  k
  {:keys [randomize-dataset?] :or {randomize-dataset? true} :as options})

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.
sourceraw docstring

->row-majorclj

(->row-major dataset)
(->row-major dataset options)
(->row-major dataset
             key-colname-seq-map
             {:keys [datatype] :or {datatype :float64}})

Given a dataset and a map of desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}

Given a dataset and a map of desired key names to sequences of columns,
produce a sequence of maps where each key name points to contiguous vector
composed of the column values concatenated.
If colname-seq-map is not provided then each row defaults to
{:features [feature-columns]
 :label [label-columns]}
sourceraw docstring

->sort-byclj

(->sort-by dataset key-fn)
(->sort-by dataset key-fn compare-fn)
(->sort-by dataset key-fn compare-fn column-name-seq)

Version of sort-by used in -> statements common in dataflows

Version of sort-by used in -> statements common in dataflows
sourceraw docstring

->sort-by-columnclj

(->sort-by-column dataset colname)
(->sort-by-column dataset colname compare-fn)

sort-by-column used in -> dataflows

sort-by-column used in -> dataflows
sourceraw docstring

->train-test-splitclj

(->train-test-split dataset)
(->train-test-split dataset
                    {:keys [randomize-dataset? train-fraction]
                     :or {randomize-dataset? true train-fraction 0.7}
                     :as options})
source

add-columnclj

(add-column dataset column)

Add a new column. Error if name collision

Add a new column. Error if name collision
sourceraw docstring

add-or-update-columnclj

(add-or-update-column dataset column)
(add-or-update-column dataset colname column)

If column exists, replace. Else append new column.

If column exists, replace.  Else append new column.
sourceraw docstring

aggregate-byclj

(aggregate-by map-fn
              dataset
              &
              {:keys [column-name-seq numeric-aggregate-fn boolean-aggregate-fn
                      default-aggregate-fn count-column-name]
               :or {numeric-aggregate-fn dfn/reduce-+
                    boolean-aggregate-fn count-true
                    default-aggregate-fn first}})

Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.

Group the dataset by map-fn, then aggregate by the aggregate fn.
Returns aggregated datatset.
:aggregate-fn - passed a sequence of columns and must return a new column
with the same number of entries as the count of the column sequences.
sourceraw docstring

aggregate-by-columnclj

(aggregate-by-column colname
                     dataset
                     &
                     {:keys [numeric-aggregate-fn boolean-aggregate-fn
                             default-aggregate-fn count-column-name]
                      :or {numeric-aggregate-fn dfn/reduce-+
                           boolean-aggregate-fn count-true
                           default-aggregate-fn first}})

Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.

Group the dataset by map-fn, then aggregate by the aggregate fn.
Returns aggregated datatset.
:aggregate-fn - passed a sequence of columns and must return a new column
with the same number of entries as the count of the column sequences.
sourceraw docstring

all-descriptive-stats-namesclj

(all-descriptive-stats-names)

Returns the names of all descriptive stats in the order they will be returned in the resulting dataset of descriptive stats. This allows easy filtering in the form for (descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names) (remove #{:values :num-distinct-values}))})

Returns the names of all descriptive stats in the order they will be returned
in the resulting dataset of descriptive stats.  This allows easy filtering
in the form for
(descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names)
                                        (remove #{:values :num-distinct-values}))})
sourceraw docstring

append-columnsclj

(append-columns dataset column-seq)
source

assocclj

(assoc dataset colname coldata)
(assoc dataset colname coldata & more)

If column exists, replace. Else append new column. The datatype of the new column will the the datatype of the coldata.

coldata may be a sequence in which case 'vec' will be called and the datatype will be :object.

coldata may be a reader, in which case the datatype will be the datatype of the reader. One way to make a reader is tech.v2.datatype/make-reader or anything deriving from java.util.List and java.util.RandomAccess will do.

coldata may also be a new column (tech.ml.dataset.column/new-column) in which case the missing set and the column metadata can be provided.

If column exists, replace.  Else append new column.  The datatype of the new column
will the the datatype of the coldata.

coldata may be a sequence in which case 'vec' will be called and the datatype will be
:object.

coldata may be a reader, in which case the datatype will be the datatype of the
reader.  One way to make a reader is tech.v2.datatype/make-reader or anything
deriving from java.util.List and java.util.RandomAccess will do.

coldata may also be a new column (tech.ml.dataset.column/new-column) in which case
the missing set and the column metadata can be provided.
sourceraw docstring

briefclj

(brief ds)
(brief ds options)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

Get a brief description, in mapseq form of a dataset.  A brief description is
the mapseq form of descriptive stats.
sourceraw docstring

columnclj

(column dataset column-name)

Return the column or throw if it doesn't exist.

Return the column or throw if it doesn't exist.
sourceraw docstring

column->datasetclj

(column->dataset dataset colname transform-fn)
(column->dataset dataset colname transform-fn options)

Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.

Transform a column into a sequence of maps using transform-fn.
Return dataset created out of the sequence of maps.
sourceraw docstring

column-castclj

(column-cast dataset colname datatype)

Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.

colname may be a scalar or a tuple of [src-col dst-col].

datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.

If the existing datatype is string, then tech.ml.datatype.column/parse-column is called.

Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.ml.dataset.column/parse-column.

Cast a column to a new datatype.  This is never a lazy operation.  If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.

colname may be a scalar or a tuple of [src-col dst-col].

datatype may be a datatype enumeration or a tuple of
[datatype cast-fn] where cast-fn may return either a new value,
:tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure.
Exceptions are propagated to the caller.  The new column has at least the
existing missing set (if no attempt returns :missing or :cast-failure).
:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.


If the existing datatype is string, then tech.ml.datatype.column/parse-column
is called.

Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.ml.dataset.column/parse-column.
sourceraw docstring

column-countclj

(column-count dataset)
source

column-label-mapclj

(column-label-map dataset column-name)
source

column-labeled-mapseqclj

(column-labeled-mapseq dataset value-colname-seq)

Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!

See also columnwise-concat

Return a sequence of maps with

  {... - columns not in colname-seq
   :value - value from one of the value columns
   :label - name of the column the value came from
  }
Given a dataset, return a sequence of maps where several columns are all stored
  in a :value key and a :label key contains a column name.  Used for quickly creating
  timeseries or scatterplot labeled graphs.  Returns a lazy sequence, not a reader!

  See also `columnwise-concat`

  Return a sequence of maps with
```clojure
  {... - columns not in colname-seq
   :value - value from one of the value columns
   :label - name of the column the value came from
  }
```
sourceraw docstring

column-mapclj

(column-map dataset result-colname map-fn colname & colnames)

Produce a new column as the result of mapping a fn over other columns. The result column will have a datatype of the widened combination of all the input column datatypes. The result column's missing indexes is the union of all input columns.

Produce a new column as the result of mapping a fn over other columns.
The result column will have a datatype of the widened combination of all
the input column datatypes.
The result column's missing indexes is the union of all input columns.
sourceraw docstring

column-name->column-mapclj

(column-name->column-map datatypes)

clojure map of column-name->column

clojure map of column-name->column
sourceraw docstring

column-namesclj

(column-names dataset)

In-order sequence of column names

In-order sequence of column names
sourceraw docstring

column-values->categoricalclj

(column-values->categorical dataset src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.

Given a column encoded via either string->number or one-hot, reverse
map to the a sequence of the original string column values.
sourceraw docstring

columnsclj

(columns dataset)

Return sequence of all columns in dataset.

Return sequence of all columns in dataset.
sourceraw docstring

columns-with-missing-seqclj

(columns-with-missing-seq dataset)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

Return a sequence of:
```clojure
  {:column-name column-name
   :missing-count missing-count
  }
```
  or nil of no columns are missing data.
sourceraw docstring

columnwise-concatclj

(columnwise-concat dataset colnames)
(columnwise-concat dataset
                   colnames
                   {:keys [value-column-name colname-column-name]
                    :or {value-column-name :value colname-column-name :column}
                    :as _options})

Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.

Example:

user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |

Options:

value-column-name - defaults to :value colname-column-name - defaults to :column

Given a dataset and a list of columns, produce a new dataset with
  the columns concatenated to a new column with a :column column indicating
  which column the original value came from.  Any columns not mentioned in the
  list of columns are duplicated.

  Example:
```clojure
user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |
```

  Options:

  value-column-name - defaults to :value
  colname-column-name - defaults to :column
  
sourceraw docstring

compute-centroid-and-global-meansclj

(compute-centroid-and-global-means dataset row-major-centroids)

Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.

Return a map of:
centroid-means - centroid-index -> (double array) column means.
global-means - global means (double array) for the dataset.
sourceraw docstring

concatclj

(concat dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.
Also see concat-copying as this may be faster in many situations.
sourceraw docstring

concat-copyingclj

(concat-copying dataset & datasets)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets into a new dataset copying data.  Respects missing values.
Datasets must all have the same columns.  Result column datatypes will be a widening
cast of the datatypes.
sourceraw docstring

concat-inplaceclj

(concat-inplace dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.
sourceraw docstring

correlation-tableclj

(correlation-table dataset & {:keys [correlation-type colname-seq]})

Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >)

Thus the first entry is: [colname, 1.0]

There are three possible correlation types: :pearson :spearman :kendall

:pearson is the default.

Return a map of colname->list of sorted tuple of [colname, coefficient].
Sort is:
(sort-by (comp #(Math/abs (double %)) second) >)

Thus the first entry is:
[colname, 1.0]

There are three possible correlation types:
:pearson
:spearman
:kendall

:pearson is the default.
sourceraw docstring

data->datasetclj

(data->dataset {:keys [metadata columns]})

Convert a data-ized dataset created via dataset->data back into a full dataset

Convert a data-ized dataset created via dataset->data back into a
full dataset
sourceraw docstring

dataset->dataclj

(dataset->data ds)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}.

:columns is a vector of column definitions appropriate for passing directly back into new-dataset.

A column definition in this case is a map of {:name :missing :data :metadata}.

Convert a dataset to a pure clojure datastructure.  Returns a map with two keys:
{:metadata :columns}.

:columns is a vector of column definitions appropriate for passing directly back
into new-dataset.

A column definition in this case is a map of {:name :missing :data :metadata}.
sourceraw docstring

dataset->smile-dataframeclj

(dataset->smile-dataframe ds)

Convert a dataset to a smile dataframe.

This operation may clone columns if they aren't backed by java heap arrays. See ensure-array-backed

It is important to note that smile supports a subset of the functionality in tech.ml.dataset. One difference is smile columns have string column names and have no missing set.

Returns a smile.data.DataFrame

Convert a dataset to a smile dataframe.

This operation may clone columns if they aren't backed by java heap arrays.
See ensure-array-backed

It is important to note that smile supports a subset of the functionality in
tech.ml.dataset.  One difference is smile columns have string column names and
have no missing set.

Returns a smile.data.DataFrame
sourceraw docstring

dataset->strclj

(dataset->str ds)
(dataset->str ds options)

Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.

For options documentation see dataset-data->str.

Convert a dataset to a string.  Prints a single line header and then calls
dataset-data->str.

For options documentation see dataset-data->str.
sourceraw docstring

dataset->stringclj

Deprecated method. See dataset->str

Deprecated method.  See dataset->str
sourceraw docstring

dataset-data->strclj

(dataset-data->str dataset)
(dataset-data->str dataset options)

Convert the dataset values to a string.

Options may be provided in the dataset metadata or may be provided as an options map. The options map overrides the dataset metadata.

:print-index-range - The set of indexes to print. Defaults to: (range default-table-row-print-length) :print-line-policy - defaults to :repl - one of - :repl - multiline table - default nice printing for repl - :markdown - lines delimited by <br> - :single - Only print first line :print-column-max-width - set the max width of a column when printing.

Example for conservative printing: tech.ml.dataset.github-test> (def ds (with-meta ds (assoc (meta ds) :print-column-max-width 25 :print-line-policy :single)))

Convert the dataset values to a string.

  Options may be provided in the dataset metadata or may be provided
  as an options map.  The options map overrides the dataset metadata.

  :print-index-range - The set of indexes to print.  Defaults to:
    (range *default-table-row-print-length*)
  :print-line-policy - defaults to :repl - one of
    - :repl - multiline table - default nice printing for repl
    - :markdown - lines delimited by <br>
    - :single - Only print first line
  :print-column-max-width - set the max width of a column when printing.

  Example for conservative printing:
tech.ml.dataset.github-test> (def ds (with-meta ds
                                       (assoc (meta ds)
                                              :print-column-max-width 25
                                              :print-line-policy :single)))
sourceraw docstring

dataset-label-mapclj

(dataset-label-map dataset)
source

dataset-nameclj

(dataset-name dataset)
source

descriptive-statsclj

(descriptive-stats dataset)
(descriptive-stats dataset options)

Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.

Get descriptive statistics across the columns of the dataset.
In addition to the standard stats.
Options:
:stat-names - defaults to (remove #{:values :num-distinct-values}
                                  (all-descriptive-stats-names))
:n-categorical-values - Number of categorical values to report in the 'values'
   field. Defaults to 21.
sourceraw docstring

dissocclj

(dissoc dataset colname)
(dissoc dataset colname & more)

Remove one more more columns from the dataset.

Remove one more more columns from the dataset.
sourceraw docstring

drop-columnsclj

(drop-columns dataset col-name-seq)

Same as remove-columns

Same as remove-columns
sourceraw docstring

drop-missingclj

(drop-missing ds)

Drop rows with missing entries

Drop rows with missing entries
sourceraw docstring

drop-rowsclj

(drop-rows dataset row-indexes)

Same as remove-rows.

Same as remove-rows.
sourceraw docstring

ds-concatclj

(ds-concat dataset & other-datasets)

Legacy method. Please see concat

Legacy method.  Please see concat
sourceraw docstring

ds-take-nthclj

(ds-take-nth n-val dataset)

Legacy method. Please see take-nth

Legacy method.  Please see take-nth
sourceraw docstring

ensure-array-backedclj

(ensure-array-backed ds)
(ensure-array-backed ds {:keys [unpack?] :or {unpack? true}})

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

options - :unpack? - unpack packed datetime types. Defaults to true

Ensure the column data in the dataset is stored in pure java arrays.  This is
sometimes necessary for interop with other libraries and this operation will
force any lazy computations to complete.  This also clears the missing set
for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not
changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate
datatype for each column.

options -
:unpack? - unpack packed datetime types.  Defaults to true
sourceraw docstring

feature-ecountclj

(feature-ecount dataset)

When columns aren't scalars then this will change. For now, just the number of feature columns.

When columns aren't scalars then this will change.
For now, just the number of feature columns.
sourceraw docstring

fill-range-replaceclj

(fill-range-replace ds colname max-span)
(fill-range-replace ds colname max-span missing-strategy)
(fill-range-replace ds colname max-span missing-strategy missing-value)

Given an in-order column of a numeric or datetime type, fill in spans that are larger than the given max-span. The source column must not have missing values. For more documentation on fill-range, see tech.v2.datatype.function.fill-range.

If the column is a datetime type the operation happens in millisecond space and max-span may be a datetime type convertible to milliseconds.

The result column has the same datatype as the input column.

After the operation, if missing strategy is not nil the newly produced missing values along with the existing missing values will be replaced using the given missing strategy for all other columns. See tech.ml.dataset.missing/replace-missing for documentation on missing strategies. The missing strategy defaults to :down unless explicity set.

Returns a new dataset.

Given an in-order column of a numeric or datetime type, fill in spans that are
larger than the given max-span.  The source column must not have missing values.
For more documentation on fill-range, see tech.v2.datatype.function.fill-range.

If the column is a datetime type the operation happens in millisecond space and
max-span may be a datetime type convertible to milliseconds.

The result column has the same datatype as the input column.

After the operation, if missing strategy is not nil the newly produced missing
values along with the existing missing values will be replaced using the given
missing strategy for all other columns.  See
`tech.ml.dataset.missing/replace-missing` for documentation on missing strategies.
The missing strategy defaults to :down unless explicity set.

Returns a new dataset.
sourceraw docstring

filterclj

(filter predicate dataset)
(filter predicate column-name-seq dataset)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

dataset->dataset transformation.  Predicate is passed a map of
colname->column-value.
sourceraw docstring

filter-columnclj

(filter-column predicate colname dataset)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). Returns a dataset.

Filter a given column by a predicate.  Predicate is passed column values.
If predicate is *not* an instance of Ifn it is treated as a value and will
be used as if the predicate is #(= value %).
Returns a dataset.
sourceraw docstring

from-prototypeclj

(from-prototype dataset table-name column-seq)

Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.

Create a new dataset that is the same type as this one but with a potentially
different table name and column sequence.  Take care that the columns are all of
the correct type.
sourceraw docstring

g-meansclj

(g-means dataset & [max-k error-on-missing?])

g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.

g-means. Not NAN aware, missing is an error.
Returns array of centroids in row-major array-of-array-of-doubles format.
sourceraw docstring

group-byclj

(group-by key-fn dataset)
(group-by key-fn column-name-seq dataset)

Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.

Produce a map of key-fn-value->dataset.  key-fn is a function taking
a map of colname->column-value.  Selecting which columns are used in the key-fn
using column-name-seq is optional but will greatly improve performance.
sourceraw docstring

group-by->indexesclj

(group-by->indexes key-fn dataset)
(group-by->indexes key-fn column-name-seq dataset)
source

group-by-columnclj

(group-by-column colname dataset)

Return a map of column-value->dataset.

Return a map of column-value->dataset.
sourceraw docstring

group-by-column->indexesclj

(group-by-column->indexes colname dataset)
source

has-column-label-map?clj

(has-column-label-map? dataset column-name)
source

has-column?clj

(has-column? dataset column-name)
source

hash-joinclj

(hash-join colname lhs rhs)
(hash-join colname
           lhs
           rhs
           {:keys [operation-space] :or {operation-space :int32} :as options})

Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :lhs-missing? Calculate the missing lhs indexes and left outer join table. :rhs-missing? Calculate the missing rhs indexes and right outer join table. :operation-space - either :int32 or :int64. Defaults to :int32. Returns {:join-table - joined-table :lhs-indexes - matched lhs indexes :rhs-indexes - matched rhs indexes ;; -- when rhs-missing? is true -- :rhs-missing - missing indexes of rhs. :rhs-outer-join - rhs outer join table. ;; -- when lhs-missing? is true -- :lhs-missing - missing indexes of lhs. :lhs-outer-join - lhs outer join table. }

Join by column.  For efficiency, lhs should be smaller than rhs.
colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:lhs-missing? Calculate the missing lhs indexes and left outer join table.
:rhs-missing? Calculate the missing rhs indexes and right outer join table.
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns
{:join-table - joined-table
 :lhs-indexes - matched lhs indexes
 :rhs-indexes - matched rhs indexes
 ;; -- when rhs-missing? is true --
 :rhs-missing - missing indexes of rhs.
 :rhs-outer-join - rhs outer join table.
 ;; -- when lhs-missing? is true --
 :lhs-missing - missing indexes of lhs.
 :lhs-outer-join - lhs outer join table.
}
sourceraw docstring

(head dataset)
(head n dataset)

Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.

Get the first n row of a dataset.  Equivalent to
`(select-rows ds (range n)).  Arguments are reversed, however, so this can
be used in ->> operators.
sourceraw docstring

impute-missing-by-centroid-averagesclj

(impute-missing-by-centroid-averages dataset
                                     row-major-centroids
                                     {:keys [centroid-means global-means]})

Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.

Impute missing columns by first grouping by nearest centroids and then computing the
mean.  In the case where the grouping for a given centroid contains all NaN's, use the
global dataset mean.  In the case where this is NaN, this algorithm will fail to
replace the missing values with meaningful values.  Return a new dataset.
sourceraw docstring

inference-target-column-namesclj

(inference-target-column-names ds)
source

inference-target-label-inverse-mapclj

(inference-target-label-inverse-map dataset & [label-columns])

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

Given options generated during ETL operations and annotated with :label-columns
sequence container 1 label column, generate a reverse map that maps from a dataset
value back to the label that generated that value.
sourceraw docstring

inference-target-label-mapclj

(inference-target-label-map dataset & [label-columns])
source

inner-joinclj

(inner-join colname lhs rhs)
(inner-join colname lhs rhs options)

Inner join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

Inner join by column.  For efficiency, lhs should be smaller than rhs.
 colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns the joined table
sourceraw docstring

interpolate-loessclj

(interpolate-loess ds x-colname y-colname)
(interpolate-loess ds
                   x-colname
                   y-colname
                   {:keys [bandwidth iterations accuracy result-name]
                    :or {bandwidth 0.75
                         iterations 4
                         accuracy LoessInterpolator/DEFAULT_ACCURACY}})

Interpolate using the LOESS regression engine. Useful for smoothing out graphs.

Interpolate using the LOESS regression engine.  Useful for smoothing out graphs.
sourceraw docstring

invert-string->numberclj

(invert-string->number ds)

When ds-pipe/string->number is called it creates label maps. This reverts the dataset back to those labels. Currently results in object columns so a cast operation may be needed to convert to desired datatype.

When ds-pipe/string->number is called it creates label maps.  This reverts
the dataset back to those labels.  Currently results in object columns
so a cast operation may be needed to convert to desired datatype.
sourceraw docstring

k-meansclj

(k-means dataset & [k max-iterations num-runs error-on-missing? tolerance])

Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.

Nan-aware k-means.
Returns array of centroids in row-major array-of-array-of-doubles format.
sourceraw docstring

labelsclj

(labels dataset)

Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in a dataset. Return a reader of labels

Given a dataset and an options map, generate a sequence of label-values.
If label count is 1, then if there is a label-map associated with column
generate sequence of labels by reverse mapping the column(s) back to the original
dataset values.  If there are multiple label columns results are presented in
a dataset.
Return a reader of labels
sourceraw docstring

left-joinclj

(left-join colname lhs rhs)
(left-join colname lhs rhs options)

Left join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

Left join by column.  For efficiency, lhs should be smaller than rhs.
 colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns the joined table
sourceraw docstring

left-join-asofclj

(left-join-asof colname lhs rhs)
(left-join-asof colname lhs rhs {:keys [asof-op] :or {asof-op :<=} :as options})

Perform a left join asof. Similar to left join except this will join on nearest value. lhs and rhs must be sorted by join-column. join columns must be either datetime columns in which the join happens in millisecond space or they must be numeric - integer or floating point datatypes.

options:

  • asof-op- may be [:< :<= :nearest :>= :>] - type of join operation. Defaults to <=.
Perform a left join asof.  Similar to left join except this will join on nearest
value.  lhs and rhs must be sorted by join-column.
join columns must be either datetime columns in which
the join happens in millisecond space or they must be numeric - integer or floating
point datatypes.

options:
- `asof-op`- may be [:< :<= :nearest :>= :>] - type of join operation.  Defaults to
   <=.
sourceraw docstring

mapseq-readerclj

(mapseq-reader dataset)
(mapseq-reader dataset options)

Return a reader that produces a map of column-name->column-value

Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a map of column-name->column-value

Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.
sourceraw docstring

maybe-columnclj

(maybe-column dataset column-name)

Return either column if exists or nil.

Return either column if exists or nil.
sourceraw docstring

metadataclj

(metadata dataset)
source

missingclj

(missing dataset)
source

model-typeclj

(model-type dataset & [column-name-seq])

Check the label column after dataset processing. Return either :regression :classification

Check the label column after dataset processing.
Return either
:regression
:classification
sourceraw docstring

n-feature-permutationsclj

(n-feature-permutations n dataset)

Given a dataset with at least one inference target column, produce all datasets with n feature columns and the label columns.

Given a dataset with at least one inference target column, produce all datasets
with n feature columns and the label columns.
sourceraw docstring

n-permutationsclj

(n-permutations n dataset)

Return n datasets with all permutations n of the columns possible. N must be less than (column-count dataset)).

Return n datasets with all permutations n of the columns possible.
N must be less than (column-count dataset)).
sourceraw docstring

name-values-seq->datasetclj

(name-values-seq->dataset name-values-seq & {:as options})

Given a sequence of [name data-seq], produce a columns. If data-seq is of unknown (:object) datatype, the first item is checked. If it is a number, then doubles are used. If it is a string, then strings are used for the column datatype. All sequences must be the same length. Returns a new dataset

Given a sequence of [name data-seq], produce a columns.  If data-seq is
of unknown (:object) datatype, the first item is checked. If it is a number,
then doubles are used.  If it is a string, then strings are used for the
column datatype.
All sequences must be the same length.
Returns a new dataset
sourceraw docstring

new-columnclj

(new-column dataset column-name values)

Create a new column from some values

Create a new column from some values
sourceraw docstring

new-datasetclj

(new-dataset column-seq)
(new-dataset options column-seq)
(new-dataset options ds-metadata column-seq)

Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

Create a new dataset from a sequence of columns.  Data will be converted
into columns using ds-col-proto/ensure-column-seq.  If the column seq is simply a
collection of vectors, for instance, columns will be named ordinally.
options map -
  :dataset-name - Name of the dataset.  Defaults to "_unnamed".
  :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.
sourceraw docstring

num-inference-classesclj

(num-inference-classes dataset)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

Given a dataset and correctly built options from pipeline operations,
return the number of classes used for the label.  Error if not classification
dataset.
sourceraw docstring

order-column-namesclj

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

Order a sequence of columns names so they match the order in the
original dataset.  Missing columns are placed last.
sourceraw docstring

parallelized-load-csvclj

(parallelized-load-csv input)
(parallelized-load-csv input options)

In load a csv distributing rows between N different datasets. Concat them at the end and return the final dataset. Loads data into an out-of-order dataset.

Type-hinting your columns and providing specific parsers for datetime types like: (ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}}) may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than single-file parallelization in most cases.

In load a csv distributing rows between N different datasets.  Concat them at the
end and return the final dataset.  Loads data into an out-of-order dataset.

Type-hinting your columns and providing specific parsers for datetime types like:
(ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})
may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than
single-file parallelization in most cases.
sourceraw docstring

rand-nthclj

(rand-nth dataset)

Return a random row from the dataset in map format

Return a random row from the dataset in map format
sourceraw docstring

reduce-column-namesclj

(reduce-column-names dataset colname-seq)

Reverse map from the one-hot encoded columns to the original source column.

Reverse map from the one-hot encoded columns
to the original source column.
sourceraw docstring

remove-columnclj

(remove-column dataset col-name)

Fails quietly

Fails quietly
sourceraw docstring

remove-columnsclj

(remove-columns dataset colname-seq)

Same as drop-columns

Same as drop-columns
sourceraw docstring

remove-rowsclj

(remove-rows dataset row-indexes)

Same as drop-rows.

Same as drop-rows.
sourceraw docstring

rename-columnsclj

(rename-columns dataset colname-map)

Rename columns using a map. Does not reorder columns.

Rename columns using a map.  Does not reorder columns.
sourceraw docstring

replace-missingclj

(replace-missing ds)
(replace-missing ds strategy)
(replace-missing ds columns-selector strategy)
(replace-missing ds columns-selector strategy value)

Replace missing values in some columns with a given strategy. The columns selector may be any legal argument to select-columns. Strategies may be:

  • :down - take value from previous non-missing row if possible else use next non-missing row.
  • :up - take value from next non-missing row if possible else use previous non-missing row.
  • :mid - Use midpoint of averaged values between previous and next nonmissing rows.
  • :lerp - Linearly interpolate values between previous and next nonmissing rows.
  • :value - Value will be provided - see below. value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.
Replace missing values in some columns with a given strategy.
The columns selector may be any legal argument to select-columns.
Strategies may be:
- `:down` - take value from previous non-missing row if possible else use next
  non-missing row.
- `:up` - take value from next non-missing row if possible else use previous
   non-missing row.
- `:mid` - Use midpoint of averaged values between previous and next nonmissing
   rows.
- `:lerp` - Linearly interpolate values between previous and next nonmissing rows.
- `:value` - Value will be provided - see below.
value may be provided which will then be used.  Value may be a function in which
case it will be called on the column with missing values elided and the return will
be used to as the filler.
sourceraw docstring

reverse-map-categorical-columnsclj

(reverse-map-categorical-columns dataset {:keys [column-name-seq]})

Given a dataset where we have converted columns from a categorical representation to either a numeric reprsentation or a one-hot representation, reverse map back to the original dataset given the reverse mapping of label->number in the column's metadata.

Given a dataset where we have converted columns from a categorical representation
to either a numeric reprsentation or a one-hot representation, reverse map
back to the original dataset given the reverse mapping of label->number in
the column's metadata.
sourceraw docstring

right-joinclj

(right-join colname lhs rhs)
(right-join colname lhs rhs options)

Right join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

Right join by column.  For efficiency, lhs should be smaller than rhs.
colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns the joined table
sourceraw docstring

row-countclj

(row-count dataset)
source

sampleclj

(sample dataset)
(sample n dataset)
(sample n replacement? dataset)

Sample n-rows from a dataset. Defaults to sampling without replacement.

Sample n-rows from a dataset.  Defaults to sampling *without* replacement.
sourceraw docstring

selectclj

(select dataset colname-seq index-seq)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

  • :all - all the columns
  • sequence of column names - those columns in that order.
  • implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. index-seq - either keyword :all or list of indexes. May contain duplicates.
Reorder/trim dataset according to this sequence of indexes.  Returns a new dataset.
colname-seq - one of:
  - :all - all the columns
  - sequence of column names - those columns in that order.
  - implementation of java.util.Map - column order is dictate by map iteration order
     selected columns are subsequently named after the corresponding value in the map.
     similar to `rename-columns` except this trims the result to be only the columns
     in the map.
index-seq - either keyword :all or list of indexes.  May contain duplicates.
sourceraw docstring

select-columnsclj

(select-columns dataset col-name-seq)
source

select-columns-by-indexclj

(select-columns-by-index ds idx-seq)
source

select-missingclj

(select-missing ds)

Select only rows with missing values

Select only rows with missing values
sourceraw docstring

select-rowsclj

(select-rows dataset row-indexes)
source

set-dataset-nameclj

(set-dataset-name dataset ds-name)
source

set-inference-targetclj

(set-inference-target dataset target-name-or-target-name-seq)
source

set-metadataclj

(set-metadata dataset meta-map)
source

shapeclj

(shape dataset)

Returns shape in row-major format of [n-columns n-rows].

Returns shape in row-major format of [n-columns n-rows].
sourceraw docstring

shuffleclj

(shuffle dataset)
source

sort-byclj

(sort-by key-fn dataset)
(sort-by key-fn compare-fn dataset)
(sort-by key-fn compare-fn column-name-seq dataset)

Sort a dataset by a key-fn and compare-fn.

Sort a dataset by a key-fn and compare-fn.
sourceraw docstring

sort-by-columnclj

(sort-by-column colname dataset)
(sort-by-column colname compare-fn dataset)

Sort a dataset by a given column using the given compare fn.

Sort a dataset by a given column using the given compare fn.
sourceraw docstring

tailclj

(tail dataset)
(tail n dataset)

Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.

Get the last n rows of a dataset.  Equivalent to
`(select-rows ds (range ...)).  Argument order is dataset-last, however, so this can
be used in ->> operators.
sourceraw docstring

take-nthclj

(take-nth n-val dataset)
source

unique-byclj

(unique-by map-fn dataset)
(unique-by map-fn
           {:keys [column-name-seq keep-fn]
            :or {keep-fn (fn* [p1__39025# p2__39024#] (first p2__39024#))}
            :as _options}
           dataset)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx.  Defaults to #(first %2).
sourceraw docstring

unique-by-columnclj

(unique-by-column colname dataset)
(unique-by-column colname
                  {:keys [keep-fn]
                   :or {keep-fn (fn* [p1__39038# p2__39037#]
                                     (first p2__39037#))}
                   :as _options}
                  dataset)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx.  Defaults to #(first %2).
sourceraw docstring

unordered-selectclj

(unordered-select dataset colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

Perform a selection but use the order of the columns in the existing table; do
*not* reorder the columns based on colname-seq.  Useful when doing selection based
on sets or persistent hash maps.
sourceraw docstring

unroll-columnclj

(unroll-column dataset column-name)
(unroll-column dataset column-name options)

Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.

Any missing indexes are dropped.

user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |

Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.

Unroll a column that has some (or all) sequential data as entries.
  Returns a new dataset with same columns but with other columns duplicated
  where the unroll happened.  Column now contains only scalar data.

  Any missing indexes are dropped.

```clojure
user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |
```

  Options -
  :datatype - datatype of the resulting column if one aside from :object is desired.
  :indexes? - If true, create a new column that records the indexes of the values from
    the original column.  Can also be a truthy value (like a keyword) and the column
    will be named this.
sourceraw docstring

update-columnclj

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

Update a column returning a new dataset.  update-fn is a column->column
transformation.  Error if column does not exist.
sourceraw docstring

update-columnsclj

(update-columns dataset column-name-seq update-fn)

Update a sequence of columns.

Update a sequence of columns.
sourceraw docstring

value-readerclj

(value-reader dataset)
(value-reader dataset options)

Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a reader of column values per index.
Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.
sourceraw docstring

write-csv!clj

(write-csv! ds output)
(write-csv! ds output options)

Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string -

  • .csv, .tsv - switches between tsv, csv. Tsv is the default.
  • *.gz - write to a gzipped stream. At this time writing to json is not supported. options - :separator - in case output isn't a string, you can use either , or \tab to switch between csv or tsv output respectively.
Write a dataset to a tsv or csv output stream.  Closes output if a stream
is passed in.  File output format will be inferred if output is a string -
  - .csv, .tsv - switches between tsv, csv.  Tsv is the default.
  - *.gz - write to a gzipped stream.
At this time writing to json is not supported.
options -
:separator - in case output isn't a string, you can use either \, or \tab to switch
  between csv or tsv output respectively.
sourceraw docstring

x-meansclj

(x-means dataset & [max-k error-on-missing?])

x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.

x-means. Not NAN aware, missing is an error.
Returns array of centroids in row-major array-of-array-of-doubles format.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close