tech.ml.dataset

Liking cljdoc? Tell your friends :D

Clojure only.

->>dataset
->dataset
->flyweight
->k-fold-datasets
->row-major
->sort-by
->sort-by-column
->train-test-split
add-column
add-or-update-column
aggregate-by
aggregate-by-column
all-descriptive-stats-names
append-columns
assoc
brief
column
column->dataset
column-cast
column-count
column-label-map
column-labeled-mapseq
column-map
column-name->column-map
column-names
column-values->categorical
columns
columns-with-missing-seq
columnwise-concat
compute-centroid-and-global-means
concat
concat-copying
concat-inplace
correlation-table
data->dataset
dataset->data
dataset->smile-dataframe
dataset->str
dataset->string
dataset-data->str
dataset-label-map
dataset-name
descriptive-stats
dissoc
drop-columns
drop-missing
drop-rows
ds-concat
ds-take-nth
ensure-array-backed
feature-ecount
fill-range-replace
filter
filter-column
from-prototype
g-means
group-by
group-by->indexes
group-by-column
group-by-column->indexes
has-column-label-map?
has-column?
hash-join
head
impute-missing-by-centroid-averages
inference-target-column-names
inference-target-label-inverse-map
inference-target-label-map
inner-join
interpolate-loess
invert-string->number
k-means
labels
left-join
left-join-asof
mapseq-reader
maybe-column
metadata
missing
model-type
n-feature-permutations
n-permutations
name-values-seq->dataset
new-column
new-dataset
num-inference-classes
order-column-names
parallelized-load-csv
rand-nth
reduce-column-names
remove-column
remove-columns
remove-rows
rename-columns
replace-missing
reverse-map-categorical-columns
right-join
row-count
sample
select
select-columns
select-columns-by-index
select-missing
select-rows
set-dataset-name
set-inference-target
set-metadata
shape
shuffle
sort-by
sort-by-column
tail
take-nth
unique-by
unique-by-column
unordered-select
unroll-column
update-column
update-columns
value-reader
write-csv!
x-means

Column major dataset abstraction for efficiently manipulating in memory datasets.

Column major dataset abstraction for efficiently manipulating
in memory datasets.

raw docstring

->>dataset^clj

(->>dataset dataset)

(->>dataset options dataset)

Please see documentation of ->dataset. Options are the same.

Please see documentation of ->dataset.  Options are the same.

source raw docstring

->dataset^clj

(->dataset dataset)

(->dataset dataset {:keys [table-name dataset-name] :as options})

Create a dataset from either csv/tsv or a sequence of maps.

A String or InputStream will be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.
A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.

Options:

:dataset-name - set the name of the dataset.
:file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that arrow and parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :arrow :parquet}.
:gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream.
:column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist.
:column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist.
:num-rows - Number of rows to read
:header-row? - Defaults to true, indicates the first row is a header.
:key-fn - function to be applied to column names. Typical use is: :key-fn keyword.
:separator - Add a character separator to the list of separators to auto-detect.
:csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such).
:bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row.
:skip-bad-rows? - Legacy option. Use :bad-row-policy.
:max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception.
:max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301
:n-initial-skip-rows - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.
:parser-fn -
- keyword? - all columns parsed to this datatype
- ifn? - called with two arguments: (parser-fn column-name-or-idx column-data) - Return value must be implement tech.ml.dataset.parser.PColumnParser in which case that is used or can return nil in which case the default column parser is used.
- tuple - pair of [datatype parse-data] in which case container of type [datatype] will be created. parse-data can be one of:
  - :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.
  - fn? - function from str-> one of :tech.ml.dataset.parser/missing, :tech.ml.dataset.parser/parse-failure, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated.
  - string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName.
  - DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
  - :encoded-text datatype - parse-data can be a string, a java.nio.charset.Charset, or an implementation of tech.ml.dataset.text/PEncodingToFn. If you want to serialize this format to nippy your encoding had better be nippy serializable (defrecords always are).
map? - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.
:parser-scan-len - Length of initial column data used for parser-fn's datatype detection routine. Defaults to 100.

Returns a new dataset

Create a dataset from either csv/tsv or a sequence of maps.
 *  A `String` or `InputStream` will be interpreted as a file (or gzipped file if it
 ends with .gz) of tsv or csv data.  The system will attempt to autodetect if this
 is csv or tsv and then engineering around detecting datatypes all of which can
 be overridden.
 *  A sequence of maps may be passed in in which case the first N maps are scanned in
 order to derive the column datatypes before the actual columns are created.

Options:
- `:dataset-name` - set the name of the dataset.
- `:file-type` - Override filetype discovery mechanism for strings or force a particular
    parser for an input stream.  Note that arrow and parquet must have paths on disk
    and cannot currently load from input stream.  Acceptible file types are:
    #{:csv :tsv :xlsx :xls :arrow :parquet}.
- `:gzipped?` - for file formats that support it, override autodetection and force
   creation of a gzipped input stream as opposed to a normal input stream.
- `:column-whitelist` - either sequence of string column names or sequence of column
   indices of columns to whitelist.
- `:column-blacklist` - either sequence of string column names or sequence of column
   indices of columns to blacklist.
- `:num-rows` - Number of rows to read
- `:header-row?` - Defaults to true, indicates the first row is a header.
- `:key-fn` - function to be applied to column names.  Typical use is:
   `:key-fn keyword`.
- `:separator` - Add a character separator to the list of separators to auto-detect.
- `:csv-parser` - Implementation of univocity's AbstractParser to use.  If not
   provided a default permissive parser is used.  This way you parse anything that
   univocity supports (so flat files and such).
- `:bad-row-policy` - One of three options: :skip, :error, :carry-on.  Defaults to
   :carry-on.  Some csv data has ragged rows and in this case we have several
   options. If the option is :carry-on then we either create a new column or add
   missing values for columns that had no data for that row.
- `:skip-bad-rows?` - Legacy option.  Use :bad-row-policy.
- `:max-chars-per-column` - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
- `:max-num-columns` - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301
- `:n-initial-skip-rows` - Skip N rows initially.  This currently may include the header
   row.  Works across both csv and spreadsheet datasets.
- `:parser-fn` -
  - `keyword?` - all columns parsed to this datatype
  - `ifn?` - called with two arguments: (parser-fn column-name-or-idx column-data)
        - Return value must be implement tech.ml.dataset.parser.PColumnParser in
          which case that is used or can return nil in which case the default
          column parser is used.
  - tuple - pair of [datatype `parse-data`] in which case container of type
    [datatype] will be created. `parse-data` can be one of:
      - `:relaxed?` - data will be parsed such that parse failures of the standard
         parse functions do not stop the parsing process.  :unparsed-values and
         :unparsed-indexes are available in the metadata of the column that tell
         you the values that failed to parse and their respective indexes.
      - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
         `:tech.ml.dataset.parser/parse-failure`, or the parsed value.
         Exceptions here always kill the parse process.  :missing will get marked
         in the missing indexes, and :parse-failure will result in the index being
         added to missing, the unparsed the column's :unparsed-values and
         :unparsed-indexes will be updated.
      - `string?` - for datetime types, this will turned into a DateTimeFormatter via
         DateTimeFormatter/ofPattern.  For encoded-text, this has to be a valid
         argument to Charset/forName.
      - `DateTimeFormatter` - use with the appropriate temporal parse static function
         to parse the value.
      -  :encoded-text datatype - `parse-data` can be a string, a
         java.nio.charset.Charset, or an implementation of
         tech.ml.dataset.text/PEncodingToFn.  If you want to serialize this format
         to nippy your encoding had better be nippy serializable (defrecords
         always are).
 - `map?` - the header-name-or-idx is used to lookup value.  If not nil, then
         value can be any of the above options.  Else the default column parser
         is used.
- `:parser-scan-len` - Length of initial column data used for parser-fn's datatype
     detection routine. Defaults to 100.

Returns a new dataset

source raw docstring

->flyweight^clj

(->flyweight dataset
             &
             {:keys [column-name-seq error-on-missing-values? number->string?]
              :or {error-on-missing-values? true}})

Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. If the dataset has a label and number->string? is true then columns that have been converted from categorical to numeric will be reverse-mapped back to string columns.

Convert dataset to seq-of-maps dataset.  Flag indicates if errors should be thrown
on missing values or if nil should be inserted in the map.  If the dataset has a label
and number->string? is true then columns that have been converted from categorical to
numeric will be reverse-mapped back to string columns.

source raw docstring

->k-fold-datasets^clj

(->k-fold-datasets dataset k)

(->k-fold-datasets
  dataset
  k
  {:keys [randomize-dataset?] :or {randomize-dataset? true} :as options})

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.

source raw docstring

->row-major^clj

(->row-major dataset)

(->row-major dataset options)

(->row-major dataset
             key-colname-seq-map
             {:keys [datatype] :or {datatype :float64}})

Given a dataset and a map of desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}

Given a dataset and a map of desired key names to sequences of columns,
produce a sequence of maps where each key name points to contiguous vector
composed of the column values concatenated.
If colname-seq-map is not provided then each row defaults to
{:features [feature-columns]
 :label [label-columns]}

source raw docstring

->sort-by^clj

(->sort-by dataset key-fn)

(->sort-by dataset key-fn compare-fn)

(->sort-by dataset key-fn compare-fn column-name-seq)

Version of sort-by used in -> statements common in dataflows

Version of sort-by used in -> statements common in dataflows

source raw docstring

->sort-by-column^clj

(->sort-by-column dataset colname)

(->sort-by-column dataset colname compare-fn)

sort-by-column used in -> dataflows

sort-by-column used in -> dataflows

source raw docstring

->train-test-split^clj

(->train-test-split dataset)

(->train-test-split dataset
                    {:keys [randomize-dataset? train-fraction]
                     :or {randomize-dataset? true train-fraction 0.7}
                     :as options})

source

add-column^clj

(add-column dataset column)

Add a new column. Error if name collision

Add a new column. Error if name collision

source raw docstring

add-or-update-column^clj

(add-or-update-column dataset column)

(add-or-update-column dataset colname column)

If column exists, replace. Else append new column.

If column exists, replace.  Else append new column.

source raw docstring

aggregate-by^clj

(aggregate-by map-fn
              dataset
              &
              {:keys [column-name-seq numeric-aggregate-fn boolean-aggregate-fn
                      default-aggregate-fn count-column-name]
               :or {numeric-aggregate-fn dfn/reduce-+
                    boolean-aggregate-fn count-true
                    default-aggregate-fn first}})

Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.

Group the dataset by map-fn, then aggregate by the aggregate fn.
Returns aggregated datatset.
:aggregate-fn - passed a sequence of columns and must return a new column
with the same number of entries as the count of the column sequences.

source raw docstring

aggregate-by-column^clj

(aggregate-by-column colname
                     dataset
                     &
                     {:keys [numeric-aggregate-fn boolean-aggregate-fn
                             default-aggregate-fn count-column-name]
                      :or {numeric-aggregate-fn dfn/reduce-+
                           boolean-aggregate-fn count-true
                           default-aggregate-fn first}})

Group the dataset by map-fn, then aggregate by the aggregate fn.
Returns aggregated datatset.
:aggregate-fn - passed a sequence of columns and must return a new column
with the same number of entries as the count of the column sequences.

source raw docstring

all-descriptive-stats-names^clj

(all-descriptive-stats-names)

Returns the names of all descriptive stats in the order they will be returned in the resulting dataset of descriptive stats. This allows easy filtering in the form for (descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names) (remove #{:values :num-distinct-values}))})

Returns the names of all descriptive stats in the order they will be returned
in the resulting dataset of descriptive stats.  This allows easy filtering
in the form for
(descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names)
                                        (remove #{:values :num-distinct-values}))})

source raw docstring

append-columns^clj

(append-columns dataset column-seq)

source

assoc^clj

(assoc dataset colname coldata)

(assoc dataset colname coldata & more)

If column exists, replace. Else append new column. The datatype of the new column will the the datatype of the coldata.

coldata may be a sequence in which case 'vec' will be called and the datatype will be :object.

coldata may be a reader, in which case the datatype will be the datatype of the reader. One way to make a reader is tech.v2.datatype/make-reader or anything deriving from java.util.List and java.util.RandomAccess will do.

coldata may also be a new column (tech.ml.dataset.column/new-column) in which case the missing set and the column metadata can be provided.

If column exists, replace.  Else append new column.  The datatype of the new column
will the the datatype of the coldata.

coldata may be a sequence in which case 'vec' will be called and the datatype will be
:object.

coldata may be a reader, in which case the datatype will be the datatype of the
reader.  One way to make a reader is tech.v2.datatype/make-reader or anything
deriving from java.util.List and java.util.RandomAccess will do.

coldata may also be a new column (tech.ml.dataset.column/new-column) in which case
the missing set and the column metadata can be provided.

source raw docstring

brief^clj

(brief ds)

(brief ds options)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

Get a brief description, in mapseq form of a dataset.  A brief description is
the mapseq form of descriptive stats.

source raw docstring

column^clj

(column dataset column-name)

Return the column or throw if it doesn't exist.

Return the column or throw if it doesn't exist.

source raw docstring

column->dataset^clj

(column->dataset dataset colname transform-fn)

(column->dataset dataset colname transform-fn options)

Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.

Transform a column into a sequence of maps using transform-fn.
Return dataset created out of the sequence of maps.

source raw docstring

column-cast^clj

(column-cast dataset colname datatype)

Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.

colname may be a scalar or a tuple of [src-col dst-col].

datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.

If the existing datatype is string, then tech.ml.datatype.column/parse-column is called.

Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.ml.dataset.column/parse-column.

Cast a column to a new datatype.  This is never a lazy operation.  If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.

colname may be a scalar or a tuple of [src-col dst-col].

datatype may be a datatype enumeration or a tuple of
[datatype cast-fn] where cast-fn may return either a new value,
:tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure.
Exceptions are propagated to the caller.  The new column has at least the
existing missing set (if no attempt returns :missing or :cast-failure).
:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.


If the existing datatype is string, then tech.ml.datatype.column/parse-column
is called.

Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.ml.dataset.column/parse-column.

source raw docstring

column-count^clj

(column-count dataset)

source

column-label-map^clj

(column-label-map dataset column-name)

source

column-labeled-mapseq^clj

(column-labeled-mapseq dataset value-colname-seq)

Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!

column-map^clj

(column-map dataset result-colname map-fn colname & colnames)

Produce a new column as the result of mapping a fn over other columns. The result column will have a datatype of the widened combination of all the input column datatypes. The result column's missing indexes is the union of all input columns.

Produce a new column as the result of mapping a fn over other columns.
The result column will have a datatype of the widened combination of all
the input column datatypes.
The result column's missing indexes is the union of all input columns.

source raw docstring

column-name->column-map^clj

(column-name->column-map datatypes)

clojure map of column-name->column

clojure map of column-name->column

source raw docstring

column-names^clj

(column-names dataset)

In-order sequence of column names

In-order sequence of column names

source raw docstring

column-values->categorical^clj

(column-values->categorical dataset src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.

Given a column encoded via either string->number or one-hot, reverse
map to the a sequence of the original string column values.

source raw docstring

columns^clj

(columns dataset)

Return sequence of all columns in dataset.

Return sequence of all columns in dataset.

source raw docstring

columns-with-missing-seq^clj

(columns-with-missing-seq dataset)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

Return a sequence of:
```clojure
  {:column-name column-name
   :missing-count missing-count
  }
```
  or nil of no columns are missing data.

source raw docstring

columnwise-concat^clj

(columnwise-concat dataset colnames)

(columnwise-concat dataset
                   colnames
                   {:keys [value-column-name colname-column-name]
                    :or {value-column-name :value colname-column-name :column}
                    :as _options})

Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.

Example:

user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |

Options:

value-column-name - defaults to :value colname-column-name - defaults to :column

Given a dataset and a list of columns, produce a new dataset with
  the columns concatenated to a new column with a :column column indicating
  which column the original value came from.  Any columns not mentioned in the
  list of columns are duplicated.

  Example:
```clojure
user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
          (ds/->dataset)
          (ds/columnwise-concat [:c :a :b]))
null [6 3]:

| :column | :value | :d |
|---------+--------+----|
|      :c |      3 |  1 |
|      :c |      6 |  2 |
|      :a |      1 |  1 |
|      :a |      4 |  2 |
|      :b |      2 |  1 |
|      :b |      5 |  2 |
```

  Options:

  value-column-name - defaults to :value
  colname-column-name - defaults to :column

source raw docstring

compute-centroid-and-global-means^clj

(compute-centroid-and-global-means dataset row-major-centroids)

Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.

Return a map of:
centroid-means - centroid-index -> (double array) column means.
global-means - global means (double array) for the dataset.

source raw docstring

concat^clj

(concat dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.
Also see concat-copying as this may be faster in many situations.

source raw docstring

concat-copying^clj

(concat-copying dataset & datasets)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets into a new dataset copying data.  Respects missing values.
Datasets must all have the same columns.  Result column datatypes will be a widening
cast of the datatypes.

source raw docstring

concat-inplace^clj

(concat-inplace dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.

source raw docstring

correlation-table^clj

(correlation-table dataset & {:keys [correlation-type colname-seq]})

Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >)

Thus the first entry is: [colname, 1.0]

There are three possible correlation types: :pearson :spearman :kendall

:pearson is the default.

Return a map of colname->list of sorted tuple of [colname, coefficient].
Sort is:
(sort-by (comp #(Math/abs (double %)) second) >)

Thus the first entry is:
[colname, 1.0]

There are three possible correlation types:
:pearson
:spearman
:kendall

:pearson is the default.

source raw docstring

data->dataset^clj

(data->dataset {:keys [metadata columns]})

Convert a data-ized dataset created via dataset->data back into a full dataset

Convert a data-ized dataset created via dataset->data back into a
full dataset

source raw docstring

dataset->data^clj

(dataset->data ds)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}.

:columns is a vector of column definitions appropriate for passing directly back into new-dataset.

A column definition in this case is a map of {:name :missing :data :metadata}.

Convert a dataset to a pure clojure datastructure.  Returns a map with two keys:
{:metadata :columns}.

:columns is a vector of column definitions appropriate for passing directly back
into new-dataset.

A column definition in this case is a map of {:name :missing :data :metadata}.

source raw docstring

dataset->smile-dataframe^clj

(dataset->smile-dataframe ds)

Convert a dataset to a smile dataframe.

This operation may clone columns if they aren't backed by java heap arrays. See ensure-array-backed

It is important to note that smile supports a subset of the functionality in tech.ml.dataset. One difference is smile columns have string column names and have no missing set.

Returns a smile.data.DataFrame

Convert a dataset to a smile dataframe.

This operation may clone columns if they aren't backed by java heap arrays.
See ensure-array-backed

It is important to note that smile supports a subset of the functionality in
tech.ml.dataset.  One difference is smile columns have string column names and
have no missing set.

Returns a smile.data.DataFrame

source raw docstring

dataset->str^clj

(dataset->str ds)

(dataset->str ds options)

Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.

For options documentation see dataset-data->str.

Convert a dataset to a string.  Prints a single line header and then calls
dataset-data->str.

For options documentation see dataset-data->str.

source raw docstring

dataset->string^clj

Deprecated method. See dataset->str

Deprecated method.  See dataset->str

source raw docstring

dataset-data->str^clj

(dataset-data->str dataset)

(dataset-data->str dataset options)

Convert the dataset values to a string.

Options may be provided in the dataset metadata or may be provided as an options map. The options map overrides the dataset metadata.

:print-index-range - The set of indexes to print. Defaults to: (range default-table-row-print-length) :print-line-policy - defaults to :repl - one of - :repl - multiline table - default nice printing for repl - :markdown - lines delimited by <br> - :single - Only print first line :print-column-max-width - set the max width of a column when printing.

Example for conservative printing: tech.ml.dataset.github-test> (def ds (with-meta ds (assoc (meta ds) :print-column-max-width 25 :print-line-policy :single)))

Convert the dataset values to a string.

  Options may be provided in the dataset metadata or may be provided
  as an options map.  The options map overrides the dataset metadata.

  :print-index-range - The set of indexes to print.  Defaults to:
    (range *default-table-row-print-length*)
  :print-line-policy - defaults to :repl - one of
    - :repl - multiline table - default nice printing for repl
    - :markdown - lines delimited by <br>
    - :single - Only print first line
  :print-column-max-width - set the max width of a column when printing.

  Example for conservative printing:
tech.ml.dataset.github-test> (def ds (with-meta ds
                                       (assoc (meta ds)
                                              :print-column-max-width 25
                                              :print-line-policy :single)))

source raw docstring

dataset-label-map^clj

(dataset-label-map dataset)

source

dataset-name^clj

(dataset-name dataset)

source

descriptive-stats^clj

(descriptive-stats dataset)

(descriptive-stats dataset options)

Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.

Get descriptive statistics across the columns of the dataset.
In addition to the standard stats.
Options:
:stat-names - defaults to (remove #{:values :num-distinct-values}
                                  (all-descriptive-stats-names))
:n-categorical-values - Number of categorical values to report in the 'values'
   field. Defaults to 21.

source raw docstring

dissoc^clj

(dissoc dataset colname)

(dissoc dataset colname & more)

Remove one more more columns from the dataset.

Remove one more more columns from the dataset.

source raw docstring

drop-columns^clj

(drop-columns dataset col-name-seq)

Same as remove-columns

Same as remove-columns

source raw docstring

drop-missing^clj

(drop-missing ds)

Drop rows with missing entries

Drop rows with missing entries

source raw docstring

drop-rows^clj

(drop-rows dataset row-indexes)

Same as remove-rows.

Same as remove-rows.

source raw docstring

ds-concat^clj

(ds-concat dataset & other-datasets)

Legacy method. Please see concat

Legacy method.  Please see concat

source raw docstring

ds-take-nth^clj

(ds-take-nth n-val dataset)

Legacy method. Please see take-nth

Legacy method.  Please see take-nth

source raw docstring

ensure-array-backed^clj

(ensure-array-backed ds)

(ensure-array-backed ds {:keys [unpack?] :or {unpack? true}})

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

options - :unpack? - unpack packed datetime types. Defaults to true

Ensure the column data in the dataset is stored in pure java arrays.  This is
sometimes necessary for interop with other libraries and this operation will
force any lazy computations to complete.  This also clears the missing set
for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not
changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate
datatype for each column.

options -
:unpack? - unpack packed datetime types.  Defaults to true

source raw docstring

feature-ecount^clj

(feature-ecount dataset)

When columns aren't scalars then this will change. For now, just the number of feature columns.

When columns aren't scalars then this will change.
For now, just the number of feature columns.

source raw docstring

fill-range-replace^clj

(fill-range-replace ds colname max-span)

(fill-range-replace ds colname max-span missing-strategy)

(fill-range-replace ds colname max-span missing-strategy missing-value)

Given an in-order column of a numeric or datetime type, fill in spans that are larger than the given max-span. The source column must not have missing values. For more documentation on fill-range, see tech.v2.datatype.function.fill-range.

If the column is a datetime type the operation happens in millisecond space and max-span may be a datetime type convertible to milliseconds.

The result column has the same datatype as the input column.

After the operation, if missing strategy is not nil the newly produced missing values along with the existing missing values will be replaced using the given missing strategy for all other columns. See tech.ml.dataset.missing/replace-missing for documentation on missing strategies. The missing strategy defaults to :down unless explicity set.

Returns a new dataset.

Given an in-order column of a numeric or datetime type, fill in spans that are
larger than the given max-span.  The source column must not have missing values.
For more documentation on fill-range, see tech.v2.datatype.function.fill-range.

If the column is a datetime type the operation happens in millisecond space and
max-span may be a datetime type convertible to milliseconds.

The result column has the same datatype as the input column.

After the operation, if missing strategy is not nil the newly produced missing
values along with the existing missing values will be replaced using the given
missing strategy for all other columns.  See
`tech.ml.dataset.missing/replace-missing` for documentation on missing strategies.
The missing strategy defaults to :down unless explicity set.

Returns a new dataset.

source raw docstring

filter^clj

(filter predicate dataset)

(filter predicate column-name-seq dataset)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

dataset->dataset transformation.  Predicate is passed a map of
colname->column-value.

source raw docstring

filter-column^clj

(filter-column predicate colname dataset)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). Returns a dataset.

Filter a given column by a predicate.  Predicate is passed column values.
If predicate is *not* an instance of Ifn it is treated as a value and will
be used as if the predicate is #(= value %).
Returns a dataset.

source raw docstring

from-prototype^clj

(from-prototype dataset table-name column-seq)

Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.

Create a new dataset that is the same type as this one but with a potentially
different table name and column sequence.  Take care that the columns are all of
the correct type.

source raw docstring

g-means^clj

(g-means dataset & [max-k error-on-missing?])

g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.

g-means. Not NAN aware, missing is an error.
Returns array of centroids in row-major array-of-array-of-doubles format.

source raw docstring

group-by^clj

(group-by key-fn dataset)

(group-by key-fn column-name-seq dataset)

Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.

Produce a map of key-fn-value->dataset.  key-fn is a function taking
a map of colname->column-value.  Selecting which columns are used in the key-fn
using column-name-seq is optional but will greatly improve performance.

source raw docstring

group-by->indexes^clj

(group-by->indexes key-fn dataset)

(group-by->indexes key-fn column-name-seq dataset)

source

group-by-column^clj

(group-by-column colname dataset)

Return a map of column-value->dataset.

Return a map of column-value->dataset.

source raw docstring

group-by-column->indexes^clj

(group-by-column->indexes colname dataset)

source

has-column-label-map?^clj

(has-column-label-map? dataset column-name)

source

has-column?^clj

(has-column? dataset column-name)

source

hash-join^clj

(hash-join colname lhs rhs)

(hash-join colname
           lhs
           rhs
           {:keys [operation-space] :or {operation-space :int32} :as options})

Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :lhs-missing? Calculate the missing lhs indexes and left outer join table. :rhs-missing? Calculate the missing rhs indexes and right outer join table. :operation-space - either :int32 or :int64. Defaults to :int32. Returns {:join-table - joined-table :lhs-indexes - matched lhs indexes :rhs-indexes - matched rhs indexes ;; -- when rhs-missing? is true -- :rhs-missing - missing indexes of rhs. :rhs-outer-join - rhs outer join table. ;; -- when lhs-missing? is true -- :lhs-missing - missing indexes of lhs. :lhs-outer-join - lhs outer join table. }

Join by column.  For efficiency, lhs should be smaller than rhs.
colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:lhs-missing? Calculate the missing lhs indexes and left outer join table.
:rhs-missing? Calculate the missing rhs indexes and right outer join table.
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns
{:join-table - joined-table
 :lhs-indexes - matched lhs indexes
 :rhs-indexes - matched rhs indexes
 ;; -- when rhs-missing? is true --
 :rhs-missing - missing indexes of rhs.
 :rhs-outer-join - rhs outer join table.
 ;; -- when lhs-missing? is true --
 :lhs-missing - missing indexes of lhs.
 :lhs-outer-join - lhs outer join table.
}

source raw docstring

head^clj

(head dataset)

(head n dataset)

Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.

Get the first n row of a dataset.  Equivalent to
`(select-rows ds (range n)).  Arguments are reversed, however, so this can
be used in ->> operators.

source raw docstring

impute-missing-by-centroid-averages^clj

(impute-missing-by-centroid-averages dataset
                                     row-major-centroids
                                     {:keys [centroid-means global-means]})

Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.

Impute missing columns by first grouping by nearest centroids and then computing the
mean.  In the case where the grouping for a given centroid contains all NaN's, use the
global dataset mean.  In the case where this is NaN, this algorithm will fail to
replace the missing values with meaningful values.  Return a new dataset.

source raw docstring

inference-target-column-names^clj

(inference-target-column-names ds)

source

inference-target-label-inverse-map^clj

(inference-target-label-inverse-map dataset & [label-columns])

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

Given options generated during ETL operations and annotated with :label-columns
sequence container 1 label column, generate a reverse map that maps from a dataset
value back to the label that generated that value.

source raw docstring

inference-target-label-map^clj

(inference-target-label-map dataset & [label-columns])

source

inner-join^clj

(inner-join colname lhs rhs)

(inner-join colname lhs rhs options)

Inner join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

Inner join by column.  For efficiency, lhs should be smaller than rhs.
 colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns the joined table

source raw docstring

interpolate-loess^clj

(interpolate-loess ds x-colname y-colname)

(interpolate-loess ds
                   x-colname
                   y-colname
                   {:keys [bandwidth iterations accuracy result-name]
                    :or {bandwidth 0.75
                         iterations 4
                         accuracy LoessInterpolator/DEFAULT_ACCURACY}})

Interpolate using the LOESS regression engine. Useful for smoothing out graphs.

Interpolate using the LOESS regression engine.  Useful for smoothing out graphs.

source raw docstring

invert-string->number^clj

(invert-string->number ds)

When ds-pipe/string->number is called it creates label maps. This reverts the dataset back to those labels. Currently results in object columns so a cast operation may be needed to convert to desired datatype.

When ds-pipe/string->number is called it creates label maps.  This reverts
the dataset back to those labels.  Currently results in object columns
so a cast operation may be needed to convert to desired datatype.

source raw docstring

k-means^clj

(k-means dataset & [k max-iterations num-runs error-on-missing? tolerance])

Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.

Nan-aware k-means.
Returns array of centroids in row-major array-of-array-of-doubles format.

source raw docstring

labels^clj

(labels dataset)

Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in a dataset. Return a reader of labels

Given a dataset and an options map, generate a sequence of label-values.
If label count is 1, then if there is a label-map associated with column
generate sequence of labels by reverse mapping the column(s) back to the original
dataset values.  If there are multiple label columns results are presented in
a dataset.
Return a reader of labels

source raw docstring

left-join^clj

(left-join colname lhs rhs)

(left-join colname lhs rhs options)

Left join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

Left join by column.  For efficiency, lhs should be smaller than rhs.
 colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns the joined table

source raw docstring

left-join-asof^clj

(left-join-asof colname lhs rhs)

(left-join-asof colname lhs rhs {:keys [asof-op] :or {asof-op :<=} :as options})

Perform a left join asof. Similar to left join except this will join on nearest value. lhs and rhs must be sorted by join-column. join columns must be either datetime columns in which the join happens in millisecond space or they must be numeric - integer or floating point datatypes.

options:

asof-op- may be [:< :<= :nearest :>= :>] - type of join operation. Defaults to <=.

Perform a left join asof.  Similar to left join except this will join on nearest
value.  lhs and rhs must be sorted by join-column.
join columns must be either datetime columns in which
the join happens in millisecond space or they must be numeric - integer or floating
point datatypes.

options:
- `asof-op`- may be [:< :<= :nearest :>= :>] - type of join operation.  Defaults to
   <=.

source raw docstring

mapseq-reader^clj

(mapseq-reader dataset)

(mapseq-reader dataset options)

Return a reader that produces a map of column-name->column-value

Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a map of column-name->column-value

Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.

source raw docstring

maybe-column^clj

(maybe-column dataset column-name)

Return either column if exists or nil.

Return either column if exists or nil.

source raw docstring

metadata^clj

(metadata dataset)

source

missing^clj

(missing dataset)

source

model-type^clj

(model-type dataset & [column-name-seq])

Check the label column after dataset processing. Return either :regression :classification

Check the label column after dataset processing.
Return either
:regression
:classification

source raw docstring

n-feature-permutations^clj

(n-feature-permutations n dataset)

Given a dataset with at least one inference target column, produce all datasets with n feature columns and the label columns.

Given a dataset with at least one inference target column, produce all datasets
with n feature columns and the label columns.

source raw docstring

n-permutations^clj

(n-permutations n dataset)

Return n datasets with all permutations n of the columns possible. N must be less than (column-count dataset)).

Return n datasets with all permutations n of the columns possible.
N must be less than (column-count dataset)).

source raw docstring

name-values-seq->dataset^clj

(name-values-seq->dataset name-values-seq & {:as options})

Given a sequence of [name data-seq], produce a columns. If data-seq is of unknown (:object) datatype, the first item is checked. If it is a number, then doubles are used. If it is a string, then strings are used for the column datatype. All sequences must be the same length. Returns a new dataset

Given a sequence of [name data-seq], produce a columns.  If data-seq is
of unknown (:object) datatype, the first item is checked. If it is a number,
then doubles are used.  If it is a string, then strings are used for the
column datatype.
All sequences must be the same length.
Returns a new dataset

source raw docstring

new-column^clj

(new-column dataset column-name values)

Create a new column from some values

Create a new column from some values

source raw docstring

new-dataset^clj

(new-dataset column-seq)

(new-dataset options column-seq)

(new-dataset options ds-metadata column-seq)

Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

Create a new dataset from a sequence of columns.  Data will be converted
into columns using ds-col-proto/ensure-column-seq.  If the column seq is simply a
collection of vectors, for instance, columns will be named ordinally.
options map -
  :dataset-name - Name of the dataset.  Defaults to "_unnamed".
  :key-fn - Key function used on all column names before insertion into dataset.

The return value fulfills the dataset protocols.

source raw docstring

num-inference-classes^clj

(num-inference-classes dataset)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

Given a dataset and correctly built options from pipeline operations,
return the number of classes used for the label.  Error if not classification
dataset.

source raw docstring

order-column-names^clj

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

Order a sequence of columns names so they match the order in the
original dataset.  Missing columns are placed last.

source raw docstring

parallelized-load-csv^clj

(parallelized-load-csv input)

(parallelized-load-csv input options)

In load a csv distributing rows between N different datasets. Concat them at the end and return the final dataset. Loads data into an out-of-order dataset.

Type-hinting your columns and providing specific parsers for datetime types like: (ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}}) may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than single-file parallelization in most cases.

In load a csv distributing rows between N different datasets.  Concat them at the
end and return the final dataset.  Loads data into an out-of-order dataset.

Type-hinting your columns and providing specific parsers for datetime types like:
(ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})
may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than
single-file parallelization in most cases.

source raw docstring

rand-nth^clj

(rand-nth dataset)

Return a random row from the dataset in map format

Return a random row from the dataset in map format

source raw docstring

reduce-column-names^clj

(reduce-column-names dataset colname-seq)

Reverse map from the one-hot encoded columns to the original source column.

Reverse map from the one-hot encoded columns
to the original source column.

source raw docstring

remove-column^clj

(remove-column dataset col-name)

Fails quietly

Fails quietly

source raw docstring

remove-columns^clj

(remove-columns dataset colname-seq)

Same as drop-columns

Same as drop-columns

source raw docstring

remove-rows^clj

(remove-rows dataset row-indexes)

Same as drop-rows.

Same as drop-rows.

source raw docstring

rename-columns^clj

(rename-columns dataset colname-map)

Rename columns using a map. Does not reorder columns.

Rename columns using a map.  Does not reorder columns.

source raw docstring

replace-missing^clj

(replace-missing ds)

(replace-missing ds strategy)

(replace-missing ds columns-selector strategy)

(replace-missing ds columns-selector strategy value)

Replace missing values in some columns with a given strategy. The columns selector may be any legal argument to select-columns. Strategies may be:

:down - take value from previous non-missing row if possible else use next non-missing row.
:up - take value from next non-missing row if possible else use previous non-missing row.
:mid - Use midpoint of averaged values between previous and next nonmissing rows.
:lerp - Linearly interpolate values between previous and next nonmissing rows.
:value - Value will be provided - see below. value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.

Replace missing values in some columns with a given strategy.
The columns selector may be any legal argument to select-columns.
Strategies may be:
- `:down` - take value from previous non-missing row if possible else use next
  non-missing row.
- `:up` - take value from next non-missing row if possible else use previous
   non-missing row.
- `:mid` - Use midpoint of averaged values between previous and next nonmissing
   rows.
- `:lerp` - Linearly interpolate values between previous and next nonmissing rows.
- `:value` - Value will be provided - see below.
value may be provided which will then be used.  Value may be a function in which
case it will be called on the column with missing values elided and the return will
be used to as the filler.

source raw docstring

reverse-map-categorical-columns^clj

(reverse-map-categorical-columns dataset {:keys [column-name-seq]})

Given a dataset where we have converted columns from a categorical representation to either a numeric reprsentation or a one-hot representation, reverse map back to the original dataset given the reverse mapping of label->number in the column's metadata.

Given a dataset where we have converted columns from a categorical representation
to either a numeric reprsentation or a one-hot representation, reverse map
back to the original dataset given the reverse mapping of label->number in
the column's metadata.

source raw docstring

right-join^clj

(right-join colname lhs rhs)

(right-join colname lhs rhs options)

Right join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table

Right join by column.  For efficiency, lhs should be smaller than rhs.
colname - may be a single item or a tuple in which is destructures as:
   (let [[lhs-colname rhs-colname]] colname] ...)
An options map can be passed in with optional arguments:
:operation-space - either :int32 or :int64.  Defaults to :int32.
Returns the joined table

source raw docstring

row-count^clj

(row-count dataset)

source

sample^clj

(sample dataset)

(sample n dataset)

(sample n replacement? dataset)

Sample n-rows from a dataset. Defaults to sampling without replacement.

Sample n-rows from a dataset.  Defaults to sampling *without* replacement.

source raw docstring

select^clj

(select dataset colname-seq index-seq)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

:all - all the columns
sequence of column names - those columns in that order.
implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. index-seq - either keyword :all or list of indexes. May contain duplicates.

Reorder/trim dataset according to this sequence of indexes.  Returns a new dataset.
colname-seq - one of:
  - :all - all the columns
  - sequence of column names - those columns in that order.
  - implementation of java.util.Map - column order is dictate by map iteration order
     selected columns are subsequently named after the corresponding value in the map.
     similar to `rename-columns` except this trims the result to be only the columns
     in the map.
index-seq - either keyword :all or list of indexes.  May contain duplicates.

source raw docstring

select-columns^clj

(select-columns dataset col-name-seq)

source

select-columns-by-index^clj

(select-columns-by-index ds idx-seq)

source

select-missing^clj

(select-missing ds)

Select only rows with missing values

Select only rows with missing values

source raw docstring

select-rows^clj

(select-rows dataset row-indexes)

source

set-dataset-name^clj

(set-dataset-name dataset ds-name)

source

set-inference-target^clj

(set-inference-target dataset target-name-or-target-name-seq)

source

set-metadata^clj

(set-metadata dataset meta-map)

source

shape^clj

(shape dataset)

Returns shape in row-major format of [n-columns n-rows].

Returns shape in row-major format of [n-columns n-rows].

source raw docstring

shuffle^clj

(shuffle dataset)

source

sort-by^clj

(sort-by key-fn dataset)

(sort-by key-fn compare-fn dataset)

(sort-by key-fn compare-fn column-name-seq dataset)

Sort a dataset by a key-fn and compare-fn.

Sort a dataset by a key-fn and compare-fn.

source raw docstring

sort-by-column^clj

(sort-by-column colname dataset)

(sort-by-column colname compare-fn dataset)

Sort a dataset by a given column using the given compare fn.

Sort a dataset by a given column using the given compare fn.

source raw docstring

tail^clj

(tail dataset)

(tail n dataset)

Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.

Get the last n rows of a dataset.  Equivalent to
`(select-rows ds (range ...)).  Argument order is dataset-last, however, so this can
be used in ->> operators.

source raw docstring

take-nth^clj

(take-nth n-val dataset)

source

unique-by^clj

(unique-by map-fn dataset)

(unique-by map-fn
           {:keys [column-name-seq keep-fn]
            :or {keep-fn (fn* [p1__39025# p2__39024#] (first p2__39024#))}
            :as _options}
           dataset)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx.  Defaults to #(first %2).

source raw docstring

unique-by-column^clj

(unique-by-column colname dataset)

(unique-by-column colname
                  {:keys [keep-fn]
                   :or {keep-fn (fn* [p1__39038# p2__39037#]
                                     (first p2__39037#))}
                   :as _options}
                  dataset)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx.  Defaults to #(first %2).

source raw docstring

unordered-select^clj

(unordered-select dataset colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

Perform a selection but use the order of the columns in the existing table; do
*not* reorder the columns based on colname-seq.  Useful when doing selection based
on sets or persistent hash maps.

source raw docstring

unroll-column^clj

(unroll-column dataset column-name)

(unroll-column dataset column-name options)

Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.

Any missing indexes are dropped.

user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |

Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.

Unroll a column that has some (or all) sequential data as entries.
  Returns a new dataset with same columns but with other columns duplicated
  where the unroll happened.  Column now contains only scalar data.

  Any missing indexes are dropped.

```clojure
user> (-> (ds/->dataset [{:a 1 :b [2 3]}
                              {:a 2 :b [4 5]}
                              {:a 3 :b :a}])
               (ds/unroll-column :b {:indexes? true}))
  _unnamed [5 3]:

| :a | :b | :indexes |
|----+----+----------|
|  1 |  2 |        0 |
|  1 |  3 |        1 |
|  2 |  4 |        0 |
|  2 |  5 |        1 |
|  3 | :a |        0 |
```

  Options -
  :datatype - datatype of the resulting column if one aside from :object is desired.
  :indexes? - If true, create a new column that records the indexes of the values from
    the original column.  Can also be a truthy value (like a keyword) and the column
    will be named this.

source raw docstring

update-column^clj

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

Update a column returning a new dataset.  update-fn is a column->column
transformation.  Error if column does not exist.

source raw docstring

update-columns^clj

(update-columns dataset column-name-seq update-fn)

Update a sequence of columns.

Update a sequence of columns.

source raw docstring

value-reader^clj

(value-reader dataset)

(value-reader dataset options)

Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a reader of column values per index.
Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.

source raw docstring

write-csv!^clj

(write-csv! ds output)

(write-csv! ds output options)

Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string -

.csv, .tsv - switches between tsv, csv. Tsv is the default.
*.gz - write to a gzipped stream. At this time writing to json is not supported. options - :separator - in case output isn't a string, you can use either , or \tab to switch between csv or tsv output respectively.

Write a dataset to a tsv or csv output stream.  Closes output if a stream
is passed in.  File output format will be inferred if output is a string -
  - .csv, .tsv - switches between tsv, csv.  Tsv is the default.
  - *.gz - write to a gzipped stream.
At this time writing to json is not supported.
options -
:separator - in case output isn't a string, you can use either \, or \tab to switch
  between csv or tsv output respectively.

source raw docstring

x-means^clj

(x-means dataset & [max-k error-on-missing?])

x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.

x-means. Not NAN aware, missing is an error.
Returns array of centroids in row-major array-of-array-of-doubles format.

source raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close

tech.ml.dataset

->>datasetclj

->datasetclj

->flyweightclj

->k-fold-datasetsclj

->row-majorclj

->sort-byclj

->sort-by-columnclj

->train-test-splitclj

add-columnclj

add-or-update-columnclj

aggregate-byclj

aggregate-by-columnclj

all-descriptive-stats-namesclj

append-columnsclj

assocclj

briefclj

columnclj

column->datasetclj

column-castclj

column-countclj

column-label-mapclj

column-labeled-mapseqclj

column-mapclj

column-name->column-mapclj

column-namesclj

column-values->categoricalclj

columnsclj

columns-with-missing-seqclj

columnwise-concatclj

compute-centroid-and-global-meansclj

concatclj

concat-copyingclj

concat-inplaceclj

correlation-tableclj

data->datasetclj

dataset->dataclj

dataset->smile-dataframeclj

dataset->strclj

dataset->stringclj

dataset-data->strclj

dataset-label-mapclj

dataset-nameclj

descriptive-statsclj

dissocclj

drop-columnsclj

drop-missingclj

drop-rowsclj

ds-concatclj

ds-take-nthclj

ensure-array-backedclj

feature-ecountclj

fill-range-replaceclj

filterclj

filter-columnclj

from-prototypeclj

g-meansclj

group-byclj

group-by->indexesclj

group-by-columnclj

group-by-column->indexesclj

has-column-label-map?clj

has-column?clj

hash-joinclj

headclj

impute-missing-by-centroid-averagesclj

inference-target-column-namesclj

inference-target-label-inverse-mapclj

inference-target-label-mapclj

inner-joinclj

interpolate-loessclj

invert-string->numberclj

k-meansclj

labelsclj

left-joinclj

left-join-asofclj

mapseq-readerclj

maybe-columnclj

metadataclj

missingclj

->>dataset^clj

->dataset^clj

->flyweight^clj

->k-fold-datasets^clj

->row-major^clj

->sort-by^clj

->sort-by-column^clj

->train-test-split^clj

add-column^clj

add-or-update-column^clj

aggregate-by^clj

aggregate-by-column^clj

all-descriptive-stats-names^clj

append-columns^clj

assoc^clj

brief^clj

column^clj

column->dataset^clj

column-cast^clj

column-count^clj

column-label-map^clj

column-labeled-mapseq^clj

column-map^clj

column-name->column-map^clj

column-names^clj

column-values->categorical^clj

columns^clj

columns-with-missing-seq^clj

columnwise-concat^clj

compute-centroid-and-global-means^clj

concat^clj

concat-copying^clj

concat-inplace^clj

correlation-table^clj

data->dataset^clj

dataset->data^clj

dataset->smile-dataframe^clj

dataset->str^clj

dataset->string^clj

dataset-data->str^clj

dataset-label-map^clj

dataset-name^clj

descriptive-stats^clj

dissoc^clj

drop-columns^clj

drop-missing^clj

drop-rows^clj

ds-concat^clj

ds-take-nth^clj

ensure-array-backed^clj

feature-ecount^clj

fill-range-replace^clj

filter^clj

filter-column^clj

from-prototype^clj

g-means^clj

group-by^clj

group-by->indexes^clj

group-by-column^clj

group-by-column->indexes^clj

has-column-label-map?^clj

has-column?^clj

hash-join^clj

head^clj

impute-missing-by-centroid-averages^clj

inference-target-column-names^clj

inference-target-label-inverse-map^clj

inference-target-label-map^clj

inner-join^clj

interpolate-loess^clj

invert-string->number^clj

k-means^clj

labels^clj

left-join^clj

left-join-asof^clj

mapseq-reader^clj

maybe-column^clj

metadata^clj

missing^clj

model-type^clj