scicloj.ml.dataset

Liking cljdoc? Tell your friends :D

Clojure only.

->array
add-column
add-columns
add-or-replace-column
add-or-replace-columns
aggregate
aggregate-columns
anti-join
append
as-regular-dataset
asof-join
bind
boolean
by-rank
categorical
categorical->number
categorical->one-hot
CategoricalMap
clone
column
column-count
column-filter
column-names
column-values->categorical
columns
concat
concat-copying
convert-types
create-categorical-map
dataset
dataset->categorical-maps
dataset->categorical-xforms
dataset->one-hot-maps
dataset->str
dataset-name
dataset?
datetime
difference
drop
drop-columns
drop-missing
drop-rows
empty-ds?
feature
feature-ecount
fill-range-replace
first
fit-categorical-map
fit-one-hot
fold-by
full-join
group-by
grouped?
groups->map
groups->seq
has-column?
head
inference-column?
inference-target-column-names
inference-target-ds
inference-target-label-inverse-map
inference-target-label-map
info
inner-join
intersect
intersection
invert-categorical-map
invert-one-hot-map
join-columns
k-fold-datasets
labels
last
left-join
map-columns
mark-as-group
metadata-filter
missing
model-type
no-missing
num-inference-classes
numeric
of-datatype
OneHotMap
order-by
pivot->longer
pivot->wider
prediction
print-dataset
probability-distribution
probability-distributions->label-column
process-group-data
rand-nth
random
read-nippy
rename-columns
reorder-columns
replace-missing
reverse-map-categorical-xforms
right-join
row-count
rows
select
select-columns
select-missing
select-rows
semi-join
separate-column
set-dataset-name
set-inference-target
shape
shuffle
split
split->seq
string
tail
target
train-test-split
transform-categorical-map
transform-one-hot
ungroup
union
unique-by
unmark-group
unroll
update-columns
write!
write-csv!
write-nippy!

This namespace contains functions which operate on a dataset and mostly return a dataset.

The namespaces scicloj.ml.metamorph and scicloj.ml.dataset contain functions with the same name. But they operate on either a context map (ns metamorph) or on a dataset (ns dataset)

The functions in tis namespace are re-exported from:

tabecloth.api
tech.v3.dataset.modelling
tech.v3.dataset.column-filters

This namespace contains functions which operate on a dataset
and mostly return a dataset.

The namespaces scicloj.ml.metamorph and scicloj.ml.dataset contain
functions with the same name. But they operate on either a context
map (ns metamorph) or on a dataset (ns dataset)

The functions in tis namespace are re-exported from:

* tabecloth.api
* tech.v3.dataset.modelling
* tech.v3.dataset.column-filters

raw docstring

->array^clj

(->array ds colname)

(->array ds colname datatype)

Convert numerical column(s) to java array

Convert numerical column(s) to java array

source raw docstring

add-column^clj

(add-column ds column-name column)

(add-column ds column-name column size-strategy)

Add or update (modify) column under column-name.

column can be sequence of values or generator function (which gets ds as input).

ds - a dataset
column-name - if it's existing column name, column will be replaced
column - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy argument)
size-strategy (optional) - when new column is shorter than dataset row count, following strategies are applied:
- :cycle - repeat data
- :na - append missing values
- :strict - (default) throws an exception when sizes mismatch

Add or update (modify) column under `column-name`.

`column` can be sequence of values or generator function (which gets `ds` as input).

* `ds` - a dataset
* `column-name` - if it's existing column name, column will be replaced
* `column` - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to `size-strategy` argument)
* `size-strategy` (optional) - when new column is shorter than dataset row count, following strategies are applied:
  - `:cycle` - repeat data
  - `:na` - append missing values
  - `:strict` - (default) throws an exception when sizes mismatch

source raw docstring

add-columns^clj

(add-columns ds columns-map)

(add-columns ds columns-map size-strategy)

Add or updade (modify) columns defined in columns-map (mapping: name -> column)

Add or updade (modify) columns defined in `columns-map` (mapping: name -> column)

source raw docstring

add-or-replace-column^clj

(add-or-replace-column ds column-name column)

(add-or-replace-column ds column-name column size-strategy)

source

add-or-replace-columns^clj

(add-or-replace-columns ds columns-map)

(add-or-replace-columns ds columns-map size-strategy)

source

aggregate^clj

(aggregate ds aggregator)

(aggregate ds
           aggregator
           {:keys [default-column-name-prefix ungroup? parallel?]
            :or {default-column-name-prefix "summary" ungroup? true}
            :as options})

Aggregate dataset by providing:

aggregation function
map with column names and functions
sequence of aggregation functions

Aggregation functions can return:

single value
seq of values
map of values with column names

Aggregate dataset by providing:

- aggregation function
- map with column names and functions
- sequence of aggregation functions

Aggregation functions can return:
- single value
- seq of values
- map of values with column names

source raw docstring

aggregate-columns^clj

(aggregate-columns ds columns-selector column-aggregators)

(aggregate-columns ds columns-selector column-aggregators options)

Aggregates each column separately

Aggregates each column separately

source raw docstring

anti-join^clj

(anti-join ds-left ds-right columns-selector)

(anti-join ds-left ds-right columns-selector options)

source

append^clj

(append ds & datasets)

source

as-regular-dataset^clj

(as-regular-dataset ds)

Remove grouping tag

Remove grouping tag

source raw docstring

asof-join^clj

(asof-join ds-left ds-right colname)

(asof-join ds-left ds-right colname options)

source

bind^clj

(bind ds & datasets)

source

boolean^clj

(boolean dataset)

Return a dataset containing only the boolean columns.

Return a dataset containing only the boolean columns.

source raw docstring

by-rank^clj

(by-rank ds columns-selector rank-predicate)

(by-rank ds
         columns-selector
         rank-predicate
         {:keys [desc? ties] :or {desc? true ties :dense}})

Select rows using rank on a column, ties are resolved using :dense method.

See R docs. Rank uses 0 based indexing.

Possible :ties strategies: :average, :first, :last, :random, :min, :max, :dense. :dense is the same as in data.table::frank from R

:desc? set to true (default) order descending before calculating rank

Select rows using `rank` on a column, ties are resolved using `:dense` method.

See [R docs](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rank).
Rank uses 0 based indexing.

Possible `:ties` strategies: `:average`, `:first`, `:last`, `:random`, `:min`, `:max`, `:dense`.
`:dense` is the same as in `data.table::frank` from R

`:desc?` set to true (default) order descending before calculating rank

source raw docstring

categorical^clj

(categorical dataset)

Return a dataset containing only the categorical columns.

Return a dataset containing only the categorical columns.

source raw docstring

categorical->number^clj

(categorical->number dataset filter-fn-or-ds)

(categorical->number dataset filter-fn-or-ds table-args)

(categorical->number dataset filter-fn-or-ds table-args result-datatype)

Convert columns into a discrete , numeric representation See tech.v3.dataset.categorical/fit-categorical-map.

Convert columns into a discrete , numeric representation
See tech.v3.dataset.categorical/fit-categorical-map.

source raw docstring

categorical->one-hot^clj

(categorical->one-hot dataset filter-fn-or-ds)

(categorical->one-hot dataset filter-fn-or-ds table-args)

(categorical->one-hot dataset filter-fn-or-ds table-args result-datatype)

Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot

Convert string columns to numeric columns.
See tech.v3.dataset.categorical/fit-one-hot

source raw docstring

CategoricalMap^clj

source

clone^clj

(clone item)

Clone an object. Can clone anything convertible to a reader.

Clone an object.  Can clone anything convertible to a reader.

source raw docstring

column^clj

(column dataset colname)

source

column-count^clj

(column-count dataset)

source

column-filter^clj

(column-filter dataset filter-fn)

Return a dataset with only the columns for which the filter function returns a truthy value.

Return a dataset with only the columns for which the filter function returns a truthy
value.

source raw docstring

column-names^clj

(column-names ds)

(column-names ds columns-selector)

(column-names ds columns-selector meta-field)

source

column-values->categorical^clj

(column-values->categorical dataset src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values. In the case of one-hot mappings, src-column must be the original column name before the one-hot map

Given a column encoded via either string->number or one-hot, reverse
map to the a sequence of the original string column values.
In the case of one-hot mappings, src-column must be the original
column name before the one-hot map

source raw docstring

columns^clj

(columns ds)

(columns ds result-type)

Returns columns of dataset. Result type can be any of:

:as-map
:as-double-arrays
:as-seqs

Returns columns of dataset. Result type can be any of:
* `:as-map`
* `:as-double-arrays`
* `:as-seqs`

source raw docstring

concat^clj

(concat dataset & datasets)

source

concat-copying^clj

(concat-copying dataset & datasets)

source

convert-types^clj

(convert-types ds coltype-map-or-columns-selector)

(convert-types ds columns-selector new-types)

Convert type of the column to the other type.

Convert type of the column to the other type.

source raw docstring

create-categorical-map^clj

(create-categorical-map lookup-table src-colname result-datatype)

source

dataset^clj

(dataset)

(dataset data)

(dataset data
         {:keys [single-value-column-name column-names layout dataset-name]
          :or {single-value-column-name :$value layout :as-rows}
          :as options})

Create dataset.

Dataset can be created from:

single value
map of values and/or sequences
sequence of maps
sequence of columns
file or url

Create `dataset`.

Dataset can be created from:

* single value
* map of values and/or sequences
* sequence of maps
* sequence of columns
* file or url

source raw docstring

dataset->categorical-maps^clj

(dataset->categorical-maps dataset)

Given a dataset, return a map of column names to categorical label maps. This aids in inverting all of the label maps in a dataset. The source column name is src-column.

Given a dataset, return a map of column names to categorical label maps.
This aids in inverting all of the label maps in a dataset.
The source column name is src-column.

source raw docstring

dataset->categorical-xforms^clj

(dataset->categorical-xforms ds)

Given a dataset, return a map of column-name->xform information.

Given a dataset, return a map of column-name->xform information.

source raw docstring

dataset->one-hot-maps^clj

(dataset->one-hot-maps dataset)

Given a dataset, return a sequence of applied on-hot transformations.

Given a dataset, return a sequence of applied on-hot transformations.

source raw docstring

dataset->str^clj

(dataset->str ds)

(dataset->str ds options)

Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.

For options documentation see dataset-data->str.

Convert a dataset to a string.  Prints a single line header and then calls
dataset-data->str.

For options documentation see dataset-data->str.

source raw docstring

dataset-name^clj

(dataset-name dataset)

source

dataset?^clj

(dataset? ds)

Is ds a dataset type?

Is `ds` a `dataset` type?

source raw docstring

datetime^clj

(datetime dataset)

Return a dataset containing only the datetime columns.

Return a dataset containing only the datetime columns.

source raw docstring

difference^clj

(difference ds-left ds-right)

(difference ds-left ds-right options)

source

drop^clj

(drop ds columns-selector rows-selector)

Drop columns and rows.

Drop columns and rows.

source raw docstring

drop-columns^clj

(drop-columns ds)

(drop-columns ds columns-selector)

(drop-columns ds columns-selector meta-field)

Drop columns by (returns dataset):

name
sequence of names
map of names with new names (rename)
function which filter names (via column metadata)

Drop columns by (returns dataset):

- name
- sequence of names
- map of names with new names (rename)
- function which filter names (via column metadata)

source raw docstring

drop-missing^clj

(drop-missing ds)

(drop-missing ds columns-selector)

Drop rows with missing values

columns-selector selects columns to look at missing values

Drop rows with missing values

`columns-selector` selects columns to look at missing values

source raw docstring

drop-rows^clj

(drop-rows ds)

(drop-rows ds rows-selector)

(drop-rows ds rows-selector {:keys [select-keys pre result-type parallel?]})

Drop rows using:

row id
seq of row ids
seq of true/false
fn with predicate

Drop rows using:

- row id
- seq of row ids
- seq of true/false
- fn with predicate

source raw docstring

empty-ds?^clj

(empty-ds? ds)

source

feature^clj

(feature dataset)

Return a dataset container only the columns which have not been marked as inference columns.

Return a dataset container only the columns which have not been marked as inference
columns.

source raw docstring

feature-ecount^clj

(feature-ecount dataset)

Number of feature columns. Feature columns are columns that are not inference targets.

Number of feature columns.  Feature columns are columns that are not
inference targets.

source raw docstring

fill-range-replace^clj

(fill-range-replace ds colname max-span)

(fill-range-replace ds colname max-span missing-strategy)

(fill-range-replace ds colname max-span missing-strategy missing-value)

source

first^clj

(first ds)

source

fit-categorical-map^clj

(fit-categorical-map dataset colname & [table-args res-dtype])

Given a column, map it into an numeric space via a discrete map of values to integers. This fits the categorical transformation onto the column and returns the transformation.

If table-args is not given, the distinct column values will be mapped into 0..x without any specific order.

'table-args` allows to specify the precise mapping as a sequence of pairs of [val idx] or as a sorted seq of values.

Given a column, map it into an numeric space via a discrete map of values
to integers.  This fits the categorical transformation onto the column and returns
the transformation. 

If `table-args` is not given, the distinct column values will be mapped into 0..x without any specific order.

'table-args` allows to specify the precise mapping as a sequence of pairs of [val idx] or as a sorted seq of values.

source raw docstring

fit-one-hot^clj

(fit-one-hot dataset colname & [table-args res-dtype])

Fit a one hot transformation to a column. Returns a reusable transformation. Maps each unique value to a column with 1 every time the value appears in the original column and 0 otherwise.

Fit a one hot transformation to a column.  Returns a reusable transformation.
Maps each unique value to a column with 1 every time the value appears in the
original column and 0 otherwise.

source raw docstring

fold-by^clj

(fold-by ds columns-selector)

(fold-by ds columns-selector folding-function)

source

full-join^clj

(full-join ds-left ds-right columns-selector)

(full-join ds-left ds-right columns-selector options)

source

group-by^clj

(group-by ds grouping-selector)

(group-by ds
          grouping-selector
          {:keys [select-keys result-type]
           :or {result-type :as-dataset select-keys :all}
           :as options})

Group dataset by:

column name
list of columns
map of keys and row indexes
function getting map of values

Options are:

select-keys - when grouping is done by function, you can limit fields to a select-keys seq.
result-type - return results as dataset (:as-dataset, default) or as map of datasets (:as-map) or as map of row indexes (:as-indexes) or as sequence of (sub)datasets
other parameters which are passed to dataset fn

When dataset is returned, meta contains :grouped? set to true. Columns in dataset:

name - group name
group-id - id of the group (int)
data - group as dataset

Group dataset by:

- column name
- list of columns
- map of keys and row indexes
- function getting map of values

Options are:

- select-keys - when grouping is done by function, you can limit fields to a `select-keys` seq.
- result-type - return results as dataset (`:as-dataset`, default) or as map of datasets (`:as-map`) or as map of row indexes (`:as-indexes`) or as sequence of (sub)datasets
- other parameters which are passed to `dataset` fn

When dataset is returned, meta contains `:grouped?` set to true. Columns in dataset:

- name - group name
- group-id - id of the group (int)
- data - group as dataset

source raw docstring

grouped?^clj

(grouped? ds)

Is dataset represents grouped dataset (result of group-by)?

Is `dataset` represents grouped dataset (result of `group-by`)?

source raw docstring

groups->map^clj

(groups->map ds)

Convert grouped dataset to the map of groups

Convert grouped dataset to the map of groups

source raw docstring

groups->seq^clj

(groups->seq ds)

source

has-column?^clj

(has-column? dataset column-name)

source

head^clj

(head ds)

(head ds n)

source

inference-column?^clj

(inference-column? col)

source

inference-target-column-names^clj

(inference-target-column-names ds)

Return the names of the columns that are inference targets.

Return the names of the columns that are inference targets.

source raw docstring

inference-target-ds^clj

(inference-target-ds dataset)

Given a dataset return reverse-mapped inference target columns or nil in the case where there are no inference targets.

Given a dataset return reverse-mapped inference target columns or nil
in the case where there are no inference targets.

source raw docstring

inference-target-label-inverse-map^clj

(inference-target-label-inverse-map dataset & [label-columns])

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

Given options generated during ETL operations and annotated with :label-columns
sequence container 1 label column, generate a reverse map that maps from a dataset
value back to the label that generated that value.

source raw docstring

inference-target-label-map^clj

(inference-target-label-map dataset & [label-columns])

source

info^clj

(info ds)

(info ds result-type)

source

inner-join^clj

(inner-join ds-left ds-right columns-selector)

(inner-join ds-left ds-right columns-selector options)

source

intersect^clj

(intersect ds-left ds-right)

(intersect ds-left ds-right options)

source

intersection^clj

(intersection lhs-ds rhs-ds)

Return only columns for rhs for which an equivalently named column exists in lhs.

Return only columns for rhs for which an equivalently named column exists in lhs.

source raw docstring

invert-categorical-map^clj

(invert-categorical-map dataset {:keys [src-column lookup-table]})

Invert a categorical map returning the column to the original set of values.

Invert a categorical map returning the column to the original set of values.

source raw docstring

invert-one-hot-map^clj

(invert-one-hot-map dataset {:keys [one-hot-table src-column]})

Invert a one-hot transformation removing the one-hot columns and adding back the original column.

Invert a one-hot transformation removing the one-hot columns and adding back the
original column.

source raw docstring

join-columns^clj

(join-columns ds target-column columns-selector)

(join-columns ds
              target-column
              columns-selector
              {:keys [separator missing-subst drop-columns? result-type
                      parallel?]
               :or {separator "-" drop-columns? true result-type :string}})

source

k-fold-datasets^clj

(k-fold-datasets dataset k)

(k-fold-datasets dataset k options)

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Returns a sequence of {:test-ds :train-ds}

Options:

:randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be provided. Defaults to true.
:seed - when :randomize-dataset? is true then this can either be an implementation of java.util.Random or an integer seed which will be used to construct java.util.Random.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.

Returns a sequence of {:test-ds :train-ds}

Options:

* `:randomize-dataset?` - When true, shuffle the dataset.  In that case 'seed' may be
   provided.  Defaults to true.
* `:seed` -  when `:randomize-dataset?` is true then this can either be an
   implementation of java.util.Random or an integer seed which will be used to
   construct java.util.Random.

source raw docstring

labels^clj

(labels dataset)

Return the labels. The labels sequence is the reverse mapped inference column. This returns a single column of data or errors out.

Return the labels.  The labels sequence is the reverse mapped inference
column.  This returns a single column of data or errors out.

source raw docstring

last^clj

(last ds)

source

left-join^clj

(left-join ds-left ds-right columns-selector)

(left-join ds-left ds-right columns-selector options)

source

map-columns^clj

(map-columns ds column-name map-fn)

(map-columns ds column-name columns-selector map-fn)

(map-columns ds column-name new-type columns-selector map-fn)

source

mark-as-group^clj

(mark-as-group ds)

Add grouping tag

Add grouping tag

source raw docstring

metadata-filter^clj

(metadata-filter dataset filter-fn)

Return a dataset with only the columns for which, given the column metadata, the filter function returns a truthy value.

Return a dataset with only the columns for which, given the column metadata,
the filter function returns a truthy value.

source raw docstring

missing^clj

(missing dataset)

Return a dataset with only columns have have missing values

Return a dataset with only columns have have missing values

source raw docstring

model-type^clj

(model-type dataset & [column-name-seq])

Check the label column after dataset processing. Return either :regression :classification

Check the label column after dataset processing.
Return either
:regression
:classification

source raw docstring

no-missing^clj

(no-missing dataset)

Return a dataset with only columns that have no missing values.

Return a dataset with only columns that have no missing values.

source raw docstring

num-inference-classes^clj

(num-inference-classes dataset)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

Given a dataset and correctly built options from pipeline operations,
return the number of classes used for the label.  Error if not classification
dataset.

source raw docstring

numeric^clj

(numeric dataset)

Return a dataset containing only the numeric columns.

Return a dataset containing only the numeric columns.

source raw docstring

of-datatype^clj

(of-datatype dataset datatype)

Return a dataset containing only the columns of a specific datatype.

Return a dataset containing only the columns of a specific datatype.

source raw docstring

OneHotMap^clj

source

order-by^clj

(order-by ds columns-or-fn)

(order-by ds columns-or-fn comparators)

(order-by ds columns-or-fn comparators {:keys [parallel?]})

Order dataset by:

column name
columns (as sequence of names)
key-fn
sequence of columns / key-fn Additionally you can ask the order by:
:asc
:desc
custom comparator function

Order dataset by:
- column name
- columns (as sequence of names)
- key-fn
- sequence of columns / key-fn
Additionally you can ask the order by:
- :asc
- :desc
- custom comparator function

source raw docstring

pivot->longer^clj

(pivot->longer ds)

(pivot->longer ds columns-selector)

(pivot->longer
  ds
  columns-selector
  {:keys [target-columns value-column-name splitter drop-missing? datatypes]
   :or {target-columns :$column value-column-name :$value drop-missing? true}})

tidyr pivot_longer api

`tidyr` pivot_longer api

source raw docstring

pivot->wider^clj

(pivot->wider ds columns-selector value-columns)

(pivot->wider
  ds
  columns-selector
  value-columns
  {:keys [fold-fn concat-columns-with concat-value-with drop-missing?]
   :or {concat-columns-with "_" concat-value-with "-" drop-missing? true}})

source

prediction^clj

(prediction dataset)

Return the columns of the dataset marked as predictions.

Return the columns of the dataset marked as predictions.

source raw docstring

print-dataset^clj

(print-dataset ds)

(print-dataset ds options)

source

probability-distribution^clj

(probability-distribution dataset)

Return the columns of the dataset that comprise the probability distribution after classification.

Return the columns of the dataset that comprise the probability distribution
after classification.

source raw docstring

probability-distributions->label-column^clj

(probability-distributions->label-column prob-ds dst-colname)

Given a dataset that has columns in which the column names describe labels and the rows describe a probability distribution, create a label column by taking the max value in each row and assign column that row value.

Given a dataset that has columns in which the column names describe labels and the
rows describe a probability distribution, create a label column by taking the max
value in each row and assign column that row value.

source raw docstring

process-group-data^clj

(process-group-data ds f)

(process-group-data ds f parallel?)

source

rand-nth^clj

(rand-nth ds)

(rand-nth ds {:keys [seed]})

source

random^clj

(random ds)

(random ds n)

(random ds n {:keys [repeat? seed] :or {repeat? true}})

source

read-nippy^clj

(read-nippy filename)

source

rename-columns^clj

(rename-columns ds columns-mapping)

(rename-columns ds columns-selector columns-map-fn)

Rename columns with provided old -> new name map

Rename columns with provided old -> new name map

source raw docstring

reorder-columns^clj

(reorder-columns ds columns-selector & columns-selectors)

Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.

Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.

source raw docstring

replace-missing^clj

(replace-missing ds)

(replace-missing ds strategy)

(replace-missing ds columns-selector strategy)

(replace-missing ds columns-selector strategy value)

source

reverse-map-categorical-xforms^clj

(reverse-map-categorical-xforms dataset)

Given a dataset where we have converted columns from a categorical representation to either a numeric reprsentation or a one-hot representation, reverse map back to the original dataset given the reverse mapping of label->number in the column's metadata.

Given a dataset where we have converted columns from a categorical representation
to either a numeric reprsentation or a one-hot representation, reverse map
back to the original dataset given the reverse mapping of label->number in
the column's metadata.

source raw docstring

right-join^clj

(right-join ds-left ds-right columns-selector)

(right-join ds-left ds-right columns-selector options)

source

row-count^clj

(row-count dataset-or-col)

source

rows^clj

(rows ds)

(rows ds result-type)

Returns rows of dataset. Result type can be any of:

:as-maps
:as-double-arrays
:as-seqs

Returns rows of dataset. Result type can be any of:
* `:as-maps`
* `:as-double-arrays`
* `:as-seqs`

source raw docstring

select^clj

(select ds columns-selector rows-selector)

Select columns and rows.

Select columns and rows.

source raw docstring

select-columns^clj

(select-columns ds)

(select-columns ds columns-selector)

(select-columns ds columns-selector meta-field)

Select columns by (returns dataset):

name
sequence of names
map of names with new names (rename)
function which filter names (via column metadata)

Select columns by (returns dataset):

- name
- sequence of names
- map of names with new names (rename)
- function which filter names (via column metadata)

source raw docstring

select-missing^clj

(select-missing ds)

(select-missing ds columns-selector)

Select rows with missing values

columns-selector selects columns to look at missing values

Select rows with missing values

`columns-selector` selects columns to look at missing values

source raw docstring

select-rows^clj

(select-rows ds)

(select-rows ds rows-selector)

(select-rows ds rows-selector {:keys [select-keys pre result-type parallel?]})

Select rows using:

row id
seq of row ids
seq of true/false
fn with predicate

Select rows using:

- row id
- seq of row ids
- seq of true/false
- fn with predicate

source raw docstring

semi-join^clj

(semi-join ds-left ds-right columns-selector)

(semi-join ds-left ds-right columns-selector options)

source

separate-column^clj

(separate-column ds column separator)

(separate-column ds column target-columns separator)

(separate-column ds
                 column
                 target-columns
                 separator
                 {:keys [missing-subst drop-column? parallel?]
                  :or {missing-subst ""}})

source

set-dataset-name^clj

(set-dataset-name dataset ds-name)

source

set-inference-target^clj

(set-inference-target dataset target-name-or-target-name-seq)

Set the inference target on the column. This sets the :column-type member of the column metadata to :inference-target?.

Set the inference target on the column.  This sets the :column-type member
of the column metadata to :inference-target?.

source raw docstring

shape^clj

(shape ds)

Returns shape of the dataset [rows, cols]

Returns shape of the dataset [rows, cols]

source raw docstring

shuffle^clj

(shuffle ds)

(shuffle ds {:keys [seed]})

source

split^clj

(split ds)

(split ds split-type)

(split ds
       split-type
       {:keys [seed parallel? shuffle?] :or {shuffle? true} :as opts})

Split given dataset into 2 or more (holdout) splits

As the result two new columns are added:

:$split-name - with subgroup name
:$split-id - fold id/repetition id

split-type can be one of the following:

:kfold - k-fold strategy, :k defines number of folds (defaults to 5), produces k splits
:bootstrap - :ratio defines ratio of observations put into result (defaults to 1.0), produces 1 split
:holdout - split into two parts with given ratio (defaults to 2/3), produces 1 split
:loo - leave one out, produces the same number of splits as number of observations

:holdout can accept also probabilites or ratios and can split to more than 2 subdatasets

Additionally you can provide:

:seed - for random number generator
:repeats - repeat procedure :repeats times
:partition-selector - same as in group-by for stratified splitting to reflect dataset structure in splits.
:split-names names of subdatasets different than default, ie. [:train :test :split-2 ...]
:split-col-name - a column where name of split is stored, either :train or :test values (default: :$split-name)
:split-id-col-name - a column where id of the train/test pair is stored (default: :$split-id)

Rows are shuffled before splitting.

In case of grouped dataset each group is processed separately.

Split given dataset into 2 or more (holdout) splits

As the result two new columns are added:

* `:$split-name` - with subgroup name
* `:$split-id` - fold id/repetition id

`split-type` can be one of the following:

* `:kfold` - k-fold strategy, `:k` defines number of folds (defaults to `5`), produces `k` splits
* `:bootstrap` - `:ratio` defines ratio of observations put into result (defaults to `1.0`), produces `1` split
* `:holdout` - split into two parts with given ratio (defaults to `2/3`), produces `1` split
* `:loo` - leave one out, produces the same number of splits as number of observations

`:holdout` can accept also probabilites or ratios and can split to more than 2 subdatasets

Additionally you can provide:

* `:seed` - for random number generator
* `:repeats` - repeat procedure `:repeats` times
* `:partition-selector` - same as in `group-by` for stratified splitting to reflect dataset structure in splits.
* `:split-names` names of subdatasets different than default, ie. `[:train :test :split-2 ...]`
* `:split-col-name` - a column where name of split is stored, either `:train` or `:test` values (default: `:$split-name`)
* `:split-id-col-name` - a column where id of the train/test pair is stored (default: `:$split-id`)

Rows are shuffled before splitting.

In case of grouped dataset each group is processed separately.

See [more](https://www.mitpressjournals.org/doi/pdf/10.1162/EVCO_a_00069)

source raw docstring

split->seq^clj

(split->seq ds)

(split->seq ds split-type)

(split->seq ds
            split-type
            {:keys [split-col-name split-id-col-name]
             :or {split-col-name :$split-name split-id-col-name :$split-id}
             :as opts})

Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)

Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)

source raw docstring

string^clj

(string dataset)

Return a dataset containing only the string columns.

Return a dataset containing only the string columns.

source raw docstring

tail^clj

(tail ds)

(tail ds n)

source

target^clj

(target dataset)

Return a dataset containing only the columns that have been marked as inference targets.

Return a dataset containing only the columns that have been marked as inference
targets.

source raw docstring

train-test-split^clj

(train-test-split dataset)

(train-test-split dataset
                  {:keys [train-fraction] :or {train-fraction 0.7} :as options})

Probabilistically split the dataset returning a map of {:train-ds :test-ds}.

Options:

:randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be provided. Defaults to true.
:seed - when :randomize-dataset? is true then this can either be an implementation of java.util.Random or an integer seed which will be used to construct java.util.Random.
:train-fraction - Fraction of the dataset to use as training set. Defaults to 0.7.

Probabilistically split the dataset returning a map of `{:train-ds :test-ds}`.

Options:

* `:randomize-dataset?` - When true, shuffle the dataset.  In that case 'seed' may be
   provided.  Defaults to true.
* `:seed` -  when `:randomize-dataset?` is true then this can either be an
   implementation of java.util.Random or an integer seed which will be used to
   construct java.util.Random.
* `:train-fraction` - Fraction of the dataset to use as training set.  Defaults to
   0.7.

source raw docstring

transform-categorical-map^clj

(transform-categorical-map dataset fit-data)

Apply a categorical mapping transformation fit with fit-categorical-map.

Apply a categorical mapping transformation fit with fit-categorical-map.

source raw docstring

transform-one-hot^clj

(transform-one-hot dataset one-hot-fit-data)

Apply a one-hot transformation to a dataset

Apply a one-hot transformation to a dataset

source raw docstring

ungroup^clj

(ungroup ds)

(ungroup ds
         {:keys [order? add-group-as-column add-group-id-as-column separate?
                 dataset-name parallel?]
          :or {separate? true}})

Concat groups into dataset.

When add-group-as-column or add-group-id-as-column is set to true or name(s), columns with group name(s) or group id is added to the result.

Before joining the groups groups can be sorted by group name.

Concat groups into dataset.

When `add-group-as-column` or `add-group-id-as-column` is set to `true` or name(s), columns with group name(s) or group id is added to the result.

Before joining the groups groups can be sorted by group name.

source raw docstring

union^clj

(union ds & datasets)

source

unique-by^clj

(unique-by ds)

(unique-by ds columns-selector)

(unique-by
  ds
  columns-selector
  {:keys [strategy select-keys parallel?] :or {strategy :first} :as options})

source

unmark-group^clj

(unmark-group ds)

Remove grouping tag

Remove grouping tag

source raw docstring

unroll^clj

(unroll ds columns-selector)

(unroll ds columns-selector options)

source

update-columns^clj

(update-columns ds columns-map)

(update-columns ds columns-selector update-functions)

source

write!^clj

(write! dataset output-path)

(write! dataset output-path options)

Write a dataset out to a file. Supported forms are:

(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)

Options:

:max-chars-per-column - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.
:max-num-columns - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.
:quoted-columns - csv specific - sequence of columns names that you would like to always have quoted.
:file-type - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type.
:headers? - if csv headers are written, defaults to true.

Write a dataset out to a file.  Supported forms are:

```clojure
(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)
```

Options:

  * `:max-chars-per-column` - csv,tsv specific, defaults to 65536 - values longer than this will
     cause an exception during serialization.
  * `:max-num-columns` - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
     columns an exception will be thrown during serialization.
  * `:quoted-columns` - csv specific - sequence of columns names that you would like to always have quoted.
  * `:file-type` - Manually specify the file type.  This is usually inferred from the filename but if you
     pass in an output stream then you will need to specify the file type.
  * `:headers?` - if csv headers are written, defaults to true.

source raw docstring

write-csv!^clj

source

write-nippy!^clj

(write-nippy! ds filename)

source

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close