Linear pipeline operations.
Linear pipeline operations.
(->array colname)
(->array colname datatype)
Convert numerical column(s) to java array
Convert numerical column(s) to java array
(add-column column-name column)
(add-column column-name column size-strategy)
Add or update (modify) column under column-name
.
column
can be sequence of values or generator function (which gets ds
as input).
ds
- a datasetcolumn-name
- if it's existing column name, column will be replacedcolumn
- can be column (from other dataset), sequence, single value or function (taking a dataset). Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy
argument)size-strategy
(optional) - when new column is shorter than dataset row count, following strategies are applied:
:cycle
- repeat data:na
- append missing values:strict
- (default) throws an exception when sizes mismatchAdd or update (modify) column under `column-name`. `column` can be sequence of values or generator function (which gets `ds` as input). * `ds` - a dataset * `column-name` - if it's existing column name, column will be replaced * `column` - can be column (from other dataset), sequence, single value or function (taking a dataset). Too big columns are always trimmed. Too small are cycled or extended with missing values (according to `size-strategy` argument) * `size-strategy` (optional) - when new column is shorter than dataset row count, following strategies are applied: - `:cycle` - repeat data - `:na` - append missing values - `:strict` - (default) throws an exception when sizes mismatch
(add-columns columns-map)
(add-columns columns-map size-strategy)
Add or updade (modify) columns defined in columns-map
(mapping: name -> column)
Add or updade (modify) columns defined in `columns-map` (mapping: name -> column)
(add-or-replace-column column-name column)
(add-or-replace-column column-name column size-strategy)
(add-or-replace-columns columns-map)
(add-or-replace-columns columns-map size-strategy)
(aggregate aggregator)
(aggregate aggregator options)
Aggregate dataset by providing:
Aggregation functions can return:
Aggregate dataset by providing: - aggregation function - map with column names and functions - sequence of aggregation functions Aggregation functions can return: - single value - seq of values - map of values with column names
(aggregate-columns columns-aggregators)
(aggregate-columns columns-selector column-aggregators)
(aggregate-columns columns-selector column-aggregators options)
Aggregates each column separately
Aggregates each column separately
(anti-join ds-right columns-selector)
(anti-join ds-right columns-selector options)
(append & args)
Concats columns of several datasets
Concats columns of several datasets
(array-column->columns src-column)
(array-column->columns src-column opts)
Converts a column of type java array into several columns, one for each element of the array of all rows. The source column is dropped afterwards. The function assumes that arrays in all rows have same type and length and are numeric.
ds
Datset to operate on.
src-column
The (array) column to convert
opts
can contain:
prefix
newly created column will get prefix before column number
Converts a column of type java array into several columns, one for each element of the array of all rows. The source column is dropped afterwards. The function assumes that arrays in all rows have same type and length and are numeric. `ds` Datset to operate on. `src-column` The (array) column to convert `opts` can contain: `prefix` newly created column will get prefix before column number
(as-regular-dataset)
Remove grouping tag
Remove grouping tag
(asof-join ds-right columns-selector)
(asof-join ds-right columns-selector options)
(by-rank columns-selector rank-predicate)
(by-rank columns-selector rank-predicate options)
Select rows using rank
on a column, ties are resolved using :dense
method.
See R docs. Rank uses 0 based indexing.
Possible :ties
strategies: :average
, :first
, :last
, :random
, :min
, :max
, :dense
.
:dense
is the same as in data.table::frank
from R
:desc?
set to true (default) order descending before calculating rank
Select rows using `rank` on a column, ties are resolved using `:dense` method. See [R docs](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rank). Rank uses 0 based indexing. Possible `:ties` strategies: `:average`, `:first`, `:last`, `:random`, `:min`, `:max`, `:dense`. `:dense` is the same as in `data.table::frank` from R `:desc?` set to true (default) order descending before calculating rank
(clone)
Clone an object. Can clone anything convertible to a reader.
Clone an object. Can clone anything convertible to a reader.
(column-names)
(column-names columns-selector)
(column-names columns-selector meta-field)
Returns column names, given a selector. Columns-selector can be one of the following:
Column name can be anything.
column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following:
:name
(default) - to operate on column names:datatype
- to operated on column types:all
- if you want to process all metadataDatatype groups are:
:type/numerical
- any numerical type:type/float
- floating point number (:float32 and :float64):type/integer
- any integer:type/datetime
- any datetime typeIf qualified keyword starts with :!type, complement set is used.
Returns column names, given a selector. Columns-selector can be one of the following: * :all keyword - selects all columns * column name - for single column * sequence of column names - for collection of columns * regex - to apply pattern on column names or datatype * filter predicate - to filter column names or datatype * type namespaced keyword for specific datatype or group of datatypes Column name can be anything. column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following: * `:name` (default) - to operate on column names * `:datatype` - to operated on column types * `:all` - if you want to process all metadata Datatype groups are: * `:type/numerical` - any numerical type * `:type/float` - floating point number (:float32 and :float64) * `:type/integer` - any integer * `:type/datetime` - any datetime type If qualified keyword starts with :!type, complement set is used.
(columns)
(columns result-type)
Returns columns of dataset. Result type can be any of:
:as-map
:as-double-arrays
:as-seqs
Returns columns of dataset. Result type can be any of: * `:as-map` * `:as-double-arrays` * `:as-seqs`
(columns->array-column column-selector new-column)
Converts several columns to a single column of type array. The src columns are dropped afterwards.
ds
Dataset to operate on.
column-selector
anything supported by select-columns
new-column
new column to create
Converts several columns to a single column of type array. The src columns are dropped afterwards. `ds` Dataset to operate on. `column-selector` anything supported by [[select-columns]] `new-column` new column to create
(complete columns-selector & args)
TidyR complete.
Fills a dataset with all possible combinations of selected columns. When a given combination doesn't exist, missing values are created.
TidyR complete. Fills a dataset with all possible combinations of selected columns. When a given combination doesn't exist, missing values are created.
(concat & args)
Joins rows from other datasets
Joins rows from other datasets
(concat-copying & args)
Joins rows from other datasets via a copy of data
Joins rows from other datasets via a copy of data
(convert-types coltype-map-or-columns-selector)
(convert-types columns-selector new-types)
Convert type of the column to the other type.
Convert type of the column to the other type.
(cross-join ds-right)
(cross-join ds-right columns-selector)
(cross-join ds-right columns-selector options)
Cross product from selected columns
Cross product from selected columns
(crosstab row-selector col-selector)
(crosstab row-selector col-selector options)
Cross tabulation of two sets of columns.
Creates grouped dataset by [row-selector, col-selector] pairs and calls aggregation on each group.
Options:
Cross tabulation of two sets of columns. Creates grouped dataset by [row-selector, col-selector] pairs and calls aggregation on each group. Options: * pivot? - create pivot table or just flat structure (default: true) * replace-missing? - replace missing values? (default: true) * missing-value - a missing value (default: 0) * aggregator - aggregating function (default: row-count) * marginal-rows, marginal-cols - adds row and/or cols, it's a sum if true. Can be a custom fn.
(dataset->str)
(dataset->str options)
Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.
For options documentation see dataset-data->str.
Convert a dataset to a string. Prints a single line header and then calls dataset-data->str. For options documentation see dataset-data->str.
(drop columns-selector rows-selector)
Drop columns and rows.
Drop columns and rows.
(drop-columns)
(drop-columns columns-selector)
(drop-columns columns-selector meta-field)
Drop columns by (returns dataset):
Drop columns by (returns dataset): - name - sequence of names - map of names with new names (rename) - function which filter names (via column metadata)
(drop-missing)
(drop-missing columns-selector)
Drop rows with missing values
columns-selector
selects columns to look at missing values
Drop rows with missing values `columns-selector` selects columns to look at missing values
(drop-rows)
(drop-rows rows-selector)
(drop-rows rows-selector options)
Drop rows using:
Drop rows using: - row id - seq of row ids - seq of true/false - fn with predicate
(expand columns-selector & args)
TidyR expand.
Creates all possible combinations of selected columns.
TidyR expand. Creates all possible combinations of selected columns.
(fill-range-replace colname max-span)
(fill-range-replace colname max-span missing-strategy)
(fill-range-replace colname max-span missing-strategy missing-value)
Fill missing up with lacking values. Accepts
Fill missing up with lacking values. Accepts * dataset * column name * expected step (max-span, milliseconds in case of datetime column) * (optional) missing-strategy - how to replace missing, default :down (set to nil if none) * (optional) missing-value - optional value for replace missing
(fold-by columns-selector)
(fold-by columns-selector folding-function)
Group-by and pack columns into vector - the output data set has a row for each unique combination of the provided columns while each remaining column has its valu(es) collected into a vector, similar to how clojure.core/group-by works. See https://scicloj.github.io/tablecloth/index.html#Fold-by
Group-by and pack columns into vector - the output data set has a row for each unique combination of the provided columns while each remaining column has its valu(es) collected into a vector, similar to how clojure.core/group-by works. See https://scicloj.github.io/tablecloth/index.html#Fold-by
(full-join ds-right columns-selector)
(full-join ds-right columns-selector options)
Join keeping all rows
Join keeping all rows
(get-entry column row)
Returns a single value from given column and row
Returns a single value from given column and row
(group-by grouping-selector)
(group-by grouping-selector options)
Group dataset by:
Options are:
select-keys
seq.:as-dataset
, default) or as map of datasets (:as-map
) or as map of row indexes (:as-indexes
) or as sequence of (sub)datasetsdataset
fnWhen dataset is returned, meta contains :grouped?
set to true. Columns in dataset:
Group dataset by: - column name - list of columns - map of keys and row indexes - function getting map of values Options are: - select-keys - when grouping is done by function, you can limit fields to a `select-keys` seq. - result-type - return results as dataset (`:as-dataset`, default) or as map of datasets (`:as-map`) or as map of row indexes (`:as-indexes`) or as sequence of (sub)datasets - other parameters which are passed to `dataset` fn When dataset is returned, meta contains `:grouped?` set to true. Columns in dataset: - name - group name - group-id - id of the group (int) - data - group as dataset
(grouped?)
Is dataset
represents grouped dataset (result of group-by
)?
Is `dataset` represents grouped dataset (result of `group-by`)?
(groups->map)
Convert grouped dataset to the map of groups
Convert grouped dataset to the map of groups
(groups->seq)
Convert grouped dataset to seq of the groups
Convert grouped dataset to seq of the groups
(info)
(info result-type)
Returns a statistcial information about the columns of a dataset.
result-type
can be :descriptive or :columns
Returns a statistcial information about the columns of a dataset. `result-type ` can be :descriptive or :columns
(inner-join ds-right columns-selector)
(inner-join ds-right columns-selector options)
(join-columns target-column columns-selector)
(join-columns target-column columns-selector conf)
Join clumns of dataset. Accepts:
dataset
column selector (as in select-columns)
options
:separator
(default -)
:drop-columns?
- whether to drop source columns or not (default true)
:result-type
:map
- packs data into map
:seq
- packs data into sequence
:string
- join strings with separator (default)
or custom function which gets row as a vector
:missing-subst
- substitution for missing value
Join clumns of dataset. Accepts: dataset column selector (as in select-columns) options `:separator` (default -) `:drop-columns?` - whether to drop source columns or not (default true) `:result-type` `:map` - packs data into map `:seq` - packs data into sequence `:string` - join strings with separator (default) or custom function which gets row as a vector `:missing-subst` - substitution for missing value
(left-join ds-right columns-selector)
(left-join ds-right columns-selector options)
(map-columns column-name map-fn)
(map-columns column-name columns-selector map-fn)
(map-columns column-name new-type columns-selector map-fn)
Map over rows using a map function. The arity should match the columns selected.
Map over rows using a map function. The arity should match the columns selected.
(order-by columns-or-fn)
(order-by columns-or-fn comparators)
(order-by columns-or-fn comparators options)
Order dataset by:
Order dataset by: - column name - columns (as sequence of names) - key-fn - sequence of columns / key-fn Additionally you can ask the order by: - :asc - :desc - custom comparator function
(pivot->longer)
(pivot->longer columns-selector)
(pivot->longer columns-selector options)
tidyr
pivot_longer api
`tidyr` pivot_longer api
(pivot->wider columns-selector value-columns)
(pivot->wider columns-selector value-columns options)
Converts columns to rows. Arguments:
dataset
columns selector
options:
:target-columns
- names of the columns created or columns pattern (see below) (default: :$column)
:value-column-name
- name of the column for values (default: :$value)
:splitter
- string, regular expression or function which splits source column names into data
:drop-missing?
- remove rows with missing? (default: true)
:datatypes
- map of target columns data types
:coerce-to-number
- try to convert extracted values to numbers if possible (default: true)
target-columns - can be:
Converts columns to rows. Arguments: * dataset * columns selector * options: `:target-columns` - names of the columns created or columns pattern (see below) (default: :$column) `:value-column-name` - name of the column for values (default: :$value) `:splitter` - string, regular expression or function which splits source column names into data `:drop-missing?` - remove rows with missing? (default: true) `:datatypes` - map of target columns data types `:coerce-to-number` - try to convert extracted values to numbers if possible (default: true) * target-columns - can be: * column name - source columns names are put there as a data * column names as seqence - source columns names after split are put separately into :target-columns as data * pattern - is a sequence of names, where some of the names are nil. nil is replaced by a name taken from splitter and such column is used for values.
(print-dataset)
(print-dataset options)
Prints dataset into console. For options see tech.v3.dataset.print/dataset-data->str
Prints dataset into console. For options see tech.v3.dataset.print/dataset-data->str
(process-group-data f)
(process-group-data f parallel?)
Internal: The passed-in function is applied on all groups
Internal: The passed-in function is applied on all groups
(rand-nth)
(rand-nth options)
Returns single random row
Returns single random row
(random)
(random n)
(random n options)
Returns (n) random rows with repetition
Returns (n) random rows with repetition
(rename-columns columns-mapping)
(rename-columns columns-selector columns-map-fn)
Rename columns with provided old -> new name map
Rename columns with provided old -> new name map
(reorder-columns columns-selector & args)
Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.
Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.
(replace-missing)
(replace-missing strategy)
(replace-missing columns-selector strategy)
(replace-missing columns-selector strategy value)
Replaces missing values. Accepts
Strategies are:
:value
- replace with given value
:up
- copy values up
:down
- copy values down
:updown
- copy values up and then down for missing values at the end
:downup
- copy values down and then up for missing values at the beginning
:mid
or :nearest
- copy values around known values
:midpoint
- use average value from previous and next non-missing
:lerp
- trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.
Replaces missing values. Accepts * dataset * column selector, default: :all * strategy, default: :nearest * value (optional) * single value * sequence of values (cycled) * function, applied on column(s) with stripped missings Strategies are: `:value` - replace with given value `:up` - copy values up `:down` - copy values down `:updown` - copy values up and then down for missing values at the end `:downup` - copy values down and then up for missing values at the beginning `:mid` or `:nearest` - copy values around known values `:midpoint` - use average value from previous and next non-missing `:lerp` - trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.
(right-join ds-right columns-selector)
(right-join ds-right columns-selector options)
(rows)
(rows result-type)
Returns rows of dataset. Result type can be any of:
:as-maps
:as-double-arrays
:as-seqs
Returns rows of dataset. Result type can be any of: * `:as-maps` * `:as-double-arrays` * `:as-seqs`
(select columns-selector rows-selector)
Select columns and rows.
Select columns and rows.
(select-columns)
(select-columns columns-selector)
(select-columns columns-selector meta-field)
Select columns by (returns dataset):
Select columns by (returns dataset): - name - sequence of names - map of names with new names (rename) - function which filter names (via column metadata)
(select-missing)
(select-missing columns-selector)
Select rows with missing values
columns-selector
selects columns to look at missing values
Select rows with missing values `columns-selector` selects columns to look at missing values
(select-rows)
(select-rows rows-selector)
(select-rows rows-selector options)
Select rows using:
Select rows using: - row id - seq of row ids - seq of true/false - fn with predicate
(semi-join ds-right columns-selector)
(semi-join ds-right columns-selector options)
(separate-column column)
(separate-column column separator)
(separate-column column target-columns separator)
(separate-column column target-columns separator conf)
(shape)
Returns shape of the dataset [rows, cols]
Returns shape of the dataset [rows, cols]
(shuffle)
(shuffle options)
Shuffle dataset (with seed)
Shuffle dataset (with seed)
(split)
(split split-type)
(split split-type options)
Split given dataset into 2 or more (holdout) splits
As the result two new columns are added:
:$split-name
- with subgroup name:$split-id
- fold id/repetition idsplit-type
can be one of the following:
:kfold
- k-fold strategy, :k
defines number of folds (defaults to 5
), produces k
splits:bootstrap
- :ratio
defines ratio of observations put into result (defaults to 1.0
), produces 1
split:holdout
- split into two parts with given ratio (defaults to 2/3
), produces 1
split:loo
- leave one out, produces the same number of splits as number of observations:holdout
can accept also probabilites or ratios and can split to more than 2 subdatasets
Additionally you can provide:
:seed
- for random number generator:repeats
- repeat procedure :repeats
times:partition-selector
- same as in group-by
for stratified splitting to reflect dataset structure in splits.:split-names
names of subdatasets different than default, ie. [:train :test :split-2 ...]
:split-col-name
- a column where name of split is stored, either :train
or :test
values (default: :$split-name
):split-id-col-name
- a column where id of the train/test pair is stored (default: :$split-id
)Rows are shuffled before splitting.
In case of grouped dataset each group is processed separately.
See more
Split given dataset into 2 or more (holdout) splits As the result two new columns are added: * `:$split-name` - with subgroup name * `:$split-id` - fold id/repetition id `split-type` can be one of the following: * `:kfold` - k-fold strategy, `:k` defines number of folds (defaults to `5`), produces `k` splits * `:bootstrap` - `:ratio` defines ratio of observations put into result (defaults to `1.0`), produces `1` split * `:holdout` - split into two parts with given ratio (defaults to `2/3`), produces `1` split * `:loo` - leave one out, produces the same number of splits as number of observations `:holdout` can accept also probabilites or ratios and can split to more than 2 subdatasets Additionally you can provide: * `:seed` - for random number generator * `:repeats` - repeat procedure `:repeats` times * `:partition-selector` - same as in `group-by` for stratified splitting to reflect dataset structure in splits. * `:split-names` names of subdatasets different than default, ie. `[:train :test :split-2 ...]` * `:split-col-name` - a column where name of split is stored, either `:train` or `:test` values (default: `:$split-name`) * `:split-id-col-name` - a column where id of the train/test pair is stored (default: `:$split-id`) Rows are shuffled before splitting. In case of grouped dataset each group is processed separately. See [more](https://www.mitpressjournals.org/doi/pdf/10.1162/EVCO_a_00069)
(split->seq)
(split->seq split-type)
(split->seq split-type options)
Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)
Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)
(ungroup)
(ungroup options)
Concat groups into dataset.
When add-group-as-column
or add-group-id-as-column
is set to true
or name(s), columns with group name(s) or group id is added to the result.
Before joining the groups groups can be sorted by group name.
Concat groups into dataset. When `add-group-as-column` or `add-group-id-as-column` is set to `true` or name(s), columns with group name(s) or group id is added to the result. Before joining the groups groups can be sorted by group name.
(unique-by)
(unique-by columns-selector)
(unique-by columns-selector options)
Remove rows which contains the same data
column-selector
Select columns for uniqueness
strategy
There are 4 strategies defined to handle duplicates
:first
- select first row (default)
:last
- select last row
:random
- select random row
any function - apply function to a columns which are subject of uniqueness
Remove rows which contains the same data `column-selector` Select columns for uniqueness `strategy` There are 4 strategies defined to handle duplicates `:first` - select first row (default) `:last` - select last row `:random` - select random row any function - apply function to a columns which are subject of uniqueness
(unroll columns-selector)
(unroll columns-selector options)
Unfolds sequences stored inside a column(s), turning it into multiple columns. Opposite of fold-by
.
Add each of the provided columns to the set that defines the "uniqe key" of each row.
Thus there will be a new row for each value inside the target column(s)' value sequence.
If you want instead to split the content of the columns into a set of new columns, look at separate-column
.
See https://scicloj.github.io/tablecloth/index.html#Unroll
Unfolds sequences stored inside a column(s), turning it into multiple columns. Opposite of [[fold-by]]. Add each of the provided columns to the set that defines the "uniqe key" of each row. Thus there will be a new row for each value inside the target column(s)' value sequence. If you want instead to split the content of the columns into a set of new _columns_, look at [[separate-column]]. See https://scicloj.github.io/tablecloth/index.html#Unroll
(update-columns columns-map)
(update-columns columns-selector update-functions)
(write! output-path)
(write! output-path options)
Write a dataset out to a file. Supported forms are:
(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)
Options:
:max-chars-per-column
- csv,tsv specific, defaults to 65536 - values longer than this will
cause an exception during serialization.:max-num-columns
- csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
columns an exception will be thrown during serialization.:quoted-columns
- csv specific - sequence of columns names that you would like to always have quoted.:file-type
- Manually specify the file type. This is usually inferred from the filename but if you
pass in an output stream then you will need to specify the file type.:headers?
- if csv headers are written, defaults to true.Write a dataset out to a file. Supported forms are: ```clojure (ds/write! test-ds "test.csv") (ds/write! test-ds "test.tsv") (ds/write! test-ds "test.tsv.gz") (ds/write! test-ds "test.nippy") (ds/write! test-ds out-stream) ``` Options: * `:max-chars-per-column` - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization. * `:max-num-columns` - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization. * `:quoted-columns` - csv specific - sequence of columns names that you would like to always have quoted. * `:file-type` - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type. * `:headers?` - if csv headers are written, defaults to true.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close