tablecloth.pipeline

Liking cljdoc? Tell your friends :D

Clojure only.

->array
add-column
add-columns
add-or-replace-column
add-or-replace-columns
aggregate
aggregate-columns
anti-join
append
as-regular-dataset
asof-join
bind
build-pipelined-function
by-rank
clone
column
column-count
column-names
columns
complete
concat
concat-copying
convert-types
cross-join
dataset->str
dataset-name
dataset?
difference
drop
drop-columns
drop-missing
drop-rows
empty-ds?
expand
fill-range-replace
first
fold-by
full-join
group-by
grouped?
groups->map
groups->seq
has-column?
head
info
inner-join
intersect
join-columns
last
left-join
map-columns
mark-as-group
order-by
pivot->longer
pivot->wider
print-dataset
process-all-api-symbols
process-group-data
rand-nth
random
read-nippy
rename-columns
reorder-columns
replace-missing
right-join
row-count
rows
select
select-columns
select-missing
select-rows
semi-join
separate-column
set-dataset-name
shape
shuffle
split
split->seq
tail
ungroup
union
unique-by
unmark-group
unroll
update-columns
write!
write-nippy!

Linear pipeline operations.

Linear pipeline operations.

raw docstring

->array^clj

(->array colname)

(->array colname datatype)

Convert numerical column(s) to java array

Convert numerical column(s) to java array

source raw docstring

add-column^clj

(add-column column-name column)

(add-column column-name column size-strategy)

Add or update (modify) column under column-name.

column can be sequence of values or generator function (which gets ds as input).

ds - a dataset
column-name - if it's existing column name, column will be replaced
column - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy argument)
size-strategy (optional) - when new column is shorter than dataset row count, following strategies are applied:
- :cycle - repeat data
- :na - append missing values
- :strict - (default) throws an exception when sizes mismatch

Add or update (modify) column under `column-name`.

`column` can be sequence of values or generator function (which gets `ds` as input).

* `ds` - a dataset
* `column-name` - if it's existing column name, column will be replaced
* `column` - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to `size-strategy` argument)
* `size-strategy` (optional) - when new column is shorter than dataset row count, following strategies are applied:
  - `:cycle` - repeat data
  - `:na` - append missing values
  - `:strict` - (default) throws an exception when sizes mismatch

source raw docstring

add-columns^clj

(add-columns columns-map)

(add-columns columns-map size-strategy)

Add or updade (modify) columns defined in columns-map (mapping: name -> column)

Add or updade (modify) columns defined in `columns-map` (mapping: name -> column)

source raw docstring

add-or-replace-column^clj

(add-or-replace-column column-name column)

(add-or-replace-column column-name column size-strategy)

source

add-or-replace-columns^clj

(add-or-replace-columns columns-map)

(add-or-replace-columns columns-map size-strategy)

source

aggregate^clj

(aggregate aggregator)

(aggregate aggregator options)

Aggregate dataset by providing:

aggregation function
map with column names and functions
sequence of aggregation functions

Aggregation functions can return:

single value
seq of values
map of values with column names

Aggregate dataset by providing:

- aggregation function
- map with column names and functions
- sequence of aggregation functions

Aggregation functions can return:
- single value
- seq of values
- map of values with column names

source raw docstring

aggregate-columns^clj

(aggregate-columns columns-selector column-aggregators)

(aggregate-columns columns-selector column-aggregators options)

Aggregates each column separately

Aggregates each column separately

source raw docstring

anti-join^clj

(anti-join ds-right columns-selector)

(anti-join ds-right columns-selector options)

source

append^clj

(append & args)

source

as-regular-dataset^clj

(as-regular-dataset)

Remove grouping tag

Remove grouping tag

source raw docstring

asof-join^clj

(asof-join ds-right columns-selector)

(asof-join ds-right columns-selector options)

source

bind^clj

(bind & args)

source

build-pipelined-function^cljmacro

(build-pipelined-function f m)

source

by-rank^clj

(by-rank columns-selector rank-predicate)

(by-rank columns-selector rank-predicate options)

Select rows using rank on a column, ties are resolved using :dense method.

See R docs. Rank uses 0 based indexing.

Possible :ties strategies: :average, :first, :last, :random, :min, :max, :dense. :dense is the same as in data.table::frank from R

:desc? set to true (default) order descending before calculating rank

Select rows using `rank` on a column, ties are resolved using `:dense` method.

See [R docs](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/rank).
Rank uses 0 based indexing.

Possible `:ties` strategies: `:average`, `:first`, `:last`, `:random`, `:min`, `:max`, `:dense`.
`:dense` is the same as in `data.table::frank` from R

`:desc?` set to true (default) order descending before calculating rank

source raw docstring

clone^clj

(clone)

Clone an object. Can clone anything convertible to a reader.

Clone an object.  Can clone anything convertible to a reader.

source raw docstring

column^clj

(column colname)

source

column-count^clj

(column-count)

source

column-names^clj

(column-names)

(column-names columns-selector)

(column-names columns-selector meta-field)

source

columns^clj

(columns)

(columns result-type)

Returns columns of dataset. Result type can be any of:

:as-map
:as-double-arrays
:as-seqs

Returns columns of dataset. Result type can be any of:
* `:as-map`
* `:as-double-arrays`
* `:as-seqs`

source raw docstring

complete^clj

(complete columns-selector & args)

TidyR complete.

Fills a dataset with all possible combinations of selected columns. When given combination wasn't existed, missing values are created.

TidyR complete.

Fills a dataset with all possible combinations of selected columns. When given combination wasn't existed, missing values are created.

source raw docstring

concat^clj

(concat & args)

source

concat-copying^clj

(concat-copying & args)

source

convert-types^clj

(convert-types coltype-map-or-columns-selector)

(convert-types columns-selector new-types)

Convert type of the column to the other type.

Convert type of the column to the other type.

source raw docstring

cross-join^clj

(cross-join ds-right)

(cross-join ds-right columns-selector)

(cross-join ds-right columns-selector options)

source

dataset->str^clj

(dataset->str)

(dataset->str options)

Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.

For options documentation see dataset-data->str.

Convert a dataset to a string.  Prints a single line header and then calls
dataset-data->str.

For options documentation see dataset-data->str.

source raw docstring

dataset-name^clj

(dataset-name)

source

dataset?^clj

(dataset?)

Is ds a dataset type?

Is `ds` a `dataset` type?

source raw docstring

difference^clj

(difference ds-right)

(difference ds-right options)

source

drop^clj

(drop columns-selector rows-selector)

Drop columns and rows.

Drop columns and rows.

source raw docstring

drop-columns^clj

(drop-columns)

(drop-columns columns-selector)

(drop-columns columns-selector meta-field)

Drop columns by (returns dataset):

name
sequence of names
map of names with new names (rename)
function which filter names (via column metadata)

Drop columns by (returns dataset):

- name
- sequence of names
- map of names with new names (rename)
- function which filter names (via column metadata)

source raw docstring

drop-missing^clj

(drop-missing)

(drop-missing columns-selector)

Drop rows with missing values

columns-selector selects columns to look at missing values

Drop rows with missing values

`columns-selector` selects columns to look at missing values

source raw docstring

drop-rows^clj

(drop-rows)

(drop-rows rows-selector)

(drop-rows rows-selector options)

Drop rows using:

row id
seq of row ids
seq of true/false
fn with predicate

Drop rows using:

- row id
- seq of row ids
- seq of true/false
- fn with predicate

source raw docstring

empty-ds?^clj

(empty-ds?)

source

expand^clj

(expand columns-selector & args)

TidyR expand.

Creates all possible combinations of selected columns.

TidyR expand.

Creates all possible combinations of selected columns.

source raw docstring

fill-range-replace^clj

(fill-range-replace colname max-span)

(fill-range-replace colname max-span missing-strategy)

(fill-range-replace colname max-span missing-strategy missing-value)

source

first^clj

(first)

source

fold-by^clj

(fold-by columns-selector)

(fold-by columns-selector folding-function)

Group-by and pack columns into vector - the output data set has a row for each unique combination of the provided columns while each remaining column has its valu(es) collected into a vector, similar to how clojure.core/group-by works. See https://scicloj.github.io/tablecloth/index.html#Fold-by

Group-by and pack columns into vector - the output data set has a row for each unique combination
of the provided columns while each remaining column has its valu(es) collected into a vector, similar
to how clojure.core/group-by works.
See https://scicloj.github.io/tablecloth/index.html#Fold-by

source raw docstring

full-join^clj

(full-join ds-right columns-selector)

(full-join ds-right columns-selector options)

source

group-by^clj

(group-by grouping-selector)

(group-by grouping-selector options)

Group dataset by:

column name
list of columns
map of keys and row indexes
function getting map of values

Options are:

select-keys - when grouping is done by function, you can limit fields to a select-keys seq.
result-type - return results as dataset (:as-dataset, default) or as map of datasets (:as-map) or as map of row indexes (:as-indexes) or as sequence of (sub)datasets
other parameters which are passed to dataset fn

When dataset is returned, meta contains :grouped? set to true. Columns in dataset:

name - group name
group-id - id of the group (int)
data - group as dataset

Group dataset by:

- column name
- list of columns
- map of keys and row indexes
- function getting map of values

Options are:

- select-keys - when grouping is done by function, you can limit fields to a `select-keys` seq.
- result-type - return results as dataset (`:as-dataset`, default) or as map of datasets (`:as-map`) or as map of row indexes (`:as-indexes`) or as sequence of (sub)datasets
- other parameters which are passed to `dataset` fn

When dataset is returned, meta contains `:grouped?` set to true. Columns in dataset:

- name - group name
- group-id - id of the group (int)
- data - group as dataset

source raw docstring

grouped?^clj

(grouped?)

Is dataset represents grouped dataset (result of group-by)?

Is `dataset` represents grouped dataset (result of `group-by`)?

source raw docstring

groups->map^clj

(groups->map)

Convert grouped dataset to the map of groups

Convert grouped dataset to the map of groups

source raw docstring

groups->seq^clj

(groups->seq)

source

has-column?^clj

(has-column? column-name)

source

head^clj

(head)

(head n)

source

info^clj

(info)

(info result-type)

source

inner-join^clj

(inner-join ds-right columns-selector)

(inner-join ds-right columns-selector options)

source

intersect^clj

(intersect ds-right)

(intersect ds-right options)

source

join-columns^clj

(join-columns target-column columns-selector)

(join-columns target-column columns-selector conf)

source

last^clj

(last)

source

left-join^clj

(left-join ds-right columns-selector)

(left-join ds-right columns-selector options)

source

map-columns^clj

(map-columns column-name map-fn)

(map-columns column-name columns-selector map-fn)

(map-columns column-name new-type columns-selector map-fn)

source

mark-as-group^clj

(mark-as-group)

Add grouping tag

Add grouping tag

source raw docstring

order-by^clj

(order-by columns-or-fn)

(order-by columns-or-fn comparators)

(order-by columns-or-fn comparators options)

Order dataset by:

column name
columns (as sequence of names)
key-fn
sequence of columns / key-fn Additionally you can ask the order by:
:asc
:desc
custom comparator function

Order dataset by:
- column name
- columns (as sequence of names)
- key-fn
- sequence of columns / key-fn
Additionally you can ask the order by:
- :asc
- :desc
- custom comparator function

source raw docstring

pivot->longer^clj

(pivot->longer)

(pivot->longer columns-selector)

(pivot->longer columns-selector options)

tidyr pivot_longer api

`tidyr` pivot_longer api

source raw docstring

pivot->wider^clj

(pivot->wider columns-selector value-columns)

(pivot->wider columns-selector value-columns options)

source

print-dataset^clj

(print-dataset)

(print-dataset options)

source

process-all-api-symbols^cljmacro

(process-all-api-symbols)

source

process-group-data^clj

(process-group-data f)

(process-group-data f parallel?)

source

rand-nth^clj

(rand-nth)

(rand-nth options)

source

random^clj

(random)

(random n)

(random n options)

source

read-nippy^clj

(read-nippy)

source

rename-columns^clj

(rename-columns columns-mapping)

(rename-columns columns-selector columns-map-fn)

Rename columns with provided old -> new name map

Rename columns with provided old -> new name map

source raw docstring

reorder-columns^clj

(reorder-columns columns-selector & args)

Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.

Reorder columns using column selector(s). When column names are incomplete, the missing will be attached at the end.

source raw docstring

replace-missing^clj

(replace-missing)

(replace-missing strategy)

(replace-missing columns-selector strategy)

(replace-missing columns-selector strategy value)

source

right-join^clj

(right-join ds-right columns-selector)

(right-join ds-right columns-selector options)

source

row-count^clj

(row-count)

source

rows^clj

(rows)

(rows result-type)

Returns rows of dataset. Result type can be any of:

:as-maps
:as-double-arrays
:as-seqs

Returns rows of dataset. Result type can be any of:
* `:as-maps`
* `:as-double-arrays`
* `:as-seqs`

source raw docstring

select^clj

(select columns-selector rows-selector)

Select columns and rows.

Select columns and rows.

source raw docstring

select-columns^clj

(select-columns)

(select-columns columns-selector)

(select-columns columns-selector meta-field)

Select columns by (returns dataset):

name
sequence of names
map of names with new names (rename)
function which filter names (via column metadata)

Select columns by (returns dataset):

- name
- sequence of names
- map of names with new names (rename)
- function which filter names (via column metadata)

source raw docstring

select-missing^clj

(select-missing)

(select-missing columns-selector)

Select rows with missing values

columns-selector selects columns to look at missing values

Select rows with missing values

`columns-selector` selects columns to look at missing values

source raw docstring

select-rows^clj

(select-rows)

(select-rows rows-selector)

(select-rows rows-selector options)

Select rows using:

row id
seq of row ids
seq of true/false
fn with predicate

Select rows using:

- row id
- seq of row ids
- seq of true/false
- fn with predicate

source raw docstring

semi-join^clj

(semi-join ds-right columns-selector)

(semi-join ds-right columns-selector options)

source

separate-column^clj

(separate-column column separator)

(separate-column column target-columns separator)

(separate-column column target-columns separator conf)

source

set-dataset-name^clj

(set-dataset-name ds-name)

source

shape^clj

(shape)

Returns shape of the dataset [rows, cols]

Returns shape of the dataset [rows, cols]

source raw docstring

shuffle^clj

(shuffle)

(shuffle options)

source

split^clj

(split)

(split split-type)

(split split-type options)

Split given dataset into 2 or more (holdout) splits

As the result two new columns are added:

:$split-name - with subgroup name
:$split-id - fold id/repetition id

split-type can be one of the following:

:kfold - k-fold strategy, :k defines number of folds (defaults to 5), produces k splits
:bootstrap - :ratio defines ratio of observations put into result (defaults to 1.0), produces 1 split
:holdout - split into two parts with given ratio (defaults to 2/3), produces 1 split
:loo - leave one out, produces the same number of splits as number of observations

:holdout can accept also probabilites or ratios and can split to more than 2 subdatasets

Additionally you can provide:

:seed - for random number generator
:repeats - repeat procedure :repeats times
:partition-selector - same as in group-by for stratified splitting to reflect dataset structure in splits.
:split-names names of subdatasets different than default, ie. [:train :test :split-2 ...]
:split-col-name - a column where name of split is stored, either :train or :test values (default: :$split-name)
:split-id-col-name - a column where id of the train/test pair is stored (default: :$split-id)

Rows are shuffled before splitting.

In case of grouped dataset each group is processed separately.

Split given dataset into 2 or more (holdout) splits

As the result two new columns are added:

* `:$split-name` - with subgroup name
* `:$split-id` - fold id/repetition id

`split-type` can be one of the following:

* `:kfold` - k-fold strategy, `:k` defines number of folds (defaults to `5`), produces `k` splits
* `:bootstrap` - `:ratio` defines ratio of observations put into result (defaults to `1.0`), produces `1` split
* `:holdout` - split into two parts with given ratio (defaults to `2/3`), produces `1` split
* `:loo` - leave one out, produces the same number of splits as number of observations

`:holdout` can accept also probabilites or ratios and can split to more than 2 subdatasets

Additionally you can provide:

* `:seed` - for random number generator
* `:repeats` - repeat procedure `:repeats` times
* `:partition-selector` - same as in `group-by` for stratified splitting to reflect dataset structure in splits.
* `:split-names` names of subdatasets different than default, ie. `[:train :test :split-2 ...]`
* `:split-col-name` - a column where name of split is stored, either `:train` or `:test` values (default: `:$split-name`)
* `:split-id-col-name` - a column where id of the train/test pair is stored (default: `:$split-id`)

Rows are shuffled before splitting.

In case of grouped dataset each group is processed separately.

See [more](https://www.mitpressjournals.org/doi/pdf/10.1162/EVCO_a_00069)

source raw docstring

split->seq^clj

(split->seq)

(split->seq split-type)

(split->seq split-type options)

Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)

Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)

source raw docstring

tail^clj

(tail)

(tail n)

source

ungroup^clj

(ungroup)

(ungroup options)

Concat groups into dataset.

When add-group-as-column or add-group-id-as-column is set to true or name(s), columns with group name(s) or group id is added to the result.

Before joining the groups groups can be sorted by group name.

Concat groups into dataset.

When `add-group-as-column` or `add-group-id-as-column` is set to `true` or name(s), columns with group name(s) or group id is added to the result.

Before joining the groups groups can be sorted by group name.

source raw docstring

union^clj

(union & args)

source

unique-by^clj

(unique-by)

(unique-by columns-selector)

(unique-by columns-selector options)

source

unmark-group^clj

(unmark-group)

Remove grouping tag

Remove grouping tag

source raw docstring

unroll^clj

(unroll columns-selector)

(unroll columns-selector options)

Unfolds sequences stored inside a column(s), turning it into multiple columns. Opposite of fold-by. Add each of the provided columns to the set that defines the "uniqe key" of each row. Thus there will be a new row for each value inside the target column(s)' value sequence. If you want instead to split the content of the columns into a set of new columns, look at separate-column. See https://scicloj.github.io/tablecloth/index.html#Unroll

Unfolds sequences stored inside a column(s), turning it into multiple columns. Opposite of [[fold-by]].
Add each of the provided columns to the set that defines the "uniqe key" of each row.
Thus there will be a new row for each value inside the target column(s)' value sequence.
If you want instead to split the content of the columns into a set of new _columns_, look at [[separate-column]].
See https://scicloj.github.io/tablecloth/index.html#Unroll

source raw docstring

update-columns^clj

(update-columns columns-map)

(update-columns columns-selector update-functions)

source

write!^clj

(write! output-path)

(write! output-path options)

Write a dataset out to a file. Supported forms are:

(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)

Options:

:max-chars-per-column - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.
:max-num-columns - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.
:quoted-columns - csv specific - sequence of columns names that you would like to always have quoted.
:file-type - Manually specify the file type. This is usually inferred from the filename but if you pass in an output stream then you will need to specify the file type.
:headers? - if csv headers are written, defaults to true.

Write a dataset out to a file.  Supported forms are:

```clojure
(ds/write! test-ds "test.csv")
(ds/write! test-ds "test.tsv")
(ds/write! test-ds "test.tsv.gz")
(ds/write! test-ds "test.nippy")
(ds/write! test-ds out-stream)
```

Options:

  * `:max-chars-per-column` - csv,tsv specific, defaults to 65536 - values longer than this will
     cause an exception during serialization.
  * `:max-num-columns` - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of
     columns an exception will be thrown during serialization.
  * `:quoted-columns` - csv specific - sequence of columns names that you would like to always have quoted.
  * `:file-type` - Manually specify the file type.  This is usually inferred from the filename but if you
     pass in an output stream then you will need to specify the file type.
  * `:headers?` - if csv headers are written, defaults to true.

source raw docstring

write-nippy!^clj

(write-nippy! filename)

source

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close