(split ds)(split ds split-type)(split ds split-type {:keys [seed parallel?] :as opts})Split given dataset into 2 or more (holdout) splits
As the result two new columns are added:
:$split-name - with subgroup name:$split-id - fold id/repetition idsplit-type can be one of the following:
:kfold - k-fold strategy, :k defines number of folds (defaults to 5), produces k splits:bootstrap - :ratio defines ratio of observations put into result (defaults to 1.0), produces 1 split:holdout - split into two parts with given ratio (defaults to 2/3), produces 1 split:loo - leave one out, produces the same number of splits as number of observations:holdout can accept also probabilites or ratios and can split to more than 2 subdatasets
Additionally you can provide:
:seed - for random number generator:repeats - repeat procedure :repeats times:partition-selector - same as in group-by for stratified splitting to reflect dataset structure in splits.:split-names names of subdatasets different than default, ie. [:train :test :split-2 ...]:split-col-name - a column where name of split is stored, either :train or :test values (default: :$split-name):split-id-col-name - a column where id of the train/test pair is stored (default: :$split-id)Rows are shuffled before splitting.
In case of grouped dataset each group is processed separately.
See more
Split given dataset into 2 or more (holdout) splits As the result two new columns are added: * `:$split-name` - with subgroup name * `:$split-id` - fold id/repetition id `split-type` can be one of the following: * `:kfold` - k-fold strategy, `:k` defines number of folds (defaults to `5`), produces `k` splits * `:bootstrap` - `:ratio` defines ratio of observations put into result (defaults to `1.0`), produces `1` split * `:holdout` - split into two parts with given ratio (defaults to `2/3`), produces `1` split * `:loo` - leave one out, produces the same number of splits as number of observations `:holdout` can accept also probabilites or ratios and can split to more than 2 subdatasets Additionally you can provide: * `:seed` - for random number generator * `:repeats` - repeat procedure `:repeats` times * `:partition-selector` - same as in `group-by` for stratified splitting to reflect dataset structure in splits. * `:split-names` names of subdatasets different than default, ie. `[:train :test :split-2 ...]` * `:split-col-name` - a column where name of split is stored, either `:train` or `:test` values (default: `:$split-name`) * `:split-id-col-name` - a column where id of the train/test pair is stored (default: `:$split-id`) Rows are shuffled before splitting. In case of grouped dataset each group is processed separately. See [more](https://www.mitpressjournals.org/doi/pdf/10.1162/EVCO_a_00069)
(split->seq ds)(split->seq ds split-type)(split->seq ds
split-type
{:keys [split-col-name split-id-col-name]
:or {split-col-name :$split-name split-id-col-name :$split-id}
:as opts})Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)
Returns split as a sequence of train/test datasets or map of sequences (grouped dataset)
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |