Column major dataset abstraction for efficiently manipulating in memory datasets.
Column major dataset abstraction for efficiently manipulating in memory datasets.
(->dataset dataset)
(->dataset dataset {:keys [table-name] :or {table-name "_unnamed"} :as options})
(->flyweight dataset
&
{:keys [column-name-seq error-on-missing-values? number->string?]
:or {error-on-missing-values? true}})
Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. IF a label map is passed in then for the columns that are present in the label map a reverse mapping is done such that the flyweight maps contain the labels and not their encoded values.
Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. IF a label map is passed in then for the columns that are present in the label map a reverse mapping is done such that the flyweight maps contain the labels and not their encoded values.
(->k-fold-datasets
dataset
k
{:keys [randomize-dataset?] :or {randomize-dataset? true} :as options})
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
(->row-major dataset options)
(->row-major dataset
key-colname-seq-map
{:keys [datatype] :or {datatype :float64}})
Given a dataset and a map if desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}
Given a dataset and a map if desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}
(->train-test-split dataset
{:keys [randomize-dataset? train-fraction]
:or {randomize-dataset? true train-fraction 0.7}
:as options})
(add-column dataset column)
Add a new column. Error if name collision
Add a new column. Error if name collision
(add-or-update-column dataset column)
If column exists, replace. Else append new column.
If column exists, replace. Else append new column.
(column dataset column-name)
Return the column or throw if it doesn't exist.
Return the column or throw if it doesn't exist.
(column-map datatypes)
clojure map of column-name->column
clojure map of column-name->column
(column-names dataset)
In-order sequence of column names
In-order sequence of column names
(column-values->categorical dataset src-column)
Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.
Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.
(columns dataset)
Return sequence of all columns in dataset.
Return sequence of all columns in dataset.
(columns-with-missing-seq dataset)
Return a sequence of: {:column-name column-name :missing-count missing-count } or nil of no columns are missing data.
Return a sequence of: {:column-name column-name :missing-count missing-count } or nil of no columns are missing data.
(compute-centroid-and-global-means dataset row-major-centroids)
Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.
Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.
(correlation-table dataset & [correlation-type])
Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >)
Thus the first entry is: [colname, 1.0]
There are three possible correlation types: :pearson :spearman :kendall
:pearson is the default.
Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >) Thus the first entry is: [colname, 1.0] There are three possible correlation types: :pearson :spearman :kendall :pearson is the default.
(ds-column-map map-fn first-ds & ds-seq)
Map a function columnwise across datasets and produce a new dataset. column sequence. Note this does not produce a new dataset as that would preclude remove,filter on nil values.
Map a function columnwise across datasets and produce a new dataset. column sequence. Note this does not produce a new dataset as that would preclude remove,filter on nil values.
(ds-filter predicate dataset & [column-name-seq])
dataset->dataset transformation
dataset->dataset transformation
(ds-group-by key-fn dataset & [column-name-seq])
Produce a map of key-fn-value->dataset. key-fn is a function taking Y values where Y is the count of column-name-seq or :all.
Produce a map of key-fn-value->dataset. key-fn is a function taking Y values where Y is the count of column-name-seq or :all.
(ds-map-values dataset map-fn & [column-name-seq])
Note this returns a sequence, not a dataset.
Note this returns a sequence, not a dataset.
(ds-sort-by key-fn dataset)
(ds-sort-by key-fn compare-fn dataset)
(ds-sort-by key-fn compare-fn dataset column-name-seq)
(feature-ecount dataset)
When columns aren't scalars then this will change. For now, just the number of feature columns.
When columns aren't scalars then this will change. For now, just the number of feature columns.
(from-prototype dataset table-name column-seq)
Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.
Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.
(g-means dataset & [max-k error-on-missing?])
g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
(impute-missing-by-centroid-averages dataset
row-major-centroids
{:keys [centroid-means global-means]})
Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.
Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.
(inference-target-label-inverse-map dataset & [label-columns])
Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.
Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.
(k-means dataset & [k max-iterations num-runs error-on-missing?])
Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.
Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.
(labels dataset)
Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in flyweight (sequence of maps) format.
Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in flyweight (sequence of maps) format.
(maybe-column dataset column-name)
Return either column if exists or nil.
Return either column if exists or nil.
(model-type dataset & [column-name-seq])
Check the label column after dataset processing. Return either :regression :classification
Check the label column after dataset processing. Return either :regression :classification
(new-column dataset column-name values)
(new-column dataset
column-name
values
{:keys [datatype container-type]
:or {container-type :tablesaw-column}
:as options})
Create a new column from some values.
Create a new column from some values.
(num-inference-classes dataset)
Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.
Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.
(order-column-names dataset colname-seq)
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
(reduce-column-names dataset colname-seq)
Reverse map from the one-hot encoded columns to the original source column.
Reverse map from the one-hot encoded columns to the original source column.
(select dataset colname-seq index-seq)
Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - either keyword :all or list of column names with no duplicates. index-seq - either keyword :all or list of indexes. May contain duplicates.
Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - either keyword :all or list of column names with no duplicates. index-seq - either keyword :all or list of indexes. May contain duplicates.
(update-column dataset col-name update-fn)
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
(update-columns dataset column-name-seq update-fn)
Update a sequence of columns.
Update a sequence of columns.
(x-means dataset & [max-k error-on-missing?])
x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close