Liking cljdoc? Tell your friends :D

tech.ml.dataset

Column major dataset abstraction for efficiently manipulating in memory datasets.

Column major dataset abstraction for efficiently manipulating
in memory datasets.
raw docstring

->datasetclj

(->dataset dataset)
(->dataset dataset {:keys [table-name] :or {table-name "_unnamed"} :as options})
source

->flyweightclj

(->flyweight dataset
             &
             {:keys [column-name-seq error-on-missing-values? number->string?]
              :or {error-on-missing-values? true}})

Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. IF a label map is passed in then for the columns that are present in the label map a reverse mapping is done such that the flyweight maps contain the labels and not their encoded values.

Convert dataset to seq-of-maps dataset.  Flag indicates if errors should be thrown on
missing values or if nil should be inserted in the map.  IF a label map is passed in
then for the columns that are present in the label map a reverse mapping is done such
that the flyweight maps contain the labels and not their encoded values.
sourceraw docstring

->k-fold-datasetsclj

(->k-fold-datasets
  dataset
  k
  {:keys [randomize-dataset?] :or {randomize-dataset? true} :as options})

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.
sourceraw docstring

->row-majorclj

(->row-major dataset options)
(->row-major dataset
             key-colname-seq-map
             {:keys [datatype] :or {datatype :float64}})

Given a dataset and a map if desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}

Given a dataset and a map if desired key names to sequences of columns,
produce a sequence of maps where each key name points to contiguous vector
composed of the column values concatenated.
If colname-seq-map is not provided then each row defaults to
{:features [feature-columns]
 :label [label-columns]}
sourceraw docstring

->train-test-splitclj

(->train-test-split dataset
                    {:keys [randomize-dataset? train-fraction]
                     :or {randomize-dataset? true train-fraction 0.7}
                     :as options})
source

add-columnclj

(add-column dataset column)

Add a new column. Error if name collision

Add a new column. Error if name collision
sourceraw docstring

add-or-update-columnclj

(add-or-update-column dataset column)

If column exists, replace. Else append new column.

If column exists, replace.  Else append new column.
sourceraw docstring

columnclj

(column dataset column-name)

Return the column or throw if it doesn't exist.

Return the column or throw if it doesn't exist.
sourceraw docstring

column-label-mapclj

(column-label-map dataset column-name)
source

column-mapclj

(column-map datatypes)

clojure map of column-name->column

clojure map of column-name->column
sourceraw docstring

column-namesclj

(column-names dataset)

In-order sequence of column names

In-order sequence of column names
sourceraw docstring

column-values->categoricalclj

(column-values->categorical dataset src-column)

Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.

Given a column encoded via either string->number or one-hot, reverse
map to the a sequence of the original string column values.
sourceraw docstring

columnsclj

(columns dataset)

Return sequence of all columns in dataset.

Return sequence of all columns in dataset.
sourceraw docstring

columns-with-missing-seqclj

(columns-with-missing-seq dataset)

Return a sequence of: {:column-name column-name :missing-count missing-count } or nil of no columns are missing data.

Return a sequence of:
{:column-name column-name
 :missing-count missing-count
}
or nil of no columns are missing data.
sourceraw docstring

compute-centroid-and-global-meansclj

(compute-centroid-and-global-means dataset row-major-centroids)

Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.

Return a map of:
centroid-means - centroid-index -> (double array) column means.
global-means - global means (double array) for the dataset.
sourceraw docstring

correlation-tableclj

(correlation-table dataset & [correlation-type])

Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >)

Thus the first entry is: [colname, 1.0]

There are three possible correlation types: :pearson :spearman :kendall

:pearson is the default.

Return a map of colname->list of sorted tuple of [colname, coefficient].
Sort is:
(sort-by (comp #(Math/abs (double %)) second) >)

Thus the first entry is:
[colname, 1.0]

There are three possible correlation types:
:pearson
:spearman
:kendall

:pearson is the default.
sourceraw docstring

dataset->stringclj

(dataset->string ds)
source

dataset-label-mapclj

(dataset-label-map dataset)
source

dataset-nameclj

(dataset-name dataset)
source

ds-column-mapclj

(ds-column-map map-fn first-ds & ds-seq)

Map a function columnwise across datasets and produce a new dataset. column sequence. Note this does not produce a new dataset as that would preclude remove,filter on nil values.

Map a function columnwise across datasets and produce a new dataset.
column sequence.  Note this does not produce a new dataset as that would
preclude remove,filter on nil values.
sourceraw docstring

ds-concatclj

(ds-concat dataset & other-datasets)
source

ds-filterclj

(ds-filter predicate dataset & [column-name-seq])

dataset->dataset transformation

dataset->dataset transformation
sourceraw docstring

ds-group-byclj

(ds-group-by key-fn dataset & [column-name-seq])

Produce a map of key-fn-value->dataset. key-fn is a function taking Y values where Y is the count of column-name-seq or :all.

Produce a map of key-fn-value->dataset.  key-fn is a function taking
Y values where Y is the count of column-name-seq or :all.
sourceraw docstring

ds-map-valuesclj

(ds-map-values dataset map-fn & [column-name-seq])

Note this returns a sequence, not a dataset.

Note this returns a sequence, not a dataset.
sourceraw docstring

ds-sort-byclj

(ds-sort-by key-fn dataset)
(ds-sort-by key-fn compare-fn dataset)
(ds-sort-by key-fn compare-fn dataset column-name-seq)
source

ds-take-nthclj

(ds-take-nth n-val dataset)
source

feature-ecountclj

(feature-ecount dataset)

When columns aren't scalars then this will change. For now, just the number of feature columns.

When columns aren't scalars then this will change.
For now, just the number of feature columns.
sourceraw docstring

from-prototypeclj

(from-prototype dataset table-name column-seq)

Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.

Create a new dataset that is the same type as this one but with a potentially
different table name and column sequence.  Take care that the columns are all of
the correct type.
sourceraw docstring

g-meansclj

(g-means dataset & [max-k error-on-missing?])

g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.

g-means. Not NAN aware, missing is an error.
Returns array of centroids in row-major array-of-array-of-doubles format.
sourceraw docstring

has-column-label-map?clj

(has-column-label-map? dataset column-name)
source

impute-missing-by-centroid-averagesclj

(impute-missing-by-centroid-averages dataset
                                     row-major-centroids
                                     {:keys [centroid-means global-means]})

Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.

Impute missing columns by first grouping by nearest centroids and then computing the
mean.  In the case where the grouping for a given centroid contains all NaN's, use the
global dataset mean.  In the case where this is NaN, this algorithm will fail to
replace the missing values with meaningful values.  Return a new dataset.
sourceraw docstring

inference-target-label-inverse-mapclj

(inference-target-label-inverse-map dataset & [label-columns])

Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.

Given options generated during ETL operations and annotated with :label-columns
sequence container 1 label column, generate a reverse map that maps from a dataset
value back to the label that generated that value.
sourceraw docstring

inference-target-label-mapclj

(inference-target-label-map dataset & [label-columns])
source

k-meansclj

(k-means dataset & [k max-iterations num-runs error-on-missing?])

Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.

Nan-aware k-means.
Returns array of centroids in row-major array-of-array-of-doubles format.
sourceraw docstring

labelsclj

(labels dataset)

Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in flyweight (sequence of maps) format.

Given a dataset and an options map, generate a sequence of label-values.
If label count is 1, then if there is a label-map associated with column
generate sequence of labels by reverse mapping the column(s) back to the original
dataset values.  If there are multiple label columns results are presented in
flyweight (sequence of maps) format.
sourceraw docstring

maybe-columnclj

(maybe-column dataset column-name)

Return either column if exists or nil.

Return either column if exists or nil.
sourceraw docstring

metadataclj

(metadata dataset)
source

model-typeclj

(model-type dataset & [column-name-seq])

Check the label column after dataset processing. Return either :regression :classification

Check the label column after dataset processing.
Return either
:regression
:classification
sourceraw docstring

new-columnclj

(new-column dataset column-name values)
(new-column dataset
            column-name
            values
            {:keys [datatype container-type]
             :or {container-type :tablesaw-column}
             :as options})

Create a new column from some values.

Create a new column from some values.
sourceraw docstring

num-inference-classesclj

(num-inference-classes dataset)

Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.

Given a dataset and correctly built options from pipeline operations,
return the number of classes used for the label.  Error if not classification
dataset.
sourceraw docstring

order-column-namesclj

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

Order a sequence of columns names so they match the order in the
original dataset.  Missing columns are placed last.
sourceraw docstring

reduce-column-namesclj

(reduce-column-names dataset colname-seq)

Reverse map from the one-hot encoded columns to the original source column.

Reverse map from the one-hot encoded columns
to the original source column.
sourceraw docstring

remove-columnclj

(remove-column dataset col-name)

Fails quietly

Fails quietly
sourceraw docstring

remove-columnsclj

(remove-columns dataset colname-seq)
source

selectclj

(select dataset colname-seq index-seq)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - either keyword :all or list of column names with no duplicates. index-seq - either keyword :all or list of indexes. May contain duplicates.

Reorder/trim dataset according to this sequence of indexes.  Returns a new dataset.
colname-seq - either keyword :all or list of column names with no duplicates.
index-seq - either keyword :all or list of indexes.  May contain duplicates.
sourceraw docstring

select-columnsclj

(select-columns dataset col-name-seq)
source

set-inference-targetclj

(set-inference-target dataset target-name-or-target-name-seq)
source

set-metadataclj

(set-metadata dataset meta-map)
source

update-columnclj

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

Update a column returning a new dataset.  update-fn is a column->column
transformation.  Error if column does not exist.
sourceraw docstring

update-columnsclj

(update-columns dataset column-name-seq update-fn)

Update a sequence of columns.

Update a sequence of columns.
sourceraw docstring

x-meansclj

(x-means dataset & [max-k error-on-missing?])

x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.

x-means. Not NAN aware, missing is an error.
Returns array of centroids in row-major array-of-array-of-doubles format.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close