tech.ml.dataset

Liking cljdoc? Tell your friends :D

Clojure only.

->k-fold-datasets
->train-test-split
apply-dataset-options
augment-dataset-with-stats
calculate-nominal-stats
check-dataset-datatypes
coalesce-dataset
force-keyword
get-dataset-item
min-max-map->scale-map
normalize-keys
per-parameter-dataset-min-max
per-parameter-scale-coalesced-dataset
sequence->iterator

The most simple dataset description we have figured out is a sequence of maps.

Using this definition, things like k-fold have natural interpretations.

While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information.

Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated. Operatings like auto-scaling, however, will read the dataset into memory.

The most simple dataset description we have figured out is a sequence of maps.

Using this definition, things like k-fold have natural interpretations.

While this works well for clojure generation/manipulation,  ML interfaces uniformly
require this sequence to be coalesced somehow into larger buffers; sometimes on a
point by point basis and sometimes into batching buffers.  This file is intended
to provide direct, simple tools to provide either type of coalesced information.

Care has been taken to keep certain operations lazy so that datasets of unbounded
length can be manipulated.  Operatings like auto-scaling, however, will read the
dataset into memory.

raw docstring

->k-fold-datasets^clj

(->k-fold-datasets k
                   {:keys [randomize-dataset?] :or {randomize-dataset? true}}
                   dataset)

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.

source raw docstring

->train-test-split^clj

(->train-test-split {:keys [randomize-dataset? train-fraction]
                     :or {randomize-dataset? true train-fraction 0.7}}
                    dataset)

source

apply-dataset-options^clj

(apply-dataset-options feature-keys label-keys options dataset)

Apply dataset options to dataset producing a coalesced dataset and a new options map. A coalesced dataset is a dataset where all the feature keys are coalesced into a contiguous ::features member and all the labels are coalesced into a contiguous ::labels member.

Transformations:

If the dataset as nominal (not numeric) data then this data is converted into integer data and the original keys mapped to the indexes. This is recorded in :label-map.

Some global information about the dataset is recorded: ::dataset-info {:value-ecount - Ecount of the feature vector. :key-ecount-map - map of keys to ecounts for all keys.}

:feature-keys normaliaed feature keys. :label-keys normalized label keys.

:range-map - if passed in, coalesced ::features or ::label's are set to the ranges specified in the map. This means a min-max pass is performed and per-element scaling is done. See tests for example. The result of a range map operation is a per-element scale map.

:scale-map - if passed in, this is a map of #{::features ::label} to a scaling operation: (-> (ct/clone v) (ops/- (:per-elem-subtract scale-entry)) (ops// (:per-elem-div scale-entry)) (ops/+ (:per-elem-bias scale-entry)))

Apply dataset options to dataset producing a coalesced dataset and a new options map.
A coalesced dataset is a dataset where all the feature keys are coalesced into a
contiguous ::features member and all the labels are coalesced into a contiguous ::labels
member.

Transformations:

If the dataset as nominal (not numeric) data then this data is converted into integer
data and the original keys mapped to the indexes.  This is recorded in :label-map.

Some global information about the dataset is recorded:
::dataset-info {:value-ecount - Ecount of the feature vector.
               :key-ecount-map - map of keys to ecounts for all keys.}

:feature-keys normaliaed feature keys.
:label-keys normalized label keys.

:range-map - if passed in, coalesced ::features or ::label's are set to the ranges
specified in the map.  This means a min-max pass is performed and per-element scaling
is done.  See tests for example. The result of a range map operation is a per-element
scale map.

:scale-map - if passed in, this is a map of #{::features ::label} to a scaling operation:
     (-> (ct/clone v)
         (ops/- (:per-elem-subtract scale-entry))
         (ops// (:per-elem-div scale-entry))
         (ops/+ (:per-elem-bias scale-entry)))

source raw docstring

augment-dataset-with-stats^clj

(augment-dataset-with-stats stats-map
                            nominal-feature-keywords
                            label-keyword
                            dataset)

source

calculate-nominal-stats^clj

(calculate-nominal-stats nominal-feature-keywords label-keyword dataset)

Calculate the mean, variance of each value of each nominal-type feature as it relates to the regressed value. This is useful to provide a simple number of derived features that directly relate to regressed values and that often provide better learning.

Calculate the mean, variance of each value of each nominal-type feature
as it relates to the regressed value.  This is useful to provide a simple
number of derived features that directly relate to regressed values and
that often provide better learning.

source raw docstring

check-dataset-datatypes^clj

(check-dataset-datatypes dataset)

Check that the datatype of the rest of the dataset matches the datatypes of the first entry.

Check that the datatype of the rest of the dataset matches
the datatypes of the first entry.

source raw docstring

coalesce-dataset^clj

(coalesce-dataset feature-keys
                  label-keys
                  {:keys [datatype unchecked? container-fn queue-depth
                          batch-size keep-extra?]
                   :or {datatype :float64
                        unchecked? true
                        container-fn dtype/make-array-of-type
                        queue-depth 0
                        batch-size 1}
                   :as options}
                  dataset)

Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations.

Returns a sequence of {::features - container of datatype :labels - container or scalar}

Take a dataset and produce a sequence of values,label maps
  where the entries are coalesced items of the dataset.
  Ecounts are always checked.
options are:
  datatype - datatype to use.
  unchecked? - true for faster conversions to container.
  scalar-label? - true if the label should be a single scalar value.
  container-fn - container constructor with prototype:
     (container-fn datatype elem-count {:keys [unchecked?] :as options})
  queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap
    This is useful with the data sequence invoves cpu-intensive or blocking
    transformations (loading large images, scaling them, etc) and the train/test
    method is relatively fast in comparison.  Defaults to 0 in which case queued-pmap
    turns into just map.
  batch-size - nil - point by point conversion
             - number - items are coalesced into batches of given size.  Options map
                 passed in is passed to dataset->batched-dataset.
  keep-extra? - Keep extra data in the items.  Defaults to true.
                Allows users to assocthat users can assoc extra information into each
                data item for things like visualizations.

  Returns a sequence of
  {::features - container of datatype
   :labels - container or scalar}

source raw docstring

force-keyword^clj

(force-keyword value
               &
               {:keys [missing-value-placeholder]
                :or {missing-value-placeholder -1}})

Force a value to a keyword. Often times data is backwards where normative values are represented by numbers; this removes important information from a dataset. If a particular column is categorical, it should be represented by a keyword.

Force a value to a keyword.  Often times data is backwards where normative
values are represented by numbers; this removes important information from
a dataset.  If a particular column is categorical, it should be represented
by a keyword.

source raw docstring

get-dataset-item^clj

(get-dataset-item dataset-entry item-key {:keys [label-map]})

source

min-max-map->scale-map^clj

(min-max-map->scale-map min-max-map range-map)

source

normalize-keys^clj

(normalize-keys kwd-or-seq)

source

per-parameter-dataset-min-max^clj

(per-parameter-dataset-min-max batch-size coalesced-dataset)

Create a new (coalesced) dataset with parameters scaled. If label range is not provided then labels are left unscaled.

Create a new (coalesced) dataset with parameters scaled.
If label range is not provided then labels are left unscaled.

source raw docstring

per-parameter-scale-coalesced-dataset^clj

(per-parameter-scale-coalesced-dataset scale-map coalesced-dataset)

scale a coalesced dataset in place

scale a coalesced dataset in place

source raw docstring

sequence->iterator^clj

(sequence->iterator item-seq)

Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.

Java ml interfaces sometimes use iterators where they really should
use sequences (iterators have state).  In any case, we do what we can.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

tech.ml.dataset

->k-fold-datasetsclj

->train-test-splitclj

apply-dataset-optionsclj

augment-dataset-with-statsclj

calculate-nominal-statsclj

check-dataset-datatypesclj

coalesce-datasetclj

force-keywordclj

get-dataset-itemclj

min-max-map->scale-mapclj

normalize-keysclj

per-parameter-dataset-min-maxclj

per-parameter-scale-coalesced-datasetclj

sequence->iteratorclj

->k-fold-datasets^clj

->train-test-split^clj

apply-dataset-options^clj

augment-dataset-with-stats^clj

calculate-nominal-stats^clj

check-dataset-datatypes^clj

coalesce-dataset^clj

force-keyword^clj

get-dataset-item^clj

min-max-map->scale-map^clj

normalize-keys^clj

per-parameter-dataset-min-max^clj

per-parameter-scale-coalesced-dataset^clj

sequence->iterator^clj