tech.ml.dataset

Liking cljdoc? Tell your friends :D

Clojure only.

apply-dataset-options
coalesce-dataset
dataset->k-fold-datasets
get-dataset-item
min-max-map->scale-map
per-parameter-dataset-min-max
per-parameter-scale-coalesced-dataset!
sequence->iterator

The most simple dataset description we have figured out is a sequence of maps.

Using this definition, things like k-fold have natural interpretations.

While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information.

Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated.

The most simple dataset description we have figured out is a sequence of maps.

Using this definition, things like k-fold have natural interpretations.

While this works well for clojure generation/manipulation,  ML interfaces uniformly
require this sequence to be coalesced somehow into larger buffers; sometimes on a
point by point basis and sometimes into batching buffers.  This file is intended
to provide direct, simple tools to provide either type of coalesced information.

Care has been taken to keep certain operations lazy so that datasets of unbounded
length can be manipulated.

raw docstring

apply-dataset-options^clj

(apply-dataset-options feature-keys label-keys options dataset)

source

coalesce-dataset^clj

(coalesce-dataset feature-keys
                  label-keys
                  {:keys [datatype unchecked? container-fn queue-depth
                          batch-size keep-extra?]
                   :or {datatype :float64
                        unchecked? true
                        container-fn dtype/make-array-of-type
                        queue-depth 0
                        batch-size 1}
                   :as options}
                  dataset)

Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations.

Returns a sequence of {:values - container of datatype :labels - container or scalar}

Take a dataset and produce a sequence of values,label maps
  where the entries are coalesced items of the dataset.
  Ecounts are always checked.
options are:
  datatype - datatype to use.
  unchecked? - true for faster conversions to container.
  scalar-label? - true if the label should be a single scalar value.
  container-fn - container constructor with prototype:
     (container-fn datatype elem-count {:keys [unchecked?] :as options})
  queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap
    This is useful with the data sequence invoves cpu-intensive or blocking
    transformations (loading large images, scaling them, etc) and the train/test
    method is relatively fast in comparison.  Defaults to 0 in which case queued-pmap
    turns into just map.
  batch-size - nil - point by point conversion
             - number - items are coalesced into batches of given size.  Options map
                 passed in is passed to dataset->batched-dataset.
  keep-extra? - Keep extra data in the items.  Defaults to true.
                Allows users to assocthat users can assoc extra information into each
                data item for things like visualizations.

  Returns a sequence of
  {:values - container of datatype
   :labels - container or scalar}

source raw docstring

dataset->k-fold-datasets^clj

(dataset->k-fold-datasets k
                          {:keys [randomize-dataset?]
                           :or {randomize-dataset? true}}
                          dataset)

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.

source raw docstring

get-dataset-item^clj

(get-dataset-item dataset-entry item-key)

source

min-max-map->scale-map^clj

(min-max-map->scale-map min-max-map range-map)

source

per-parameter-dataset-min-max^clj

(per-parameter-dataset-min-max coalesced-dataset)

Create a new (coalesced) dataset with parameters scaled. If label range is not provided then labels are left unscaled.

Create a new (coalesced) dataset with parameters scaled.
If label range is not provided then labels are left unscaled.

source raw docstring

per-parameter-scale-coalesced-dataset!^clj

(per-parameter-scale-coalesced-dataset! scale-map coalesced-dataset)

scale a coalesced dataset in place

scale a coalesced dataset in place

source raw docstring

sequence->iterator^clj

(sequence->iterator item-seq)

Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.

Java ml interfaces sometimes use iterators where they really should
use sequences (iterators have state).  In any case, we do what we can.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

tech.ml.dataset

apply-dataset-optionsclj

coalesce-datasetclj

dataset->k-fold-datasetsclj

get-dataset-itemclj

min-max-map->scale-mapclj

per-parameter-dataset-min-maxclj

per-parameter-scale-coalesced-dataset!clj

sequence->iteratorclj

apply-dataset-options^clj

coalesce-dataset^clj

dataset->k-fold-datasets^clj

get-dataset-item^clj

min-max-map->scale-map^clj

per-parameter-dataset-min-max^clj

per-parameter-scale-coalesced-dataset!^clj

sequence->iterator^clj