Liking cljdoc? Tell your friends :D

tech.ml.dataset

The most simple dataset description we have figured out is a sequence of maps.

Using this definition, things like k-fold have natural interpretations.

While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information.

Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated.

The most simple dataset description we have figured out is a sequence of maps.

Using this definition, things like k-fold have natural interpretations.

While this works well for clojure generation/manipulation,  ML interfaces uniformly
require this sequence to be coalesced somehow into larger buffers; sometimes on a
point by point basis and sometimes into batching buffers.  This file is intended
to provide direct, simple tools to provide either type of coalesced information.

Care has been taken to keep certain operations lazy so that datasets of unbounded
length can be manipulated.
raw docstring

dataset->k-fold-datasetsclj

(dataset->k-fold-datasets k
                          {:keys [randomize-dataset?]
                           :or {randomize-dataset? true}}
                          dataset)

Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.

Given 1 dataset, prepary K datasets using the k-fold algorithm.
Randomize dataset defaults to true which will realize the entire dataset
so use with care if you have large datasets.
sourceraw docstring

dataset->values-label-sequenceclj

(dataset->values-label-sequence feature-keys
                                label-keys
                                {:keys [datatype unchecked? scalar-label?
                                        container-fn queue-depth batch-size
                                        keep-extra?]
                                 :or {datatype :float64
                                      unchecked? true
                                      scalar-label? false
                                      container-fn dtype/make-array-of-type
                                      queue-depth 0
                                      batch-size 1}
                                 :as options}
                                dataset)

Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations.

Returns a sequence of {:values - container of datatype :labels - container or scalar}

Take a dataset and produce a sequence of values,label maps
  where the entries are coalesced items of the dataset.
  Ecounts are always checked.
options are:
  datatype - datatype to use.
  unchecked? - true for faster conversions to container.
  scalar-label? - true if the label should be a single scalar value.
  container-fn - container constructor with prototype:
     (container-fn datatype elem-count {:keys [unchecked?] :as options})
  queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap
    This is useful with the data sequence invoves cpu-intensive or blocking
    transformations (loading large images, scaling them, etc) and the train/test
    method is relatively fast in comparison.  Defaults to 0 in which case queued-pmap
    turns into just map.
  batch-size - nil - point by point conversion
             - number - items are coalesced into batches of given size.  Options map
                 passed in is passed to dataset->batched-dataset.
  keep-extra? - Keep extra data in the items.  Defaults to true.
                Allows users to assocthat users can assoc extra information into each
                data item for things like visualizations.

  Returns a sequence of
  {:values - container of datatype
   :labels - container or scalar}
sourceraw docstring

get-dataset-itemclj

(get-dataset-item dataset-entry item-key)
source

sequence->iteratorclj

(sequence->iterator item-seq)

Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.

Java ml interfaces sometimes use iterators where they really should
use sequences (iterators have state).  In any case, we do what we can.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close