The most simple dataset description we have figured out is a sequence of maps.
Using this definition, things like k-fold have natural interpretations.
While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information.
Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated.
The most simple dataset description we have figured out is a sequence of maps. Using this definition, things like k-fold have natural interpretations. While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information. Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated.
(coalesce-dataset feature-keys
label-keys
{:keys [datatype unchecked? container-fn queue-depth
batch-size keep-extra?]
:or {datatype :float64
unchecked? true
container-fn dtype/make-array-of-type
queue-depth 0
batch-size 1}
:as options}
dataset)
Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations.
Returns a sequence of {:values - container of datatype :labels - container or scalar}
Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations. Returns a sequence of {:values - container of datatype :labels - container or scalar}
(dataset->k-fold-datasets k
{:keys [randomize-dataset?]
:or {randomize-dataset? true}}
dataset)
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
(per-parameter-dataset-min-max coalesced-dataset)
Create a new (coalesced) dataset with parameters scaled. If label range is not provided then labels are left unscaled.
Create a new (coalesced) dataset with parameters scaled. If label range is not provided then labels are left unscaled.
(per-parameter-scale-coalesced-dataset! scale-map coalesced-dataset)
scale a coalesced dataset in place
scale a coalesced dataset in place
(sequence->iterator item-seq)
Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.
Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close