The most simple dataset description we have figured out is a sequence of maps.
Using this definition, things like k-fold have natural interpretations.
While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information.
Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated. Operatings like auto-scaling, however, will read the dataset into memory.
The most simple dataset description we have figured out is a sequence of maps. Using this definition, things like k-fold have natural interpretations. While this works well for clojure generation/manipulation, ML interfaces uniformly require this sequence to be coalesced somehow into larger buffers; sometimes on a point by point basis and sometimes into batching buffers. This file is intended to provide direct, simple tools to provide either type of coalesced information. Care has been taken to keep certain operations lazy so that datasets of unbounded length can be manipulated. Operatings like auto-scaling, however, will read the dataset into memory.
(->k-fold-datasets k
{:keys [randomize-dataset?] :or {randomize-dataset? true}}
dataset)
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
(->train-test-split {:keys [randomize-dataset? train-fraction]
:or {randomize-dataset? true train-fraction 0.7}}
dataset)
(apply-dataset-options feature-keys label-keys options dataset)
Apply dataset options to dataset producing a coalesced dataset and a new options map. A coalesced dataset is a dataset where all the feature keys are coalesced into a contiguous ::features member and all the labels are coalesced into a contiguous ::labels member.
Transformations:
If the dataset as nominal (not numeric) data then this data is converted into integer data and the original keys mapped to the indexes. This is recorded in :label-map.
Some global information about the dataset is recorded: ::dataset-info {:value-ecount - Ecount of the feature vector. :key-ecount-map - map of keys to ecounts for all keys.}
:feature-keys normaliaed feature keys. :label-keys normalized label keys.
:range-map - if passed in, coalesced ::features or ::label's are set to the ranges specified in the map. This means a min-max pass is performed and per-element scaling is done. See tests for example. The result of a range map operation is a per-element scale map.
:scale-map - if passed in, this is a map of #{::features ::label} to a scaling operation: (-> (ct/clone v) (ops/- (:per-elem-subtract scale-entry)) (ops// (:per-elem-div scale-entry)) (ops/+ (:per-elem-bias scale-entry)))
Apply dataset options to dataset producing a coalesced dataset and a new options map. A coalesced dataset is a dataset where all the feature keys are coalesced into a contiguous ::features member and all the labels are coalesced into a contiguous ::labels member. Transformations: If the dataset as nominal (not numeric) data then this data is converted into integer data and the original keys mapped to the indexes. This is recorded in :label-map. Some global information about the dataset is recorded: ::dataset-info {:value-ecount - Ecount of the feature vector. :key-ecount-map - map of keys to ecounts for all keys.} :feature-keys normaliaed feature keys. :label-keys normalized label keys. :range-map - if passed in, coalesced ::features or ::label's are set to the ranges specified in the map. This means a min-max pass is performed and per-element scaling is done. See tests for example. The result of a range map operation is a per-element scale map. :scale-map - if passed in, this is a map of #{::features ::label} to a scaling operation: (-> (ct/clone v) (ops/- (:per-elem-subtract scale-entry)) (ops// (:per-elem-div scale-entry)) (ops/+ (:per-elem-bias scale-entry)))
(coalesce-dataset feature-keys
label-keys
{:keys [datatype unchecked? container-fn queue-depth
batch-size keep-extra?]
:or {datatype :float64
unchecked? true
container-fn dtype/make-array-of-type
queue-depth 0
batch-size 1}
:as options}
dataset)
Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations.
Returns a sequence of {::features - container of datatype :labels - container or scalar}
Take a dataset and produce a sequence of values,label maps where the entries are coalesced items of the dataset. Ecounts are always checked. options are: datatype - datatype to use. unchecked? - true for faster conversions to container. scalar-label? - true if the label should be a single scalar value. container-fn - container constructor with prototype: (container-fn datatype elem-count {:keys [unchecked?] :as options}) queue-depth - parallelism used for coalescing - see tech.parallel/queued-pmap This is useful with the data sequence invoves cpu-intensive or blocking transformations (loading large images, scaling them, etc) and the train/test method is relatively fast in comparison. Defaults to 0 in which case queued-pmap turns into just map. batch-size - nil - point by point conversion - number - items are coalesced into batches of given size. Options map passed in is passed to dataset->batched-dataset. keep-extra? - Keep extra data in the items. Defaults to true. Allows users to assocthat users can assoc extra information into each data item for things like visualizations. Returns a sequence of {::features - container of datatype :labels - container or scalar}
(per-parameter-dataset-min-max batch-size coalesced-dataset)
Create a new (coalesced) dataset with parameters scaled. If label range is not provided then labels are left unscaled.
Create a new (coalesced) dataset with parameters scaled. If label range is not provided then labels are left unscaled.
(per-parameter-scale-coalesced-dataset scale-map coalesced-dataset)
scale a coalesced dataset in place
scale a coalesced dataset in place
(sequence->iterator item-seq)
Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.
Java ml interfaces sometimes use iterators where they really should use sequences (iterators have state). In any case, we do what we can.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close