Preemptively compute a dataset (i.e. features from natural language utterances) and store them in Elasticsearch. This is useful for use with training, testing, validating and development machine learning models.
The unit of data is an instance. An instance set (or just instances) makes up the dataset.
The idea is to abstract out Elasticsearch, but that might be a future enhancement. At the moment functions don't carry Elassticsearch artifacts but they are exposed.
There are three basic ways to use this data:
ids
is considered training data. This is the default
nascent state.divide-by-set
).divide-by-fold
).The information used to represent either fold or the test/train split is referred to as the dataset split state and is stored in Elasticsearch under a differnent mapping-type in the same index as the instances.
See ids
for more information.
Preemptively compute a dataset (i.e. features from natural language utterances) and store them in Elasticsearch. This is useful for use with training, testing, validating and development machine learning models. The unit of data is an instance. An instance set (or just *instances*) makes up the dataset. The idea is to abstract out Elasticsearch, but that might be a future enhancement. At the moment functions don't carry Elassticsearch artifacts but they are exposed. There are three basic ways to use this data: * Get all instances (i.e. an utterance or a feature set). In this case all data returned from [[ids]] is considered training data. This is the default nascent state. * Split the data into a train and test set (see [[divide-by-set]]). * Use the data as a cross fold validation and iterate folds (see [[divide-by-fold]]). The information used to represent either fold or the test/train split is referred to as the *dataset split* state and is stored in Elasticsearch under a differnent mapping-type in the same index as the instances. See [[ids]] for more information.
(clear & {:keys [wipe-persistent?] :or {wipe-persistent? false}})
Clear the in memory instance data. If key :wipe-persistent?
is true
all
fold and test/train split data is also cleared.
Clear the in memory instance data. If key `:wipe-persistent?` is `true` all fold and test/train split data is also cleared.
(dataset-file)
(distribution)
Return maps representing the data set distribution by class label. Each element of the returned sequence has the following keys:
Return maps representing the data set distribution by class label. Each element of the returned sequence has the following keys: * **:class-label** the class-label fo the instances * **:count** the number of instances for **:class-label**
(divide-by-fold)
(divide-by-fold folds & {:keys [shuffle?] :or {shuffle? true}})
Divide the data into folds and initialize the current fold in the dataset split state. Using this kind of dataset split is useful for cross fold validation.
See set-fold
Divide the data into folds and initialize the current fold in the *dataset split* state. Using this kind of dataset split is useful for cross fold validation. * **folds** number of folds to use, which defaults to 10 See [[set-fold]]
(divide-by-preset)
Divide the data into test and training buckets. The respective train/test
buckets are dictated by the :set-type
label given in parameter given to the
:create-instances-fn as documented in elasticsearch-connection
.
Divide the data into test and training *buckets*. The respective train/test buckets are dictated by the `:set-type` label given in parameter given to the **:create-instances-fn** as documented in [[elasticsearch-connection]].
(divide-by-set)
(divide-by-set train-ratio
&
{:keys [dist-type shuffle? max-instances seed]
:as opts
:or {shuffle? true dist-type (quote uneven)}})
Divide the dataset into a test and training buckets.
0.5
true
then shuffle the set before partitioning, otherwise
just update the demarcation boundaryDivide the dataset into a test and training *buckets*. * **train-ratio** this is the percentage of data in the train bucket, which defaults to `0.5` Keys ---- * **:dist-type** one of the following symbols: *even* each test/training set has an even distribution by class label *uneven* each test/training set has an uneven distribution by class label * **:shuffle?** if `true` then shuffle the set before partitioning, otherwise just update the *demarcation* boundary * **:filter-fn** if given a filter function that takes a key as input * **:max-instances** the maximum number of instances per class * **:seed** if given, seed the random number generator, otherwise don't return random documents
(elasticsearch-connection index-name
&
{:keys [create-instances-fn population-use set-type
url mapping-type-def cache-inst]
:or {create-instances-fn identity
population-use 1.0
set-type :train
mapping-type-def {instance-key {:type "nested"}
class-label-key
{:type "string"
:index "not_analyzed"}}
url "http://localhost:9200"}})
Create a connection to the dataset DB cache.
instances-load
; this
function takes a single argument, which is also a function that is used to
load utterance in the DB; this function takes the following forms:
(fn [instance class-label] ...
(fn [id instance class-label] ...
(fn [id instance class-label set-type] ...
N
-deep map):test
, :train
, :train-test
(all) used to presort the data
with divide-by-preset
; note that it isn't necessary to
call divide-by-preset
for the first invocation of instances-load
:url the URL to the DB (defaults to http://localhost:9200
)
:mapping-type map type name (see ES docs)
:cache-inst an atom used to cache instances by ID; if given this retrieves instances from the in memory map stored in the atom; otherwise it goes to ElasticSearch each time
Create a connection that produces a list of 20 instances:
(defn- create-iter-connection []
(letfn [(load-fn [add-fn]
(doseq [i (range 20)]
(add-fn (str i) (format "inst %d" i) (format "class %d" i))))]
(elasticsearch-connection "tmp" :create-instances-fn load-fn)))
Create a connection to the dataset DB cache. Parameters ---------- * **index-name** the name of the Elasticsearch index Keys ---- * **:create-instances-fn** a function that computes the instance set (i.e. parses the utterance) and invoked by [[instances-load]]; this function takes a single argument, which is also a function that is used to load utterance in the DB; this function takes the following forms: * (fn [instance class-label] ... * (fn [id instance class-label] ... * (fn [id instance class-label set-type] ... * **id** the unique identifier of the data point * **instance** is the data set instance (can be an `N`-deep map) * **class-label** the label of the class (can be nominal, double, integer) * **set-type** either `:test`, `:train`, `:train-test` (all) used to presort the data with [[divide-by-preset]]; note that it isn't necessary to call [[divide-by-preset]] for the first invocation of [[instances-load]] * **:url** the URL to the DB (defaults to `http://localhost:9200`) * **:mapping-type** map type name (see ES docs) * **:cache-inst** an atom used to cache instances by ID; if given this retrieves instances from the in memory map stored in the atom; otherwise it goes to ElasticSearch each time Example ------- Create a connection that produces a list of 20 instances: ```clojure (defn- create-iter-connection [] (letfn [(load-fn [add-fn] (doseq [i (range 20)] (add-fn (str i) (format "inst %d" i) (format "class %d" i))))] (elasticsearch-connection "tmp" :create-instances-fn load-fn))) ```
(freeze-dataset &
{:keys [output-file id-key set-type-key]
:or {set-type-key :set-type}})
Distille the current data set (data and test/train splits) in a
output-file. See freeze-dataset-to-writer
.
Distille the current data set (data and test/train splits) in a **output-file**. See [[freeze-dataset-to-writer]].
(freeze-dataset-to-writer writer & {:keys [set-type-key]})
Distille the current data set (data and test/train splits) in writer to be later restored with[[zensols.dataset.thaw/taw-connection]].
Distille the current data set (data and test/train splits) in **writer** to be later restored with[[zensols.dataset.thaw/taw-connection]].
(freeze-file)
(ids & {:keys [set-type]})
Return all IDs based on the dataset split (see class docs).
:train
, :test
, :train-test
(all) and defaults
to set-default-set-type
or :train
if not setReturn all IDs based on the *dataset split* (see class docs). Keys ---- * **:set-type** is either `:train`, `:test`, `:train-test` (all) and defaults to [[set-default-set-type]] or `:train` if not set
(instance-by-id id)
(instance-by-id conn id)
Get a specific instance by its ID.
This returns a map that has the following keys:
elasticsearch-connection
Get a specific instance by its ID. This returns a map that has the following keys: * **:instance** the instance data, which was set with **:create-instances-fn** in [[elasticsearch-connection]]
(instance-count)
Get the number of total instances in the database. This result is independent of the dataset split state.
Get the number of total instances in the database. This result is independent of the *dataset split* state.
(instances & {:keys [set-type include-ids? id-set]})
Return all instance data based on the dataset split (see class docs).
See instance-by-id
for the data in each map sequence returned.
:train
, :test
, :train-test
(all) and defaults
to set-default-set-type
or :train
if not setnil
return keys in the map as wellReturn all instance data based on the *dataset split* (see class docs). See [[instance-by-id]] for the data in each map sequence returned. Keys ---- * **:set-type** is either `:train`, `:test`, `:train-test` (all) and defaults to [[set-default-set-type]] or `:train` if not set * **:include-ids?** if non-`nil` return keys in the map as well
(instances-by-class-label &
{:keys [max-instances type seed]
:or {max-instances Integer/MAX_VALUE}})
Return a map with class-labels for keys and corresponding instances for that class-label.
Return a map with class-labels for keys and corresponding instances for that class-label. Keys ---- * **:max-instances** the maximum number of instances per class * **:seed** if given, seed the random number generator, otherwise don't return random documents
(instances-count)
Return the number of datasets in the DB.
Return the number of datasets in the DB.
(instances-load & {:keys [recreate-index?] :or {recreate-index? true}})
Parse and load the dataset in the DB.
Parse and load the dataset in the DB.
(set-default-connection)
(set-default-connection conn)
Set the default connection.
Parameter conn is used in place of what is set with with-connection
.
This is very convenient and saves typing, but will get clobbered if
a with-connection
is used further down in the stack frame.
If the parameter is missing, it's unset.
Set the default connection. Parameter **conn** is used in place of what is set with [[with-connection]]. This is very convenient and saves typing, but will get clobbered if a [[with-connection]] is used further down in the stack frame. If the parameter is missing, it's unset.
(set-default-set-type set-type)
Set the default bucket (training or testing) to get data.
:train
(default) or :test
;
see elasticsearch-connection
See ids
Set the default bucket (training or testing) to get data. * **:set-type** is either `:train` (default) or `:test`; see [[elasticsearch-connection]] See [[ids]]
(set-fold fold)
Set the current fold in the dataset split state.
You must call divide-by-fold
before calling this.
See the namespace docs for more information.
Set the current fold in the *dataset split* state. You must call [[divide-by-fold]] before calling this. See the namespace docs for more information.
(set-population-use ratio)
Set how much of the data from the DB to use. This is useful for cases where your dataset or corpus is huge and you only want to start with a small chunk until you get your models debugged.
Note This removes any stored dataset split state
Set how much of the data from the DB to use. This is useful for cases where your dataset or corpus is huge and you only want to start with a small chunk until you get your models debugged. Parameters ---------- * **ratio** a number between (0-1]; by default this is 1 **Note** This removes any stored *dataset split* state
(stats)
Get training vs testing dataset split statistics.
Get training vs testing *dataset split* statistics.
(with-connection connection & body)
Execute a body with the form (with-connection connection ...)
elasticsearch-connection
Execute a body with the form (with-connection connection ...) * **connection** is created with [[elasticsearch-connection]]
(write-dataset &
{:keys [output-file single? instance-fn columns-fn]
:or {instance-fn identity
columns-fn (constantly ["Instance"])}})
Write the data set to a spreadsheet. If the file name ends with a .csv
a
CSV file is written, otherwise an Excel file is written.
:analysis-report
true
then create a single sheet, otherwise the training
and testing buckets are split between sheetsWrite the data set to a spreadsheet. If the file name ends with a `.csv` a CSV file is written, otherwise an Excel file is written. Keys ---- * **:output-file** where to write the file and defaults to [[res/resource-path]] `:analysis-report` * **:single?** if `true` then create a single sheet, otherwise the training and testing *buckets* are split between sheets
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close