Liking cljdoc? Tell your friends :D

zensols.dataset.db

Preemptively compute a dataset (i.e. features from natural language utterances) and store them in Elasticsearch. This is useful for use with training, testing, validating and development machine learning models.

The unit of data is an instance. An instance set (or just instances) makes up the dataset.

The idea is to abstract out Elasticsearch, but that might be a future enhancement. At the moment functions don't carry Elassticsearch artifacts but they are exposed.

There are three basic ways to use this data:

  • Get all instances (i.e. an utterance or a feature set). In this case all data returned from ids is considered training data. This is the default nascent state.
  • Split the data into a train and test set (see divide-by-set).
  • Use the data as a cross fold validation and iterate folds (see divide-by-fold).

The information used to represent either fold or the test/train split is referred to as the dataset split state and is stored in Elasticsearch under a differnent mapping-type in the same index as the instances.

See ids for more information.

Preemptively compute a dataset (i.e. features from natural language
utterances) and store them in Elasticsearch.  This is useful for use with
training, testing, validating and development machine learning models.

The unit of data is an instance.  An instance set (or just *instances*) makes
up the dataset.

The idea is to abstract out Elasticsearch, but that might be a future
enhancement.  At the moment functions don't carry Elassticsearch artifacts but
they are exposed.

There are three basic ways to use this data:

* Get all instances (i.e. an utterance or a feature set).  In this case all
  data returned from [[ids]] is considered training data.  This is the default
  nascent state.
* Split the data into a train and test set (see [[divide-by-set]]).
* Use the data as a cross fold validation and iterate
  folds (see [[divide-by-fold]]).

The information used to represent either fold or the test/train split is
referred to as the *dataset split* state and is stored in Elasticsearch under a
differnent mapping-type in the same index as the instances.

See [[ids]] for more information.
raw docstring

class-label-keyclj


clearclj

(clear & {:keys [wipe-persistent?] :or {wipe-persistent? false}})

Clear the in memory instance data. If key :wipe-persistent? is true all fold and test/train split data is also cleared.

Clear the in memory instance data.  If key `:wipe-persistent?` is `true` all
fold and test/train split data is also cleared.
raw docstring

dataset-fileclj

(dataset-file)

default-connection-instclj


distributionclj

(distribution)

Return maps representing the data set distribution by class label. Each element of the returned sequence has the following keys:

  • :class-label the class-label fo the instances
  • :count the number of instances for :class-label
Return maps representing the data set distribution by class label.  Each
element of the returned sequence has the following keys:

* **:class-label** the class-label fo the instances
* **:count** the number of instances for **:class-label**
raw docstring

divide-by-foldclj

(divide-by-fold)
(divide-by-fold folds & {:keys [shuffle?] :or {shuffle? true}})

Divide the data into folds and initialize the current fold in the dataset split state. Using this kind of dataset split is useful for cross fold validation.

  • folds number of folds to use, which defaults to 10

See set-fold

Divide the data into folds and initialize the current fold in the *dataset
split* state.  Using this kind of dataset split is useful for cross fold
validation.

* **folds** number of folds to use, which defaults to 10

See [[set-fold]]
raw docstring

divide-by-presetclj

(divide-by-preset)

Divide the data into test and training buckets. The respective train/test buckets are dictated by the :set-type label given in parameter given to the :create-instances-fn as documented in elasticsearch-connection.

Divide the data into test and training *buckets*.  The respective train/test
buckets are dictated by the `:set-type` label given in parameter given to the
**:create-instances-fn** as documented in [[elasticsearch-connection]].
raw docstring

divide-by-setclj

(divide-by-set)
(divide-by-set train-ratio
               &
               {:keys [dist-type shuffle? max-instances seed]
                :as opts
                :or {shuffle? true dist-type (quote uneven)}})

Divide the dataset into a test and training buckets.

  • train-ratio this is the percentage of data in the train bucket, which defaults to 0.5

Keys

  • :dist-type one of the following symbols: even each test/training set has an even distribution by class label uneven each test/training set has an uneven distribution by class label
  • :shuffle? if true then shuffle the set before partitioning, otherwise just update the demarcation boundary
  • :filter-fn if given a filter function that takes a key as input
  • :max-instances the maximum number of instances per class
  • :seed if given, seed the random number generator, otherwise don't return random documents
Divide the dataset into a test and training *buckets*.

* **train-ratio** this is the percentage of data in the train bucket, which
defaults to `0.5`

Keys
----
* **:dist-type** one of the following symbols:
    *even* each test/training set has an even distribution by class label
    *uneven* each test/training set has an uneven distribution by class label
* **:shuffle?** if `true` then shuffle the set before partitioning, otherwise
just update the *demarcation* boundary
* **:filter-fn** if given a filter function that takes a key as input
* **:max-instances** the maximum number of instances per class
* **:seed** if given, seed the random number generator, otherwise don't
return random documents
raw docstring

elasticsearch-connectionclj

(elasticsearch-connection index-name
                          &
                          {:keys [create-instances-fn population-use set-type
                                  url mapping-type-def cache-inst]
                           :or {create-instances-fn identity
                                population-use 1.0
                                set-type :train
                                mapping-type-def {instance-key {:type "nested"}
                                                  class-label-key
                                                    {:type "string"
                                                     :index "not_analyzed"}}
                                url "http://localhost:9200"}})

Create a connection to the dataset DB cache.

Parameters

  • index-name the name of the Elasticsearch index

Keys

  • :create-instances-fn a function that computes the instance set (i.e. parses the utterance) and invoked by instances-load; this function takes a single argument, which is also a function that is used to load utterance in the DB; this function takes the following forms:
    • (fn [instance class-label] ...

    • (fn [id instance class-label] ...

    • (fn [id instance class-label set-type] ...

      • id the unique identifier of the data point
      • instance is the data set instance (can be an N-deep map)
      • class-label the label of the class (can be nominal, double, integer)
      • set-type either :test, :train, :train-test (all) used to presort the data with divide-by-preset; note that it isn't necessary to call divide-by-preset for the first invocation of instances-load
    • :url the URL to the DB (defaults to http://localhost:9200)

    • :mapping-type map type name (see ES docs)

    • :cache-inst an atom used to cache instances by ID; if given this retrieves instances from the in memory map stored in the atom; otherwise it goes to ElasticSearch each time

Example

Create a connection that produces a list of 20 instances:

(defn- create-iter-connection []
  (letfn [(load-fn [add-fn]
            (doseq [i (range 20)]
              (add-fn (str i) (format "inst %d" i) (format "class %d" i))))]
    (elasticsearch-connection "tmp" :create-instances-fn load-fn)))
Create a connection to the dataset DB cache.

Parameters
----------
* **index-name** the name of the Elasticsearch index

Keys
----
* **:create-instances-fn** a function that computes the instance
set (i.e. parses the utterance) and invoked by [[instances-load]]; this
function takes a single argument, which is also a function that is used to
load utterance in the DB; this function takes the following forms:
    * (fn [instance class-label] ...
    * (fn [id instance class-label] ...
    * (fn [id instance class-label set-type] ...
        * **id** the unique identifier of the data point
        * **instance** is the data set instance (can be an `N`-deep map)
        * **class-label** the label of the class (can be nominal, double, integer)
        * **set-type** either `:test`, `:train`, `:train-test` (all) used to presort the data
        with [[divide-by-preset]]; note that it isn't necessary to
        call [[divide-by-preset]] for the first invocation of [[instances-load]]

  * **:url** the URL to the DB (defaults to `http://localhost:9200`)
  * **:mapping-type** map type name (see ES docs)
  * **:cache-inst** an atom used to cache instances by ID; if given this
    retrieves instances from the in memory map stored in the atom; otherwise it
    goes to ElasticSearch each time

Example
-------
  Create a connection that produces a list of 20 instances:
```clojure
(defn- create-iter-connection []
  (letfn [(load-fn [add-fn]
            (doseq [i (range 20)]
              (add-fn (str i) (format "inst %d" i) (format "class %d" i))))]
    (elasticsearch-connection "tmp" :create-instances-fn load-fn)))
```
raw docstring

freeze-datasetclj

(freeze-dataset &
                {:keys [output-file id-key set-type-key]
                 :or {set-type-key :set-type}})

Distille the current data set (data and test/train splits) in a output-file. See freeze-dataset-to-writer.

Distille the current data set (data and test/train splits) in a
**output-file**.  See [[freeze-dataset-to-writer]].
raw docstring

freeze-dataset-to-writerclj

(freeze-dataset-to-writer writer & {:keys [set-type-key]})

Distille the current data set (data and test/train splits) in writer to be later restored with[[zensols.dataset.thaw/taw-connection]].

Distille the current data set (data and test/train splits) in **writer** to
be later restored with[[zensols.dataset.thaw/taw-connection]].
raw docstring

freeze-fileclj

(freeze-file)

id-keyclj


idsclj

(ids & {:keys [set-type]})

Return all IDs based on the dataset split (see class docs).

Keys

  • :set-type is either :train, :test, :train-test (all) and defaults to set-default-set-type or :train if not set
Return all IDs based on the *dataset split* (see class docs).

Keys
----
* **:set-type** is either `:train`, `:test`, `:train-test` (all) and defaults
to [[set-default-set-type]] or `:train` if not set
raw docstring

instance-by-idclj

(instance-by-id id)
(instance-by-id conn id)

Get a specific instance by its ID.

This returns a map that has the following keys:

Get a specific instance by its ID.

This returns a map that has the following keys:

* **:instance** the instance data, which was set with
**:create-instances-fn** in [[elasticsearch-connection]]
raw docstring

instance-countclj

(instance-count)

Get the number of total instances in the database. This result is independent of the dataset split state.

Get the number of total instances in the database.  This result is
independent of the *dataset split* state.
raw docstring

instance-keyclj


instancesclj

(instances & {:keys [set-type include-ids? id-set]})

Return all instance data based on the dataset split (see class docs).

See instance-by-id for the data in each map sequence returned.

Keys

  • :set-type is either :train, :test, :train-test (all) and defaults to set-default-set-type or :train if not set
  • :include-ids? if non-nil return keys in the map as well
Return all instance data based on the *dataset split* (see class docs).

See [[instance-by-id]] for the data in each map sequence returned.

Keys
----
* **:set-type** is either `:train`, `:test`, `:train-test` (all) and defaults
to [[set-default-set-type]] or `:train` if not set
* **:include-ids?** if non-`nil` return keys in the map as well
raw docstring

instances-by-class-labelclj

(instances-by-class-label &
                          {:keys [max-instances type seed]
                           :or {max-instances Integer/MAX_VALUE}})

Return a map with class-labels for keys and corresponding instances for that class-label.

Keys

  • :max-instances the maximum number of instances per class
  • :seed if given, seed the random number generator, otherwise don't return random documents
Return a map with class-labels for keys and corresponding instances for that
class-label.

Keys
----
* **:max-instances** the maximum number of instances per class
* **:seed** if given, seed the random number generator, otherwise don't
return random documents
raw docstring

instances-countclj

(instances-count)

Return the number of datasets in the DB.

Return the number of datasets in the DB.
raw docstring

instances-loadclj

(instances-load & {:keys [recreate-index?] :or {recreate-index? true}})

Parse and load the dataset in the DB.

Parse and load the dataset in the DB.
raw docstring

set-default-connectionclj

(set-default-connection)
(set-default-connection conn)

Set the default connection.

Parameter conn is used in place of what is set with with-connection. This is very convenient and saves typing, but will get clobbered if a with-connection is used further down in the stack frame.

If the parameter is missing, it's unset.

Set the default connection.

Parameter **conn** is used in place of what is set with [[with-connection]].
This is very convenient and saves typing, but will get clobbered if
a [[with-connection]] is used further down in the stack frame.

If the parameter is missing, it's unset.
raw docstring

set-default-set-typeclj

(set-default-set-type set-type)

Set the default bucket (training or testing) to get data.

See ids

Set the default bucket (training or testing) to get data.

* **:set-type** is either `:train` (default) or `:test`;
see [[elasticsearch-connection]]

See [[ids]]
raw docstring

set-foldclj

(set-fold fold)

Set the current fold in the dataset split state.

You must call divide-by-fold before calling this.

See the namespace docs for more information.

Set the current fold in the *dataset split* state.

You must call [[divide-by-fold]] before calling this.

See the namespace docs for more information.
raw docstring

set-population-useclj

(set-population-use ratio)

Set how much of the data from the DB to use. This is useful for cases where your dataset or corpus is huge and you only want to start with a small chunk until you get your models debugged.

Parameters

  • ratio a number between (0-1]; by default this is 1

Note This removes any stored dataset split state

Set how much of the data from the DB to use.  This is useful for cases where
your dataset or corpus is huge and you only want to start with a small chunk
until you get your models debugged.

Parameters
----------
* **ratio** a number between (0-1]; by default this is 1

**Note** This removes any stored *dataset split* state
raw docstring

statsclj

(stats)

Get training vs testing dataset split statistics.

Get training vs testing *dataset split* statistics.
raw docstring

with-connectionclj/smacro

(with-connection connection & body)

Execute a body with the form (with-connection connection ...)

Execute a body with the form (with-connection connection ...)

* **connection** is created with [[elasticsearch-connection]]
raw docstring

write-datasetclj

(write-dataset &
               {:keys [output-file single? instance-fn columns-fn]
                :or {instance-fn identity
                     columns-fn (constantly ["Instance"])}})

Write the data set to a spreadsheet. If the file name ends with a .csv a CSV file is written, otherwise an Excel file is written.

Keys

  • :output-file where to write the file and defaults to [[res/resource-path]] :analysis-report
  • :single? if true then create a single sheet, otherwise the training and testing buckets are split between sheets
Write the data set to a spreadsheet.  If the file name ends with a `.csv` a
CSV file is written, otherwise an Excel file is written.

Keys
----
* **:output-file** where to write the file and defaults to
[[res/resource-path]] `:analysis-report`
* **:single?** if `true` then create a single sheet, otherwise the training
and testing *buckets* are split between sheets
raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close