scicloj.ml.core

Liking cljdoc? Tell your friends :D

Clojure only.

Core functions for machine learninig and pipeline execution.

Requiring this namesspace registers as well the model in:

scicloj.ml.smile.classification
scicloj.ml.smile.regression
scicloj.ml.xgboost

Functions are re-exported from:

scicloj.metamorph.ml.*
scicloj.metamorph.core

Core functions for machine learninig and pipeline execution.

Requiring this namesspace registers as well the model in:

* scicloj.ml.smile.classification
* scicloj.ml.smile.regression
* scicloj.ml.xgboost


Functions are re-exported from:



* scicloj.metamorph.ml.*
* scicloj.metamorph.core

raw docstring

->pipeline^clj

(->pipeline ops)

(->pipeline config ops)

Create pipeline from declarative description.

Create pipeline from declarative description.

source raw docstring

categorical^clj

(categorical value-vec)

Given a vector a categorical values create a gridsearch definition.

Given a vector a categorical values create a gridsearch definition.

source raw docstring

classification-accuracy^clj

(classification-accuracy lhs rhs)

correct/total. Model output is a sequence of probability distributions. label-seq is a sequence of values. The answer is considered correct if the key highest probability in the model output entry matches that label.

correct/total.
Model output is a sequence of probability distributions.
label-seq is a sequence of values.  The answer is considered correct
if the key highest probability in the model output entry matches
that label.

source raw docstring

classification-loss^clj

(classification-loss lhs rhs)

1.0 - classification-accuracy.

1.0 - classification-accuracy.

source raw docstring

confusion-map^clj

(confusion-map predicted-labels labels)

(confusion-map predicted-labels labels normalize)

source

confusion-map->ds^clj

(confusion-map->ds conf-matrix-map)

(confusion-map->ds conf-matrix-map normalize)

source

def-ctx^cljmacro

(def-ctx varname)

Convenience macro for defining pipelined operations that bind the current value of the context to a var, for simple debugging purposes.

Convenience macro for defining pipelined operations that
bind the current value of the context to a var, for simple
debugging purposes.

source raw docstring

default-loss-fn^clj

(default-loss-fn dataset)

Given a datset which must have exactly 1 inference target column return a default loss fn. If column is categorical, loss is tech.v3.ml.loss/classification-loss, else the loss is tech.v3.ml.loss/mae (mean average error).

Given a datset which must have exactly 1 inference target column return a default
loss fn. If column is categorical, loss is tech.v3.ml.loss/classification-loss, else
the loss is tech.v3.ml.loss/mae (mean average error).

source raw docstring

default-result-dissoc-in-seq^clj

source

define-model!^clj

(define-model! model-kwd
               train-fn
               predict-fn
               {:keys [hyperparameters thaw-fn explain-fn options documentation
                       unsupervised?]})

Create a model definition. An ml model is a function that takes a dataset and an options map and returns a model. A model is something that, combined with a dataset, produces a inferred dataset.

Create a model definition.  An ml model is a function that takes a dataset and an
options map and returns a model.  A model is something that, combined with a dataset,
produces a inferred dataset.

source raw docstring

do-ctx^clj

(do-ctx f)

Apply f:: ctx -> any, ignore the result, leaving pipeline unaffected. Akin to using doseq for side-effecting operations like printing, visualization, or binding to vars for debugging.

Apply f:: ctx -> any, ignore the result, leaving
pipeline unaffected.  Akin to using doseq for side-effecting
operations like printing, visualization, or binding to vars
for debugging.

source raw docstring

evaluate-pipelines^clj

(evaluate-pipelines pipe-fn-seq train-test-split-seq metric-fn loss-or-accuracy)

(evaluate-pipelines pipe-fn-or-decl-seq
                    train-test-split-seq
                    metric-fn
                    loss-or-accuracy
                    options)

Evaluates the performance of a seq of metamorph pipelines, which are suposed to have a model as last step under key :model, which behaves correctly in mode :fit and :transform. The function scicloj.metamorph.ml/model is such function behaving correctly.

This function calculates the accuracy or loss, given as metric-fn of each pipeline in pipeline-fn-seq using all the train-test splits given in train-test-split-seq.

It runs the pipelines in mode :fit and in mode :transform for each pipeline-fn in pipe-fn-seq for each split in train-test-split-seq.

The function returns a seq of seqs of evaluation results per pipe-fn per train-test split. Each of teh evaluation results is a context map, which is specified in the malli schema attached to this function.

pipe-fn-or-decl-seq need to be sequence of pipeline functions or pipline declarations which follow the metamorph approach. These type of functions get produced typically by calling scicloj.metamorph/pipeline. Documentation is here:
train-test-split-seq need to be a sequence of maps containing the train and test dataset (being tech.ml.dataset) at keys :train and :test. tableclot.api/split->seq produces such splits. Supervised models require both keys (:train and :test), while unsupervised models only use :train
metric-fn Metric function to use. Typically comming from tech.v3.ml.loss. For supervised models the metric-fn receives the trueth and predicted vales as double arrays and should return a single double number. For unsupervised models he function receives the fitted ctx and should return a singel double number as well. This metric will be used to sort and eventualy filter the result, depending on the options (:return-best-pipeline-only and :return-best-crossvalidation-only). The notion of best comes from metric-fn combined with loss-and-accuracy
loss-or-accuracy If the metric-fn is a loss or accuracy calculation. Can be :loss or :accuracy. Decided the notion of best model. In case of :loss pipelines with lower metric are better, in case of :accuracy pipelines with higher value are better.
options map controls some mainly performance related parameters. These function can potentialy result in a large ammount of data, able to bring the JVM into out-of-memory. We can control how many details the function returns by the following parameter: The default are quite aggresive in removing details, and this can be tweaked further into more or less details via:
```
* `:result-dissoc-in-seq`  - Controls how much information is returned for each cross validation. We call `dissoc-in` on every seq of this
```
for the fit-ctx and transform-ctx before returning them. Default is scicloj.metamorph.ml/default-result-dissoc-in-seq So every path in result-dissoc-in-seq is removed from the evaluation result, the default being:

[[:fit-ctx :metamorph/data]

[:train-transform :ctx :metamorph/data]

[:train-transform :ctx :model :scicloj.metamorph.ml/target-ds] [:train-transform :ctx :model :scicloj.metamorph.ml/feature-ds]

[:test-transform :ctx :metamorph/data]

[:test-transform :ctx :model :scicloj.metamorph.ml/target-ds] [:test-transform :ctx :model :scicloj.metamorph.ml/feature-ds] ;; scicloj.ml.smile specific [:train-transform :ctx :model :model-data :model-as-bytes] [:test-transform :ctx :model :model-data :model-as-bytes]]

 ```
 This ns contains 2 other result-disssoc-in sequences:
 * result-dissoc-in-seq--ctxs : Removes all contexts from result. This should remove all 'big data'
 * result-dissoc-in-seq--all : Only keeps the metric value per pipeline(s)

 * `:return-best-pipeline-only` - Only return information of the best performing pipeline. Default is true.
 * `:return-best-crossvalidation-only` - Only return information of the best crossvalidation (per pipeline returned). Default is true.
 * `:map-fn` - Controls parallelism, so if we use map (:map) , pmap (:pmap) or :mapv to map over different pipelines. Default :pmap
 * `:evaluation-handler-fn` - Gets called once with the complete result of an individual evaluation step.
     Its result is ignored and it's default is a noop. It can be used for side effects, like experiment tracking on disk.
     The passed in evaluation result is a map with all information on the current evaluation, including the datasets used.
 * `:other-metrices` Specifies other metrices to be calculated during evaluation

This function expects as well the ground truth of the target variable into a specific key in the context at key :model :scicloj.metamorph.ml/target-ds See here for the simplest way to set this up: https://github.com/behrica/metamorph.ml/blob/main/README.md The function scicloj.ml.metamorph/model does this correctly.

Evaluates the performance of a seq of metamorph pipelines, which are suposed to have a model as last step under key :model,
which behaves correctly  in mode :fit and  :transform. The function `scicloj.metamorph.ml/model` is such function behaving correctly.

 This function calculates the accuracy or loss, given as `metric-fn` of each pipeline in `pipeline-fn-seq` using all the train-test splits
given in  `train-test-split-seq`.

 It runs the pipelines  in mode  :fit and in mode :transform for each pipeline-fn in `pipe-fn-seq` for each split in `train-test-split-seq`.

 The function returns a seq of seqs of evaluation results per pipe-fn per train-test split.
 Each of teh evaluation results is a context map, which is specified in the malli schema attached to this function. 

 * `pipe-fn-or-decl-seq` need to be  sequence of pipeline functions or pipline declarations which follow the metamorph approach.
    These type of functions get produced typically by calling `scicloj.metamorph/pipeline`. Documentation is here:

 * `train-test-split-seq` need to be a sequence of maps containing the  train and test dataset (being tech.ml.dataset) at keys :train and :test.
  `tableclot.api/split->seq` produces such splits. Supervised models require both keys (:train and :test), while unsupervised models only use :train

 * `metric-fn` Metric function to use. Typically comming from `tech.v3.ml.loss`. For supervised models the metric-fn receives the trueth
    and predicted vales as double arrays and should return a single double number.  For unsupervised models he function receives the fitted ctx
    and should return a singel double number as well. This metric will be used to sort and eventualy filter the result, depending on the options
    (:return-best-pipeline-only   and :return-best-crossvalidation-only). The notion of `best` comes from metric-fn combined with loss-and-accuracy


 * `loss-or-accuracy` If the metric-fn is a loss or accuracy calculation. Can be :loss or :accuracy. Decided the notion of `best` model.
    In case of :loss pipelines with lower metric are better, in case of :accuracy pipelines with higher value are better.

* `options` map controls some mainly performance related parameters. These function can potentialy result in a large ammount of data,
  able to bring the JVM into out-of-memory. We can control how many details the function returns by the following parameter: 
   The default are quite aggresive in removing details, and this can be tweaked further into more or less details via:
   

      * `:result-dissoc-in-seq`  - Controls how much information is returned for each cross validation. We call `dissoc-in` on every seq of this
    for the `fit-ctx` and `transform-ctx` before returning them. Default is  `scicloj.metamorph.ml/default-result-dissoc-in-seq`
    So `every path` in result-dissoc-in-seq is removed from the evaluation result, the default being:  
     ```
[[:fit-ctx :metamorph/data]

 [:train-transform :ctx :metamorph/data]

 [:train-transform :ctx :model :scicloj.metamorph.ml/target-ds]
 [:train-transform :ctx :model :scicloj.metamorph.ml/feature-ds]

 [:test-transform :ctx :metamorph/data]

 [:test-transform :ctx :model :scicloj.metamorph.ml/target-ds]
 [:test-transform :ctx :model :scicloj.metamorph.ml/feature-ds]
 ;;  scicloj.ml.smile specific
 [:train-transform :ctx :model :model-data :model-as-bytes]
 [:test-transform :ctx :model :model-data :model-as-bytes]]

     ```
     This ns contains 2 other result-disssoc-in sequences:
     * result-dissoc-in-seq--ctxs : Removes all contexts from result. This should remove all 'big data'
     * result-dissoc-in-seq--all : Only keeps the metric value per pipeline(s)

     * `:return-best-pipeline-only` - Only return information of the best performing pipeline. Default is true.
     * `:return-best-crossvalidation-only` - Only return information of the best crossvalidation (per pipeline returned). Default is true.
     * `:map-fn` - Controls parallelism, so if we use map (:map) , pmap (:pmap) or :mapv to map over different pipelines. Default :pmap
     * `:evaluation-handler-fn` - Gets called once with the complete result of an individual evaluation step.
         Its result is ignored and it's default is a noop. It can be used for side effects, like experiment tracking on disk.
         The passed in evaluation result is a map with all information on the current evaluation, including the datasets used.
     * `:other-metrices` Specifies other metrices to be calculated during evaluation

 This function expects as well the ground truth of the target variable into
 a specific key in the context at key `:model :scicloj.metamorph.ml/target-ds`
 See here for the simplest way to set this up: https://github.com/behrica/metamorph.ml/blob/main/README.md
 The function [[scicloj.ml.metamorph/model]] does this correctly.

source raw docstring

explain^clj

(explain model & [options])

Explain (if possible) an ml model. A model explanation is a model-specific map of data that usually indicates some level of mapping between features and importance

Explain (if possible) an ml model.  A model explanation is a model-specific map
of data that usually indicates some level of mapping between features and importance

source raw docstring

fit^clj

(fit data & ops)

Helper function which executes pipeline op(s) in mode :fit on the given data and returns the fitted ctx.

Main use is for cases in which the pipeline gets executed ones and no model is part of the pipeline.

Helper function which executes pipeline op(s) in mode :fit on the given data and returns the fitted ctx.

Main use is for cases in which the pipeline gets executed ones and no model is part of the pipeline.

source raw docstring

fit-pipe^clj

(fit-pipe data pipe-fn)

Helper function which executes pipeline op(s) in mode :fit on the given data and returns the fitted ctx.

Main use is for cases in which the pipeline gets executed ones and no model is part of the pipeline.

Helper function which executes pipeline op(s) in mode :fit on the given data and returns the fitted ctx.

Main use is for cases in which the pipeline gets executed ones and no model is part of the pipeline.

source raw docstring

format-fn-sources^clj

(format-fn-sources fn-sources)

source

get-nice-source-info^clj

(get-nice-source-info pipeline-decl pipe-fns-ns pipe-fns-source-file)

source

hyperparameters^clj

(hyperparameters model-kwd)

Get the hyperparameters for this model definition

Get the hyperparameters for this model definition

source raw docstring

lift^clj

(lift op & params)

Create context aware version of the given op function. :metamorph/data will be used as a first parameter.

Result of the op function will be stored under :metamorph/data

Create context aware version of the given `op` function. `:metamorph/data` will be used as a first parameter.

Result of the `op` function will be stored under `:metamorph/data`

source raw docstring

linear^clj

(linear start end)

(linear start end n-steps)

(linear start end n-steps res-dtype-or-space)

Create a gridsearch definition which does a linear search.

res-dtype-or-space map be either a datatype keyword or a vector of categorical values.

Create a gridsearch definition which does a linear search.

* res-dtype-or-space map be either a datatype keyword or a vector
  of categorical values.

source raw docstring

mae^clj

(mae predictions labels)

mean absolute error

mean absolute error

source raw docstring

model-definition-names^clj

(model-definition-names)

Return a list of all registered model defintion names.

Return a list of all registered model defintion names.

source raw docstring

model-definitions*^clj

Map of model kwd to model definition

Map of model kwd to model definition

source raw docstring

mse^clj

(mse predictions labels)

mean squared error

mean squared error

source raw docstring

options->model-def^clj

(options->model-def options)

Return the model definition that corresponse to the :model-type option

Return the model definition that corresponse to the :model-type option

source raw docstring

pipe-it^clj

(pipe-it data & ops)

Takes a data objects, executes the pipeline op(s) with it in :metamorph/data in mode :fit and returns content of :metamorph/data. Usefull to use execute a pipeline of pure data->data functions on some data

Takes a data objects, executes the pipeline op(s) with it in :metamorph/data
in mode :fit and returns content of :metamorph/data.
Usefull to use execute a pipeline of pure data->data functions on some data

source raw docstring

pipeline^clj

(pipeline & ops)

Create a metamorph pipeline function out of operators.

ops are metamorph compliant functions (basicaly fn, which takle a ctx as first argument)

This function returns a function, whcih can ve execute with a ctx as parameter.

Create a metamorph pipeline function out of operators.

`ops` are metamorph compliant functions (basicaly fn, which takle a ctx as first argument)

This function returns a function, whcih can ve execute with a ctx as parameter.

source raw docstring

predict^clj

(predict dataset model)

Predict returns a dataset with only the predictions in it.

For regression, a single column dataset is returned with the column named after the target
For classification, a dataset is returned with a float64 column for each target value and values that describe the probability distribution.

Predict returns a dataset with only the predictions in it.

* For regression, a single column dataset is returned with the column named after the
  target
* For classification, a dataset is returned with a float64 column for each target
  value and values that describe the probability distribution.

source raw docstring

probability-distributions->labels^clj

(probability-distributions->labels prob-dists)

source

result-dissoc-in-seq--all^clj

source

result-dissoc-in-seq--ctxs^clj

source

rmse^clj

(rmse predictions labels)

root mean squared error

root mean squared error

source raw docstring

sobol-gridsearch^clj

(sobol-gridsearch opt-map)

(sobol-gridsearch opt-map start-idx)

Given an map of key->values where some of the values are gridsearch definitions produce a sequence of fully defined maps.

user> (require '[tech.v3.ml.gridsearch :as ml-gs])
nil
user> (def opt-map  {:a (ml-gs/categorical [:a :b :c])
                     :b (ml-gs/linear 0.01 1 10)
                     :c :not-searched})
user> opt-map
{:a
 {:tech.v3.ml.gridsearch/type :linear,
  :start 0.0,
  :end 2.0,
  :n-steps 3,
  :result-space [:a :b :c]}
  ...

user> (ml-gs/sobol-gridsearch opt-map)
({:a :b, :b 0.56, :c :not-searched}
 {:a :c, :b 0.22999999999999998, :c :not-searched}
 {:a :b, :b 0.78, :c :not-searched}
...

Given an map of key->values where some of the values are gridsearch definitions
  produce a sequence of fully defined maps.


```clojure
user> (require '[tech.v3.ml.gridsearch :as ml-gs])
nil
user> (def opt-map  {:a (ml-gs/categorical [:a :b :c])
                     :b (ml-gs/linear 0.01 1 10)
                     :c :not-searched})
user> opt-map
{:a
 {:tech.v3.ml.gridsearch/type :linear,
  :start 0.0,
  :end 2.0,
  :n-steps 3,
  :result-space [:a :b :c]}
  ...

user> (ml-gs/sobol-gridsearch opt-map)
({:a :b, :b 0.56, :c :not-searched}
 {:a :c, :b 0.22999999999999998, :c :not-searched}
 {:a :b, :b 0.78, :c :not-searched}
...
```

source raw docstring

thaw-model^clj

(thaw-model model)

(thaw-model model {:keys [thaw-fn]})

Thaw a model. Model's returned from train may be 'frozen' meaning a 'thaw' operation is needed in order to use the model. This happens for you during predict but you may also cached the 'thawed' model on the model map under the ':thawed-model' keyword in order to do fast predictions on small datasets.

Thaw a model.  Model's returned from train may be 'frozen' meaning a 'thaw'
operation is needed in order to use the model.  This happens for you during predict
but you may also cached the 'thawed' model on the model map under the
':thawed-model'  keyword in order to do fast predictions on small datasets.

source raw docstring

train^clj

(train dataset options)

Given a dataset and an options map produce a model. The model-type keyword in the options map selects which model definition to use to train the model. Returns a map containing at least:

:model-data - the result of that definitions's train-fn.
:options - the options passed in.
:id - new randomly generated UUID.
:feature-columns - vector of column names.
:target-columns - vector of column names.

Given a dataset and an options map produce a model.  The model-type keyword in the
options map selects which model definition to use to train the model.  Returns a map
containing at least:


* `:model-data` - the result of that definitions's train-fn.
* `:options` - the options passed in.
* `:id` - new randomly generated UUID.
* `:feature-columns` - vector of column names.
* `:target-columns` - vector of column names.

source raw docstring

transform-pipe^clj

(transform-pipe data pipe-fn ctx)

Helper functions which execute the passed pipe-fn on the given data in mode :transform. It merges the data into the provided ctx while doing so.

Helper functions which execute the passed `pipe-fn` on the given `data` in mode :transform.
It merges the data into the provided `ctx` while doing so.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

scicloj.ml.core

->pipelineclj

categoricalclj

classification-accuracyclj

classification-lossclj

confusion-mapclj

confusion-map->dsclj

def-ctxcljmacro

default-loss-fnclj

default-result-dissoc-in-seqclj

define-model!clj

do-ctxclj

evaluate-pipelinesclj

explainclj

fitclj

fit-pipeclj

format-fn-sourcesclj

get-nice-source-infoclj

hyperparametersclj

liftclj

linearclj

maeclj

model-definition-namesclj

model-definitions*clj

mseclj

options->model-defclj

pipe-itclj

pipelineclj

predictclj

probability-distributions->labelsclj

result-dissoc-in-seq--allclj

result-dissoc-in-seq--ctxsclj

rmseclj

sobol-gridsearchclj

thaw-modelclj

trainclj

transform-pipeclj

->pipeline^clj

categorical^clj

classification-accuracy^clj

classification-loss^clj

confusion-map^clj

confusion-map->ds^clj

def-ctx^cljmacro

default-loss-fn^clj

default-result-dissoc-in-seq^clj

define-model!^clj

do-ctx^clj

evaluate-pipelines^clj

explain^clj

fit^clj

fit-pipe^clj

format-fn-sources^clj

get-nice-source-info^clj

hyperparameters^clj

lift^clj

linear^clj

mae^clj

model-definition-names^clj

model-definitions*^clj

mse^clj

options->model-def^clj

pipe-it^clj

pipeline^clj

predict^clj

probability-distributions->labels^clj

result-dissoc-in-seq--all^clj

result-dissoc-in-seq--ctxs^clj

rmse^clj

sobol-gridsearch^clj

thaw-model^clj

train^clj

transform-pipe^clj