Liking cljdoc? Tell your friends :D

scicloj.metamorph.ml

Core machine learning framework integrating metamorph pipelines with standardized model APIs.

This is the central namespace of metamorph.ml, providing infrastructure for:

  • Registering and using machine learning models
  • Training models and making predictions
  • Evaluating pipelines via cross-validation
  • Standardized model diagnostics (glance, tidy, augment)
  • Optional caching of computationally expensive operations

Key Concepts:

Model Registration: Models are registered using define-model! and can be referenced by keyword (e.g., :fastmath/ols, :metamorph.ml/dummy-classifier). Models define a train-fn, predict-fn, and optional diagnostic functions.

Training and Prediction:

  • train: Train a model on a dataset given options including :model-type
  • predict: Make predictions using a trained model
  • train-predict-cache: Optional cache to avoid redundant computations

Pipeline Evaluation:

  • evaluate-pipelines: Evaluate multiple pipelines across train/test splits
  • evaluate-one-pipeline: Evaluate a single pipeline with cross-validation
  • Returns results sorted by metric performance with optional filtering
  • Supports parallel evaluation (:map/:pmap/:ppmap)

Model Diagnostics (following tidymodels conventions):

  • glance: One-row model summary (goodness-of-fit)
  • tidy: One-row-per-component output (coefficients with statistics)
  • augment: One-row-per-observation output (predictions, residuals)

Main API Functions:

  • define-model!: Register a new model type with train/predict/diagnostic functions
  • train: Train a model with a specified model-type
  • predict: Generate predictions from a trained model
  • evaluate-pipelines: Evaluate pipelines with cross-validation
  • glance: Get model summary statistics
  • tidy: Extract coefficient-level results
  • augment: Add predictions and residuals to data

Pipeline Integration:

Models integrate with metamorph pipelines via the model step, which:

  • Trains in :fit mode using training data
  • Predicts in :transform mode on new data
  • Stores model output column metadata for later evaluation

Example Usage:

;; Register a custom model (rarely needed - use existing models) (define-model! :my/custom-model train-fn predict-fn {...})

;; Train a model (let [model (train iris-data {:model-type :fastmath/ols :target-columns [:Sepal.Width] :feature-columns [:Sepal.Length]})] ;; Get diagnostics (glance model) (tidy model) ;; Make predictions (predict iris-data model))

;; Evaluate multiple pipelines in cross-validation (evaluate-pipelines [pipeline1 pipeline2] train-test-splits metric-fn :accuracy {:map-fn :pmap})

Built-in Models:

Regression:

  • :metamorph.ml/ols: Apache Commons Math OLS
  • :fastmath/ols: FastMath OLS
  • :fastmath/glm: FastMath GLM
  • :metamorph.ml/dummy-regressor: Mean baseline

Classification:

  • :metamorph.ml/dummy-classifier: Majority class or random baseline

Preprocessing: See specific namespaces for transformers:

  • scicloj.metamorph.ml.preprocessing: Scaling and normalization
  • scicloj.metamorph.ml.categorical: One-hot encoding
  • scicloj.metamorph.ml.r-model-matrix: R formula features

See also: scicloj.metamorph.core for metamorph pipeline mechanics, scicloj.metamorph.ml.tidy-models for diagnostic validation

Core machine learning framework integrating metamorph pipelines with standardized model APIs.

This is the central namespace of metamorph.ml, providing infrastructure for:
- Registering and using machine learning models
- Training models and making predictions
- Evaluating pipelines via cross-validation
- Standardized model diagnostics (glance, tidy, augment)
- Optional caching of computationally expensive operations

Key Concepts:

**Model Registration**: Models are registered using `define-model!` and can be
referenced by keyword (e.g., `:fastmath/ols`, `:metamorph.ml/dummy-classifier`).
Models define a train-fn, predict-fn, and optional diagnostic functions.

**Training and Prediction**:
- `train`: Train a model on a dataset given options including :model-type
- `predict`: Make predictions using a trained model
- `train-predict-cache`: Optional cache to avoid redundant computations

**Pipeline Evaluation**:
- `evaluate-pipelines`: Evaluate multiple pipelines across train/test splits
- `evaluate-one-pipeline`: Evaluate a single pipeline with cross-validation
- Returns results sorted by metric performance with optional filtering
- Supports parallel evaluation (:map/:pmap/:ppmap)

**Model Diagnostics** (following tidymodels conventions):
- `glance`: One-row model summary (goodness-of-fit)
- `tidy`: One-row-per-component output (coefficients with statistics)
- `augment`: One-row-per-observation output (predictions, residuals)

Main API Functions:

- `define-model!`: Register a new model type with train/predict/diagnostic functions
- `train`: Train a model with a specified model-type
- `predict`: Generate predictions from a trained model
- `evaluate-pipelines`: Evaluate pipelines with cross-validation
- `glance`: Get model summary statistics
- `tidy`: Extract coefficient-level results
- `augment`: Add predictions and residuals to data

Pipeline Integration:

Models integrate with metamorph pipelines via the `model` step, which:
- Trains in :fit mode using training data
- Predicts in :transform mode on new data
- Stores model output column metadata for later evaluation

Example Usage:

;; Register a custom model (rarely needed - use existing models)
(define-model! :my/custom-model train-fn predict-fn {...})

;; Train a model
(let [model (train iris-data {:model-type :fastmath/ols
                              :target-columns [:Sepal.Width]
                              :feature-columns [:Sepal.Length]})]
  ;; Get diagnostics
  (glance model)
  (tidy model)
  ;; Make predictions
  (predict iris-data model))

;; Evaluate multiple pipelines in cross-validation
(evaluate-pipelines
  [pipeline1 pipeline2]
  train-test-splits
  metric-fn
  :accuracy
  {:map-fn :pmap})

Built-in Models:

**Regression**:
- `:metamorph.ml/ols`: Apache Commons Math OLS
- `:fastmath/ols`: FastMath OLS
- `:fastmath/glm`: FastMath GLM
- `:metamorph.ml/dummy-regressor`: Mean baseline

**Classification**:
- `:metamorph.ml/dummy-classifier`: Majority class or random baseline

**Preprocessing**:
See specific namespaces for transformers:
- `scicloj.metamorph.ml.preprocessing`: Scaling and normalization
- `scicloj.metamorph.ml.categorical`: One-hot encoding
- `scicloj.metamorph.ml.r-model-matrix`: R formula features

See also: `scicloj.metamorph.core` for metamorph pipeline mechanics,
`scicloj.metamorph.ml.tidy-models` for diagnostic validation
raw docstring

scicloj.metamorph.ml.cache

Caching infrastructure for metamorph.ml train/predict operations.

This namespace provides flexible caching backends to store and retrieve results of machine learning training and prediction operations. This is useful for avoiding redundant computations when working with the same models and data.

Supported cache backends:

  • Atom cache: In-memory caching using a Clojure atom (fast, ephemeral)
  • Disk cache: File-based caching using Nippy serialization (persistent)
  • Redis cache: Distributed caching via Redis (requires carmine library)

Usage: (enable-atom-cache! (atom {})) ; Enable in-memory caching ;; or (enable-disk-cache! "/tmp/ml-cache") ; Enable disk-based caching ;; or (enable-redis-cache! {...}) ; Enable Redis caching

To disable caching: (disable-cache!)

See individual function docs for more details on each backend.

Caching infrastructure for metamorph.ml train/predict operations.

This namespace provides flexible caching backends to store and retrieve results
of machine learning training and prediction operations. This is useful for
avoiding redundant computations when working with the same models and data.

Supported cache backends:
- **Atom cache**: In-memory caching using a Clojure atom (fast, ephemeral)
- **Disk cache**: File-based caching using Nippy serialization (persistent)
- **Redis cache**: Distributed caching via Redis (requires carmine library)

Usage:
(enable-atom-cache! (atom {}))  ; Enable in-memory caching
;; or
(enable-disk-cache! "/tmp/ml-cache")  ; Enable disk-based caching
;; or
(enable-redis-cache! {...})  ; Enable Redis caching

To disable caching:
(disable-cache!)

See individual function docs for more details on each backend.
raw docstring

scicloj.metamorph.ml.categorical

Categorical feature encoding for machine learning pipelines.

This namespace provides metamorph transformers for handling categorical variables commonly used in supervised learning. Currently focuses on one-hot encoding, which converts categorical values into binary indicator columns.

One-hot encoding is essential for:

  • Preparing categorical features for algorithms that expect numeric inputs
  • Preventing ordinal assumptions on nominal categories
  • Creating interpretable model features

Main API:

  • transform-one-hot: The primary metamorph transformer for one-hot encoding

Encoding strategies:

  • :full Uses a predefined level set from full dataset context
  • :fit Levels discovered during :fit used in :transform
  • :independent Each mode independently determines and encodes levels
Categorical feature encoding for machine learning pipelines.

This namespace provides metamorph transformers for handling categorical
variables commonly used in supervised learning. Currently focuses on
one-hot encoding, which converts categorical values into binary indicator columns.

One-hot encoding is essential for:
- Preparing categorical features for algorithms that expect numeric inputs
- Preventing ordinal assumptions on nominal categories
- Creating interpretable model features

Main API:
- `transform-one-hot`: The primary metamorph transformer for one-hot encoding

Encoding strategies:
- `:full`        Uses a predefined level set from full dataset context
- `:fit`         Levels discovered during :fit used in :transform
- `:independent` Each mode independently determines and encodes levels

raw docstring

scicloj.metamorph.ml.classification

Classification models and evaluation metrics for metamorph.ml.

This namespace provides tools for classification tasks including:

  • Confusion matrix generation and analysis
  • Baseline classifier implementations
  • Classification evaluation utilities

Key features:

  • confusion-map: Creates confusion matrices from predictions and true labels
  • confusion-map->ds: Converts confusion matrices to tabular dataset format
  • :metamorph.ml/dummy-classifier: A baseline classifier for sanity checks

Dummy Classifier Strategies:

  • :majority-class (default): Always predicts the most frequent class
  • :fixed-class: Predicts a specified class
  • :random-class: Predicts randomly from the observed classes

Confusion Matrix Normalization:

  • :all (default): Row-wise normalization (recall perspective)
  • :none: Raw counts

Example usage: (let [pred [0 1 0 1 1] true [0 0 1 1 1] conf-map (confusion-map pred true :none)] (confusion-map->ds conf-map))

See also: scicloj.metamorph.ml/define-model!, scicloj.metamorph.ml.viz/confusion-matrix

Classification models and evaluation metrics for metamorph.ml.

This namespace provides tools for classification tasks including:
- Confusion matrix generation and analysis
- Baseline classifier implementations
- Classification evaluation utilities

Key features:
- `confusion-map`: Creates confusion matrices from predictions and true labels
- `confusion-map->ds`: Converts confusion matrices to tabular dataset format
- `:metamorph.ml/dummy-classifier`: A baseline classifier for sanity checks

Dummy Classifier Strategies:
- `:majority-class` (default): Always predicts the most frequent class
- `:fixed-class`: Predicts a specified class
- `:random-class`: Predicts randomly from the observed classes

Confusion Matrix Normalization:
- `:all` (default): Row-wise normalization (recall perspective)
- `:none`: Raw counts

Example usage:
(let [pred [0 1 0 1 1]
      true [0 0 1 1 1]
      conf-map (confusion-map pred true :none)]
  (confusion-map->ds conf-map))

See also: `scicloj.metamorph.ml/define-model!`, `scicloj.metamorph.ml.viz/confusion-matrix`
raw docstring

scicloj.metamorph.ml.column-metric

Model evaluation metrics for classification and regression tasks.

This namespace provides functions to compute standard machine learning metrics on model predictions vs. ground truth labels, with support for both binary and multiclass classification as well as regression tasks.

Key Functions:

  • classification-metric: Evaluate classification model predictions
  • regression-metric: Evaluate regression model predictions

Classification Metrics (from fastmath.stats): Supports binary and multiclass metrics including accuracy, precision, recall, F1-score, and more. Multiclass metrics can be averaged using:

  • :macro - Unweighted mean of per-class metrics
  • :micro - Aggregated true/false positives globally Also supports :roc-auc for multiclass AUC scoring.

Regression Metrics (from fastmath.stats): Distance and similarity metrics such as MAE, MSE, RMSE, R², etc.

Data Format:

  • Input datasets must be tech.ml.dataset (TMD) format
  • Must have appropriate column metadata (:prediction, :target, etc.)
  • Support categorical mappings via :categorical-map metadata
  • Missing values and NaNs are detected and rejected appropriately

Validation: The functions perform extensive validation including:

  • Column metadata correctness
  • Missing values and NaN detection
  • Type and datatype uniformity
  • Row count alignment between datasets
  • Single-label assumption (multi-label not yet supported)

Example: (classification-metric y-true y-pred :f1 :macro {}) (regression-metric y-true y-pred :mse)

See also: fastmath.stats documentation for available metric names

Model evaluation metrics for classification and regression tasks.

This namespace provides functions to compute standard machine learning metrics
on model predictions vs. ground truth labels, with support for both binary and
multiclass classification as well as regression tasks.

Key Functions:
- `classification-metric`: Evaluate classification model predictions
- `regression-metric`: Evaluate regression model predictions

Classification Metrics (from fastmath.stats):
Supports binary and multiclass metrics including accuracy, precision, recall,
F1-score, and more. Multiclass metrics can be averaged using:
- `:macro` - Unweighted mean of per-class metrics
- `:micro` - Aggregated true/false positives globally
Also supports `:roc-auc` for multiclass AUC scoring.

Regression Metrics (from fastmath.stats):
Distance and similarity metrics such as MAE, MSE, RMSE, R², etc.

Data Format:
- Input datasets must be tech.ml.dataset (TMD) format
- Must have appropriate column metadata (:prediction, :target, etc.)
- Support categorical mappings via :categorical-map metadata
- Missing values and NaNs are detected and rejected appropriately

Validation:
The functions perform extensive validation including:
- Column metadata correctness
- Missing values and NaN detection
- Type and datatype uniformity
- Row count alignment between datasets
- Single-label assumption (multi-label not yet supported)

Example:
(classification-metric y-true y-pred :f1 :macro {})
(regression-metric y-true y-pred :mse)

See also: `fastmath.stats` documentation for available metric names
raw docstring

scicloj.metamorph.ml.design-matrix

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).

Main Entry Point:

  • create-design-matrix: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use [column-name transformation] pairs where:

  • Transformations are Clojure expressions (quoted with ')
  • Expressions can reference column names directly as symbols
  • Expressions are evaluated in order and can chain
  • Non-listed columns are removed from the output

Shorthand Syntax:

  • :column-name Keeps column unchanged (identity function)
  • [nil '(+ a b)] Auto-generates column name for derived column
  • ['(+ a b)] Same as above

Available Aliases (no qualification needed):

  • ds - tech.v3.dataset
  • tc - tablecloth.api
  • tcc - tablecloth.column.api
  • All of clojure.core

Features:

  • Derives new columns from existing data
  • Expands array and map columns into separate columns
  • Automatically converts categorical columns to numbers
  • Sets inference target(s) for supervised learning
  • Chains transformations in dependency order

Example: (create-design-matrix iris-data [:species] ; target column [[:petal-length identity] ; keep as-is [:sepal-ratio '(/ :sepal-length ; derive new feature :sepal-width)]])

Limitations:

  • Does not automatically expand categorical variables (specify manually)
  • For linear regression, fastmath/ols offers a :transformer option using R formulas
  • Design matrix approach is more flexible but less compact than R formula syntax

See also: fastmath.ml/lm for linear regression with formula-based transformations

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design
matrices suitable for machine learning models. It supports deriving new features,
transforming existing columns, managing target variables, and expanding complex
column types (arrays, maps).

Main Entry Point:
- `create-design-matrix`: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use [column-name transformation] pairs where:
- Transformations are Clojure expressions (quoted with ')
- Expressions can reference column names directly as symbols
- Expressions are evaluated in order and can chain
- Non-listed columns are removed from the output

Shorthand Syntax:
- :column-name           Keeps column unchanged (identity function)
- [nil '(+ a b)]         Auto-generates column name for derived column
- ['(+ a b)]             Same as above

Available Aliases (no qualification needed):
- `ds`  - tech.v3.dataset
- `tc`  - tablecloth.api
- `tcc` - tablecloth.column.api
- All of clojure.core

Features:
- Derives new columns from existing data
- Expands array and map columns into separate columns
- Automatically converts categorical columns to numbers
- Sets inference target(s) for supervised learning
- Chains transformations in dependency order

Example:
(create-design-matrix
  iris-data
  [:species]                          ; target column
  [[:petal-length identity]           ; keep as-is
   [:sepal-ratio '(/ :sepal-length    ; derive new feature
                     :sepal-width)]]) 

Limitations:
- Does not automatically expand categorical variables (specify manually)
- For linear regression, fastmath/ols offers a :transformer option using R formulas
- Design matrix approach is more flexible but less compact than R formula syntax

See also: `fastmath.ml/lm` for linear regression with formula-based transformations
raw docstring

scicloj.metamorph.ml.gridsearch

Gridsearching as defined by create a map with gridsearch definitions for its values and then gridsearching which produces a sequence of full defined maps.

The initial default implementation uses the sobol sequence.

Gridsearching as defined by create a map with gridsearch definitions
for its values and then gridsearching which produces a sequence of full
defined maps.


The initial default implementation uses the sobol sequence.
raw docstring

scicloj.metamorph.ml.preprocessing

Feature scaling and normalization transformers for metamorph pipelines.

This namespace provides metamorph-compatible transformers for standardizing and normalizing numeric features. These preprocessing steps are essential for many machine learning algorithms to perform well.

Available Transformers:

  • std-scale: Standardization (z-score normalization)
  • min-max-scale: Min-max scaling to a specified range

StandardScaling (std-scale): Centers each numeric column (subtract mean) and/or scales by standard deviation, producing zero-mean unit-variance data. Useful for:

  • Algorithms sensitive to feature magnitude (SVMs, neural networks, KNN)
  • Distance-based models Options:
  • :mean? (default true): Center by subtracting column mean
  • :stddev? (default true): Scale by standard deviation

Min-Max Scaling (min-max-scale): Rescales each numeric column to a specified range (default [-0.5, 0.5]). Options:

  • :min (default -0.5): Target minimum value
  • :max (default 0.5): Target maximum value

Metamorph Integration: Both transformers follow the metamorph pipeline pattern:

  • :fit mode: Learn scaling parameters from training data
  • :transform mode: Apply learned parameters to new data
  • Stores transformation parameters in context under their assigned :metamorph/id

Example Usage (in metamorph pipeline): (preprocessing/std-scale [:age :income] {:mean? true :stddev? true}))

Feature scaling and normalization transformers for metamorph pipelines.

This namespace provides metamorph-compatible transformers for standardizing and
normalizing numeric features. These preprocessing steps are essential for many
machine learning algorithms to perform well.

Available Transformers:
- `std-scale`: Standardization (z-score normalization)
- `min-max-scale`: Min-max scaling to a specified range

StandardScaling (std-scale):
Centers each numeric column (subtract mean) and/or scales by standard deviation,
producing zero-mean unit-variance data. Useful for:
- Algorithms sensitive to feature magnitude (SVMs, neural networks, KNN)
- Distance-based models
Options:
- `:mean?` (default true): Center by subtracting column mean
- `:stddev?` (default true): Scale by standard deviation

Min-Max Scaling (min-max-scale):
Rescales each numeric column to a specified range (default [-0.5, 0.5]).
Options:
- `:min` (default -0.5): Target minimum value
- `:max` (default 0.5): Target maximum value

Metamorph Integration:
Both transformers follow the metamorph pipeline pattern:
- `:fit` mode: Learn scaling parameters from training data
- `:transform` mode: Apply learned parameters to new data
- Stores transformation parameters in context under their assigned `:metamorph/id`

Example Usage (in metamorph pipeline):
  (preprocessing/std-scale [:age :income] {:mean? true :stddev? true}))
raw docstring

scicloj.metamorph.ml.r-model-matrix

R-style formula-based feature engineering and linear regression.

This namespace provides tools to leverage R's powerful formula syntax for feature engineering and linear modeling within Clojure. R formulas enable expressive specification of interactions, transformations, and categorical expansions without manual column manipulation.

Key Functions:

  • r-model-matrix: Convert dataset + R formula to design matrix
  • lm: Simplified linear regression using R formulas

Implementation Backends: The namespace supports multiple R execution backends:

  • :ocpu Remote R via OpenCPU (cloud.opencpu.org) - no local R needed
  • :renjine Java-based R implementation (https://renjin.org/)
  • :clojisr Local R via clojisr (requires R installation)

Model Matrix Capabilities: R formulas handle:

  • Basic features: y ~ x1 + x2
  • Interactions: y ~ x1 * x2 (expands to x1 + x2 + x1:x2)
  • Polynomial terms: y ~ x + I(x^2)
  • Categorical encoding: Automatic dummy variable creation
  • Intercept control: y ~ x - 1 (remove intercept)
  • Exclusions: y ~ . - x3 (all columns except x3)

Linear Regression (lm): Combines formula-based feature engineering with OLS regression training. Returns a ready-to-use trained model for predictions.

Example Usage: (r-model-matrix iris-data "~ Sepal.Length + Sepal.Width" :renjine) (lm iris-data "Sepal.Width ~ Sepal.Length * Petal.Length" :Sepal.Width :ocpu)

Notes:

  • OpenCPU backend is convenient but requires internet connectivity
  • Renjin is standalone but may have some R incompatibilities
  • clojisr requires a local R installation but offers full R compatibility
  • Returned model matrices exclude row names and intercept columns by default

See also: scicloj.metamorph.ml.design-matrix for Clojure-native feature engineering

R-style formula-based feature engineering and linear regression.

This namespace provides tools to leverage R's powerful formula syntax for
feature engineering and linear modeling within Clojure. R formulas enable
expressive specification of interactions, transformations, and categorical
expansions without manual column manipulation.

Key Functions:
- `r-model-matrix`: Convert dataset + R formula to design matrix
- `lm`: Simplified linear regression using R formulas

Implementation Backends:
The namespace supports multiple R execution backends:
- `:ocpu`    Remote R via OpenCPU (cloud.opencpu.org) - no local R needed
- `:renjine` Java-based R implementation (https://renjin.org/)
- `:clojisr` Local R via clojisr (requires R installation)

Model Matrix Capabilities:
R formulas handle:
- Basic features: `y ~ x1 + x2`
- Interactions: `y ~ x1 * x2` (expands to x1 + x2 + x1:x2)
- Polynomial terms: `y ~ x + I(x^2)`
- Categorical encoding: Automatic dummy variable creation
- Intercept control: `y ~ x - 1` (remove intercept)
- Exclusions: `y ~ . - x3` (all columns except x3)

Linear Regression (lm):
Combines formula-based feature engineering with OLS regression training.
Returns a ready-to-use trained model for predictions.

Example Usage:
(r-model-matrix iris-data "~ Sepal.Length + Sepal.Width" :renjine)
(lm iris-data "Sepal.Width ~ Sepal.Length * Petal.Length" 
    :Sepal.Width :ocpu)

Notes:
- OpenCPU backend is convenient but requires internet connectivity
- Renjin is standalone but may have some R incompatibilities
- clojisr requires a local R installation but offers full R compatibility
- Returned model matrices exclude row names and intercept columns by default

See also: `scicloj.metamorph.ml.design-matrix` for Clojure-native feature engineering
raw docstring

scicloj.metamorph.ml.rdatasets

scicloj.metamorph.ml.regression

Regression models for continuous target prediction.

This namespace provides implementations of various regression algorithms with a consistent metamorph.ml training and prediction interface. Models support statistical output formats (tidy, glance, augment) for analysis and diagnostics.

Available Models:

OLS (Ordinary Least Squares)

  • :metamorph.ml/ols: Apache Commons Math implementation (Java-based)
  • :fastmath/ols: FastMath implementation (pure Clojure) Solves for regression coefficients β in: y = Xβ + ε Assumes linear relationships and homoscedastic errors.

GLM (Generalized Linear Model)

  • :fastmath/glm: FastMath GLM implementation Extends linear regression to non-normal distributions and non-linear relationships via link functions and variance models.

Baseline Model

  • :metamorph.ml/dummy-regressor: Predicts mean of training target Useful sanity check - models should outperform this baseline.

Model Output Functions:

  • :tidy-fn: Extracts model coefficients with statistics Returns dataset with :term, :estimate, :std.error, :statistic, :p.value
  • :glance-fn: Extracts model-level diagnostics Returns dataset with :r.squared, :adj.r.squared, :rss, :aic, :bic, etc.
  • :augment-fn: Adds model predictions and residuals to data Returns augmented dataset with :.fitted and :.resid columns

Example Usage (in metamorph pipeline): (ml/train data {:model-type :fastmath/ols :target-columns [:price] :feature-columns [:sqft :bedrooms]})

Model Diagnostics: (glance model) ; Overall model metrics (tidy model) ; Coefficient table (augment model ds) ; Predicted values and residuals

See also: scicloj.metamorph.ml.r-model-matrix for formula-based feature engineering

Regression models for continuous target prediction.

This namespace provides implementations of various regression algorithms with
a consistent metamorph.ml training and prediction interface. Models support
statistical output formats (tidy, glance, augment) for analysis and diagnostics.

Available Models:

**OLS (Ordinary Least Squares)**
- `:metamorph.ml/ols`: Apache Commons Math implementation (Java-based)
- `:fastmath/ols`: FastMath implementation (pure Clojure)
Solves for regression coefficients β in: y = Xβ + ε
Assumes linear relationships and homoscedastic errors.

**GLM (Generalized Linear Model)**
- `:fastmath/glm`: FastMath GLM implementation
Extends linear regression to non-normal distributions and non-linear relationships
via link functions and variance models.

**Baseline Model**
- `:metamorph.ml/dummy-regressor`: Predicts mean of training target
Useful sanity check - models should outperform this baseline.

Model Output Functions:

- **:tidy-fn**: Extracts model coefficients with statistics
  Returns dataset with :term, :estimate, :std.error, :statistic, :p.value
- **:glance-fn**: Extracts model-level diagnostics
  Returns dataset with :r.squared, :adj.r.squared, :rss, :aic, :bic, etc.
- **:augment-fn**: Adds model predictions and residuals to data
  Returns augmented dataset with :.fitted and :.resid columns

Example Usage (in metamorph pipeline):
(ml/train
  data
  {:model-type :fastmath/ols
   :target-columns [:price]
   :feature-columns [:sqft :bedrooms]})

Model Diagnostics:
(glance model)      ; Overall model metrics
(tidy model)        ; Coefficient table
(augment model ds)  ; Predicted values and residuals

See also: `scicloj.metamorph.ml.r-model-matrix` for formula-based feature engineering
raw docstring

scicloj.metamorph.ml.text

Large-scale text processing and TF-IDF feature engineering for NLP pipelines.

This namespace provides efficient tools for converting raw text documents into machine learning-ready features using TF-IDF (Term Frequency-Inverse Document Frequency) scoring. Designed to handle large text corpora with flexible memory management strategies.

Core Functions:

->tidy-text Parses text files or datasets into tidy-text format (one token per row). Line-by-line processing enables handling of files larger than available RAM. Supports custom tokenization and metadata extraction.

Output format: tech.v3.dataset with columns:

  • :document (int): Document/line identifier
  • :token-idx (int): Token as indexed integer (maps to lookup table)
  • :token-pos (int): Position of token within document
  • :meta (optional): Arbitrary metadata from line-split-fn

->tfidf Transforms tidy-text into TF-IDF vector representation for bag-of-words models. Calculates term frequency (TF) and inverse document frequency (IDF) for each token.

Output columns:

  • :document, :token-idx, :token-count, :tf, :tfidf

Memory Optimization:

The namespace provides flexible memory control for large texts via options:

Container Types:

  • :jvm-heap (default): Java heap storage (fast, limited by heap)
  • :native-heap: Off-heap native memory via tech.v3
  • :mmap: Memory-mapped files (disk-backed, bypasses heap limits)

Processing Options:

  • container-type: Storage for intermediate results during processing
  • column-container-type: Storage for final output dataset
  • combine-method: :coalesce-blocks! or :concat-buffers (tradeoffs)
  • compacting-document-interval: Batch size for consolidating data
  • datatype-document/token-pos/idx: Memory datatype selection (:int16 vs :int32)

Token Management:

  • token->index-map: Custom token lookup table (can reuse across runs)
  • new-token-behaviour: :store (default), :fail, or :as-unknown

Performance Characteristics:

  • Typical text requires ~1.5x the original file size in RAM
  • A 8GB text file typically needs ≥12GB total memory
  • Scaling strategy: Use off-heap or mmap for large corpora

Example Usage: ;; Load and tokenize text file (let [reader (io/reader "corpus.txt") result (->tidy-text reader #(line-seq %) #(str/split % #"\t" 2) ; tab-separated: text, meta #(str/split % #"\s+") ; whitespace tokenization :max-lines 100000)] ;; Extract tidy text datasets (doseq [ds (:datasets result)] ;; Convert to TF-IDF vectors (->tfidf ds :container-type :native-heap :column-container-type :jvm-heap)))

Typical Workflow:

  1. Use ->tidy-text to create tidy text format from raw documents
  2. Use ->tfidf to create TF-IDF feature vectors
  3. Pass vectors to classification/regression models

See also: scicloj.metamorph.ml.column-metric for evaluation, scicloj.metamorph.ml for model training

Large-scale text processing and TF-IDF feature engineering for NLP pipelines.

This namespace provides efficient tools for converting raw text documents into
machine learning-ready features using TF-IDF (Term Frequency-Inverse Document
Frequency) scoring. Designed to handle large text corpora with flexible memory
management strategies.

Core Functions:

**->tidy-text**
Parses text files or datasets into tidy-text format (one token per row).
Line-by-line processing enables handling of files larger than available RAM.
Supports custom tokenization and metadata extraction.

Output format: tech.v3.dataset with columns:
- :document (int): Document/line identifier
- :token-idx (int): Token as indexed integer (maps to lookup table)
- :token-pos (int): Position of token within document
- :meta (optional): Arbitrary metadata from line-split-fn

**->tfidf**
Transforms tidy-text into TF-IDF vector representation for bag-of-words models.
Calculates term frequency (TF) and inverse document frequency (IDF) for each token.

Output columns:
- :document, :token-idx, :token-count, :tf, :tfidf

Memory Optimization:

The namespace provides flexible memory control for large texts via options:

Container Types:
- `:jvm-heap` (default): Java heap storage (fast, limited by heap)
- `:native-heap`: Off-heap native memory via tech.v3
- `:mmap`: Memory-mapped files (disk-backed, bypasses heap limits)

Processing Options:
- `container-type`: Storage for intermediate results during processing
- `column-container-type`: Storage for final output dataset
- `combine-method`: `:coalesce-blocks!` or `:concat-buffers` (tradeoffs)
- `compacting-document-interval`: Batch size for consolidating data
- `datatype-document/token-pos/idx`: Memory datatype selection (:int16 vs :int32)

Token Management:
- `token->index-map`: Custom token lookup table (can reuse across runs)
- `new-token-behaviour`: `:store` (default), `:fail`, or `:as-unknown`

Performance Characteristics:
- Typical text requires ~1.5x the original file size in RAM
- A 8GB text file typically needs ≥12GB total memory
- Scaling strategy: Use off-heap or mmap for large corpora

Example Usage:
;; Load and tokenize text file
(let [reader (io/reader "corpus.txt")
      result (->tidy-text
              reader
              #(line-seq %)
              #(str/split % #"\t" 2) ; tab-separated: text, meta
              #(str/split % #"\s+") ; whitespace tokenization
              :max-lines 100000)]
  ;; Extract tidy text datasets
  (doseq [ds (:datasets result)]
    ;; Convert to TF-IDF vectors
    (->tfidf ds
              :container-type :native-heap
              :column-container-type :jvm-heap)))

Typical Workflow:
1. Use ->tidy-text to create tidy text format from raw documents
2. Use ->tfidf to create TF-IDF feature vectors
3. Pass vectors to classification/regression models

See also: `scicloj.metamorph.ml.column-metric` for evaluation,
`scicloj.metamorph.ml` for model training
raw docstring

scicloj.metamorph.ml.tidy-models

Model output standardization and validation following tidymodels conventions.

This namespace implements the tidymodels philosophy (inspired by R's tidymodels/broom packages) for standardized, machine-readable model outputs. All model outputs conform to consistent schemas defined in canonical column specification files.

Three Core Output Functions:

glance: One-row model summary

  • High-level goodness-of-fit statistics
  • Examples: R², AIC, BIC, log-likelihood, F-statistic, p-value
  • Use case: Quick model performance overview

tidy: One-row-per-component output

  • Component-level details (e.g., one row per coefficient)
  • Examples: term, estimate, std.error, statistic, p.value
  • Use case: Detailed model inspection and reporting

augment: One-row-per-observation output

  • Adds model predictions/residuals to original data
  • Original columns plus: .fitted, .resid, .hat, .sigma, .cooksd
  • Use case: Diagnostics and visualization of predictions

Validation and Schema Management:

  • allowed-tidy-columns: Canonical list of valid tidy column names
  • allowed-glance-columns: Canonical list of valid glance column names
  • allowed-augment-columns: Canonical list of valid augment column names
  • validate-tidy-ds: Validates dataset conforms to tidy standard
  • validate-glance-ds: Validates dataset conforms to glance standard
  • validate-augment-ds: Validates dataset conforms to augment standard

Schemas are maintained in GitHub repository (resources/*.edn):

  • columms-tidy.edn
  • columms-glance.edn
  • columms-augment.edn

Control Validation: The *validate-tidy-fns* dynamic variable controls strict validation:

  • true (default): Raises exception on invalid columns
  • false: Silently allows any columns

Integration: Model implementations use these validators in their tidy-fn/glance-fn/augment-fn to ensure outputs conform to standardized schemas for consistency across models.

See also: scicloj.metamorph.ml for training and prediction, scicloj.metamorph.ml.regression and scicloj.metamorph.ml.classification for specific model implementations

Model output standardization and validation following tidymodels conventions.

This namespace implements the tidymodels philosophy (inspired by R's tidymodels/broom
packages) for standardized, machine-readable model outputs. All model outputs conform
to consistent schemas defined in canonical column specification files.

Three Core Output Functions:

**glance**: One-row model summary
- High-level goodness-of-fit statistics
- Examples: R², AIC, BIC, log-likelihood, F-statistic, p-value
- Use case: Quick model performance overview

**tidy**: One-row-per-component output
- Component-level details (e.g., one row per coefficient)
- Examples: term, estimate, std.error, statistic, p.value
- Use case: Detailed model inspection and reporting

**augment**: One-row-per-observation output
- Adds model predictions/residuals to original data
- Original columns plus: .fitted, .resid, .hat, .sigma, .cooksd
- Use case: Diagnostics and visualization of predictions

Validation and Schema Management:

- `allowed-tidy-columns`: Canonical list of valid tidy column names
- `allowed-glance-columns`: Canonical list of valid glance column names
- `allowed-augment-columns`: Canonical list of valid augment column names
- `validate-tidy-ds`: Validates dataset conforms to tidy standard
- `validate-glance-ds`: Validates dataset conforms to glance standard
- `validate-augment-ds`: Validates dataset conforms to augment standard

Schemas are maintained in GitHub repository (resources/*.edn):
- columms-tidy.edn
- columms-glance.edn
- columms-augment.edn

Control Validation:
The `*validate-tidy-fns*` dynamic variable controls strict validation:
- `true` (default): Raises exception on invalid columns
- `false`: Silently allows any columns


Integration:
Model implementations use these validators in their tidy-fn/glance-fn/augment-fn
to ensure outputs conform to standardized schemas for consistency across models.

See also: `scicloj.metamorph.ml` for training and prediction,
`scicloj.metamorph.ml.regression` and `scicloj.metamorph.ml.classification`
for specific model implementations
raw docstring

scicloj.metamorph.ml.toydata

Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead

Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead
raw docstring

scicloj.metamorph.ml.toydata.ggplot

Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead

Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead
raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close