scicloj.metamorph — org.scicloj/metamorph.ml 1.7.0

scicloj.metamorph.ml

Core machine learning framework integrating metamorph pipelines with standardized model APIs.

This is the central namespace of metamorph.ml, providing infrastructure for:

Registering and using machine learning models
Training models and making predictions
Evaluating pipelines via cross-validation
Standardized model diagnostics (glance, tidy, augment)
Optional caching of computationally expensive operations

Key Concepts:

Model Registration: Models are registered using define-model! and can be referenced by keyword (e.g., :fastmath/ols, :metamorph.ml/dummy-classifier). Models define a train-fn, predict-fn, and optional diagnostic functions.

Training and Prediction:

train: Train a model on a dataset given options including :model-type
predict: Make predictions using a trained model
train-predict-cache: Optional cache to avoid redundant computations

Pipeline Evaluation:

evaluate-pipelines: Evaluate multiple pipelines across train/test splits
evaluate-one-pipeline: Evaluate a single pipeline with cross-validation
Returns results sorted by metric performance with optional filtering
Supports parallel evaluation (:map/:pmap/:ppmap)

Model Diagnostics (following tidymodels conventions):

glance: One-row model summary (goodness-of-fit)
tidy: One-row-per-component output (coefficients with statistics)
augment: One-row-per-observation output (predictions, residuals)

Main API Functions:

define-model!: Register a new model type with train/predict/diagnostic functions
train: Train a model with a specified model-type
predict: Generate predictions from a trained model
evaluate-pipelines: Evaluate pipelines with cross-validation
glance: Get model summary statistics
tidy: Extract coefficient-level results
augment: Add predictions and residuals to data

Pipeline Integration:

Models integrate with metamorph pipelines via the model step, which:

Trains in :fit mode using training data
Predicts in :transform mode on new data
Stores model output column metadata for later evaluation

Built-in Models:

Regression:

:metamorph.ml/ols: Apache Commons Math OLS
:fastmath/ols: FastMath OLS
:fastmath/glm: FastMath GLM
:metamorph.ml/dummy-regressor: Mean baseline

Classification:

:metamorph.ml/dummy-classifier: Majority class or random baseline
:metamorph.ml/random-forest: Random forest classifier

Preprocessing:

See specific namespaces for transformers:

scicloj.metamorph.ml.preprocessing: Scaling and normalization
scicloj.metamorph.ml.categorical: One-hot encoding
scicloj.metamorph.ml.r-model-matrix: R formula features

See also: scicloj.metamorph.core for metamorph pipeline mechanics, scicloj.metamorph.ml.tidy-models for diagnostic validation

Core machine learning framework integrating metamorph pipelines with standardized model APIs.

This is the central namespace of metamorph.ml, providing infrastructure for:

- Registering and using machine learning models
- Training models and making predictions
- Evaluating pipelines via cross-validation
- Standardized model diagnostics (glance, tidy, augment)
- Optional caching of computationally expensive operations

Key Concepts:

**Model Registration**: Models are registered using `define-model!` and can be
referenced by keyword (e.g., `:fastmath/ols`, `:metamorph.ml/dummy-classifier`).
Models define a train-fn, predict-fn, and optional diagnostic functions.

**Training and Prediction**:

- `train`: Train a model on a dataset given options including :model-type
- `predict`: Make predictions using a trained model
- `train-predict-cache`: Optional cache to avoid redundant computations

**Pipeline Evaluation**:

- `evaluate-pipelines`: Evaluate multiple pipelines across train/test splits
- `evaluate-one-pipeline`: Evaluate a single pipeline with cross-validation
- Returns results sorted by metric performance with optional filtering
- Supports parallel evaluation (:map/:pmap/:ppmap)


**Model Diagnostics** (following tidymodels conventions):

- `glance`: One-row model summary (goodness-of-fit)
- `tidy`: One-row-per-component output (coefficients with statistics)
- `augment`: One-row-per-observation output (predictions, residuals)

Main API Functions:

- `define-model!`: Register a new model type with train/predict/diagnostic functions
- `train`: Train a model with a specified model-type
- `predict`: Generate predictions from a trained model
- `evaluate-pipelines`: Evaluate pipelines with cross-validation
- `glance`: Get model summary statistics
- `tidy`: Extract coefficient-level results
- `augment`: Add predictions and residuals to data

Pipeline Integration:

Models integrate with metamorph pipelines via the `model` step, which:

- Trains in :fit mode using training data
- Predicts in :transform mode on new data
- Stores model output column metadata for later evaluation


Built-in Models:

**Regression**:
                              
- `:metamorph.ml/ols`: Apache Commons Math OLS
- `:fastmath/ols`: FastMath OLS
- `:fastmath/glm`: FastMath GLM
- `:metamorph.ml/dummy-regressor`: Mean baseline

**Classification**:

- `:metamorph.ml/dummy-classifier`: Majority class or random baseline
- `:metamorph.ml/random-forest`: Random forest classifier

**Preprocessing**:
  
See specific namespaces for transformers:
  
- `scicloj.metamorph.ml.preprocessing`: Scaling and normalization
- `scicloj.metamorph.ml.categorical`: One-hot encoding
- `scicloj.metamorph.ml.r-model-matrix`: R formula features

See also: `scicloj.metamorph.core` for metamorph pipeline mechanics,
`scicloj.metamorph.ml.tidy-models` for diagnostic validation

raw docstring

scicloj.metamorph.ml.cache

Caching infrastructure for metamorph.ml train/predict operations.

This namespace provides flexible caching backends to store and retrieve results of machine learning training and prediction operations. This is useful for avoiding redundant computations when working with the same models and data.

Supported cache backends:

Atom cache: In-memory caching using a Clojure atom (fast, ephemeral)
Disk cache: File-based caching using Nippy serialization (persistent)
Redis cache: Distributed caching via Redis (requires carmine library)

Usage:

(enable-atom-cache! (atom {}))  ; Enable in-memory caching
;; or
(enable-disk-cache! "/tmp/ml-cache")  ; Enable disk-based caching
;; or
(enable-redis-cache! {...})  ; Enable Redis caching

To disable caching:

(disable-cache!)

See individual function docs for more details on each backend.

Caching infrastructure for metamorph.ml train/predict operations.

This namespace provides flexible caching backends to store and retrieve results
of machine learning training and prediction operations. This is useful for
avoiding redundant computations when working with the same models and data.

Supported cache backends:

- **Atom cache**: In-memory caching using a Clojure atom (fast, ephemeral)
- **Disk cache**: File-based caching using Nippy serialization (persistent)
- **Redis cache**: Distributed caching via Redis (requires carmine library)

Usage:

```
(enable-atom-cache! (atom {}))  ; Enable in-memory caching
;; or
(enable-disk-cache! "/tmp/ml-cache")  ; Enable disk-based caching
;; or
(enable-redis-cache! {...})  ; Enable Redis caching
```

To disable caching:

```
(disable-cache!)
```

See individual function docs for more details on each backend.

raw docstring

scicloj.metamorph.ml.categorical

Categorical feature encoding for machine learning pipelines.

This namespace provides metamorph transformers for handling categorical variables commonly used in supervised learning. Currently focuses on one-hot encoding, which converts categorical values into binary indicator columns.

One-hot encoding is essential for:

Preparing categorical features for algorithms that expect numeric inputs
Preventing ordinal assumptions on nominal categories
Creating interpretable model features

Main API:

transform-one-hot: The primary metamorph transformer for one-hot encoding

Encoding strategies:

:full Uses a predefined level set from full dataset context
:fit Levels discovered during :fit used in :transform
:independent Each mode independently determines and encodes levels

Categorical feature encoding for machine learning pipelines.

This namespace provides metamorph transformers for handling categorical
variables commonly used in supervised learning. Currently focuses on
one-hot encoding, which converts categorical values into binary indicator columns.

One-hot encoding is essential for:

- Preparing categorical features for algorithms that expect numeric inputs
- Preventing ordinal assumptions on nominal categories
- Creating interpretable model features

Main API:

- `transform-one-hot`: The primary metamorph transformer for one-hot encoding

Encoding strategies:

- `:full`        Uses a predefined level set from full dataset context
- `:fit`         Levels discovered during :fit used in :transform
- `:independent` Each mode independently determines and encodes levels

raw docstring

transform-one-hot

scicloj.metamorph.ml.classification

Classification models and evaluation metrics for metamorph.ml.

This namespace provides tools for classification tasks including:

Confusion matrix generation and analysis
Baseline classifier implementations
Classification evaluation utilities

Key features:

confusion-map: Creates confusion matrices from predictions and true labels
confusion-map->ds: Converts confusion matrices to tabular dataset format
:metamorph.ml/dummy-classifier: A baseline classifier for sanity checks

Dummy Classifier Strategies:

:majority-class (default): Always predicts the most frequent class
:fixed-class: Predicts a specified class
:random-class: Predicts randomly from the observed classes

Confusion Matrix Normalization:

:all (default): Row-wise normalization (recall perspective)
:none: Raw counts

Classification models and evaluation metrics for metamorph.ml.

This namespace provides tools for classification tasks including:

- Confusion matrix generation and analysis
- Baseline classifier implementations
- Classification evaluation utilities

Key features:

- `confusion-map`: Creates confusion matrices from predictions and true labels
- `confusion-map->ds`: Converts confusion matrices to tabular dataset format
- `:metamorph.ml/dummy-classifier`: A baseline classifier for sanity checks

Dummy Classifier Strategies:

- `:majority-class` (default): Always predicts the most frequent class
- `:fixed-class`: Predicts a specified class
- `:random-class`: Predicts randomly from the observed classes

Confusion Matrix Normalization:

- `:all` (default): Row-wise normalization (recall perspective)
- `:none`: Raw counts



See also: [[scicloj.metamorph.ml.viz/confusion-matrix]]

raw docstring

scicloj.metamorph.ml.column-metric

Model evaluation metrics for classification and regression tasks.

This namespace provides functions to compute standard machine learning metrics on model predictions vs. ground truth labels, with support for both binary and multiclass classification as well as regression tasks.

Key Functions:

classification-metric: Evaluate classification model predictions
regression-metric: Evaluate regression model predictions

Classification Metrics (from fastmath.stats):

Supports binary and multiclass metrics including accuracy, precision, recall, F1-score, and more. Multiclass metrics can be averaged using:

:macro - Unweighted mean of per-class metrics
:micro - Aggregated true/false positives globally Also supports :roc-auc for multiclass AUC scoring.

Regression Metrics (from fastmath.stats): Distance and similarity metrics such as MAE, MSE, RMSE, R², etc.

Data Format:

Input datasets must be tech.ml.dataset (TMD) format
Must have appropriate column metadata (:prediction, :target, etc.)
Support categorical mappings via :categorical-map metadata
Missing values and NaNs are detected and rejected appropriately

Validation: The functions perform extensive validation including:

Column metadata correctness
Missing values and NaN detection
Type and datatype uniformity
Row count alignment between datasets
Single-label assumption (multi-label not yet supported)

See also: fastmath.stats documentation for available metric names

Model evaluation metrics for classification and regression tasks.

This namespace provides functions to compute standard machine learning metrics
on model predictions vs. ground truth labels, with support for both binary and
multiclass classification as well as regression tasks.

Key Functions:

- `classification-metric`: Evaluate classification model predictions
- `regression-metric`: Evaluate regression model predictions

Classification Metrics (from fastmath.stats):

Supports binary and multiclass metrics including accuracy, precision, recall,
F1-score, and more. Multiclass metrics can be averaged using:
- `:macro` - Unweighted mean of per-class metrics
- `:micro` - Aggregated true/false positives globally
Also supports `:roc-auc` for multiclass AUC scoring.

Regression Metrics (from fastmath.stats):
Distance and similarity metrics such as MAE, MSE, RMSE, R², etc.

Data Format:

- Input datasets must be tech.ml.dataset (TMD) format
- Must have appropriate column metadata (:prediction, :target, etc.)
- Support categorical mappings via :categorical-map metadata
- Missing values and NaNs are detected and rejected appropriately

Validation:
The functions perform extensive validation including:

- Column metadata correctness
- Missing values and NaN detection
- Type and datatype uniformity
- Row count alignment between datasets
- Single-label assumption (multi-label not yet supported)


See also: `fastmath.stats` documentation for available metric names

raw docstring

scicloj.metamorph.ml.design-matrix

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).

Main Entry Point:

create-design-matrix: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use [column-name transformation] pairs where:

Transformations are Clojure expressions (quoted with ')
Expressions can reference column names directly as symbols
Expressions are evaluated in order and can chain
Non-listed columns are removed from the output

Shorthand Syntax:

:column-name Keeps column unchanged (identity function)
[nil '(+ a b)] Auto-generates column name for derived column
['(+ a b)] Same as above

Available Aliases (no qualification needed):

ds - tech.v3.dataset
tc - tablecloth.api
tcc - tablecloth.column.api
All of clojure.core

Features:

Derives new columns from existing data
Expands array and map columns into separate columns
Automatically converts categorical columns to numbers
Sets inference target(s) for supervised learning
Chains transformations in dependency order

Limitations:

Does not automatically expand categorical variables (specify manually)
Design matrix approach is more flexible but less compact than R formula syntax

See also: fastmath.ml/lm for linear regression with formula-based transformations

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design
matrices suitable for machine learning models. It supports deriving new features,
transforming existing columns, managing target variables, and expanding complex
column types (arrays, maps).

Main Entry Point:

- `create-design-matrix`: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use [column-name transformation] pairs where:

- Transformations are Clojure expressions (quoted with ')
- Expressions can reference column names directly as symbols
- Expressions are evaluated in order and can chain
- Non-listed columns are removed from the output

Shorthand Syntax:

- :column-name           Keeps column unchanged (identity function)
- [nil '(+ a b)]         Auto-generates column name for derived column
- ['(+ a b)]             Same as above

Available Aliases (no qualification needed):

- `ds`  - tech.v3.dataset
- `tc`  - tablecloth.api
- `tcc` - tablecloth.column.api
- All of clojure.core

Features:

- Derives new columns from existing data
- Expands array and map columns into separate columns
- Automatically converts categorical columns to numbers
- Sets inference target(s) for supervised learning
- Chains transformations in dependency order

Limitations:

- Does not automatically expand categorical variables (specify manually)
- Design matrix approach is more flexible but less compact than R formula syntax

See also: `fastmath.ml/lm` for linear regression with formula-based transformations

raw docstring

create-design-matrix

scicloj.metamorph.ml.ensemble

ensemble-pipe

scicloj.metamorph.ml.evaluation-handler

scicloj.metamorph.ml.explore

explore-all

scicloj.metamorph.ml.gridsearch

Gridsearching as defined by create a map with gridsearch definitions for its values and then gridsearching which produces a sequence of full defined maps.

The initial default implementation uses the sobol sequence.

Gridsearching as defined by create a map with gridsearch definitions
for its values and then gridsearching which produces a sequence of full
defined maps.


The initial default implementation uses the sobol sequence.

raw docstring

scicloj.metamorph.ml.learning-curve

learning-curve

scicloj.metamorph.ml.loss

DEPRECATED: Simple loss functions.

DEPRECATED: Simple loss functions.

raw docstring

scicloj.metamorph.ml.metrics

DEPRECATED: Excellent metrics tools from the cortex project.

DEPRECATED: Excellent metrics tools from the cortex project.

raw docstring

scicloj.metamorph.ml.preprocessing

Feature scaling and normalization transformers for metamorph pipelines.

This namespace provides metamorph-compatible transformers for standardizing and normalizing numeric features. These preprocessing steps are essential for many machine learning algorithms to perform well.

Available Transformers:

std-scale: Standardization (z-score normalization)
min-max-scale: Min-max scaling to a specified range

StandardScaling (std-scale): Centers each numeric column (subtract mean) and/or scales by standard deviation, producing zero-mean unit-variance data. Useful for:

Algorithms sensitive to feature magnitude (SVMs, neural networks, KNN)
Distance-based models

Options:

:mean? (default true): Center by subtracting column mean
:stddev? (default true): Scale by standard deviation

Min-Max Scaling (min-max-scale):

Rescales each numeric column to a specified range (default [-0.5, 0.5]). Options:

:min (default -0.5): Target minimum value
:max (default 0.5): Target maximum value

Metamorph Integration: Both transformers follow the metamorph pipeline pattern:

:fit mode: Learn scaling parameters from training data
:transform mode: Apply learned parameters to new data
Stores transformation parameters in context under their assigned :metamorph/id

Feature scaling and normalization transformers for metamorph pipelines.

This namespace provides metamorph-compatible transformers for standardizing and
normalizing numeric features. These preprocessing steps are essential for many
machine learning algorithms to perform well.

Available Transformers:

- `std-scale`: Standardization (z-score normalization)
- `min-max-scale`: Min-max scaling to a specified range

StandardScaling (std-scale):
Centers each numeric column (subtract mean) and/or scales by standard deviation,
producing zero-mean unit-variance data. Useful for:

- Algorithms sensitive to feature magnitude (SVMs, neural networks, KNN)
- Distance-based models

Options:

- `:mean?` (default true): Center by subtracting column mean
- `:stddev?` (default true): Scale by standard deviation

Min-Max Scaling (min-max-scale):

Rescales each numeric column to a specified range (default [-0.5, 0.5]).
Options:

- `:min` (default -0.5): Target minimum value
- `:max` (default 0.5): Target maximum value

Metamorph Integration:
Both transformers follow the metamorph pipeline pattern:

- `:fit` mode: Learn scaling parameters from training data
- `:transform` mode: Apply learned parameters to new data
- Stores transformation parameters in context under their assigned `:metamorph/id`

raw docstring

scicloj.metamorph.ml.pretty

pretty

scicloj.metamorph.ml.r

scicloj.metamorph.ml.r-model-matrix

R-style formula-based feature engineering and linear regression.

This namespace provides tools to leverage R's powerful formula syntax for feature engineering and linear modeling within Clojure. R formulas enable expressive specification of interactions, transformations, and categorical expansions without manual column manipulation.

Key Functions:

r-model-matrix: Convert dataset + R formula to design matrix
lm: Simplified linear regression using R formulas

Implementation Backends: The namespace supports multiple R execution backends:

:ocpu Remote R via OpenCPU (cloud.opencpu.org) - no local R needed
:renjin Java-based R implementation (https://renjin.org/)
:clojisr Local R via clojisr (requires R installation)

Model Matrix Capabilities: R formulas handle:

Basic features: y ~ x1 + x2
Interactions: y ~ x1 * x2 (expands to x1 + x2 + x1:x2)
Polynomial terms: y ~ x + I(x^2)
Categorical encoding: Automatic dummy variable creation
Intercept control: y ~ x - 1 (remove intercept)
Exclusions: y ~ . - x3 (all columns except x3)

Linear Regression (lm): Combines formula-based feature engineering with OLS regression training. Returns a ready-to-use trained model for predictions.

Notes:

OpenCPU backend is convenient but requires internet connectivity
Renjin is standalone but may have some R incompatibilities
clojisr requires a local R installation but offers full R compatibility
Returned model matrices exclude row names and intercept columns by default

See also: scicloj.metamorph.ml.design-matrix for Clojure-native feature engineering

R-style formula-based feature engineering and linear regression.

This namespace provides tools to leverage R's powerful formula syntax for
feature engineering and linear modeling within Clojure. R formulas enable
expressive specification of interactions, transformations, and categorical
expansions without manual column manipulation.

Key Functions:

- `r-model-matrix`: Convert dataset + R formula to design matrix
- `lm`: Simplified linear regression using R formulas

Implementation Backends:
The namespace supports multiple R execution backends:

- `:ocpu`    Remote R via OpenCPU (cloud.opencpu.org) - no local R needed
- `:renjin` Java-based R implementation (https://renjin.org/)
- `:clojisr` Local R via clojisr (requires R installation)

Model Matrix Capabilities:
R formulas handle:

- Basic features: `y ~ x1 + x2`
- Interactions: `y ~ x1 * x2` (expands to x1 + x2 + x1:x2)
- Polynomial terms: `y ~ x + I(x^2)`
- Categorical encoding: Automatic dummy variable creation
- Intercept control: `y ~ x - 1` (remove intercept)
- Exclusions: `y ~ . - x3` (all columns except x3)

Linear Regression (lm):
Combines formula-based feature engineering with OLS regression training.
Returns a ready-to-use trained model for predictions.


Notes:

- OpenCPU backend is convenient but requires internet connectivity
- Renjin is standalone but may have some R incompatibilities
- clojisr requires a local R installation but offers full R compatibility
- Returned model matrices exclude row names and intercept columns by default

See also: [[scicloj.metamorph.ml.design-matrix]] for Clojure-native feature engineering

raw docstring

scicloj.metamorph.ml.random-forest

Optimized Pure Clojure Random Forest implementation for classification and regression. Can be used specifying

:model-type :metamorph.ml/random-forest

Optimized Pure Clojure Random Forest implementation for classification and regression.
Can be used specifying 

`:model-type :metamorph.ml/random-forest`

raw docstring

No vars found in this namespace.

scicloj.metamorph.ml.rdatasets

scicloj.metamorph.ml.regression

Regression models for continuous target prediction.

This namespace provides implementations of various regression algorithms with a consistent metamorph.ml training and prediction interface. Models support statistical output formats (tidy, glance, augment) for analysis and diagnostics.

Available Models:

OLS (Ordinary Least Squares)

:metamorph.ml/ols: Apache Commons Math implementation (Java-based)
:fastmath/ols: FastMath implementation (pure Clojure) Solves for regression coefficients β in: y = Xβ + ε Assumes linear relationships and homoscedastic errors.

GLM (Generalized Linear Model)

:fastmath/glm: FastMath GLM implementation Extends linear regression to non-normal distributions and non-linear relationships via link functions and variance models.

Baseline Model

:metamorph.ml/dummy-regressor: Predicts mean of training target Useful sanity check - models should outperform this baseline.

Model Output Functions:

:tidy-fn: Extracts model coefficients with statistics Returns dataset with :term, :estimate, :std.error, :statistic, :p.value
:glance-fn: Extracts model-level diagnostics Returns dataset with :r.squared, :adj.r.squared, :rss, :aic, :bic, etc.
:augment-fn: Adds model predictions and residuals to data Returns augmented dataset with :.fitted and :.resid columns

Example Usage (in metamorph pipeline):

(ml/train
  data
  {:model-type :fastmath/ols})

Model Diagnostics:

(ml/glance model)        ; Overall model metrics
(ml/tidy model)          ; Coefficient table
(ml/augment model data)  ; Predicted values and residuals

See also: scicloj.metamorph.ml.r-model-matrix for R-formula-based feature engineering

Regression models for continuous target prediction.

 This namespace provides implementations of various regression algorithms with
 a consistent metamorph.ml training and prediction interface. Models support
 statistical output formats (tidy, glance, augment) for analysis and diagnostics.

 Available Models:

 **OLS (Ordinary Least Squares)**

 - `:metamorph.ml/ols`: Apache Commons Math implementation (Java-based)
 - `:fastmath/ols`: FastMath implementation (pure Clojure)
 Solves for regression coefficients β in: y = Xβ + ε
 Assumes linear relationships and homoscedastic errors.

 **GLM (Generalized Linear Model)**

 - `:fastmath/glm`: FastMath GLM implementation
 Extends linear regression to non-normal distributions and non-linear relationships
 via link functions and variance models.

 **Baseline Model**
 
 - `:metamorph.ml/dummy-regressor`: Predicts mean of training target
 Useful sanity check - models should outperform this baseline.

 Model Output Functions:

 - **:tidy-fn**: Extracts model coefficients with statistics
   Returns dataset with :term, :estimate, :std.error, :statistic, :p.value
 - **:glance-fn**: Extracts model-level diagnostics
   Returns dataset with :r.squared, :adj.r.squared, :rss, :aic, :bic, etc.
 - **:augment-fn**: Adds model predictions and residuals to data
   Returns augmented dataset with :.fitted and :.resid columns

 Example Usage (in metamorph pipeline):
 ```
 (ml/train
   data
   {:model-type :fastmath/ols})
    
 ```

 Model Diagnostics:
 ```
 (ml/glance model)        ; Overall model metrics
 (ml/tidy model)          ; Coefficient table
 (ml/augment model data)  ; Predicted values and residuals
```   

 See also: [[scicloj.metamorph.ml.r-model-matrix]] for R-formula-based feature engineering

raw docstring

scicloj.metamorph.ml.text

Large-scale text processing and TF-IDF feature engineering for NLP pipelines.

This namespace provides efficient tools for converting raw text documents into machine learning-ready features using TF-IDF (Term Frequency-Inverse Document Frequency) scoring. Designed to handle large text corpora with flexible memory management strategies.

Core Functions:

->tidy-text Parses text files or datasets into tidy-text format (one token per row). Line-by-line processing enables handling of files larger than available RAM. Supports custom tokenization and metadata extraction.

Output format: tech.v3.dataset with columns:

:document (int): Document/line identifier
:token-idx (int): Token as indexed integer (maps to lookup table)
:token-pos (int): Position of token within document
:meta (optional): Arbitrary metadata from line-split-fn

->tfidf Transforms tidy-text into TF-IDF vector representation for bag-of-words models. Calculates term frequency (TF) and inverse document frequency (IDF) for each token.

Output columns:

:document
:token-idx
:token-count
:tf
:tfid

Memory Optimization:

The namespace provides flexible memory control for large texts via options:

Container Types:

:jvm-heap (default): Java heap storage (fast, limited by heap)
:native-heap: Off-heap native memory via tech.v3
:mmap: Memory-mapped files (disk-backed, bypasses heap limits)

Processing Options:

container-type: Storage for intermediate results during processing
column-container-type: Storage for final output dataset
combine-method: :coalesce-blocks! or :concat-buffers (tradeoffs)
compacting-document-interval: Batch size for consolidating data
datatype-document/token-pos/idx: Memory datatype selection (:int16 vs :int32)

Token Management:

token->index-map: Custom token lookup table (can reuse across runs)
new-token-behaviour: :store (default), :fail, or :as-unknown

Performance Characteristics:

Typical text requires ~1.5x the original file size in RAM
A 8GB text file typically needs ≥12GB total memory
Scaling strategy: Use off-heap or mmap for large corpora

Typical Workflow:

Use ->tidy-text to create tidy text format from raw documents
Use ->tfidf to create TF-IDF feature vectors
Pass vectors to classification/regression models

See also: scicloj.metamorph.ml.column-metric for evaluation, scicloj.metamorph.ml/train for model training

Large-scale text processing and TF-IDF feature engineering for NLP pipelines.

This namespace provides efficient tools for converting raw text documents into
machine learning-ready features using TF-IDF (Term Frequency-Inverse Document
Frequency) scoring. Designed to handle large text corpora with flexible memory
management strategies.

Core Functions:

**->tidy-text**
Parses text files or datasets into tidy-text format (one token per row).
Line-by-line processing enables handling of files larger than available RAM.
Supports custom tokenization and metadata extraction.

Output format: tech.v3.dataset with columns:

- :document (int): Document/line identifier
- :token-idx (int): Token as indexed integer (maps to lookup table)
- :token-pos (int): Position of token within document
- :meta (optional): Arbitrary metadata from line-split-fn

**->tfidf**
Transforms tidy-text into TF-IDF vector representation for bag-of-words models.
Calculates term frequency (TF) and inverse document frequency (IDF) for each token.

Output columns:

- `:document`
- `:token-idx`
- `:token-count`
- `:tf`
- `:tfid`

Memory Optimization:

The namespace provides flexible memory control for large texts via options:

Container Types:

- `:jvm-heap` (default): Java heap storage (fast, limited by heap)
- `:native-heap`: Off-heap native memory via tech.v3
- `:mmap`: Memory-mapped files (disk-backed, bypasses heap limits)

Processing Options:

- `container-type`: Storage for intermediate results during processing
- `column-container-type`: Storage for final output dataset
- `combine-method`: `:coalesce-blocks!` or `:concat-buffers` (tradeoffs)
- `compacting-document-interval`: Batch size for consolidating data
- `datatype-document/token-pos/idx`: Memory datatype selection (:int16 vs :int32)

Token Management:

- `token->index-map`: Custom token lookup table (can reuse across runs)
- `new-token-behaviour`: `:store` (default), `:fail`, or `:as-unknown`

Performance Characteristics:

- Typical text requires ~1.5x the original file size in RAM
- A 8GB text file typically needs ≥12GB total memory
- Scaling strategy: Use off-heap or mmap for large corpora


Typical Workflow:

1. Use ->tidy-text to create tidy text format from raw documents
2. Use ->tfidf to create TF-IDF feature vectors
3. Pass vectors to classification/regression models

See also: [[scicloj.metamorph.ml.column-metric]] for evaluation,
[[scicloj.metamorph.ml/train]] for model training

raw docstring

scicloj.metamorph.ml.tidy-models

Model output standardization and validation following tidymodels conventions.

This namespace implements the tidymodels philosophy (inspired by R's tidymodels/broom packages) for standardized, machine-readable model outputs. All model outputs conform to consistent schemas defined in canonical column specification files.

Three Core Output Functions:

glance: One-row model summary

High-level goodness-of-fit statistics
Examples: R², AIC, BIC, log-likelihood, F-statistic, p-value
Use case: Quick model performance overview

tidy: One-row-per-component output

Component-level details (e.g., one row per coefficient)
Examples: term, estimate, std.error, statistic, p.value
Use case: Detailed model inspection and reporting

augment: One-row-per-observation output

Adds model predictions/residuals to original data
Original columns plus: .fitted, .resid, .hat, .sigma, .cooksd
Use case: Diagnostics and visualization of predictions

Validation and Schema Management:

allowed-tidy-columns: Canonical list of valid tidy column names
allowed-glance-columns: Canonical list of valid glance column names
allowed-augment-columns: Canonical list of valid augment column names
validate-tidy-ds: Validates dataset conforms to tidy standard
validate-glance-ds: Validates dataset conforms to glance standard
validate-augment-ds: Validates dataset conforms to augment standard

Schemas are maintained in GitHub repository (resources/*.edn):

columms-tidy.edn
columms-glance.edn
columms-augment.edn

Control Validation: The *validate-tidy-fns* dynamic variable controls strict validation:

true (default): Raises exception on invalid columns
false: Silently allows any columns

Integration: Model implementations use these validators in their tidy-fn/glance-fn/augment-fn to ensure outputs conform to standardized schemas for consistency across models.

See also: scicloj.metamorph.ml for training and prediction, scicloj.metamorph.ml.regression and scicloj.metamorph.ml.classification for specific model implementations

Model output standardization and validation following tidymodels conventions.

This namespace implements the tidymodels philosophy (inspired by R's tidymodels/broom
packages) for standardized, machine-readable model outputs. All model outputs conform
to consistent schemas defined in canonical column specification files.

Three Core Output Functions:

**glance**: One-row model summary
- High-level goodness-of-fit statistics
- Examples: R², AIC, BIC, log-likelihood, F-statistic, p-value
- Use case: Quick model performance overview

**tidy**: One-row-per-component output
- Component-level details (e.g., one row per coefficient)
- Examples: term, estimate, std.error, statistic, p.value
- Use case: Detailed model inspection and reporting

**augment**: One-row-per-observation output
- Adds model predictions/residuals to original data
- Original columns plus: .fitted, .resid, .hat, .sigma, .cooksd
- Use case: Diagnostics and visualization of predictions

Validation and Schema Management:

- `allowed-tidy-columns`: Canonical list of valid tidy column names
- `allowed-glance-columns`: Canonical list of valid glance column names
- `allowed-augment-columns`: Canonical list of valid augment column names
- `validate-tidy-ds`: Validates dataset conforms to tidy standard
- `validate-glance-ds`: Validates dataset conforms to glance standard
- `validate-augment-ds`: Validates dataset conforms to augment standard

Schemas are maintained in GitHub repository (resources/*.edn):
- columms-tidy.edn
- columms-glance.edn
- columms-augment.edn

Control Validation:
The `*validate-tidy-fns*` dynamic variable controls strict validation:
- `true` (default): Raises exception on invalid columns
- `false`: Silently allows any columns


Integration:
Model implementations use these validators in their tidy-fn/glance-fn/augment-fn
to ensure outputs conform to standardized schemas for consistency across models.

See also: `scicloj.metamorph.ml` for training and prediction,
`scicloj.metamorph.ml.regression` and `scicloj.metamorph.ml.classification`
for specific model implementations

raw docstring

scicloj.metamorph.ml.toydata.ggplot

Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead

Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead

raw docstring

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field