Core machine learning framework integrating metamorph pipelines with standardized model APIs.
This is the central namespace of metamorph.ml, providing infrastructure for:
Key Concepts:
Model Registration: Models are registered using define-model! and can be
referenced by keyword (e.g., :fastmath/ols, :metamorph.ml/dummy-classifier).
Models define a train-fn, predict-fn, and optional diagnostic functions.
Training and Prediction:
train: Train a model on a dataset given options including :model-typepredict: Make predictions using a trained modeltrain-predict-cache: Optional cache to avoid redundant computationsPipeline Evaluation:
evaluate-pipelines: Evaluate multiple pipelines across train/test splitsevaluate-one-pipeline: Evaluate a single pipeline with cross-validationModel Diagnostics (following tidymodels conventions):
glance: One-row model summary (goodness-of-fit)tidy: One-row-per-component output (coefficients with statistics)augment: One-row-per-observation output (predictions, residuals)Main API Functions:
define-model!: Register a new model type with train/predict/diagnostic functionstrain: Train a model with a specified model-typepredict: Generate predictions from a trained modelevaluate-pipelines: Evaluate pipelines with cross-validationglance: Get model summary statisticstidy: Extract coefficient-level resultsaugment: Add predictions and residuals to dataPipeline Integration:
Models integrate with metamorph pipelines via the model step, which:
Example Usage:
;; Register a custom model (rarely needed - use existing models) (define-model! :my/custom-model train-fn predict-fn {...})
;; Train a model (let [model (train iris-data {:model-type :fastmath/ols :target-columns [:Sepal.Width] :feature-columns [:Sepal.Length]})] ;; Get diagnostics (glance model) (tidy model) ;; Make predictions (predict iris-data model))
;; Evaluate multiple pipelines in cross-validation (evaluate-pipelines [pipeline1 pipeline2] train-test-splits metric-fn :accuracy {:map-fn :pmap})
Built-in Models:
Regression:
:metamorph.ml/ols: Apache Commons Math OLS:fastmath/ols: FastMath OLS:fastmath/glm: FastMath GLM:metamorph.ml/dummy-regressor: Mean baselineClassification:
:metamorph.ml/dummy-classifier: Majority class or random baselinePreprocessing: See specific namespaces for transformers:
scicloj.metamorph.ml.preprocessing: Scaling and normalizationscicloj.metamorph.ml.categorical: One-hot encodingscicloj.metamorph.ml.r-model-matrix: R formula featuresSee also: scicloj.metamorph.core for metamorph pipeline mechanics,
scicloj.metamorph.ml.tidy-models for diagnostic validation
Core machine learning framework integrating metamorph pipelines with standardized model APIs.
This is the central namespace of metamorph.ml, providing infrastructure for:
- Registering and using machine learning models
- Training models and making predictions
- Evaluating pipelines via cross-validation
- Standardized model diagnostics (glance, tidy, augment)
- Optional caching of computationally expensive operations
Key Concepts:
**Model Registration**: Models are registered using `define-model!` and can be
referenced by keyword (e.g., `:fastmath/ols`, `:metamorph.ml/dummy-classifier`).
Models define a train-fn, predict-fn, and optional diagnostic functions.
**Training and Prediction**:
- `train`: Train a model on a dataset given options including :model-type
- `predict`: Make predictions using a trained model
- `train-predict-cache`: Optional cache to avoid redundant computations
**Pipeline Evaluation**:
- `evaluate-pipelines`: Evaluate multiple pipelines across train/test splits
- `evaluate-one-pipeline`: Evaluate a single pipeline with cross-validation
- Returns results sorted by metric performance with optional filtering
- Supports parallel evaluation (:map/:pmap/:ppmap)
**Model Diagnostics** (following tidymodels conventions):
- `glance`: One-row model summary (goodness-of-fit)
- `tidy`: One-row-per-component output (coefficients with statistics)
- `augment`: One-row-per-observation output (predictions, residuals)
Main API Functions:
- `define-model!`: Register a new model type with train/predict/diagnostic functions
- `train`: Train a model with a specified model-type
- `predict`: Generate predictions from a trained model
- `evaluate-pipelines`: Evaluate pipelines with cross-validation
- `glance`: Get model summary statistics
- `tidy`: Extract coefficient-level results
- `augment`: Add predictions and residuals to data
Pipeline Integration:
Models integrate with metamorph pipelines via the `model` step, which:
- Trains in :fit mode using training data
- Predicts in :transform mode on new data
- Stores model output column metadata for later evaluation
Example Usage:
;; Register a custom model (rarely needed - use existing models)
(define-model! :my/custom-model train-fn predict-fn {...})
;; Train a model
(let [model (train iris-data {:model-type :fastmath/ols
:target-columns [:Sepal.Width]
:feature-columns [:Sepal.Length]})]
;; Get diagnostics
(glance model)
(tidy model)
;; Make predictions
(predict iris-data model))
;; Evaluate multiple pipelines in cross-validation
(evaluate-pipelines
[pipeline1 pipeline2]
train-test-splits
metric-fn
:accuracy
{:map-fn :pmap})
Built-in Models:
**Regression**:
- `:metamorph.ml/ols`: Apache Commons Math OLS
- `:fastmath/ols`: FastMath OLS
- `:fastmath/glm`: FastMath GLM
- `:metamorph.ml/dummy-regressor`: Mean baseline
**Classification**:
- `:metamorph.ml/dummy-classifier`: Majority class or random baseline
**Preprocessing**:
See specific namespaces for transformers:
- `scicloj.metamorph.ml.preprocessing`: Scaling and normalization
- `scicloj.metamorph.ml.categorical`: One-hot encoding
- `scicloj.metamorph.ml.r-model-matrix`: R formula features
See also: `scicloj.metamorph.core` for metamorph pipeline mechanics,
`scicloj.metamorph.ml.tidy-models` for diagnostic validationCaching infrastructure for metamorph.ml train/predict operations.
This namespace provides flexible caching backends to store and retrieve results of machine learning training and prediction operations. This is useful for avoiding redundant computations when working with the same models and data.
Supported cache backends:
Usage: (enable-atom-cache! (atom {})) ; Enable in-memory caching ;; or (enable-disk-cache! "/tmp/ml-cache") ; Enable disk-based caching ;; or (enable-redis-cache! {...}) ; Enable Redis caching
To disable caching: (disable-cache!)
See individual function docs for more details on each backend.
Caching infrastructure for metamorph.ml train/predict operations.
This namespace provides flexible caching backends to store and retrieve results
of machine learning training and prediction operations. This is useful for
avoiding redundant computations when working with the same models and data.
Supported cache backends:
- **Atom cache**: In-memory caching using a Clojure atom (fast, ephemeral)
- **Disk cache**: File-based caching using Nippy serialization (persistent)
- **Redis cache**: Distributed caching via Redis (requires carmine library)
Usage:
(enable-atom-cache! (atom {})) ; Enable in-memory caching
;; or
(enable-disk-cache! "/tmp/ml-cache") ; Enable disk-based caching
;; or
(enable-redis-cache! {...}) ; Enable Redis caching
To disable caching:
(disable-cache!)
See individual function docs for more details on each backend.Categorical feature encoding for machine learning pipelines.
This namespace provides metamorph transformers for handling categorical variables commonly used in supervised learning. Currently focuses on one-hot encoding, which converts categorical values into binary indicator columns.
One-hot encoding is essential for:
Main API:
transform-one-hot: The primary metamorph transformer for one-hot encodingEncoding strategies:
:full Uses a predefined level set from full dataset context:fit Levels discovered during :fit used in :transform:independent Each mode independently determines and encodes levelsCategorical feature encoding for machine learning pipelines. This namespace provides metamorph transformers for handling categorical variables commonly used in supervised learning. Currently focuses on one-hot encoding, which converts categorical values into binary indicator columns. One-hot encoding is essential for: - Preparing categorical features for algorithms that expect numeric inputs - Preventing ordinal assumptions on nominal categories - Creating interpretable model features Main API: - `transform-one-hot`: The primary metamorph transformer for one-hot encoding Encoding strategies: - `:full` Uses a predefined level set from full dataset context - `:fit` Levels discovered during :fit used in :transform - `:independent` Each mode independently determines and encodes levels
Classification models and evaluation metrics for metamorph.ml.
This namespace provides tools for classification tasks including:
Key features:
confusion-map: Creates confusion matrices from predictions and true labelsconfusion-map->ds: Converts confusion matrices to tabular dataset format:metamorph.ml/dummy-classifier: A baseline classifier for sanity checksDummy Classifier Strategies:
:majority-class (default): Always predicts the most frequent class:fixed-class: Predicts a specified class:random-class: Predicts randomly from the observed classesConfusion Matrix Normalization:
:all (default): Row-wise normalization (recall perspective):none: Raw countsExample usage: (let [pred [0 1 0 1 1] true [0 0 1 1 1] conf-map (confusion-map pred true :none)] (confusion-map->ds conf-map))
See also: scicloj.metamorph.ml/define-model!, scicloj.metamorph.ml.viz/confusion-matrix
Classification models and evaluation metrics for metamorph.ml.
This namespace provides tools for classification tasks including:
- Confusion matrix generation and analysis
- Baseline classifier implementations
- Classification evaluation utilities
Key features:
- `confusion-map`: Creates confusion matrices from predictions and true labels
- `confusion-map->ds`: Converts confusion matrices to tabular dataset format
- `:metamorph.ml/dummy-classifier`: A baseline classifier for sanity checks
Dummy Classifier Strategies:
- `:majority-class` (default): Always predicts the most frequent class
- `:fixed-class`: Predicts a specified class
- `:random-class`: Predicts randomly from the observed classes
Confusion Matrix Normalization:
- `:all` (default): Row-wise normalization (recall perspective)
- `:none`: Raw counts
Example usage:
(let [pred [0 1 0 1 1]
true [0 0 1 1 1]
conf-map (confusion-map pred true :none)]
(confusion-map->ds conf-map))
See also: `scicloj.metamorph.ml/define-model!`, `scicloj.metamorph.ml.viz/confusion-matrix`Model evaluation metrics for classification and regression tasks.
This namespace provides functions to compute standard machine learning metrics on model predictions vs. ground truth labels, with support for both binary and multiclass classification as well as regression tasks.
Key Functions:
classification-metric: Evaluate classification model predictionsregression-metric: Evaluate regression model predictionsClassification Metrics (from fastmath.stats): Supports binary and multiclass metrics including accuracy, precision, recall, F1-score, and more. Multiclass metrics can be averaged using:
:macro - Unweighted mean of per-class metrics:micro - Aggregated true/false positives globally
Also supports :roc-auc for multiclass AUC scoring.Regression Metrics (from fastmath.stats): Distance and similarity metrics such as MAE, MSE, RMSE, R², etc.
Data Format:
Validation: The functions perform extensive validation including:
Example: (classification-metric y-true y-pred :f1 :macro {}) (regression-metric y-true y-pred :mse)
See also: fastmath.stats documentation for available metric names
Model evaluation metrics for classification and regression tasks.
This namespace provides functions to compute standard machine learning metrics
on model predictions vs. ground truth labels, with support for both binary and
multiclass classification as well as regression tasks.
Key Functions:
- `classification-metric`: Evaluate classification model predictions
- `regression-metric`: Evaluate regression model predictions
Classification Metrics (from fastmath.stats):
Supports binary and multiclass metrics including accuracy, precision, recall,
F1-score, and more. Multiclass metrics can be averaged using:
- `:macro` - Unweighted mean of per-class metrics
- `:micro` - Aggregated true/false positives globally
Also supports `:roc-auc` for multiclass AUC scoring.
Regression Metrics (from fastmath.stats):
Distance and similarity metrics such as MAE, MSE, RMSE, R², etc.
Data Format:
- Input datasets must be tech.ml.dataset (TMD) format
- Must have appropriate column metadata (:prediction, :target, etc.)
- Support categorical mappings via :categorical-map metadata
- Missing values and NaNs are detected and rejected appropriately
Validation:
The functions perform extensive validation including:
- Column metadata correctness
- Missing values and NaN detection
- Type and datatype uniformity
- Row count alignment between datasets
- Single-label assumption (multi-label not yet supported)
Example:
(classification-metric y-true y-pred :f1 :macro {})
(regression-metric y-true y-pred :mse)
See also: `fastmath.stats` documentation for available metric namesDesign matrix construction for machine learning pipelines.
This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).
Main Entry Point:
create-design-matrix: Transform a dataset into a design matrix with custom specsDesign Matrix Specification Syntax:
Column specifications use [column-name transformation] pairs where:
Shorthand Syntax:
Available Aliases (no qualification needed):
ds - tech.v3.datasettc - tablecloth.apitcc - tablecloth.column.apiFeatures:
Example: (create-design-matrix iris-data [:species] ; target column [[:petal-length identity] ; keep as-is [:sepal-ratio '(/ :sepal-length ; derive new feature :sepal-width)]])
Limitations:
See also: fastmath.ml/lm for linear regression with formula-based transformations
Design matrix construction for machine learning pipelines.
This namespace provides utilities to transform datasets into numeric design
matrices suitable for machine learning models. It supports deriving new features,
transforming existing columns, managing target variables, and expanding complex
column types (arrays, maps).
Main Entry Point:
- `create-design-matrix`: Transform a dataset into a design matrix with custom specs
Design Matrix Specification Syntax:
Column specifications use [column-name transformation] pairs where:
- Transformations are Clojure expressions (quoted with ')
- Expressions can reference column names directly as symbols
- Expressions are evaluated in order and can chain
- Non-listed columns are removed from the output
Shorthand Syntax:
- :column-name Keeps column unchanged (identity function)
- [nil '(+ a b)] Auto-generates column name for derived column
- ['(+ a b)] Same as above
Available Aliases (no qualification needed):
- `ds` - tech.v3.dataset
- `tc` - tablecloth.api
- `tcc` - tablecloth.column.api
- All of clojure.core
Features:
- Derives new columns from existing data
- Expands array and map columns into separate columns
- Automatically converts categorical columns to numbers
- Sets inference target(s) for supervised learning
- Chains transformations in dependency order
Example:
(create-design-matrix
iris-data
[:species] ; target column
[[:petal-length identity] ; keep as-is
[:sepal-ratio '(/ :sepal-length ; derive new feature
:sepal-width)]])
Limitations:
- Does not automatically expand categorical variables (specify manually)
- For linear regression, fastmath/ols offers a :transformer option using R formulas
- Design matrix approach is more flexible but less compact than R formula syntax
See also: `fastmath.ml/lm` for linear regression with formula-based transformationsGridsearching as defined by create a map with gridsearch definitions for its values and then gridsearching which produces a sequence of full defined maps.
The initial default implementation uses the sobol sequence.
Gridsearching as defined by create a map with gridsearch definitions for its values and then gridsearching which produces a sequence of full defined maps. The initial default implementation uses the sobol sequence.
Simple loss functions.
Simple loss functions.
Excellent metrics tools from the cortex project.
Excellent metrics tools from the cortex project.
Feature scaling and normalization transformers for metamorph pipelines.
This namespace provides metamorph-compatible transformers for standardizing and normalizing numeric features. These preprocessing steps are essential for many machine learning algorithms to perform well.
Available Transformers:
std-scale: Standardization (z-score normalization)min-max-scale: Min-max scaling to a specified rangeStandardScaling (std-scale): Centers each numeric column (subtract mean) and/or scales by standard deviation, producing zero-mean unit-variance data. Useful for:
:mean? (default true): Center by subtracting column mean:stddev? (default true): Scale by standard deviationMin-Max Scaling (min-max-scale): Rescales each numeric column to a specified range (default [-0.5, 0.5]). Options:
:min (default -0.5): Target minimum value:max (default 0.5): Target maximum valueMetamorph Integration: Both transformers follow the metamorph pipeline pattern:
:fit mode: Learn scaling parameters from training data:transform mode: Apply learned parameters to new data:metamorph/idExample Usage (in metamorph pipeline): (preprocessing/std-scale [:age :income] {:mean? true :stddev? true}))
Feature scaling and normalization transformers for metamorph pipelines.
This namespace provides metamorph-compatible transformers for standardizing and
normalizing numeric features. These preprocessing steps are essential for many
machine learning algorithms to perform well.
Available Transformers:
- `std-scale`: Standardization (z-score normalization)
- `min-max-scale`: Min-max scaling to a specified range
StandardScaling (std-scale):
Centers each numeric column (subtract mean) and/or scales by standard deviation,
producing zero-mean unit-variance data. Useful for:
- Algorithms sensitive to feature magnitude (SVMs, neural networks, KNN)
- Distance-based models
Options:
- `:mean?` (default true): Center by subtracting column mean
- `:stddev?` (default true): Scale by standard deviation
Min-Max Scaling (min-max-scale):
Rescales each numeric column to a specified range (default [-0.5, 0.5]).
Options:
- `:min` (default -0.5): Target minimum value
- `:max` (default 0.5): Target maximum value
Metamorph Integration:
Both transformers follow the metamorph pipeline pattern:
- `:fit` mode: Learn scaling parameters from training data
- `:transform` mode: Apply learned parameters to new data
- Stores transformation parameters in context under their assigned `:metamorph/id`
Example Usage (in metamorph pipeline):
(preprocessing/std-scale [:age :income] {:mean? true :stddev? true}))R-style formula-based feature engineering and linear regression.
This namespace provides tools to leverage R's powerful formula syntax for feature engineering and linear modeling within Clojure. R formulas enable expressive specification of interactions, transformations, and categorical expansions without manual column manipulation.
Key Functions:
r-model-matrix: Convert dataset + R formula to design matrixlm: Simplified linear regression using R formulasImplementation Backends: The namespace supports multiple R execution backends:
:ocpu Remote R via OpenCPU (cloud.opencpu.org) - no local R needed:renjine Java-based R implementation (https://renjin.org/):clojisr Local R via clojisr (requires R installation)Model Matrix Capabilities: R formulas handle:
y ~ x1 + x2y ~ x1 * x2 (expands to x1 + x2 + x1:x2)y ~ x + I(x^2)y ~ x - 1 (remove intercept)y ~ . - x3 (all columns except x3)Linear Regression (lm): Combines formula-based feature engineering with OLS regression training. Returns a ready-to-use trained model for predictions.
Example Usage: (r-model-matrix iris-data "~ Sepal.Length + Sepal.Width" :renjine) (lm iris-data "Sepal.Width ~ Sepal.Length * Petal.Length" :Sepal.Width :ocpu)
Notes:
See also: scicloj.metamorph.ml.design-matrix for Clojure-native feature engineering
R-style formula-based feature engineering and linear regression.
This namespace provides tools to leverage R's powerful formula syntax for
feature engineering and linear modeling within Clojure. R formulas enable
expressive specification of interactions, transformations, and categorical
expansions without manual column manipulation.
Key Functions:
- `r-model-matrix`: Convert dataset + R formula to design matrix
- `lm`: Simplified linear regression using R formulas
Implementation Backends:
The namespace supports multiple R execution backends:
- `:ocpu` Remote R via OpenCPU (cloud.opencpu.org) - no local R needed
- `:renjine` Java-based R implementation (https://renjin.org/)
- `:clojisr` Local R via clojisr (requires R installation)
Model Matrix Capabilities:
R formulas handle:
- Basic features: `y ~ x1 + x2`
- Interactions: `y ~ x1 * x2` (expands to x1 + x2 + x1:x2)
- Polynomial terms: `y ~ x + I(x^2)`
- Categorical encoding: Automatic dummy variable creation
- Intercept control: `y ~ x - 1` (remove intercept)
- Exclusions: `y ~ . - x3` (all columns except x3)
Linear Regression (lm):
Combines formula-based feature engineering with OLS regression training.
Returns a ready-to-use trained model for predictions.
Example Usage:
(r-model-matrix iris-data "~ Sepal.Length + Sepal.Width" :renjine)
(lm iris-data "Sepal.Width ~ Sepal.Length * Petal.Length"
:Sepal.Width :ocpu)
Notes:
- OpenCPU backend is convenient but requires internet connectivity
- Renjin is standalone but may have some R incompatibilities
- clojisr requires a local R installation but offers full R compatibility
- Returned model matrices exclude row names and intercept columns by default
See also: `scicloj.metamorph.ml.design-matrix` for Clojure-native feature engineeringRegression models for continuous target prediction.
This namespace provides implementations of various regression algorithms with a consistent metamorph.ml training and prediction interface. Models support statistical output formats (tidy, glance, augment) for analysis and diagnostics.
Available Models:
OLS (Ordinary Least Squares)
:metamorph.ml/ols: Apache Commons Math implementation (Java-based):fastmath/ols: FastMath implementation (pure Clojure)
Solves for regression coefficients β in: y = Xβ + ε
Assumes linear relationships and homoscedastic errors.GLM (Generalized Linear Model)
:fastmath/glm: FastMath GLM implementation
Extends linear regression to non-normal distributions and non-linear relationships
via link functions and variance models.Baseline Model
:metamorph.ml/dummy-regressor: Predicts mean of training target
Useful sanity check - models should outperform this baseline.Model Output Functions:
Example Usage (in metamorph pipeline): (ml/train data {:model-type :fastmath/ols :target-columns [:price] :feature-columns [:sqft :bedrooms]})
Model Diagnostics: (glance model) ; Overall model metrics (tidy model) ; Coefficient table (augment model ds) ; Predicted values and residuals
See also: scicloj.metamorph.ml.r-model-matrix for formula-based feature engineering
Regression models for continuous target prediction.
This namespace provides implementations of various regression algorithms with
a consistent metamorph.ml training and prediction interface. Models support
statistical output formats (tidy, glance, augment) for analysis and diagnostics.
Available Models:
**OLS (Ordinary Least Squares)**
- `:metamorph.ml/ols`: Apache Commons Math implementation (Java-based)
- `:fastmath/ols`: FastMath implementation (pure Clojure)
Solves for regression coefficients β in: y = Xβ + ε
Assumes linear relationships and homoscedastic errors.
**GLM (Generalized Linear Model)**
- `:fastmath/glm`: FastMath GLM implementation
Extends linear regression to non-normal distributions and non-linear relationships
via link functions and variance models.
**Baseline Model**
- `:metamorph.ml/dummy-regressor`: Predicts mean of training target
Useful sanity check - models should outperform this baseline.
Model Output Functions:
- **:tidy-fn**: Extracts model coefficients with statistics
Returns dataset with :term, :estimate, :std.error, :statistic, :p.value
- **:glance-fn**: Extracts model-level diagnostics
Returns dataset with :r.squared, :adj.r.squared, :rss, :aic, :bic, etc.
- **:augment-fn**: Adds model predictions and residuals to data
Returns augmented dataset with :.fitted and :.resid columns
Example Usage (in metamorph pipeline):
(ml/train
data
{:model-type :fastmath/ols
:target-columns [:price]
:feature-columns [:sqft :bedrooms]})
Model Diagnostics:
(glance model) ; Overall model metrics
(tidy model) ; Coefficient table
(augment model ds) ; Predicted values and residuals
See also: `scicloj.metamorph.ml.r-model-matrix` for formula-based feature engineeringLarge-scale text processing and TF-IDF feature engineering for NLP pipelines.
This namespace provides efficient tools for converting raw text documents into machine learning-ready features using TF-IDF (Term Frequency-Inverse Document Frequency) scoring. Designed to handle large text corpora with flexible memory management strategies.
Core Functions:
->tidy-text Parses text files or datasets into tidy-text format (one token per row). Line-by-line processing enables handling of files larger than available RAM. Supports custom tokenization and metadata extraction.
Output format: tech.v3.dataset with columns:
->tfidf Transforms tidy-text into TF-IDF vector representation for bag-of-words models. Calculates term frequency (TF) and inverse document frequency (IDF) for each token.
Output columns:
Memory Optimization:
The namespace provides flexible memory control for large texts via options:
Container Types:
:jvm-heap (default): Java heap storage (fast, limited by heap):native-heap: Off-heap native memory via tech.v3:mmap: Memory-mapped files (disk-backed, bypasses heap limits)Processing Options:
container-type: Storage for intermediate results during processingcolumn-container-type: Storage for final output datasetcombine-method: :coalesce-blocks! or :concat-buffers (tradeoffs)compacting-document-interval: Batch size for consolidating datadatatype-document/token-pos/idx: Memory datatype selection (:int16 vs :int32)Token Management:
token->index-map: Custom token lookup table (can reuse across runs)new-token-behaviour: :store (default), :fail, or :as-unknownPerformance Characteristics:
Example Usage: ;; Load and tokenize text file (let [reader (io/reader "corpus.txt") result (->tidy-text reader #(line-seq %) #(str/split % #"\t" 2) ; tab-separated: text, meta #(str/split % #"\s+") ; whitespace tokenization :max-lines 100000)] ;; Extract tidy text datasets (doseq [ds (:datasets result)] ;; Convert to TF-IDF vectors (->tfidf ds :container-type :native-heap :column-container-type :jvm-heap)))
Typical Workflow:
See also: scicloj.metamorph.ml.column-metric for evaluation,
scicloj.metamorph.ml for model training
Large-scale text processing and TF-IDF feature engineering for NLP pipelines.
This namespace provides efficient tools for converting raw text documents into
machine learning-ready features using TF-IDF (Term Frequency-Inverse Document
Frequency) scoring. Designed to handle large text corpora with flexible memory
management strategies.
Core Functions:
**->tidy-text**
Parses text files or datasets into tidy-text format (one token per row).
Line-by-line processing enables handling of files larger than available RAM.
Supports custom tokenization and metadata extraction.
Output format: tech.v3.dataset with columns:
- :document (int): Document/line identifier
- :token-idx (int): Token as indexed integer (maps to lookup table)
- :token-pos (int): Position of token within document
- :meta (optional): Arbitrary metadata from line-split-fn
**->tfidf**
Transforms tidy-text into TF-IDF vector representation for bag-of-words models.
Calculates term frequency (TF) and inverse document frequency (IDF) for each token.
Output columns:
- :document, :token-idx, :token-count, :tf, :tfidf
Memory Optimization:
The namespace provides flexible memory control for large texts via options:
Container Types:
- `:jvm-heap` (default): Java heap storage (fast, limited by heap)
- `:native-heap`: Off-heap native memory via tech.v3
- `:mmap`: Memory-mapped files (disk-backed, bypasses heap limits)
Processing Options:
- `container-type`: Storage for intermediate results during processing
- `column-container-type`: Storage for final output dataset
- `combine-method`: `:coalesce-blocks!` or `:concat-buffers` (tradeoffs)
- `compacting-document-interval`: Batch size for consolidating data
- `datatype-document/token-pos/idx`: Memory datatype selection (:int16 vs :int32)
Token Management:
- `token->index-map`: Custom token lookup table (can reuse across runs)
- `new-token-behaviour`: `:store` (default), `:fail`, or `:as-unknown`
Performance Characteristics:
- Typical text requires ~1.5x the original file size in RAM
- A 8GB text file typically needs ≥12GB total memory
- Scaling strategy: Use off-heap or mmap for large corpora
Example Usage:
;; Load and tokenize text file
(let [reader (io/reader "corpus.txt")
result (->tidy-text
reader
#(line-seq %)
#(str/split % #"\t" 2) ; tab-separated: text, meta
#(str/split % #"\s+") ; whitespace tokenization
:max-lines 100000)]
;; Extract tidy text datasets
(doseq [ds (:datasets result)]
;; Convert to TF-IDF vectors
(->tfidf ds
:container-type :native-heap
:column-container-type :jvm-heap)))
Typical Workflow:
1. Use ->tidy-text to create tidy text format from raw documents
2. Use ->tfidf to create TF-IDF feature vectors
3. Pass vectors to classification/regression models
See also: `scicloj.metamorph.ml.column-metric` for evaluation,
`scicloj.metamorph.ml` for model trainingModel output standardization and validation following tidymodels conventions.
This namespace implements the tidymodels philosophy (inspired by R's tidymodels/broom packages) for standardized, machine-readable model outputs. All model outputs conform to consistent schemas defined in canonical column specification files.
Three Core Output Functions:
glance: One-row model summary
tidy: One-row-per-component output
augment: One-row-per-observation output
Validation and Schema Management:
allowed-tidy-columns: Canonical list of valid tidy column namesallowed-glance-columns: Canonical list of valid glance column namesallowed-augment-columns: Canonical list of valid augment column namesvalidate-tidy-ds: Validates dataset conforms to tidy standardvalidate-glance-ds: Validates dataset conforms to glance standardvalidate-augment-ds: Validates dataset conforms to augment standardSchemas are maintained in GitHub repository (resources/*.edn):
Control Validation:
The *validate-tidy-fns* dynamic variable controls strict validation:
true (default): Raises exception on invalid columnsfalse: Silently allows any columnsIntegration: Model implementations use these validators in their tidy-fn/glance-fn/augment-fn to ensure outputs conform to standardized schemas for consistency across models.
See also: scicloj.metamorph.ml for training and prediction,
scicloj.metamorph.ml.regression and scicloj.metamorph.ml.classification
for specific model implementations
Model output standardization and validation following tidymodels conventions. This namespace implements the tidymodels philosophy (inspired by R's tidymodels/broom packages) for standardized, machine-readable model outputs. All model outputs conform to consistent schemas defined in canonical column specification files. Three Core Output Functions: **glance**: One-row model summary - High-level goodness-of-fit statistics - Examples: R², AIC, BIC, log-likelihood, F-statistic, p-value - Use case: Quick model performance overview **tidy**: One-row-per-component output - Component-level details (e.g., one row per coefficient) - Examples: term, estimate, std.error, statistic, p.value - Use case: Detailed model inspection and reporting **augment**: One-row-per-observation output - Adds model predictions/residuals to original data - Original columns plus: .fitted, .resid, .hat, .sigma, .cooksd - Use case: Diagnostics and visualization of predictions Validation and Schema Management: - `allowed-tidy-columns`: Canonical list of valid tidy column names - `allowed-glance-columns`: Canonical list of valid glance column names - `allowed-augment-columns`: Canonical list of valid augment column names - `validate-tidy-ds`: Validates dataset conforms to tidy standard - `validate-glance-ds`: Validates dataset conforms to glance standard - `validate-augment-ds`: Validates dataset conforms to augment standard Schemas are maintained in GitHub repository (resources/*.edn): - columms-tidy.edn - columms-glance.edn - columms-augment.edn Control Validation: The `*validate-tidy-fns*` dynamic variable controls strict validation: - `true` (default): Raises exception on invalid columns - `false`: Silently allows any columns Integration: Model implementations use these validators in their tidy-fn/glance-fn/augment-fn to ensure outputs conform to standardized schemas for consistency across models. See also: `scicloj.metamorph.ml` for training and prediction, `scicloj.metamorph.ml.regression` and `scicloj.metamorph.ml.classification` for specific model implementations
Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead
Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead
Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead
Deprecated ns. Use scicloj.metamorph.ml.rdatasets instead
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |