Stratum includes a built-in isolation forest implementation for out-of-distribution (OOD) detection on columnar data. The algorithm runs entirely in Java with SIMD-accelerated scoring, achieving sub-50ms latency on 6M rows.
The isolation forest (Liu et al. 2008) detects anomalies by measuring how easily a point can be isolated from the rest of the data. Anomalous points have shorter average path lengths in random binary trees, yielding higher anomaly scores.
Key properties:
(require '[stratum.api :as st])
;; Prepare data
(def amounts (double-array [10 15 12 11 14 200 13 11 300 12]))
(def freqs (double-array [ 5 6 4 5 7 1 5 4 1 6]))
;; Train with auto-threshold (5% contamination)
(def model (st/train-iforest {:from {:amount amounts :freq freqs}
:contamination 0.05}))
;; Score all rows: double[] in [0, 1]
(st/iforest-score model {:amount amounts :freq freqs})
;; Binary prediction: long[] with 1 = anomaly, 0 = normal
(st/iforest-predict model {:amount amounts :freq freqs})
train-iforestTrain an isolation forest on columnar data.
(st/train-iforest {:from {:col1 data1 :col2 data2} ;; required
:n-trees 100 ;; default 100
:sample-size 256 ;; default 256
:seed 42 ;; default 42
:contamination 0.05}) ;; optional (0, 0.5]
Parameters:
:from - map of keyword to double[] or long[] columns (required):n-trees - number of isolation trees (default 100):sample-size - rows subsampled per tree (default 256). Controls tree depth: ceil(log2(sample-size)):seed - random seed for reproducibility:contamination - expected fraction of anomalies in training data. When set, computes a score threshold automatically from the training score distribution (percentile at 1 - contamination)Returns a model map containing the flat forest array, metadata, and (if contamination was set) the threshold, training score min/max.
iforest-scoreRaw anomaly scores.
(st/iforest-score model data) ;; → double[]
Returns double[] of anomaly scores in [0, 1]. Higher values indicate more anomalous points. Typical anomaly threshold: 0.6-0.7.
iforest-predictBinary anomaly classification.
(st/iforest-predict model data) ;; uses auto-threshold
(st/iforest-predict model data {:threshold 0.65}) ;; explicit threshold
Returns long[] with 1 for anomaly, 0 for normal. Requires either :contamination during training or an explicit :threshold.
iforest-predict-probaCalibrated anomaly probability.
(st/iforest-predict-proba model data) ;; → double[]
Returns double[] in [0, 1]. If the model was trained with :contamination, normalizes scores using min-max scaling from the training score distribution. Otherwise returns raw scores.
iforest-predict-confidencePrediction confidence based on tree agreement.
(st/iforest-predict-confidence model data) ;; → double[]
Returns double[] in [0, 1] where 1.0 means all trees fully agree on the point's isolation depth. Uses the coefficient of variation (CV) of per-tree path lengths: confidence = 1 / (1 + CV).
iforest-rotateOnline model adaptation by replacing oldest trees.
(st/iforest-rotate model new-data) ;; replace 10% of trees (default)
(st/iforest-rotate model new-data 20) ;; replace 20 trees
Returns a new model. The original model is unchanged (CoW semantics).
Best practices:
k (5-10% of n-trees) for gradual adaptationscore-weightedScoring with exponential decay weighting for recent trees (useful after rotation).
(require '[stratum.iforest :as iforest])
(iforest/score-weighted model data 0.98) ;; decay=0.98, newer trees weighted higher
Anomaly scores can be used as regular columns in queries:
;; Score, then filter anomalies
(def scores (st/iforest-score model data))
(st/q {:from (assoc data :score scores)
:where [[:> :score 0.7]]
:agg [[:count]]})
;; Combine with other analytics
(st/q {:from (assoc data :score scores)
:group [:region]
:agg [[:avg :score] [:count]]
:having [[:> :avg 0.5]]
:order [[:avg :desc]]})
Models can be created, managed, and queried entirely from SQL — no Clojure needed.
-- Train an isolation forest on the training query results
CREATE MODEL fraud_model
TYPE ISOLATION_FOREST
OPTIONS (n_trees = 200, sample_size = 256, contamination = 0.05)
AS SELECT amount, freq FROM transactions;
The AS SELECT ... query defines the training data. Any valid SELECT is supported (including WHERE filters, JOINs, expressions). Column names from the SELECT become the model's feature names.
OPTIONS (all optional):
| Option | Default | Description |
|--------|---------|-------------|
| n_trees | 100 | Number of isolation trees |
| sample_size | 256 | Rows subsampled per tree |
| seed | 42 | Random seed for reproducibility |
| contamination | not set | Expected anomaly fraction (0, 0.5]. Sets threshold automatically |
-- List all registered models
SHOW MODELS;
-- Show model details (features, hyperparameters, threshold)
DESCRIBE MODEL fraud_model;
-- Remove a model
DROP MODEL fraud_model;
-- Remove only if it exists (no error if missing)
DROP MODEL IF EXISTS fraud_model;
Two calling conventions are supported:
Short form — uses the model's feature names automatically:
-- Simplest: model remembers its features from training
SELECT *, ANOMALY_SCORE('fraud_model') AS score
FROM transactions
WHERE ANOMALY_SCORE('fraud_model') > 0.7;
Long form — explicit column/expression arguments (mapped positionally to features):
-- Explicit columns (useful for remapping or expressions)
SELECT *, ANOMALY_SCORE('fraud_model', amount, freq) AS score
FROM transactions
WHERE ANOMALY_SCORE('fraud_model', amount, freq) > 0.7;
-- With expressions: score on transformed data
SELECT *, ANOMALY_SCORE('fraud_model', amount * 100, LOG(freq)) AS score
FROM transactions;
-- Works across JOINs
SELECT t.*, ANOMALY_SCORE('fraud_model', t.amount, r.rate) AS score
FROM transactions t JOIN rates r ON t.currency = r.code;
All anomaly functions support both forms:
-- Binary prediction (1 = anomaly, 0 = normal)
SELECT *, ANOMALY_PREDICT('fraud_model') AS is_anomaly
FROM transactions;
-- Calibrated probability [0, 1]
SELECT *, ANOMALY_PROBA('fraud_model') AS prob
FROM transactions;
-- Prediction confidence (tree agreement) [0, 1]
SELECT *, ANOMALY_CONFIDENCE('fraud_model') AS conf
FROM transactions;
In the long form, column arguments must match the feature count from training (in order).
Models can also be trained via the Clojure API and registered with the server:
(def srv (st/start-server {:port 5432}))
(st/register-table! srv "transactions" tx-data)
(def model (st/train-iforest {:from tx-data :contamination 0.05}))
(st/register-model! srv "fraud_model" model)
This is useful for programmatic workflows, custom training pipelines, or model rotation.
Trees are packed into a single long[] array for cache-friendly traversal:
longDouble.doubleToLongBits(c(leafSize))forest[t * maxNodes + nodeIdx]n-trees * (2 * sample-size - 1) * 8 bytes = ~2.5 MB for default settingsFor each row, traverse each tree from root to leaf. The anomaly score is:
score(x) = 2^(-E(h(x)) / c(psi))
where E(h(x)) is the mean path length across all trees and c(n) = 2*H(n-1) - 2*(n-1)/n is the expected path length of an unsuccessful BST search (with H(i) being the harmonic number).
Scoring is parallelized using morsel-driven execution (64K rows per morsel) across all available cores.
When :contamination is set:
(1 - contamination) percentileThis means predict labels approximately contamination * 100% of the training data as anomalies.
All inputs are validated against malli schemas (stratum.specification):
STrainOpts validates :from is a column map, :contamination is in (0, 0.5], etc.(st/train-iforest {:from "bad"})
;; => ExceptionInfo Invalid input: {:from ["must be a map of keyword → column data ..."]}
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |