Stratum includes a built-in isolation forest implementation for out-of-distribution (OOD) detection on columnar data. The algorithm runs entirely in Java with SIMD-accelerated scoring, achieving sub-50ms latency on 6M rows.
The isolation forest (Liu et al. 2008) detects anomalies by measuring how easily a point can be isolated from the rest of the data. Anomalous points have shorter average path lengths in random binary trees, yielding higher anomaly scores.
Key properties:
(require '[stratum.api :as st])
;; Prepare data
(def amounts (double-array [10 15 12 11 14 200 13 11 300 12]))
(def freqs (double-array [ 5 6 4 5 7 1 5 4 1 6]))
;; Train with auto-threshold (5% contamination)
(def model (st/train-iforest {:from {:amount amounts :freq freqs}
:contamination 0.05}))
;; Score all rows: double[] in [0, 1]
(st/iforest-score model {:amount amounts :freq freqs})
;; Binary prediction: long[] with 1 = anomaly, 0 = normal
(st/iforest-predict model {:amount amounts :freq freqs})
train-iforestTrain an isolation forest on columnar data.
(st/train-iforest {:from {:col1 data1 :col2 data2} ;; required
:n-trees 100 ;; default 100
:sample-size 256 ;; default 256
:seed 42 ;; default 42
:contamination 0.05}) ;; optional (0, 0.5]
Parameters:
:from — map of keyword to double[] or long[] columns (required):n-trees — number of isolation trees (default 100):sample-size — rows subsampled per tree (default 256). Controls tree depth: ceil(log2(sample-size)):seed — random seed for reproducibility:contamination — expected fraction of anomalies in training data. When set, computes a score threshold automatically from the training score distribution (percentile at 1 - contamination)Returns a model map containing the flat forest array, metadata, and (if contamination was set) the threshold, training score min/max.
iforest-scoreRaw anomaly scores.
(st/iforest-score model data) ;; → double[]
Returns double[] of anomaly scores in [0, 1]. Higher values indicate more anomalous points. Typical anomaly threshold: 0.6-0.7.
iforest-predictBinary anomaly classification.
(st/iforest-predict model data) ;; uses auto-threshold
(st/iforest-predict model data {:threshold 0.65}) ;; explicit threshold
Returns long[] with 1 for anomaly, 0 for normal. Requires either :contamination during training or an explicit :threshold.
iforest-predict-probaCalibrated anomaly probability.
(st/iforest-predict-proba model data) ;; → double[]
Returns double[] in [0, 1]. If the model was trained with :contamination, normalizes scores using min-max scaling from the training score distribution. Otherwise returns raw scores.
iforest-predict-confidencePrediction confidence based on tree agreement.
(st/iforest-predict-confidence model data) ;; → double[]
Returns double[] in [0, 1] where 1.0 means all trees fully agree on the point's isolation depth. Uses the coefficient of variation (CV) of per-tree path lengths: confidence = 1 / (1 + CV).
iforest-rotateOnline model adaptation by replacing oldest trees.
(st/iforest-rotate model new-data) ;; replace 10% of trees (default)
(st/iforest-rotate model new-data 20) ;; replace 20 trees
Returns a new model. The original model is unchanged (CoW semantics).
Best practices:
k (5-10% of n-trees) for gradual adaptationscore-weightedScoring with exponential decay weighting for recent trees (useful after rotation).
(require '[stratum.iforest :as iforest])
(iforest/score-weighted model data 0.98) ;; decay=0.98, newer trees weighted higher
Anomaly scores can be used as regular columns in queries:
;; Score, then filter anomalies
(def scores (st/iforest-score model data))
(st/q {:from (assoc data :score scores)
:where [[:> :score 0.7]]
:agg [[:count]]})
;; Combine with other analytics
(st/q {:from (assoc data :score scores)
:group [:region]
:agg [[:avg :score] [:count]]
:having [[:> :avg 0.5]]
:order [[:avg :desc]]})
Register a model with the pgwire server, then use SQL functions:
;; Server setup
(def srv (st/start-server {:port 5432}))
(st/register-table! srv "transactions" tx-data)
(st/register-model! srv "fraud_model" model)
-- Raw anomaly score [0, 1]
SELECT *, ANOMALY_SCORE('fraud_model', amount, freq) AS score
FROM transactions
WHERE ANOMALY_SCORE('fraud_model', amount, freq) > 0.7;
-- Binary prediction (1 = anomaly, 0 = normal)
SELECT *, ANOMALY_PREDICT('fraud_model', amount, freq) AS is_anomaly
FROM transactions;
-- Calibrated probability [0, 1]
SELECT *, ANOMALY_PROBA('fraud_model', amount, freq) AS prob
FROM transactions;
-- Prediction confidence (tree agreement) [0, 1]
SELECT *, ANOMALY_CONFIDENCE('fraud_model', amount, freq) AS conf
FROM transactions;
The column arguments must match the feature names used during training (in order).
Trees are packed into a single long[] array for cache-friendly traversal:
longDouble.doubleToLongBits(c(leafSize))forest[t * maxNodes + nodeIdx]n-trees * (2 * sample-size - 1) * 8 bytes = ~2.5 MB for default settingsFor each row, traverse each tree from root to leaf. The anomaly score is:
score(x) = 2^(-E(h(x)) / c(psi))
where E(h(x)) is the mean path length across all trees and c(n) = 2*H(n-1) - 2*(n-1)/n is the expected path length of an unsuccessful BST search (with H(i) being the harmonic number).
Scoring is parallelized using morsel-driven execution (64K rows per morsel) across all available cores.
When :contamination is set:
(1 - contamination) percentileThis means predict labels approximately contamination * 100% of the training data as anomalies.
All inputs are validated against malli schemas (stratum.specification):
STrainOpts validates :from is a column map, :contamination is in (0, 0.5], etc.(st/train-iforest {:from "bad"})
;; => ExceptionInfo Invalid input: {:from ["must be a map of keyword → column data ..."]}
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |