The query engine (stratum.query) compiles declarative query maps into optimized execution plans that dispatch to Java SIMD operations. It handles column preparation, expression evaluation, predicate compilation, aggregation dispatch, group-by routing, join execution, and result decoding.
Queries are maps with the following optional keys:
{:from {col-kw data ...} ;; Column data (required)
:join [{:with {col data} :on pred :type :inner}] ;; Hash joins
:where [pred ...] ;; Filter predicates
:select [col-or-expr ...] ;; Projection (with :as aliases)
:agg [agg-spec ...] ;; Aggregations
:group [col-kw ...] ;; Group-by columns
:having [pred ...] ;; Post-aggregation filter
:order [[col dir] ...] ;; Sort (:asc/:desc)
:limit n ;; Result limit
:offset n ;; Result offset
:distinct true ;; SELECT DISTINCT
:result :columns} ;; Columnar output format
Both keyword vectors and s-expressions are supported:
[:< :col 100] ;; keyword form
'(< :col 100) ;; s-expression form (Datahike-compatible)
[:between :col 10 20] ;; range: 10 <= col < 20
[:like :name "%foo%"] ;; string pattern (case-sensitive)
[:ilike :name "%foo%"] ;; string pattern (case-insensitive)
[:in :status [1 2 3]] ;; set membership
[:or [:< :a 5] [:> :b 10]] ;; disjunction
[:is-null :col] ;; NULL check
[:not [:= :col 0]] ;; negation
[:sum :col] ;; SUM(col)
[:avg :col] ;; AVG(col)
[:count] ;; COUNT(*)
[:count-distinct :col] ;; COUNT(DISTINCT col)
[:min :col] [:max :col] ;; MIN/MAX
[:stddev :col] ;; STDDEV(col)
[:variance :col] ;; VARIANCE(col)
[:corr :col1 :col2] ;; CORR(col1, col2)
[:sum [:* :price :qty]] ;; SUM(price * qty) — expression in agg
[:sum-product :price :qty] ;; SUM(price * qty) — direct
[:median :col] ;; MEDIAN(col)
[:percentile :col 0.95] ;; PERCENTILE_CONT(0.95, col)
[:approx-quantile :col 0.95] ;; Approximate quantile via t-digest
[:+ :a :b] ;; Arithmetic: + - * / % **
[:abs :col] ;; Math: abs sqrt log log10 exp round floor ceil sign
[:date-trunc :day :ts] ;; Date truncation: year month day hour minute
[:extract :hour :ts] ;; Date extraction: year month day hour minute second
[:date-add :days 30 :ts] ;; Date arithmetic
[:coalesce :col 0] ;; NULL handling
[:nullif :col 0] ;; Returns NULL if col = value
[:greatest :a :b :c] ;; Maximum of multiple values
[:least :a :b :c] ;; Minimum of multiple values
[:upper :name] ;; String: upper lower substr replace trim concat length
[:cast :col :double] ;; Type cast
The DSL operates over a single flat namespace. Every clause (:where, :agg, :group, :select, :having, :order) resolves column names by keyword lookup against one merged column map. There is no scope nesting, no table qualification, and no quoting — just keywords:
{:from {:price (double-array [10 20 30])
:qty (long-array [1 2 3])}
:where [[:> :qty 1]] ;; :qty resolved from :from
:agg [[:sum [:* :price :qty]]]} ;; :price and :qty both visible
:join merges the dimension table's columns into the same flat namespace. After the join, all :from columns and all :with columns are visible to downstream clauses:
{:from {:order-id (long-array [1 2 3])
:cust-id (long-array [10 20 10])}
:join [{:with {:cust-id (long-array [10 20]) ;; key column
:region (into-array String ["US" "EU"])} ;; payload
:on [:= :cust-id :cust-id] ;; left.cust-id = right.cust-id
:type :inner}]
:group [:region] ;; :region comes from :with
:agg [[:count]]} ;; both sides visible
Key collision: when a column name appears in both :from and :with, the right (:with) side wins. The join key column from the right side is dropped after the join (it's redundant — it equals the left-side key by definition).
If you need both sides of a collision, rename the column in one of the input maps before building the query:
(let [orders {:cust-id orders-cust-id :amount orders-amount}
renamed {:cust-key dim-cust-id :tier dim-tier}] ;; renamed to avoid collision
(st/q {:from orders
:join [{:with renamed :on [:= :cust-id :cust-key] :type :left}]
:group [:tier]
:agg [[:sum :amount]]}))
SQL CTEs and subqueries name intermediate results:
WITH monthly AS (
SELECT DATE_TRUNC('month', ts) AS month, SUM(amount) AS total
FROM orders GROUP BY 1
)
SELECT month, total FROM monthly WHERE total > 1000
In the DSL, intermediate results are Clojure values. Use def or let to bind them:
(let [monthly (st/q {:from orders
:group [[:date-trunc :month :ts]]
:agg [[:sum :amount]]})]
;; monthly is a sequence of maps: [{:date-trunc 1704067200 :sum 5000} ...]
;; Extract to column arrays for the next stage:
(st/q {:from {:month (long-array (map :date-trunc monthly))
:total (double-array (map :sum monthly))}
:where [[:> :total 1000]]
:select [:month :total]}))
When the SQL layer translates a query with CTEs or subqueries, it does exactly this: it executes each CTE eagerly, collects the results into column arrays, and wires them as :from data in the outer query map. No lazy references, no symbolic names at the DSL level.
The SQL layer (stratum.sql) translates SQL strings to query maps. Subquery results are materialized before the outer query runs:
-- SQL
SELECT region, SUM(total) FROM
(SELECT region, SUM(amount) AS total FROM orders GROUP BY region)
WHERE total > 500 GROUP BY region
;; What the SQL layer produces (schematically):
(let [inner-result (st/q {:from orders :group [:region] :agg [[:sum :amount]]})]
(st/q {:from {:region (into-array String (map :region inner-result))
:total (double-array (map :sum inner-result))}
:where [[:> :total 500]]
:group [:region]
:agg [[:sum :total]]}))
This means the DSL is not less expressive than SQL — it has the same compositional power, expressed through Clojure's own value model rather than SQL's naming syntax.
The q function processes a query through these stages:
prepare-columns: Resolve :from data. Detect column types (long[], double[], dict-encoded). Build column lookup map.
Join execution: If :join is present, build hash tables on dimension keys, probe with fact keys, gather matched rows. Fused join+group-by attempted first for eligible queries.
Expression pre-computation: Evaluate expressions in :where, :agg, :group, :select into materialized arrays using eval-expr-vectorized (Java array ops).
Zone map pruning (index inputs): Classify chunks as skip/stats-only/SIMD using ChunkStats. Materialize only surviving chunks via idx-materialize-to-array-pruned.
Predicate separation: Split predicates into long-typed and double-typed groups. Non-SIMD predicates (OR, IN, NOT-IN) compiled to bytecode masks.
Dispatch: Route to optimal execution strategy (see below).
Result decode: Convert Java arrays to Clojure maps (or columnar format if :result :columns).
Post-processing: Apply :having, :order, :limit, :offset, :distinct.
The engine selects the fastest execution strategy based on query shape:
Is this a projection-only query (no agg, no group)?
→ Yes: materialize columns, apply select/where/order/limit
→ No: continue
Is this a fused extract+count (EXTRACT expr + all-COUNT aggs)?
→ Yes: fusedExtractCountDenseParallel
Is the input a PersistentColumnIndex eligible for chunked group-by?
→ Yes: execute-chunked-group-by (streams over chunks, no materialization)
Is this a fused join+group-by (single INNER join, keyword group cols, no WHERE)?
→ Yes: fusedJoinGroupAggregateDenseRange
Is this a single agg with ≤4 long + ≤4 double predicates on ≥1000 rows?
→ Yes: Is it COUNT?
→ Yes: fusedSimdCountParallel
→ No: fusedSimdParallel (SUM/SUM_PRODUCT/MIN/MAX)
Are all aggs SUM/SUM_PRODUCT/COUNT/AVG with ≤4 aggs?
→ Yes: Are all agg columns long[]?
→ Yes: fusedSimdMultiSumAllLongParallel (LongVector accumulators)
→ No: fusedSimdMultiSumParallel
Is this a group-by?
→ Yes: Is max key ≤ dense limit (200K)?
→ Yes: Has VARIANCE/CORR aggs?
→ Yes: fusedFilterGroupAggregateDenseVarParallel
→ No: All COUNT?
→ Yes: fusedFilterGroupCountDenseParallel
→ No: fusedFilterGroupAggregateDenseParallel
→ No (high cardinality):
→ fusedFilterGroupAggregatePartitioned (radix-partitioned hash)
Is this a percentile/median/approx-quantile?
→ Yes: Scalar path — collect matching values, QuickSelect or t-digest
Fallback: N-pass (one pass per agg) or scalar (< 1000 rows)
Statistical aggregates use specialized algorithms that don't fit the SIMD accumulator pattern:
:median — Sugar for [:percentile :col 0.5]:percentile — Exact percentile via Hoare's QuickSelect (O(N) average). Copies matching values to work array, partitions in-place. For grouped queries, collects per-group values via ArrayList then sorts.:approx-quantile — Approximate quantile via t-digest (Dunning 2019). Bounded memory (~6.4KB per digest), O(N) insertion. For ungrouped queries, uses single digest. For grouped queries, per-group digests in flat arrays. Accuracy within ±1% for 100K+ rows.These are dispatched through the scalar aggregation path (execute-scalar-aggs and the collection-agg branch of hash group-by). They cannot use the SIMD fused paths because they require collecting individual values rather than accumulating running statistics.
String columns are dictionary-encoded for efficient group-by and LIKE operations:
(q/encode-column (into-array String ["US" "EU" "US" "JP" "EU"]))
;; => {:codes (long-array [0 1 0 2 1])
;; :dict (into-array String ["US" "EU" "JP"])}
The codes are sequential integers (0, 1, 2, ...), enabling direct array indexing in dense group-by. Pre-encoding with q/encode-column is critical for performance — per-query encoding adds 3-4x overhead.
LIKE patterns on dictionary-encoded columns first filter the dictionary (e.g., which strings contain "foo"), then check per-row codes against a bitset of matching dictionary entries.
Predicates that can't be expressed as SIMD operations (:or, :in, :not-in) are compiled to JVM bytecode:
;; [:or [:< :a 5] [:> :b 10]] compiles to:
(let [mask (long-array n)]
(dotimes [i n]
(aset mask i
(if (or (< (aget a i) 5) (> (aget b i) 10)) 1 0)))
mask)
The compiled mask is passed as [:__mask :eq 1] — a SIMD-compatible equality predicate that integrates with the fused filter+aggregate pipeline.
eval-expr-vectorized evaluates expressions to materialized arrays using Java array operations:
;; [:* :price :qty] →
(ColumnOps/arrayMul price-arr qty-arr length)
;; [:- :price 100.0] →
(ColumnOps/arraySubScalar 100.0 price-arr length)
;; [:date-trunc :minute :ts] →
(ColumnOps/arrayDateTruncMinute ts-arr length)
All operations produce new arrays. Scalar values are handled by dedicated scalar-op methods (e.g., arraySubScalar) to avoid allocating broadcast arrays.
eval-expr-to-long is used for date expressions in group-by contexts, returning long[] directly instead of routing through double[] conversion.
By default, results are returned as sequences of maps. The :result :columns option returns a map of column name → array:
;; Default: [{:region "US" :sum 100} {:region "EU" :sum 200}]
;; Columnar: {:region (String[] ...) :sum (double-array ...) :n-rows 2}
For high-cardinality group-by (millions of groups), columnar output is 15x faster because it skips per-row map creation.
(q/q (assoc q :explain true))
;; => {:strategy :fused-simd-parallel
;; :n-rows 6000000
;; :predicates [{:col :shipdate :type :range :bounds [8766 9131]} ...]
;; :agg {:type :sum-product :col1 :price :col2 :discount}}
Returns the execution plan without running the query.
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |