One function. Six keywords. Two expression modes.
Datajure is a Clojure data manipulation library for finance and empirical research. It provides a clean, composable query DSL built directly on tech.ml.dataset.
(require '[datajure.core :as core])
;; Filter, group, aggregate — one call
(core/dt ds
:where #dt/e (> :year 2008)
:by [:species]
:agg {:n core/N :avg #dt/e (mn :mass)})
;; Window functions — same keywords, no new concepts
(core/dt ds
:by [:species]
:within-order [(core/desc :mass)]
:set {:rank #dt/e (win/rank :mass)})
;; Thread for multi-step pipelines
(-> ds
(core/dt :set {:bmi #dt/e (/ :mass (sq :height))})
(core/dt :by [:species] :agg {:avg-bmi #dt/e (mn :bmi)})
(core/dt :order-by [(core/desc :avg-bmi)]))
Datajure is a syntax layer, not an engine — it sits above tech.v3.dataset exactly as data.table sits above R's data frames. Every result is a standard tech.v3.dataset dataset. Full interop with tablecloth, Clerk, Clay, and the Scicloj ecosystem.
Add to your deps.edn:
{:deps {datajure/datajure {:local/root "/path/to/datajure"}}}
Datajure requires Clojure 1.12+ and Java 21+.
:by × :set/:aggTwo orthogonal keywords produce four distinct operations with no new concepts:
No :by | With :by | |
|---|---|---|
:set | Column derivation (+ whole-dataset window if win/* present) | Partitioned window |
:agg | Whole-table summary | Group aggregation |
;; Column derivation — add/update columns, keep all rows
(core/dt ds :set {:bmi #dt/e (/ :mass (sq :height))})
;; Group aggregation — collapse rows per group
(core/dt ds :by [:species] :agg {:n core/N :avg-mass #dt/e (mn :mass)})
;; Whole-table summary — collapse everything
(core/dt ds :agg {:total #dt/e (sm :mass) :n core/N})
;; first-val / last-val — OHLC construction (pre-sort, then aggregate)
(-> trades
(core/dt :order-by [(core/asc :time)])
(core/dt :by [:sym :date]
:agg {:open #dt/e (first-val :price)
:close #dt/e (last-val :price)
:hi #dt/e (mn :price)
:vol #dt/e (sm :size)}))
;; wavg / wsum — VWAP and weighted sum
(core/dt trades :by [:sym :date]
:agg {:vwap #dt/e (wavg :size :price)
:vol #dt/e (wsum :size :price)})
;; Partitioned window — compute within groups, keep all rows
(core/dt ds
:by [:species]
:within-order [(core/desc :mass)]
:set {:rank #dt/e (win/rank :mass)
:cumul #dt/e (win/cumsum :mass)})
;; Whole-dataset window — no :by, entire dataset is one partition
(core/dt ds
:within-order [(core/asc :date)]
:set {:cumret #dt/e (win/cumsum :ret)
:prev #dt/e (win/lag :price 1)})
#dt/e#dt/e is a reader tag that rewrites bare keywords to column accessors. It returns an AST object that dt interprets — vectorized, nil-safe, and pre-validated.
;; With #dt/e — terse, nil-safe, vectorized
(core/dt ds :where #dt/e (> :mass 4000))
(core/dt ds :set {:bmi #dt/e (/ :mass (sq :height))})
;; Without — plain Clojure functions (always works)
(core/dt ds :where #(> (:mass %) 4000))
(core/dt ds :set {:bmi #(/ (:mass %) (Math/pow (:height %) 2))})
#dt/e is opt-in. Users who prefer plain Clojure functions can ignore it entirely.
;; #dt/e comparisons with nil → false (no NullPointerException)
(core/dt ds :where #dt/e (> :mass 4000))
;; Arithmetic with nil propagates nil
(core/dt ds :set {:x #dt/e (+ :a :b)}) ;; nil + 1 → nil
;; Replace nil with a default
(core/dt ds :set {:mass #dt/e (coalesce :mass 0)})
;; Nil-safe division — zero or nil denominator → nil
(core/dt ds :set {:ratio #dt/e (div0 :numerator :denominator)})
;; Multi-branch conditional
(core/dt ds :set {:size #dt/e (cond
(> :mass 5000) "large"
(> :mass 3500) "medium"
:else "small")})
;; Local bindings
(core/dt ds :set {:adj #dt/e (let [bmi (/ :mass (sq :height))
base (if (> :year 2010) 1.1 1.0)]
(* base bmi))})
;; Boolean composition, membership, range
(core/dt ds :where #dt/e (and (> :mass 4000) (not (= :species "Adelie"))))
(core/dt ds :where #dt/e (in :species #{"Gentoo" "Chinstrap"}))
(core/dt ds :where #dt/e (between? :year 2007 2009))
#dt/e returns first-class AST values. Store in vars, reuse across queries:
(def bmi #dt/e (/ :mass (sq :height)))
(def high-mass #dt/e (> :mass 4000))
(core/dt ds :set {:bmi bmi})
(core/dt ds :where high-mass)
(core/dt ds :by [:species] :agg {:avg-bmi #dt/e (mn bmi)})
:select — Polymorphic Column Selection(core/dt ds :select [:species :mass]) ;; explicit list
(core/dt ds :select :type/numerical) ;; all numeric columns
(core/dt ds :select :!type/numerical) ;; all non-numeric
(core/dt ds :select #"body-.*") ;; regex match
(core/dt ds :select [:not :id :timestamp]) ;; exclusion
(core/dt ds :select {:species :sp :mass :m}) ;; select + rename
Available via win/* inside #dt/e. Work in :set context — with :by for partitioned windows, or without :by for whole-dataset windows:
;; Partitioned window — grouped by permno
(core/dt ds
:by [:permno]
:within-order [(core/asc :date)]
:set {:rank #dt/e (win/rank :ret)
:lag-1 #dt/e (win/lag :ret 1)
:cumret #dt/e (win/cumsum :ret)
:regime #dt/e (win/rleid :sign-ret)})
;; Whole-dataset window — no :by, entire dataset is one partition
(core/dt ds
:within-order [(core/asc :date)]
:set {:cumret #dt/e (win/cumsum :ret)
:prev #dt/e (win/lag :price 1)})
Functions: win/rank, win/dense-rank, win/row-number, win/lag, win/lead, win/cumsum, win/cummin, win/cummax, win/cummean, win/rleid, win/delta, win/ratio, win/differ, win/mavg, win/msum, win/mdev, win/mmin, win/mmax, win/ema, win/fills, win/scan.
Inspired by q's deltas and ratios — eliminate verbose lag patterns:
(core/dt ds :by [:permno] :within-order [(core/asc :date)]
:set {:ret #dt/e (- (win/ratio :price) 1) ;; simple return
:price-chg #dt/e (win/delta :price) ;; first differences
:changed #dt/e (win/differ :signal)}) ;; boolean change flag
(core/dt ds :by [:permno] :within-order [(core/asc :date)]
:set {:ma-20 #dt/e (win/mavg :price 20) ;; 20-day moving average
:vol-20 #dt/e (win/mdev :ret 20) ;; 20-day moving std dev
:hi-52w #dt/e (win/mmax :price 252) ;; 52-week high
:ema-10 #dt/e (win/ema :price 10)}) ;; 10-day EMA
(core/dt ds :by [:permno] :within-order [(core/asc :date)]
:set {:price #dt/e (win/fills :price)}) ;; carry forward last known
Generalized cumulative operation inspired by APL/q's scan (\). Supports +, *, max, min — the killer use case is the wealth index:
(core/dt ds :by [:permno] :within-order [(core/asc :date)]
:set {:wealth #dt/e (win/scan * (+ 1 :ret)) ;; cumulative compounding
:cum-vol #dt/e (win/scan + :volume) ;; = win/cumsum
:runmax #dt/e (win/scan max :price)}) ;; running maximum
Cross-column operations within a single row via row/*:
(core/dt ds :set {:total #dt/e (row/sum :q1 :q2 :q3 :q4)
:avg-q #dt/e (row/mean :q1 :q2 :q3 :q4)
:n-miss #dt/e (row/count-nil :q1 :q2 :q3 :q4)})
Functions: row/sum (nil as 0), row/mean, row/min, row/max (skip nil), row/count-nil, row/any-nil?.
Standalone function with cardinality validation and merge diagnostics:
(require '[datajure.join :refer [join]])
(join X Y :on :id :how :left)
(join X Y :on [:firm :date] :how :inner :validate :m:1)
(join X Y :left-on :id :right-on :key :how :left :report true)
;; [datajure] join report: 150 matched, 3 left-only, 0 right-only
;; Thread with dt
(-> (join X Y :on :id :how :left :validate :m:1)
(core/dt :where #dt/e (> :year 2008)
:agg {:total #dt/e (sm :revenue)}))
(require '[datajure.reshape :refer [melt]])
(-> ds
(melt {:id [:species :year] :measure [:mass :flipper :bill]})
(core/dt :by [:species :variable] :agg {:avg #dt/e (mn :value)}))
(require '[datajure.util :as du])
(du/describe ds) ;; summary stats → dataset
(du/describe ds [:mass :height]) ;; subset of columns
(du/clean-column-names messy-ds) ;; "Some Ugly Name!" → :some-ugly-name
(du/mark-duplicates ds [:id :date]) ;; adds :duplicate? column
(du/drop-constant-columns ds) ;; remove zero-variance
(du/coerce-columns ds {:year :int64 :mass :float64})
(require '[datajure.io :as dio])
(def ds (dio/read "data.csv"))
(def ds (dio/read "data.parquet")) ;; needs tech.v3.libs.parquet
(def ds (dio/read "data.tsv.gz")) ;; gzip auto-detected
(dio/write ds "output.csv")
Supported: CSV, TSV, Parquet, Arrow, Excel, Nippy. Gzipped variants auto-detected.
xbarFloor-division bucketing inspired by q's xbar. Primary use case is computed :by for time-series bar generation:
(require '[datajure.core :as core])
;; Numeric bucketing in :by ? price buckets of width 10
(core/dt ds :by [(core/xbar :price 10)] :agg {:n core/N :avg #dt/e (mn :volume)})
;; 5-minute OHLCV bars
(-> trades
(core/dt :order-by [(core/asc :time)])
(core/dt :by [(core/xbar :time 5 :minutes) :sym]
:agg {:open #dt/e (first-val :price)
:close #dt/e (last-val :price)
:vol #dt/e (sm :size)
:n core/N}))
;; Also usable inside #dt/e as a column derivation
(core/dt ds :set {:bucket #dt/e (xbar :price 5)})
Supported temporal units: :seconds, :minutes, :hours, :days, :weeks. Returns nil for nil input.
cutEqual-count (quantile) binning inside #dt/e. The optional :from mask computes breakpoints from a reference subpopulation and applies them to all rows — the reference and binned populations can be different sizes. This directly models the NYSE-breakpoints pattern used in empirical finance:
;; Basic: 5 equal-count bins across all rows
(core/dt ds :set {:size-q #dt/e (cut :mktcap 5)})
;; NYSE breakpoints: compute quintile breakpoints from NYSE stocks only,
;; apply to all stocks (NYSE + AMEX + NASDAQ)
(core/dt ds :set {:size-q #dt/e (cut :mktcap 5 :from (= :exchcd 1))})
;; :from accepts any #dt/e boolean expression
(core/dt ds :set {:size-q #dt/e (cut :mktcap 5 :from (and (= :exchcd 1) (> :year 2000)))})
;; Pre-computed boolean flag column also works
(core/dt ds :set {:size-q #dt/e (cut :mktcap 5 :from :nyse?)})
;; Per-date NYSE breakpoints — the canonical CRSP usage
(-> crsp
(core/dt :where #dt/e (= (month :date) 6))
(core/dt :by [:date]
:set {:size-q #dt/e (cut :mktcap 5 :from (= :exchcd 1))}))
(core/rename ds {:mass :weight-kg :species :penguin-species})
Short aliases for power users:
(require '[datajure.concise :refer [mn sm md sd ct nuniq fst lst wa ws N dt asc desc rename pass-nil]])
(dt ds :by [:species] :agg {:n N :avg #(mn (:mass %))})
| Symbol | Full name |
|---|---|
mn | mean |
sm | sum |
md | median |
sd | standard deviation |
mx | max (column maximum) |
mi | min (column minimum) |
ct | element count |
nuniq | count-distinct |
fst | first-val (first element) |
lst | last-val (last element) |
wa | wavg (weighted average) |
ws | wsum (weighted sum) |
(require '[datajure.clay :as dc])
(dc/install!) ;; auto-renders datasets, #dt/e exprs, describe output
;; Or explicit wrapping:
(dc/view ds)
(dc/view-expr #dt/e (/ :mass (sq :height)))
(dc/view-describe (du/describe ds))
Start a Clay notebook:
(require '[scicloj.clay.v2.api :as clay])
(clay/make! {:source-path "notebooks/datajure_clay_demo.clj"})
(require '[datajure.clerk :as dc])
(dc/install!) ;; registers custom Clerk viewers
*dt* holds the last dataset result (like *1), bound by nREPL middleware:
user=> (core/dt ds :by [:species] :agg {:n core/N})
;; => dataset...
user=> (core/dt core/*dt* :order-by [(core/desc :n)])
Enable in .nrepl.edn: {:middleware [datajure.nrepl/wrap-dt]}
Structured ex-info with suggestions:
(core/dt ds :set {:bmi #dt/e (/ :mass :hieght)})
;; => ExceptionInfo: Unknown column :hieght in :set expression
;; Did you mean: :height (edit distance: 1)
;; Available columns: :species :year :mass :height :flipper
(core/dt ds :set {:a #dt/e (/ :x 1)} :agg {:n core/N})
;; => ExceptionInfo: Cannot combine :set and :agg. Use -> threading.
Fixed, regardless of keyword order in the call:
1. :where — filter rows
2. :set or :agg — derive or aggregate (mutually exclusive)
3. :select — keep listed columns
4. :order-by — sort final output
User writes: #dt/e (/ :mass (sq :height))
↓
AST (pure data, serializable)
↓
compile-expr → fn [ds] → column vector
↓
tech.v3.datatype.functional (dfn)
↓
tech.v3.dataset (columnar, JVM, fast)
Datajure is a syntax layer. Computation delegates to tech.v3.dataset and tech.v3.datatype.functional. Performance matches the underlying engine — no overhead from the DSL.
| Namespace | Purpose |
|---|---|
datajure.core | dt, N, mean, sum, median, stddev, variance, max*, min*, count*, asc, desc, pass-nil, rename, xbar, cut, *dt* |
datajure.expr | AST nodes, compiler, #dt/e reader tag |
datajure.concise | Short aliases for power users |
datajure.window | Window function implementations |
datajure.row | Row-wise function implementations |
datajure.util | describe, clean-column-names, duplicate-rows, etc. |
datajure.io | Unified read/write dispatching on file extension |
datajure.reshape | melt for wide→long |
datajure.join | join with :validate and :report |
datajure.nrepl | nREPL middleware for *dt* auto-binding |
datajure.clerk | Rich Clerk notebook viewers |
datajure.clay | Clay/Kindly notebook integration |
dt is a function — not a macro. Debuggable, composable, predictable.:where always filters — conditional updates go inside :set.#dt/e — no implicit magic.coalesce replaces.dt, six keywords, two expression modes.->, use it.ex-info, Levenshtein suggestions, extensible.# Start nREPL
clj -A:nrepl
# Run all tests
clj -A:nrepl -e "(require '[clojure.test :as t])
(load-file \"test/datajure/core_test.clj\")
(load-file \"test/datajure/concise_test.clj\")
(load-file \"test/datajure/util_test.clj\")
(load-file \"test/datajure/io_test.clj\")
(load-file \"test/datajure/reshape_test.clj\")
(load-file \"test/datajure/join_test.clj\")
(load-file \"test/datajure/nrepl_test.clj\")
(load-file \"test/datajure/clerk_test.clj\")
(load-file \"test/datajure/clay_test.clj\")
(t/run-tests 'datajure.core-test 'datajure.concise-test 'datajure.util-test
'datajure.io-test 'datajure.reshape-test 'datajure.join-test
'datajure.nrepl-test 'datajure.clerk-test 'datajure.clay-test)"
209 tests, 693 assertions.
Datajure v1 was a routing layer across three backends (tablecloth, clojask, geni/Spark). v2 takes a different approach: a single, opinionated syntax layer directly on tech.v3.dataset, inspired by R's data.table and Julia's DataFramesMeta.jl.
Copyright © 2024–2025 Centre for Investment Management, HKU Business School.
Distributed under the Eclipse Public License version 2.0.
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |