Liking cljdoc? Tell your friends :D

Changelog

Unreleased

Added

Bitemporal column convention (stratum.dataset/bitemporal-config, opt-in via :metadata {:bitemporal {:valid {...} :system {...}}} on make-dataset): tags valid-time and/or system-time window columns with :temporal-unit :micros so downstream date kernels dispatch correctly, validates type at construction, and round-trips through sync!/load. Both axes are symmetric and either is optional. Per-axis helpers valid-time-config / system-time-config. Mirrors SQL:2011 column naming (_valid_from, _valid_to, _system_from, _system_to). Pairs with bench tier 10 (VT-Q1 1% sel, VT-Q2 50% + Long/MAX_VALUE sentinel, VT-Q3 group-by + vt filter) and a baseline snapshot at bench/baseline-vt-branch.txt. See doc/temporal-design.md and doc/dataset.md § Bitemporal Windows.
Audit and integrity verification (stratum.audit): verify-chain walks the dataset-commit DAG, recomputes each commit-id, and reports mismatches (layer-1). With :deep? true it additionally walks every column's PSS tree from konserve and confirms node bytes hash back to their addresses (layer-2). Live values (StratumDataset, PersistentColumnIndex) implement the IAuditable protocol (-merkle-root / -recompute-merkle-root) — protocol shape and result-map vocabulary are intentionally identical to datahike's, so bridges can pass results through without translation. Requires :crypto-hash? true on the underlying store. See doc/audit.md.
stratum.api/parquet-dataset (and close-parquet-dataset!): zero-copy lazy-decode reader. Constant-time open — only the parquet footer is parsed up front. Each row group becomes a chunk in a PersistentColumnIndex, decoded on first touch and cached as a heap long[]/double[]. Per-row-group min/max/count/null-count from the parquet metadata feed stratum's zone-map pruning, so chunks the planner can prove irrelevant are never decoded. No konserve persistence — the parquet file is the storage. Read-only (idx-set!/idx-append!/idx-sync! throw). Use this for ad-hoc queries against a parquet file; use index-parquet! instead when you need persistence to konserve.
Streaming Parquet ingest (stratum.parquet/index-parquet!): reads a Parquet file row-group-by-row-group into chunked PersistentColumnIndex columns, syncing periodically to konserve so the chunk heap is reclaimable. Memory bounded by chunk-size × num-cols × 8 B during reading, independent of file size. Wired into stratum.files/index-file-into-store! so the --index and SQL read_parquet+--index paths use it automatically.
stratum.dataset/ds-delete-rows! + bitemporal auto-split: a new transient primitive that drops a set of rows by index, fanning idx-delete! across every column with descending-index ordering so a single call can remove many rows in one shot. upsert! / retract! with :auto-split? true now reshape overlapping rows instead of rejecting: partial-left overlaps (row-vf < new-vf < row-vt) are truncated to new-vf; fully-superseded rows (row-vf >= new-vf) are dropped physically. Reject is still the default. See doc/temporal-design.md § Overlap policy.
SQL DELETE on stratum-index-backed tables + FOR PORTION OF VALID_TIME grammar (Phase D + D+). Tables registered with idx/index-from-seq columns now support DELETE FROM t WHERE … (routes through ds-delete-rows!). SQL:2011 DELETE FROM t FOR PORTION OF VALID_TIME FROM x TO y WHERE … lowers to dataset/retract! with :valid-from + :valid-to in tx-meta, performing the bounded surgery per overlapping row (truncate / shift / split / drop). INSERT INTO t (cols) VALUES (…) FOR PORTION OF VALID_TIME FROM x TO y lowers to dataset/append! with the period stamping the _valid_from / _valid_to columns. UPDATE t SET … FOR PORTION OF VALID_TIME FROM x TO y WHERE … lowers to dataset/bounded-update! — SQL:2011 non-sequenced UPDATE, the 3-way split where the overlap portion gets the new values and the non-overlap parts retain the original. The preprocessor lives in stratum.sql.rewrite/preprocess-sql; literals accept 'YYYY-MM-DD', DATE '…', and ISO instants (converted to :micros). INSERT … ON CONFLICT … FOR PORTION OF VALID_TIME is explicitly rejected — users compose two separate statements.

Changed

Parquet I/O via memory mapping: stratum.parquet now uses an mmap-based InputFile (stratum.internal.MmapInputFile) instead of parquet-mr's default LocalInputFile. Eliminates the per-call byte[] allocation in the read(ByteBuffer) path and the kernel→user copy on every read. Affects all three parquet entry points (parquet-dataset, from-parquet, index-parquet!). Files >2 GiB are supported via the foreign-memory API.
Bulk-decode for dict-encoded numeric pages: required (non-null, non-repeated) INT32/INT64/DOUBLE columns whose pages use PLAIN_DICTIONARY / RLE_DICTIONARY now bypass parquet-mr's per-value ColumnReader.readDouble() path. The dictionary page is decoded once per row group into a typed array; data pages are walked directly on their byte[] (RLE+bit-packed) and gathered against the dict 8 lanes at a time. Falls back to ColumnReader for unsupported encodings, nullable columns, and DataPageV2. Roughly 5× faster cold decode on dict-encoded columns; 4–5× on PLAIN columns from the I/O change alone.
stratum.parquet/from-parquet is deprecated: parquet-dataset (queries) and index-parquet! (persistent ingest) cover the same use cases without the eager full-file decode. The function and its public re-export through stratum.api have been retired; stratum.parquet/from-parquet is still callable directly for backward compatibility but marked :deprecated in metadata. SQL read_parquet() now uses parquet-dataset under the hood.
stratum.parquet/from-parquet (heap path) rewritten on top of pre-allocated primitive arrays + dict-encoded strings at ingest. No more ArrayList<Object> boxing or per-row String materialization. Same public API; ~5× lower peak heap on column-heavy files.

Removed

stratum.api/from-parquet is no longer exposed. Direct use of stratum.parquet/from-parquet still works but is deprecated in favor of stratum.api/parquet-dataset.

Fixed

Parquet OOM on large files: a 50M-row × 23-column file with two low-cardinality string columns previously needed ~43 GB just for the boxed-Long/Double/String intermediate state and OOM'd. After the changes above, the streaming path needs ~tens of MB during ingest and the heap path needs ~9 GB.
All-NULL int64 chunk encoding: chunk/chunk-to-bytes cast Double/MAX_VALUE (compute-stats's "no values seen" sentinel) to long, throwing Value out of range for chunks where every row is NULL. Now guarded by null-count and encoded as constant-NULL (8 bytes).
Self-joins and overlapping column names: JOINs between tables sharing column names (including self-joins) now produce correct results. The SQL layer rewrites qualified column references to unique internal keys, preventing right-side columns from silently overwriting left-side columns.
Window frame dispatch: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING now correctly routes to the sliding window path instead of falling through to the default running sum.
CORR scalar aggregate: SELECT CORR(x, y) FROM t without GROUP BY no longer crashes.
COALESCE with NULL literals: COALESCE(NULL, 1) and N-ary COALESCE(NULL, NULL, 42) now work correctly. Multi-argument COALESCE is nested into binary pairs.
NOT IN with NULL: WHERE x NOT IN (1, NULL) now correctly returns no rows per SQL three-valued logic.

v0.1.0

Initial public release.

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close