stratum.api/parquet-dataset (and close-parquet-dataset!): zero-copy lazy-decode reader. Constant-time open — only the parquet footer is parsed up front. Each row group becomes a chunk in a PersistentColumnIndex, decoded on first touch and cached as a heap long[]/double[]. Per-row-group min/max/count/null-count from the parquet metadata feed stratum's zone-map pruning, so chunks the planner can prove irrelevant are never decoded. No konserve persistence — the parquet file is the storage. Read-only (idx-set!/idx-append!/idx-sync! throw). Use this for ad-hoc queries against a parquet file; use index-parquet! instead when you need persistence to konserve.stratum.parquet/index-parquet!): reads a Parquet file row-group-by-row-group into chunked PersistentColumnIndex columns, syncing periodically to konserve so the chunk heap is reclaimable. Memory bounded by chunk-size × num-cols × 8 B during reading, independent of file size. Wired into stratum.files/index-file-into-store! so the --index and SQL read_parquet+--index paths use it automatically.stratum.parquet now uses an mmap-based InputFile (stratum.internal.MmapInputFile) instead of parquet-mr's default LocalInputFile. Eliminates the per-call byte[] allocation in the read(ByteBuffer) path and the kernel→user copy on every read. Affects all three parquet entry points (parquet-dataset, from-parquet, index-parquet!). Files >2 GiB are supported via the foreign-memory API.PLAIN_DICTIONARY / RLE_DICTIONARY now bypass parquet-mr's per-value ColumnReader.readDouble() path. The dictionary page is decoded once per row group into a typed array; data pages are walked directly on their byte[] (RLE+bit-packed) and gathered against the dict 8 lanes at a time. Falls back to ColumnReader for unsupported encodings, nullable columns, and DataPageV2. Roughly 5× faster cold decode on dict-encoded columns; 4–5× on PLAIN columns from the I/O change alone.stratum.parquet/from-parquet is deprecated: parquet-dataset (queries) and index-parquet! (persistent ingest) cover the same use cases without the eager full-file decode. The function and its public re-export through stratum.api have been retired; stratum.parquet/from-parquet is still callable directly for backward compatibility but marked :deprecated in metadata. SQL read_parquet() now uses parquet-dataset under the hood.stratum.parquet/from-parquet (heap path) rewritten on top of pre-allocated primitive arrays + dict-encoded strings at ingest. No more ArrayList<Object> boxing or per-row String materialization. Same public API; ~5× lower peak heap on column-heavy files.stratum.api/from-parquet is no longer exposed. Direct use of stratum.parquet/from-parquet still works but is deprecated in favor of stratum.api/parquet-dataset.chunk/chunk-to-bytes cast Double/MAX_VALUE (compute-stats's "no values seen" sentinel) to long, throwing Value out of range for chunks where every row is NULL. Now guarded by null-count and encoded as constant-NULL (8 bytes).ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING now correctly routes to the sliding window path instead of falling through to the default running sum.SELECT CORR(x, y) FROM t without GROUP BY no longer crashes.COALESCE(NULL, 1) and N-ary COALESCE(NULL, NULL, 42) now work correctly. Multi-argument COALESCE is nested into binary pairs.WHERE x NOT IN (1, NULL) now correctly returns no rows per SQL three-valued logic.Initial public release.
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |