Parquet file generation for columnar data export.
Schema format: {:description "Table description" :columns [["COL_NAME" :type "description" :required/:optional & opts] ...]}
Supported column types:
Memory considerations:
Example: (write-parquet-bytes {:table-name "orders" :schema {:description "Customer orders" :columns [["ORDER_ID" :uuid "Order identifier" :required] ["CUSTOMER" :string "Customer name" :required] ["AMOUNT" :double "Order amount" :required] ["STATUS" :enum "Order status" :required :enum ["PENDING" "SHIPPED" "DELIVERED"]] ["CREATED" :date "Creation date" :optional]]} :rows [{"ORDER_ID" "123e4567-e89b-12d3-a456-426614174000" "CUSTOMER" "Acme Corp" "AMOUNT" 1234.56 "STATUS" "PENDING" "CREATED" "2024-01-15"}] :compression :gzip})
Parquet file generation for columnar data export.
Schema format:
{:description "Table description"
:columns [["COL_NAME" :type "description" :required/:optional & opts]
...]}
Supported column types:
- :string - UTF-8 string (BINARY with STRING logical type)
- :enum - String with optional validation (BINARY with STRING logical type)
- :boolean - Stored as string "TRUE"/"FALSE" (BINARY with STRING logical type)
- :uuid - UUID as string (BINARY with STRING logical type)
- :date - Date as days since epoch (INT32 with DATE logical type)
- :double - 64-bit floating point (DOUBLE)
- :long - 64-bit signed integer (INT64)
Memory considerations:
- Rows are consumed lazily (one at a time), so prefer passing lazy sequences
- Parquet writer buffers up to one row group (~128MB) internally
- Output byte[] is fully materialized in memory
Example:
(write-parquet-bytes
{:table-name "orders"
:schema {:description "Customer orders"
:columns [["ORDER_ID" :uuid "Order identifier" :required]
["CUSTOMER" :string "Customer name" :required]
["AMOUNT" :double "Order amount" :required]
["STATUS" :enum "Order status" :required :enum ["PENDING" "SHIPPED" "DELIVERED"]]
["CREATED" :date "Creation date" :optional]]}
:rows [{"ORDER_ID" "123e4567-e89b-12d3-a456-426614174000"
"CUSTOMER" "Acme Corp"
"AMOUNT" 1234.56
"STATUS" "PENDING"
"CREATED" "2024-01-15"}]
:compression :gzip})(get-codec compression)Returns the Parquet compression codec for the given keyword.
Supported values:
Returns CompressionCodecName enum value.
Returns the Parquet compression codec for the given keyword. Supported values: - :uncompressed - No compression (dev, best compatibility) - :snappy - Snappy compression (fast, moderate compression) - :zstd - Zstandard compression (good balance of speed and compression) - :gzip - GZIP compression (slow, high compression, good compatibility) Returns CompressionCodecName enum value.
(mime-type)Returns the MIME type for Parquet files.
Returns the MIME type for Parquet files.
(partition-rows rows num-partitions)Partitions a sequence of rows into N roughly equal chunks for parallel processing. Returns a vector of vectors with balanced distribution (max-min size difference ≤ 1).
Partitions a sequence of rows into N roughly equal chunks for parallel processing. Returns a vector of vectors with balanced distribution (max-min size difference ≤ 1).
(schema-fingerprint schema)Returns a deterministic fingerprint for the schema definition.
Intended to change when columns/types/requirements/enums change. Useful for cache invalidation or versioning.
Returns a deterministic fingerprint for the schema definition. Intended to change when columns/types/requirements/enums change. Useful for cache invalidation or versioning.
(write-parquet-bytes
{:keys [table-name schema rows compression schema-version table-schema
threads]
:or {compression :gzip threads (.. Runtime getRuntime availableProcessors)}})Writes rows to Parquet format using parallel multi-threaded processing entirely in memory. Partitions are processed as byte arrays in memory and merged in memory.
For row counts below 10,000 or when receiving a lazy seq, automatically falls back to single-threaded write-parquet-bytes-single since parallel overhead exceeds benefit or we don't want to fully realize the data.
Arguments (as a map):
Returns byte[] containing the Parquet file contents.
Writes rows to Parquet format using parallel multi-threaded processing entirely in memory. Partitions are processed as byte arrays in memory and merged in memory. For row counts below 10,000 or when receiving a lazy seq, automatically falls back to single-threaded write-parquet-bytes-single since parallel overhead exceeds benefit or we don't want to fully realize the data. Arguments (as a map): - :table-name (required) - Name for the table - :schema (required) - Schema definition map - :rows (required) - Sequence of row maps - :compression (optional) - Compression codec (default :gzip) - :schema-version (optional) - Version string for metadata - :table-schema (optional) - Database schema name stored in Parquet key/value metadata - :threads (optional) - Number of parallel threads (default: available processors) Returns byte[] containing the Parquet file contents.
(write-parquet-bytes-single {:keys [table-name schema rows compression
schema-version table-schema]
:or {compression :gzip}})Writes rows to Parquet format in memory and returns the byte array.
Arguments (as a map):
Memory characteristics:
Returns byte[] containing the Parquet file contents.
Writes rows to Parquet format in memory and returns the byte array.
Arguments (as a map):
- :table-name (required) - Name for the table (stored in metadata)
- :schema (required) - Schema definition map with :description and :columns
- :rows (required) - Sequence of row maps with string keys matching column names.
Lazy sequences are preferred for memory efficiency as rows
are consumed one at a time without realizing the full collection.
- :compression (optional) - Compression codec (:gzip or :uncompressed, default :gzip)
- :schema-version (optional) - Version string stored in Parquet key/value metadata
- :table-schema (optional) - Database schema name stored in Parquet key/value metadata
Memory characteristics:
- Rows are processed incrementally (lazy seqs supported)
- Parquet writer buffers up to one row group (~128MB) internally
- Output byte[] is fully materialized in memory
Returns byte[] containing the Parquet file contents.cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |