edd.parquet.core

Liking cljdoc? Tell your friends :D

Clojure only.

get-codec
mime-type
partition-rows
schema-fingerprint
write-parquet-bytes
write-parquet-bytes-single

Parquet file generation for columnar data export.

Schema format: {:description "Table description" :columns [["COL_NAME" :type "description" :required/:optional & opts] ...]}

Supported column types:

:string - UTF-8 string (BINARY with STRING logical type)
:enum - String with optional validation (BINARY with STRING logical type)
:boolean - Stored as string "TRUE"/"FALSE" (BINARY with STRING logical type)
:uuid - UUID as string (BINARY with STRING logical type)
:date - Date as days since epoch (INT32 with DATE logical type)
:double - 64-bit floating point (DOUBLE)
:long - 64-bit signed integer (INT64)

Memory considerations:

Rows are consumed lazily (one at a time), so prefer passing lazy sequences
Parquet writer buffers up to one row group (~128MB) internally
Output byte[] is fully materialized in memory

Example: (write-parquet-bytes {:table-name "orders" :schema {:description "Customer orders" :columns [["ORDER_ID" :uuid "Order identifier" :required] ["CUSTOMER" :string "Customer name" :required] ["AMOUNT" :double "Order amount" :required] ["STATUS" :enum "Order status" :required :enum ["PENDING" "SHIPPED" "DELIVERED"]] ["CREATED" :date "Creation date" :optional]]} :rows [{"ORDER_ID" "123e4567-e89b-12d3-a456-426614174000" "CUSTOMER" "Acme Corp" "AMOUNT" 1234.56 "STATUS" "PENDING" "CREATED" "2024-01-15"}] :compression :gzip})

Parquet file generation for columnar data export.

Schema format:
{:description "Table description"
 :columns [["COL_NAME" :type "description" :required/:optional & opts]
           ...]}

Supported column types:
- :string   - UTF-8 string (BINARY with STRING logical type)
- :enum     - String with optional validation (BINARY with STRING logical type)
- :boolean  - Stored as string "TRUE"/"FALSE" (BINARY with STRING logical type)
- :uuid     - UUID as string (BINARY with STRING logical type)
- :date     - Date as days since epoch (INT32 with DATE logical type)
- :double   - 64-bit floating point (DOUBLE)
- :long     - 64-bit signed integer (INT64)

Memory considerations:
- Rows are consumed lazily (one at a time), so prefer passing lazy sequences
- Parquet writer buffers up to one row group (~128MB) internally
- Output byte[] is fully materialized in memory

Example:
(write-parquet-bytes
  {:table-name "orders"
   :schema {:description "Customer orders"
            :columns [["ORDER_ID" :uuid "Order identifier" :required]
                      ["CUSTOMER" :string "Customer name" :required]
                      ["AMOUNT" :double "Order amount" :required]
                      ["STATUS" :enum "Order status" :required :enum ["PENDING" "SHIPPED" "DELIVERED"]]
                      ["CREATED" :date "Creation date" :optional]]}
   :rows [{"ORDER_ID" "123e4567-e89b-12d3-a456-426614174000"
           "CUSTOMER" "Acme Corp"
           "AMOUNT" 1234.56
           "STATUS" "PENDING"
           "CREATED" "2024-01-15"}]
   :compression :gzip})

raw docstring

get-codec^clj

(get-codec compression)

Returns the Parquet compression codec for the given keyword.

Supported values:

:uncompressed - No compression (dev, best compatibility)
:snappy - Snappy compression (fast, moderate compression)
:zstd - Zstandard compression (good balance of speed and compression)
:gzip - GZIP compression (slow, high compression, good compatibility)

Returns CompressionCodecName enum value.

Returns the Parquet compression codec for the given keyword.

Supported values:
- :uncompressed - No compression (dev, best compatibility)
- :snappy - Snappy compression (fast, moderate compression)
- :zstd - Zstandard compression (good balance of speed and compression)
- :gzip - GZIP compression (slow, high compression, good compatibility)

Returns CompressionCodecName enum value.

raw docstring

mime-type^clj

(mime-type)

Returns the MIME type for Parquet files.

Returns the MIME type for Parquet files.

raw docstring

partition-rows^clj

(partition-rows rows num-partitions)

Partitions a sequence of rows into N roughly equal chunks for parallel processing. Returns a vector of vectors with balanced distribution (max-min size difference ≤ 1).

Partitions a sequence of rows into N roughly equal chunks for parallel processing.
Returns a vector of vectors with balanced distribution (max-min size difference ≤ 1).

raw docstring

schema-fingerprint^clj

(schema-fingerprint schema)

Returns a deterministic fingerprint for the schema definition.

Intended to change when columns/types/requirements/enums change. Useful for cache invalidation or versioning.

Returns a deterministic fingerprint for the schema definition.

Intended to change when columns/types/requirements/enums change.
Useful for cache invalidation or versioning.

raw docstring

write-parquet-bytes^clj

(write-parquet-bytes
  {:keys [table-name schema rows compression schema-version table-schema
          threads]
   :or {compression :gzip threads (.. Runtime getRuntime availableProcessors)}})

Writes rows to Parquet format using parallel multi-threaded processing entirely in memory. Partitions are processed as byte arrays in memory and merged in memory.

For row counts below 10,000 or when receiving a lazy seq, automatically falls back to single-threaded write-parquet-bytes-single since parallel overhead exceeds benefit or we don't want to fully realize the data.

Arguments (as a map):

:table-name (required) - Name for the table
:schema (required) - Schema definition map
:rows (required) - Sequence of row maps
:compression (optional) - Compression codec (default :gzip)
:schema-version (optional) - Version string for metadata
:table-schema (optional) - Database schema name stored in Parquet key/value metadata
:threads (optional) - Number of parallel threads (default: available processors)

Returns byte[] containing the Parquet file contents.

Writes rows to Parquet format using parallel multi-threaded processing entirely in memory.
Partitions are processed as byte arrays in memory and merged in memory.

For row counts below 10,000 or when receiving a lazy seq, automatically falls back to single-threaded
write-parquet-bytes-single since parallel overhead exceeds benefit or we don't want to fully realize the data.

Arguments (as a map):
- :table-name (required) - Name for the table
- :schema (required) - Schema definition map
- :rows (required) - Sequence of row maps
- :compression (optional) - Compression codec (default :gzip)
- :schema-version (optional) - Version string for metadata
- :table-schema (optional) - Database schema name stored in Parquet key/value metadata
- :threads (optional) - Number of parallel threads (default: available processors)

Returns byte[] containing the Parquet file contents.

raw docstring

write-parquet-bytes-single^clj

(write-parquet-bytes-single {:keys [table-name schema rows compression
                                    schema-version table-schema]
                             :or {compression :gzip}})

Writes rows to Parquet format in memory and returns the byte array.

Arguments (as a map):

:table-name (required) - Name for the table (stored in metadata)
:schema (required) - Schema definition map with :description and :columns
:rows (required) - Sequence of row maps with string keys matching column names. Lazy sequences are preferred for memory efficiency as rows are consumed one at a time without realizing the full collection.
:compression (optional) - Compression codec (:gzip or :uncompressed, default :gzip)
:schema-version (optional) - Version string stored in Parquet key/value metadata
:table-schema (optional) - Database schema name stored in Parquet key/value metadata

Memory characteristics:

Rows are processed incrementally (lazy seqs supported)
Parquet writer buffers up to one row group (~128MB) internally
Output byte[] is fully materialized in memory

Returns byte[] containing the Parquet file contents.

Writes rows to Parquet format in memory and returns the byte array.

Arguments (as a map):
- :table-name (required) - Name for the table (stored in metadata)
- :schema (required) - Schema definition map with :description and :columns
- :rows (required) - Sequence of row maps with string keys matching column names.
                     Lazy sequences are preferred for memory efficiency as rows
                     are consumed one at a time without realizing the full collection.
- :compression (optional) - Compression codec (:gzip or :uncompressed, default :gzip)
- :schema-version (optional) - Version string stored in Parquet key/value metadata
- :table-schema (optional) - Database schema name stored in Parquet key/value metadata

Memory characteristics:
- Rows are processed incrementally (lazy seqs supported)
- Parquet writer buffers up to one row group (~128MB) internally
- Output byte[] is fully materialized in memory

Returns byte[] containing the Parquet file contents.

raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close

edd.parquet.core

get-codecclj

mime-typeclj

partition-rowsclj

schema-fingerprintclj

write-parquet-bytesclj

write-parquet-bytes-singleclj

get-codec^clj

mime-type^clj

partition-rows^clj

schema-fingerprint^clj

write-parquet-bytes^clj

write-parquet-bytes-single^clj