Liking cljdoc? Tell your friends :D

sturdy.numpy

A small, focused Clojure library for reading NumPy .npy files.

Clojars Project

com.sturdystats/sturdy-numpy {:mvn/version "VERSION"}

Features

  • Reads NumPy .npy files (binary array format)
  • Supports 1D and 2D arrays
  • Supports numeric dtypes:
    • Unsigned integers: u1, u2, u4
    • Signed integers: i1, i2, i4, i8
    • Floating point: f4, f8
  • Supports little-endian and big-endian files
  • Supports C-order and Fortran-order layouts
  • Multiple output representations:
    • Idiomatic Clojure data (vector / vector-of-vectors)
    • tech.v3.dataset (columnar, primitive-backed)
    • Raw Java primitive arrays (lowest-level access)

Example Usage

Clojure Vectors

(require '[sturdy.numpy :as np])

(let [f "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy"]
  (np/npy->vec f))

;; => [[4294967291 12 29] [46 63 80]]

These are

  • Always returned in row-major (C-order) layout
  • Easy to inspect and test
  • Not optimized for large arrays

tech.v3.dataset

(require '[sturdy.numpy :as np])

(let [f "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy"]
  (np/npy->dataset f))

;; => _unnamed [2 3]:
;;    |        :c1 | :c2 | :c3 |
;;    |-----------:|----:|----:|
;;    | 4294967291 |  12 |  29 |
;;    |         46 |  63 |  80 |
  • 1D arrays → single-column dataset (:c1)
  • 2D arrays → one column per NumPy column
  • Column names are generated as :c1, :c2, … reflecting the order in the original file

This is the recommended entry point for:

  • large arrays
  • database ingestion (e.g. DuckDB)
  • most tasks

Preserves dtype

(require '[sturdy.numpy :as np])
(require '[tech.v3.dataset :as ds])
(require '[tech.v3.datatype :as dtype])

(let [f       "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy"
      dataset (np/npy->dataset f)]
  (map dtype/elemwise-datatype (ds/columns dataset)))

;; => (:uint32 :uint32 :uint32)
(require '[sturdy.numpy :as np])
(require '[tech.v3.dataset :as ds])
(require '[tech.v3.datatype :as dtype])

(defn get-dtype [dataset]
  (-> (ds/columns dataset) first dtype/elemwise-datatype))

(defn test-file [dtype fname]
  (let [dataset (np/npy->dataset fname)]
    {:expected dtype
     :actual (get-dtype dataset)}))

(for [tp [:u1 :u2 :u4 :i1 :i2 :i4 :i8 :f4 :f8]]
  (let [fname (format "test-resources/npy-fixtures/shape_2x3__dtype_%s.npy"
                      (name tp))]
    (test-file tp fname)))
;; => ({:expected :u1, :actual :uint8}
;;     {:expected :u2, :actual :uint16}
;;     {:expected :u4, :actual :uint32}
;;     {:expected :i1, :actual :int8}
;;     {:expected :i2, :actual :int16}
;;     {:expected :i4, :actual :int32}
;;     {:expected :i8, :actual :int64}
;;     {:expected :f4, :actual :float32}
;;     {:expected :f8, :actual :float64})

Primitive Arrays (advanced)

(require '[sturdy.numpy :as np])

(np/npy->primitive "shape_2x3__dtype_u4.npy")

;; => {:shape    [2 3]
;;     :dtype    :u4
;;     :fortran? false
;;     :data     #<long[]>}
  • Returns a flat Java primitive array
  • Layout depends on :fortran?
  • No reshaping or transposition is performed

This API is intended for:

  • custom ingestion pipelines
  • zero-copy workflows
  • advanced performance-sensitive use cases

Experimental: Row-list Datasets

For some workflows (e.g. downstream databases that support list or array types), it can be useful to represent each row of a 2D NumPy array as a single list-valued column rather than as many scalar columns.

The function npy->dataset-rowlists provides this representation:

(require '[sturdy.numpy :as np])

(np/npy->dataset-rowlists "test-resources/npy-fixtures/shape_2x3__dtype_i4.npy")
;; => _unnamed [2 1]:
;;    |        :c1 |
;;    |------------|
;;    | [-5 12 29] |
;;    | [46 63 80] |

(np/npy->dataset-rowlists "test-resources/npy-fixtures/shape_2x3__dtype_f4.npy")
;; => _unnamed [2 1]:
;;    |               :c1 |
;;    |-------------------|
;;    | [-3.0 -1.75 -0.5] |
;;    |   [0.75 2.0 3.25] |

Each row is represented as a zero-copy buffer view over the underlying array data.

Notes:

  • Only 2D arrays are supported.
  • Row-major (C-order) .npy files are supported.
  • Fortran-order files are not currently supported by this helper.
  • The resulting column has element dtype :object (each cell is a buffer), which may not be accepted by all ingestion paths.
  • This helper is experimental and primarily intended for advanced ingestion pipelines or custom database integrations.

Experimental: Sparse UNNESTed Datasets

For very sparse arrays, a column-oriented dataset with one column per NumPy column can be inefficient: most values are zero, and downstream systems often want a sparse or list-based representation anyway.

The function npy->dataset-unnested-nz provides an alternative representation inspired by SQL UNNEST / long-form tables.

Instead of producing one column per NumPy column, it produces a row-wise sparse representation with three columns:

  • :row_no — row index (0-based, int64)
  • :col_no — column index (0-based, int16)
  • :val — value (primitive-backed, dtype preserved) Only non-zero entries are emitted.
(npy->dataset-unnested-nz "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy")
;; => _unnamed [6 3]:
;;    | :row_no | :col_no |       :val |
;;    |--------:|--------:|-----------:|
;;    |       0 |       0 | 4294967291 |
;;    |       0 |       1 |         12 |
;;    |       0 |       2 |         29 |
;;    |       1 |       0 |         46 |
;;    |       1 |       1 |         63 |
;;    |       1 |       2 |         80 |

For Fortran-order (order='F') .npy files, the physical order of rows differs, but (row_no, col_no) are computed correctly:

(npy->dataset-unnested-nz "test-resources/npy-fixtures/shape_2x3__dtype_u4__order_F.npy")
;; => _unnamed [6 3]:
;;    | :row_no | :col_no |       :val |
;;    |--------:|--------:|-----------:|
;;    |       0 |       0 | 4294967291 |
;;    |       1 |       0 |         46 |
;;    |       0 |       1 |         12 |
;;    |       1 |       1 |         63 |
;;    |       0 |       2 |         29 |
;;    |       1 |       2 |         80 |

The row order is not significant; consumers should treat the dataset as an unordered collection of (row, col, val) triples.

Why this format?

This representation is especially useful when:

  • The array is extremely sparse
  • The column count is large
  • You intend to ingest directly into a database such as DuckDB
  • You want to immediately aggregate into list- or sparse-row formats

For example, in DuckDB you might do:

SELECT
  row_no,
  list(col_no ORDER BY col_no) AS inds,
  list(val    ORDER BY col_no) AS vals
FROM staging
GROUP BY row_no;

This yields a compact per-row sparse representation suitable for downstream modeling or analytics.

Performance characteristics

  • Two-pass algorithm:
    1. Count non-zero entries
    2. Allocate exactly-sized primitive arrays and populate them
  • No transposition or per-column materialization
  • No boxing in hot loops
  • Preserves original NumPy dtype (including unsigned integers)
  • col_no is stored as int16 (columns < 32k)
  • row_no is stored as int64 (supports millions of rows)

This makes npy->dataset-unnested-nz significantly more memory-efficient than dense columnar datasets when sparsity is high.

Notes and limitations

  • Only 1D and 2D arrays are supported
  • Zero is defined as:
    • 0 for integer types
    • 0.0 / -0.0 for floating-point types (exact comparison)
  • NaNs are not treated as zero
  • Row and column indices are 0-based

Non-goals (v0.1.0)

  • Higher-dimensional arrays (3D+)
  • Structured /record dtypes
  • Memory-mapped or streaming IO
  • Writing .npy files

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close