A small, focused Clojure library for reading NumPy .npy files.
com.sturdystats/sturdy-numpy {:mvn/version "VERSION"}
.npy files (binary array format)u1, u2, u4i1, i2, i4, i8f4, f8vector / vector-of-vectors)tech.v3.dataset (columnar, primitive-backed)(require '[sturdy.numpy :as np])
(let [f "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy"]
(np/npy->vec f))
;; => [[4294967291 12 29] [46 63 80]]
These are
tech.v3.dataset(require '[sturdy.numpy :as np])
(let [f "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy"]
(np/npy->dataset f))
;; => _unnamed [2 3]:
;; | :c1 | :c2 | :c3 |
;; |-----------:|----:|----:|
;; | 4294967291 | 12 | 29 |
;; | 46 | 63 | 80 |
:c1):c1, :c2, … reflecting the order in the original fileThis is the recommended entry point for:
dtype(require '[sturdy.numpy :as np])
(require '[tech.v3.dataset :as ds])
(require '[tech.v3.datatype :as dtype])
(let [f "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy"
dataset (np/npy->dataset f)]
(map dtype/elemwise-datatype (ds/columns dataset)))
;; => (:uint32 :uint32 :uint32)
(require '[sturdy.numpy :as np])
(require '[tech.v3.dataset :as ds])
(require '[tech.v3.datatype :as dtype])
(defn get-dtype [dataset]
(-> (ds/columns dataset) first dtype/elemwise-datatype))
(defn test-file [dtype fname]
(let [dataset (np/npy->dataset fname)]
{:expected dtype
:actual (get-dtype dataset)}))
(for [tp [:u1 :u2 :u4 :i1 :i2 :i4 :i8 :f4 :f8]]
(let [fname (format "test-resources/npy-fixtures/shape_2x3__dtype_%s.npy"
(name tp))]
(test-file tp fname)))
;; => ({:expected :u1, :actual :uint8}
;; {:expected :u2, :actual :uint16}
;; {:expected :u4, :actual :uint32}
;; {:expected :i1, :actual :int8}
;; {:expected :i2, :actual :int16}
;; {:expected :i4, :actual :int32}
;; {:expected :i8, :actual :int64}
;; {:expected :f4, :actual :float32}
;; {:expected :f8, :actual :float64})
(require '[sturdy.numpy :as np])
(np/npy->primitive "shape_2x3__dtype_u4.npy")
;; => {:shape [2 3]
;; :dtype :u4
;; :fortran? false
;; :data #<long[]>}
This API is intended for:
For some workflows (e.g. downstream databases that support list or array types), it can be useful to represent each row of a 2D NumPy array as a single list-valued column rather than as many scalar columns.
The function npy->dataset-rowlists provides this representation:
(require '[sturdy.numpy :as np])
(np/npy->dataset-rowlists "test-resources/npy-fixtures/shape_2x3__dtype_i4.npy")
;; => _unnamed [2 1]:
;; | :c1 |
;; |------------|
;; | [-5 12 29] |
;; | [46 63 80] |
(np/npy->dataset-rowlists "test-resources/npy-fixtures/shape_2x3__dtype_f4.npy")
;; => _unnamed [2 1]:
;; | :c1 |
;; |-------------------|
;; | [-3.0 -1.75 -0.5] |
;; | [0.75 2.0 3.25] |
Each row is represented as a zero-copy buffer view over the underlying array data.
Notes:
.npy files are supported.:object (each cell is a buffer), which may not be accepted by all ingestion paths.UNNESTed DatasetsFor very sparse arrays, a column-oriented dataset with one column per NumPy column can be inefficient: most values are zero, and downstream systems often want a sparse or list-based representation anyway.
The function npy->dataset-unnested-nz provides an alternative representation inspired by SQL UNNEST / long-form tables.
Instead of producing one column per NumPy column, it produces a row-wise sparse representation with three columns:
:row_no — row index (0-based, int64):col_no — column index (0-based, int16):val — value (primitive-backed, dtype preserved)
Only non-zero entries are emitted.(npy->dataset-unnested-nz "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy")
;; => _unnamed [6 3]:
;; | :row_no | :col_no | :val |
;; |--------:|--------:|-----------:|
;; | 0 | 0 | 4294967291 |
;; | 0 | 1 | 12 |
;; | 0 | 2 | 29 |
;; | 1 | 0 | 46 |
;; | 1 | 1 | 63 |
;; | 1 | 2 | 80 |
For Fortran-order (order='F') .npy files, the physical order of rows differs, but (row_no, col_no) are computed correctly:
(npy->dataset-unnested-nz "test-resources/npy-fixtures/shape_2x3__dtype_u4__order_F.npy")
;; => _unnamed [6 3]:
;; | :row_no | :col_no | :val |
;; |--------:|--------:|-----------:|
;; | 0 | 0 | 4294967291 |
;; | 1 | 0 | 46 |
;; | 0 | 1 | 12 |
;; | 1 | 1 | 63 |
;; | 0 | 2 | 29 |
;; | 1 | 2 | 80 |
The row order is not significant; consumers should treat the dataset as an unordered collection of (row, col, val) triples.
This representation is especially useful when:
For example, in DuckDB you might do:
SELECT
row_no,
list(col_no ORDER BY col_no) AS inds,
list(val ORDER BY col_no) AS vals
FROM staging
GROUP BY row_no;
This yields a compact per-row sparse representation suitable for downstream modeling or analytics.
col_no is stored as int16 (columns < 32k)row_no is stored as int64 (supports millions of rows)This makes npy->dataset-unnested-nz significantly more memory-efficient than dense columnar datasets when sparsity is high.
0 for integer types0.0 / -0.0 for floating-point types (exact comparison).npy filesCan you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |