Liking cljdoc? Tell your friends :D

Proximum

⚠️ Early Beta: Proximum is under active development. APIs may change before 1.0 release. Feedback welcome!

📋 Help shape Proximum! We'd love your input. Please fill out our 2-minute feedback survey.

A high-performance, embeddable vector database for Clojure and Java with Git-like versioning and zero-cost branching.

Why Proximum?

Unlike traditional vector databases, Proximum brings persistent data structure semantics to vector search:

✨ Time Travel: Query any historical snapshot
🌿 Zero-Cost Branching: Fork indices for experiments without copying data
🔒 Immutability: All operations return new versions, enabling safe concurrency
💾 True Persistence: Durable storage with structural sharing
🚀 High Performance: SIMD-accelerated search with competitive recall
📦 Pure JVM: No native dependencies, works everywhere

Perfect for RAG applications, semantic search, and ML experimentation where you need to track versions, A/B test embeddings, or maintain reproducible search results.

Installation

Clojure (deps.edn)

{:deps {org.replikativ/proximum {:mvn/version "LATEST"}}}

Leiningen (project.clj)

[org.replikativ/proximum "LATEST"]

Maven

<dependency>
  <groupId>org.replikativ</groupId>
  <artifactId>proximum</artifactId>
  <version>LATEST</version>
</dependency>

Gradle

implementation 'org.replikativ:proximum:LATEST'

Design Rationale: Why Not JVector?

You might wonder: Why build Proximum instead of extending JVector? After all, JVector (DataStax) is a mature HNSW implementation with quantization support.

Proximum aims at peak performance (see benchmarks below) while adding git-like versioning. The architectures serve different goals: JVector optimizes for quantization at scale; Proximum optimizes for versioning with structural sharing.

Where JVector leads: Quantization (PQ/BQ/NVQ) enables larger-than-memory indices (>10M vectors) by keeping compressed vectors in RAM and correcting with full-resolution reads. This is on Proximum's roadmap.

Architectural Comparison

Aspect	JVector	Proximum
Edge Storage	Per-node `Neighbors` objects	Chunked `int[][]` arrays
Fork Cost	O(nodes) — copy all neighbors	O(chunks) — shallow array clone
Versioning	None (write-once)	Git-like (commits, branches, merge)
Persistence	File-centric, single backend	Pluggable (memory, file, S3, etc.)
Structural Sharing	No	Yes (chunk-level copy-on-write)
Quantization	Yes (PQ/BQ/NVQ)	Roadmap

The Chunking Difference

This is the key architectural decision that enables git-like features:

// JVector: Each node has its own neighbor array
DenseIntMap<Neighbors> neighbors;  // ~1M objects for 1M nodes

// Proximum: Nodes are grouped into chunks
int[][] layer0;  // ~1000 chunks for 1M nodes (1024 nodes/chunk)

Why this matters for branching:

// JVector "fork" would require:
for (int i = 0; i < 1_000_000; i++) {
    newNeighbors[i] = oldNeighbors[i].copy();  // O(nodes)
}

// Proximum fork:
int[][] newLayer0 = layer0.clone();  // O(chunks) — shallow clone
// Only ~1000 array references copied, data is shared

What Would JVector Need?

Adding git-like features to JVector would require:

Chunk-based edge storage — Replace DenseIntMap<Neighbors> with chunked arrays
Copy-on-write semantics — Track dirty chunks for incremental persistence
Versioning system — Commit graph, branch tracking, merge logic
Storage abstraction — Content-addressable chunks, pluggable backends
Immutable API — fork() method, transient/persistent modes

These are fundamental architectural changes, not additions. JVector's CAS-based concurrent mutation model optimizes for throughput without versioning overhead.

Different Use Cases

JVector Shines At	Proximum Shines At
Very large indices (>10M) with quantization	Knowledge management with history
Memory-constrained deployments	A/B testing embedding models
Simple API, single version	Audit trails and compliance
Java-centric production systems	Clojure persistent data structures

Integration Possibilities

Proximum could integrate with JVector at the algorithm layer:

SIMD distance computation (Panama Vector API)
PQ/BQ/NVQ compression algorithms
Search patterns (two-pass, reranking)

But the versioning layer requires Proximum's specialized architecture.

Lessons from Lucene

We successfully added git-like features to Lucene's HNSW implementation because Lucene's architecture already had:

Segment-based storage (natural chunking)
Read-only index readers (immutable snapshots)
Point-in-time search semantics

JVector lacks these primitives. Proximum builds them from the ground up.

Key Features

🔄 Versioning & Time Travel

Every sync() creates a commit. Query any historical state:

index = index.sync().get();  // Snapshot 1 (blocks until persisted)
// ... make changes ...
index = index.sync().get();  // Snapshot 2

// Time travel to earlier state
ProximumVectorStore historical = index.asOf(commitId);

Use Cases: Audit trails, debugging, A/B testing, reproducible results

🌿 Zero-Cost Branching

Fork an index for experiments without copying data:

index = index.sync().get();
ProximumVectorStore experiment = index.branch("new-model");

// Test different embeddings
experiment = experiment.add(newEmbedding, "doc-1");

// Merge or discard - original unchanged

Use Cases: A/B testing, staging, parallel experiments

🔍 Advanced Features

Filtered Search: Multi-tenant search with ID filtering
Metadata: Attach arbitrary metadata to vectors
Compaction: Reclaim space from deleted vectors
Garbage Collection: Clean up unreachable commits
Crypto-Hash: Tamper-proof audit trail with SHA-512

⚡ Async Operations

Storage operations are non-blocking and return immediately for efficient I/O:

Async Operations:

sync! / sync() - Persist changes and create commit
flush! / flush() - Force pending writes to storage
gc! / gc() - Garbage collect unreachable commits
close! / close() - Release resources (mmap, file handles)

Clojure - Blocking:

(require '[clojure.core.async :as a])

;; Wait for completion
(let [idx2 (a/<!! (prox/sync! idx))]
  ;; idx2 is the updated index
  (prox/search idx2 query 5))

;; Chain operations
(-> idx
    (prox/insert vector "id")
    (prox/sync!)
    (a/<!!)
    (prox/close!)
    (a/<!!))

Clojure - Async composition:

(a/go
  (let [idx2 (a/<! (prox/sync! idx))
        idx3 (a/<! (prox/flush! idx2))]
    (a/<! (prox/close! idx3))))

Java - Blocking:

// Block with .get() or .join()
store = store.sync().get();
store = store.flush().join();
store.close().get();

Java - Async chaining:

store.sync()
     .thenCompose(s -> s.flush())
     .thenCompose(s -> s.close())
     .get();  // Only block at the end

Integrations

Spring AI

import org.replikativ.proximum.spring.ProximumVectorStore;

@Bean
public VectorStore vectorStore() {
    return ProximumVectorStore.builder()
        .dimensions(1536)
        .storagePath("/data/vectors")
        .build();
}

📖 Spring AI Integration Guide | Spring Boot RAG Example

LangChain4j

import org.replikativ.proximum.langchain4j.ProximumEmbeddingStore;

EmbeddingStore<TextSegment> store = ProximumEmbeddingStore.builder()
    .dimensions(1536)
    .storagePath("/data/embeddings")
    .build();

📖 LangChain4j Integration Guide

Performance

SIFT-1M (1M vectors, 128-dim, Intel Core Ultra 7):

# clj -M:benchmark -m runner sift1m

...

Library                   Insert (vec/s)  Search QPS   p50 (us)   p99 (us)   Recall@10 
--------------------------------------------------------------------------------------
proximum                  13392           3844         264.1      461.0      98.63%
jvector                   9771            3609         277.4      485.2      95.95%
lucene-hnsw               2395            3036         340.5      467.1      98.53%
hnswlib-java              4260            1007         1033.3     1377.4     98.29%
datalevin/usearch         2492            3616         268.1      375.1      96.96%

Proximum metrics:

Storage: 762.8 MB
Heap usage: 545.7 MB

# clj -M:benchmark -m runner dbpedia-openai-100k

...

Library                   Insert (vec/s)  Search QPS   p50 (us)   p99 (us)   Recall@10 
--------------------------------------------------------------------------------------
proximum                  3726            1648         621.3      874.2      99.10%
jvector                   4267            1795         570.0      798.0      98.65%
lucene-hnsw               969             1306         787.2      1113.7     99.28%
hnswlib-java              2070            609          1688.2     2230.4     99.22%
datalevin/usearch         941             1364         700.5      955.1      98.76%

#  clj -M:benchmark -m runner glove100

...

Library                   Insert (vec/s)  Search QPS   p50 (us)   p99 (us)   Recall@10 
----------------------------------------------------------------------------------
proximum                  7437            2604         382.6      638.6      81.93%
jvector                   7613            2620         361.9      691.4      70.14%
lucene-hnsw               1500            2031         482.9      819.2      81.35%
hnswlib-java              2779            722          1380.9     2236.5     80.91%
datalevin/usearch         1455            2424         403.8      661.8      75.67%

Key features:

Pure JVM with SIMD acceleration (Java Vector API)
No native dependencies, works on all platforms
Persistent storage with zero-cost branching

Documentation

API Guides:

Clojure Guide - Complete Clojure API with collection protocols
Java Guide - Builder pattern, immutability, and best practices

Integration Guides:

Spring AI Guide - Spring Boot RAG applications
LangChain4j Guide - LangChain4j embedding store integration

Advanced Topics:

Cryptographic Auditability - Tamper-proof commit hashing and verification
Persistence Design - Internal persistence mechanisms (PES, VectorStorage, PSS)

Examples:

Spring Boot RAG Example - Full-featured RAG application with versioning

Examples

Browse working examples in examples/:

Clojure: Semantic search, RAG, collection protocols
Java: Quick start, auditable index, metadata usage

Demo Projects:

Einbetten: Wikipedia semantic search with Datahike + FastEmbed (2,000 articles, ~8,000 chunks)

Requirements

Java: 22+ (Foreign Memory API finalized in Java 22)
OS: Linux, macOS, Windows
CPU: AVX2 recommended, AVX-512 for best performance

JVM Options Required:

--add-modules=jdk.incubator.vector
--enable-native-access=ALL-UNNAMED

License

Apache-2.0 (Apache License, Version 2.0) - see LICENSE

Contributing

We welcome contributions! See CONTRIBUTING.md for:

Code of conduct
Development workflow
Testing requirements
Licensing (DCO/Apache-2.0)

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Commercial Support: contact@datahike.io

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close