Liking cljdoc? Tell your friends :D

Proximum

Clojars Project Slack GitHub last commit

⚠️ Early Beta: Proximum is under active development. APIs may change before 1.0 release. Feedback welcome!

📋 Help shape Proximum! We'd love your input. Please fill out our 2-minute feedback survey.

A high-performance, embeddable vector database for Clojure and Java with Git-like versioning and zero-cost branching.

Why Proximum?

Unlike traditional vector databases, Proximum brings persistent data structure semantics to vector search:

  • Time Travel: Query any historical snapshot
  • 🌿 Zero-Cost Branching: Fork indices for experiments without copying data
  • 🔒 Immutability: All operations return new versions, enabling safe concurrency
  • 💾 True Persistence: Durable storage with structural sharing
  • 🚀 High Performance: SIMD-accelerated search with competitive recall
  • 📦 Pure JVM: No native dependencies, works everywhere

Perfect for RAG applications, semantic search, and ML experimentation where you need to track versions, A/B test embeddings, or maintain reproducible search results.


Quick Start

Clojure

(require '[proximum.core :as prox])

;; Create identifier of the underlying storage with (random-uuid)
(def store-id #uuid "465df026-fcd3-4cb3-be44-29a929776250") 

;; Create an index - feels like Clojure!
(def idx (prox/create-index {:type :hnsw
                              :dim 384
                              :store-config {:backend :memory
                                             :id store-id}
                              :capacity 10000}))

;; Use collection protocols
(def idx2 (assoc idx "doc-1" (float-array (repeatedly 384 rand))))
(def idx3 (assoc idx2 "doc-2" (float-array (repeatedly 384 rand))))

;; Search for nearest neighbors
(def results (prox/search idx3 (float-array (repeatedly 384 rand)) 5))
; => ({:id "doc-1", :distance 0.234} {:id "doc-2", :distance 0.456} ...)

;; Git-like branching (sync! is async - returns channel)
(require '[clojure.core.async :as a])
(let [idx3 (a/<!! (prox/sync! idx3))]  ; Block until persisted
  (def experiment (prox/branch! idx3 "experiment")))

📖 Full Clojure Guide

Java

import org.replikativ.proximum.*;

// Create index with builder pattern
try (ProximumVectorStore store = ProximumVectorStore.builder()
        .dimensions(384)
        .storagePath("/tmp/vectors")
        .build()) {

    // Add vectors (immutable - returns new store)
    store = store.add(embedding1, "doc-1");
    store = store.add(embedding2, "doc-2");

    // Search for nearest neighbors
    List<SearchResult> results = store.search(queryVector, 5);
    // => [SearchResult{id=doc-1, distance=0.234}, ...]

    // Git-like versioning (sync() is async - returns CompletableFuture)
    store = store.sync().get();  // Block until persisted
    UUID snapshot1 = store.getCommitId();

    store = store.add(embedding3, "doc-3");
    store = store.sync().get();

    // Time travel: Query historical state
    ProximumVectorStore historical = ProximumVectorStore.connectCommit(
        Map.of("backend", ":file", "path", "/tmp/vectors"), snapshot1);
    historical.search(queryVector, 5);  // Only sees doc-1, doc-2!

    // Branch for experiments
    ProximumVectorStore experiment = store.branch("experiment");
}

📖 Full Java Guide


Installation

Clojars Project

Clojure (deps.edn)

{:deps {org.replikativ/proximum {:mvn/version "LATEST"}}}

Leiningen (project.clj)

[org.replikativ/proximum "LATEST"]

Maven

<dependency>
  <groupId>org.replikativ</groupId>
  <artifactId>proximum</artifactId>
  <version>LATEST</version>
</dependency>

Gradle

implementation 'org.replikativ:proximum:LATEST'

Design Rationale: Why Not JVector?

You might wonder: Why build Proximum instead of extending JVector? After all, JVector (DataStax) is a mature HNSW implementation with quantization support.

The Short Answer

Proximum aims at peak performance (see benchmarks below) while adding git-like versioning. The architectures serve different goals: JVector optimizes for quantization at scale; Proximum optimizes for versioning with structural sharing.

Where JVector leads: Quantization (PQ/BQ/NVQ) enables larger-than-memory indices (>10M vectors) by keeping compressed vectors in RAM and correcting with full-resolution reads. This is on Proximum's roadmap.

Architectural Comparison

AspectJVectorProximum
Edge StoragePer-node Neighbors objectsChunked int[][] arrays
Fork CostO(nodes) — copy all neighborsO(chunks) — shallow array clone
VersioningNone (write-once)Git-like (commits, branches, merge)
PersistenceFile-centric, single backendPluggable (memory, file, S3, etc.)
Structural SharingNoYes (chunk-level copy-on-write)
QuantizationYes (PQ/BQ/NVQ)Roadmap

The Chunking Difference

This is the key architectural decision that enables git-like features:

// JVector: Each node has its own neighbor array
DenseIntMap<Neighbors> neighbors;  // ~1M objects for 1M nodes

// Proximum: Nodes are grouped into chunks
int[][] layer0;  // ~1000 chunks for 1M nodes (1024 nodes/chunk)

Why this matters for branching:

// JVector "fork" would require:
for (int i = 0; i < 1_000_000; i++) {
    newNeighbors[i] = oldNeighbors[i].copy();  // O(nodes)
}

// Proximum fork:
int[][] newLayer0 = layer0.clone();  // O(chunks) — shallow clone
// Only ~1000 array references copied, data is shared

What Would JVector Need?

Adding git-like features to JVector would require:

  1. Chunk-based edge storage — Replace DenseIntMap<Neighbors> with chunked arrays
  2. Copy-on-write semantics — Track dirty chunks for incremental persistence
  3. Versioning system — Commit graph, branch tracking, merge logic
  4. Storage abstraction — Content-addressable chunks, pluggable backends
  5. Immutable APIfork() method, transient/persistent modes

These are fundamental architectural changes, not additions. JVector's CAS-based concurrent mutation model optimizes for throughput without versioning overhead.

Different Use Cases

JVector Shines AtProximum Shines At
Very large indices (>10M) with quantizationKnowledge management with history
Memory-constrained deploymentsA/B testing embedding models
Simple API, single versionAudit trails and compliance
Java-centric production systemsClojure persistent data structures

Integration Possibilities

Proximum could integrate with JVector at the algorithm layer:

  • SIMD distance computation (Panama Vector API)
  • PQ/BQ/NVQ compression algorithms
  • Search patterns (two-pass, reranking)

But the versioning layer requires Proximum's specialized architecture.

Lessons from Lucene

We successfully added git-like features to Lucene's HNSW implementation because Lucene's architecture already had:

  • Segment-based storage (natural chunking)
  • Read-only index readers (immutable snapshots)
  • Point-in-time search semantics

JVector lacks these primitives. Proximum builds them from the ground up.


Key Features

🔄 Versioning & Time Travel

Every sync() creates a commit. Query any historical state:

index = index.sync().get();  // Snapshot 1 (blocks until persisted)
// ... make changes ...
index = index.sync().get();  // Snapshot 2

// Time travel to earlier state
ProximumVectorStore historical = index.asOf(commitId);

Use Cases: Audit trails, debugging, A/B testing, reproducible results

🌿 Zero-Cost Branching

Fork an index for experiments without copying data:

index = index.sync().get();
ProximumVectorStore experiment = index.branch("new-model");

// Test different embeddings
experiment = experiment.add(newEmbedding, "doc-1");

// Merge or discard - original unchanged

Use Cases: A/B testing, staging, parallel experiments

🔍 Advanced Features

  • Filtered Search: Multi-tenant search with ID filtering
  • Metadata: Attach arbitrary metadata to vectors
  • Compaction: Reclaim space from deleted vectors
  • Garbage Collection: Clean up unreachable commits
  • Crypto-Hash: Tamper-proof audit trail with SHA-512

⚡ Async Operations

Storage operations are non-blocking and return immediately for efficient I/O:

Async Operations:

  • sync! / sync() - Persist changes and create commit
  • flush! / flush() - Force pending writes to storage
  • gc! / gc() - Garbage collect unreachable commits
  • close! / close() - Release resources (mmap, file handles)

Clojure - Blocking:

(require '[clojure.core.async :as a])

;; Wait for completion
(let [idx2 (a/<!! (prox/sync! idx))]
  ;; idx2 is the updated index
  (prox/search idx2 query 5))

;; Chain operations
(-> idx
    (prox/insert vector "id")
    (prox/sync!)
    (a/<!!)
    (prox/close!)
    (a/<!!))

Clojure - Async composition:

(a/go
  (let [idx2 (a/<! (prox/sync! idx))
        idx3 (a/<! (prox/flush! idx2))]
    (a/<! (prox/close! idx3))))

Java - Blocking:

// Block with .get() or .join()
store = store.sync().get();
store = store.flush().join();
store.close().get();

Java - Async chaining:

store.sync()
     .thenCompose(s -> s.flush())
     .thenCompose(s -> s.close())
     .get();  // Only block at the end

Integrations

Spring AI

import org.replikativ.proximum.spring.ProximumVectorStore;

@Bean
public VectorStore vectorStore() {
    return ProximumVectorStore.builder()
        .dimensions(1536)
        .storagePath("/data/vectors")
        .build();
}

📖 Spring AI Integration Guide | Spring Boot RAG Example

LangChain4j

import org.replikativ.proximum.langchain4j.ProximumEmbeddingStore;

EmbeddingStore<TextSegment> store = ProximumEmbeddingStore.builder()
    .dimensions(1536)
    .storagePath("/data/embeddings")
    .build();

📖 LangChain4j Integration Guide


Performance

SIFT-1M (1M vectors, 128-dim, Intel Core Ultra 7):

# clj -M:benchmark -m runner sift1m

...

Library                   Insert (vec/s)  Search QPS   p50 (us)   p99 (us)   Recall@10 
--------------------------------------------------------------------------------------
proximum                  13392           3844         264.1      461.0      98.63%
jvector                   9771            3609         277.4      485.2      95.95%
lucene-hnsw               2395            3036         340.5      467.1      98.53%
hnswlib-java              4260            1007         1033.3     1377.4     98.29%
datalevin/usearch         2492            3616         268.1      375.1      96.96%

Proximum metrics:

  • Storage: 762.8 MB
  • Heap usage: 545.7 MB
# clj -M:benchmark -m runner dbpedia-openai-100k

...

Library                   Insert (vec/s)  Search QPS   p50 (us)   p99 (us)   Recall@10 
--------------------------------------------------------------------------------------
proximum                  3726            1648         621.3      874.2      99.10%
jvector                   4267            1795         570.0      798.0      98.65%
lucene-hnsw               969             1306         787.2      1113.7     99.28%
hnswlib-java              2070            609          1688.2     2230.4     99.22%
datalevin/usearch         941             1364         700.5      955.1      98.76%
#  clj -M:benchmark -m runner glove100

...

Library                   Insert (vec/s)  Search QPS   p50 (us)   p99 (us)   Recall@10 
----------------------------------------------------------------------------------
proximum                  7437            2604         382.6      638.6      81.93%
jvector                   7613            2620         361.9      691.4      70.14%
lucene-hnsw               1500            2031         482.9      819.2      81.35%
hnswlib-java              2779            722          1380.9     2236.5     80.91%
datalevin/usearch         1455            2424         403.8      661.8      75.67%

Key features:

  • Pure JVM with SIMD acceleration (Java Vector API)
  • No native dependencies, works on all platforms
  • Persistent storage with zero-cost branching

Documentation

API Guides:

  • Clojure Guide - Complete Clojure API with collection protocols
  • Java Guide - Builder pattern, immutability, and best practices

Integration Guides:

Advanced Topics:

Examples:


Examples

Browse working examples in examples/:

  • Clojure: Semantic search, RAG, collection protocols
  • Java: Quick start, auditable index, metadata usage

Demo Projects:

  • Einbetten: Wikipedia semantic search with Datahike + FastEmbed (2,000 articles, ~8,000 chunks)

Requirements

  • Java: 22+ (Foreign Memory API finalized in Java 22)
  • OS: Linux, macOS, Windows
  • CPU: AVX2 recommended, AVX-512 for best performance

JVM Options Required:

--add-modules=jdk.incubator.vector
--enable-native-access=ALL-UNNAMED

License

Apache-2.0 (Apache License, Version 2.0) - see LICENSE


Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • Code of conduct
  • Development workflow
  • Testing requirements
  • Licensing (DCO/Apache-2.0)

Support

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close