Liking cljdoc? Tell your friends :D

Scriptum

Copy-on-write branching for Apache Lucene. Git-like snapshot and branch semantics on full-text search indices with structural sharing.

Built on Lucene 10.3.2. Forking a branch takes 3-5ms regardless of index size by sharing immutable segment files.

Core Concepts

Branch: A COW overlay directory sharing base segments with trunk. Each branch has its own commit history.
Snapshot: An immutable reader at a specific commit generation. All commits are retained until explicit GC.
Fork: Creates a new branch by copying segment metadata only (not data). Near-instant regardless of index size.
GC: Explicit garbage collection of old snapshots, respecting branch references to shared segments.

API Layers

Layer	Namespace	Use Case
Java	`org.replikativ.scriptum.BranchIndexWriter`	Direct Java usage
Core	`scriptum.core`	Low-level Clojure wrapper
Yggdrasil	`scriptum.yggdrasil`	High-level protocols

For Clojure users: scriptum.yggdrasil for high-level API, scriptum.core for lower-level control.

For Java users: use BranchIndexWriter directly.

Getting Started

Dependencies

Add to deps.edn:

For Maven/Gradle:

<dependency>
  <groupId>org.replikativ</groupId>
  <artifactId>scriptum</artifactId>
  <version>0.1.1</version>
</dependency>

Build from Source

Java sources must be compiled before use:

clj -T:build compile-java

Quick Start (Clojure)

(require '[scriptum.core :as sc])

;; Create an index
(def writer (sc/create-index "/tmp/my-index"))

;; Add documents
(sc/add-doc writer {:title {:type :text :value "Hello World"}
                    :id    {:type :string :value "doc-1"}})
(sc/commit! writer "Initial commit")

;; Search
(sc/search writer {:match-all {}} 10)
;; => [{:title "Hello World", :id "doc-1", :score 1.0}]

;; Fork a branch
(def feature (sc/fork writer "experiment"))

;; Add to branch (doesn't affect main)
(sc/add-doc feature {:title {:type :text :value "Branch only"}
                     :id    {:type :string :value "doc-2"}})
(sc/commit! feature "Added experimental doc")

;; Main still has 1 doc, branch has 2
(count (sc/search writer {:match-all {}} 100))    ;; => 1
(count (sc/search feature {:match-all {}} 100))   ;; => 2

;; Merge branch back
(sc/merge-from! writer feature)
(sc/commit! writer "Merged experiment")

;; Cleanup
(sc/close! feature)
(sc/close! writer)

API Reference

Lifecycle

(sc/create-index path)              ; create new index at path
(sc/open-branch path branch-name)   ; open existing branch
(sc/fork writer "branch-name")      ; fast fork from writer
(sc/close! writer)                  ; close writer and release resources
(sc/discover-branches path)         ; => ["main" "feature" ...]

;; Accessors
(sc/num-docs writer)                ; document count (excluding deletions)
(sc/max-doc writer)                 ; document count (including deletions)
(sc/branch-name writer)             ; current branch name
(sc/base-path writer)               ; index base path
(sc/main-branch? writer)            ; true if this is the main branch

Document Operations

Field types: :text (analyzed, searchable), :string (exact match), :vector (float array for KNN).

(sc/add-doc writer {:title {:type :text :value "Searchable text"}
                    :tag   {:type :string :value "exact-match"}
                    :embed {:type :vector :value (float-array [0.1 0.2 0.3])
                            :dims 3}})

(sc/delete-docs writer :id "doc-1")           ; delete by field+value
(sc/update-doc writer :id "doc-1" new-fields) ; atomic delete+add

Commit & History

(sc/commit! writer "commit message")    ; persist changes
(sc/flush! writer)                      ; flush without new commit point
(sc/merge-from! writer source-writer)   ; merge segments from another branch

(sc/list-snapshots writer)
;; => [{:generation 1 :uuid "..." :timestamp "..." :message "..." :branch "main"}
;;     {:generation 2 :uuid "..." :timestamp "..." :message "..." :branch "main"}]

Search

;; Term query
(sc/search writer {:term {:field "tag" :value "exact-match"}} 10)

;; Match-all
(sc/search writer {:match-all {}} 100)

;; Custom Lucene query object
(sc/search writer my-lucene-query 10)

;; Returns: [{:field1 "val" :field2 "val" :score 1.0} ...]

Time Travel

;; Get snapshot at specific generation
(def reader (sc/open-reader-at writer 1))

;; Check if a generation still exists (may be GC'd)
(sc/commit-available? writer 1)  ; => true/false

;; Get current immutable snapshot
(def snap (sc/snapshot writer))

;; Execute with auto-closing snapshot
(sc/with-snapshot [reader writer]
  (sc/search reader {:match-all {}} 10))

(.close reader)

Garbage Collection

;; Remove commits older than 1 hour, respecting branch references
(sc/gc! writer)

GC only runs on the main branch and protects all segment files referenced by any branch.

Java API

For Java users, BranchIndexWriter provides the complete API:

import org.replikativ.scriptum.BranchIndexWriter;
import org.apache.lucene.document.*;
import java.nio.file.Path;
import java.time.Duration;
import java.time.Instant;

// Create an index
BranchIndexWriter main = BranchIndexWriter.create(Path.of("/tmp/my-index"), "main");

// Add documents
Document doc = new Document();
doc.add(new TextField("title", "Hello World", Field.Store.YES));
doc.add(new StringField("id", "doc-1", Field.Store.YES));
main.addDocument(doc);
main.commit("Initial commit");

// Fork a branch (3-5ms regardless of index size)
BranchIndexWriter feature = main.fork("experiment");
feature.addDocument(anotherDoc);
feature.commit("Feature work");

// Search
DirectoryReader reader = main.openReader();
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(new MatchAllDocsQuery(), 10);
reader.close();

// Merge branch back
main.mergeFrom(feature);

// Time travel - open reader at specific generation
DirectoryReader historical = main.openReaderAt(1);

// Garbage collect old commits
main.gc(Instant.now().minus(Duration.ofHours(1)));

// Discover branches
Set<String> branches = BranchIndexWriter.discoverBranches(Path.of("/tmp/my-index"));

// Cleanup
feature.close();
main.close();

Key Java Methods

Method	Description
`create(path, branchName)`	Create new index
`open(path, branchName)`	Open existing branch
`fork(branchName)`	Fast fork (copies metadata only)
`addDocument(doc)`	Add a document
`deleteDocuments(terms...)`	Delete by terms
`updateDocument(term, doc)`	Atomic delete+add
`commit()` / `commit(message)`	Persist changes
`openReader()`	NRT reader (sees uncommitted)
`openCommittedReader()`	Reader on committed state
`openReaderAt(generation)`	Time travel to specific commit
`isCommitAvailable(generation)`	Check if commit still exists
`listSnapshots()`	Get all commit points
`mergeFrom(source)`	Merge another branch
`gc(beforeInstant)`	Garbage collect old commits
`numDocs()` / `maxDoc()`	Document counts
`getBranchName()`	Current branch name
`isMainBranch()`	Check if main branch

Yggdrasil Integration

Scriptum implements the Yggdrasil protocol stack (Snapshotable, Branchable, Graphable, Mergeable):

(require '[scriptum.yggdrasil :as sy]
         '[yggdrasil.protocols :as p])

(def sys (sy/create "/tmp/my-index" {:system-name "search-index"}))

(p/branches sys)         ; => #{:main}
(p/branch! sys :feature)
(p/checkout sys :feature)
;; ... add docs, commit ...
(p/merge! sys :main)
(p/history sys {:limit 10})

(sy/close! sys)

Passes the full yggdrasil compliance test suite (22 tests, 203 assertions).

Performance

Typical results:

Fork latency: 3-5ms (independent of index size)
Indexing: ~50k docs/sec (text fields, SSD)
Search: sub-millisecond for simple queries

Directory Layout

On disk, scriptum uses this structure:

basePath/                    -- trunk (main branch)
  _0.cfs, _1.cfs, ...       -- shared segment files
  segments_N                 -- main's commit points
  branches/
    feature/                 -- branch overlay
      _10000.cfs, ...        -- branch-specific segments
      segments_N             -- branch's commit points

Branches share base segments via read-only references. Only new writes create branch-specific segment files.

Technical Documentation

See docs/LUCENE_EXTENSION.md for a deep-dive into how Scriptum extends Lucene:

How Lucene segments and commit points work
BranchedDirectory: overlay pattern for COW reads/writes
BranchDeletionPolicy: retaining all commits until explicit GC
BranchAwareMergePolicy: preventing merge of shared segments
Fork operation mechanics and performance analysis
GC with branch protection

Project Structure

src/
  clojure/scriptum/
    core.clj                 # Low-level COW branching API
    yggdrasil.clj            # Yggdrasil protocol adapter
  java/org/replikativ/scriptum/
    BranchIndexWriter.java   # Branch-aware Lucene writer (main Java API)
    BranchedDirectory.java   # COW directory overlay
    BranchAwareMergePolicy.java  # Prevents merging shared segments
    BranchDeletionPolicy.java    # Retains all commits until GC
docs/
  LUCENE_EXTENSION.md        # Technical deep-dive
test/scriptum/
  core_test.clj              # Unit tests
  yggdrasil_test.clj         # Compliance tests

Requirements

Java 21+
Clojure 1.12.0+
Apache Lucene 10.3.2 (pulled from Maven Central)

Development

# Compile Java sources
clj -T:build compile-java

# Run tests
clj -T:build compile-java && clj -M:test

# Start nREPL
clj -T:build compile-java && clj -M:repl

License

Licensed under the Eclipse Public License 2.0.

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close