Sandbar layers BM25F multi-field weighted scoring on top of Datomic's native Lucene-backed
:db/fulltextindexing. The analyzer (Unicode-aware tokenizer + Porter stemmer) is metamodel-driven; per-class:dt/bm25f-weightsdeclare slot weights at the schema layer. Result projection (snippets, facets, structural composition via:whereDatalog) composes through one verb. The search axis of the four-axis retrieval surface.
A retrieval substrate ought to answer "find by content, ranked" as a first-class question, not as a bolt-on feature. Sandbar's fulltext-search primitive — sandbar.search/search-bm25f — is the answer: one verb, opts-shaped, composes with the rest of the substrate via shared :where Datalog clauses and shared :include projection options.
The implementation choice is settled. Lucene's tokenization, term-frequency indexing, and inverted-list traversal are battle-tested at every scale that matters. Sandbar uses Lucene through Datomic's native :db/fulltext integration for the indexing layer. Scoring is canonical Robertson-Zaragoza BM25F (Robertson, Zaragoza & Taylor 2004) — the multi-field weighted form that single-field Lucene Similarity doesn't natively express. Substrate-quality discipline: the analyzer is class-agnostic; per-class weights are declared at the metamodel, not hardcoded in the substrate.
Doug Cutting's Lucene is the canonical fulltext-indexing library — segment-based inverted indexes, term dictionaries with skip lists, positional information for phrase queries, BKD trees for numeric/spatial. Used by Solr, Elasticsearch, OpenSearch, and (relevantly) Datomic's :db/fulltext attributes. Sandbar inherits Lucene's tokenization + inverted-list traversal directly; it does not reimplement them.
:db/fulltext (Hickey 2012-present)Datomic attributes declared with :db/fulltext true are indexed by Lucene at transact time. Querying is via (fulltext $ <attribute> <query>) — a Datalog form that returns [eid value text score] tuples. Sandbar's dt/search-fulltext is a thin wrapper over this primitive. Per decisions/datomic_primary_backend_elevation_2026_05_11.md, native Lucene integration is one of the five reasons Datomic was elevated to primary backend.
BM25 is the canonical IR relevance function — Robertson & Spärck-Jones 1976 / Robertson, Walker, Beaulieu et al 1995-1998 / Spärck-Jones, Walker & Robertson 2000. Three terms: term frequency (saturating), inverse document frequency (rarity prior), and document-length normalization.
BM25F (Robertson, Zaragoza & Taylor 2004; SIGIR) is the multi-field extension. Each document has multiple fields (title, body, tags); per-field weights amplify or attenuate the contribution of each field; field-level length normalization respects each field's average length independently. Per decisions/bm25f_canonical_robertson_zaragoza_form.md, Sandbar's implementation is canonical Robertson-Zaragoza form (not a Lucene-Similarity composition approximation).
The Porter Stemmer (Porter 1980, "An algorithm for suffix stripping") is the canonical algorithmic English stemmer. Five-step rewrite pipeline that conflates morphological variants (running / runs / ran → stem run). Sandbar's analyzer ships the Porter stemmer ported byte-for-byte from the corpus's reference implementation (etc/lib/analysis.clj); the ported form is used at index time AND query time so the stems match.
Plain whitespace-tokenization fails on real text — diacritics, ligatures, mixed-script content, contraction apostrophes. Sandbar's tokenizer uses Java's java.text.BreakIterator for word-boundary detection, with Unicode normalization (NFD) and combining-mark stripping for diacritic folding. Same form as the corpus's etc/lib/analysis.clj; ported to Sandbar substrate at Stage 4a of the fulltext arc.
(sandbar.search/search-bm25f
{:class :mm/Memory
:query "datomic recursive rules"
:field-weights {:mm.memory/name 12.0 ; optional override
:mm.memory/description 8.0
:mm.memory/body-raw 1.0}
:limit 20
:include [:snippets :scores :facets]
:where '[[?e :mm.memory/memory-type :decision]] ; optional Datalog
:facet-by [:mm.memory/memory-type :mm.memory/scope] ; optional
})
;; => {:hits [{:entity <entity-map> :score <num>
;; :field-scores {:mm.memory/name 8.2 :body-raw 3.1 ...}
;; :snippets {:mm.memory/body-raw "...**datomic** **recursive** ..."}}
;; ...]
;; :total <int>
;; :facets {:mm.memory/memory-type {:decision 12 :plan 7 :observation 4 ...}}
;; :timing {:tokenize-ms 1 :score-ms 12 :total-ms 14}}
Three opts power the composition contract:
:where — Datalog clauses that filter the candidate set BEFORE ranking. Fulltext ∩ structured filter as one query.:facet-by — slot idents over which to compute facet counts on the matched set. Aggregation composed with search.:include — projection opts: :snippets (approximate regex-based highlight windows), :scores (per-field score breakdown), :facets (the histogram emission).Per-class weights live at the metamodel layer:
;; schema/mm.edn
{:db/ident :mm/Memory
:dt/type :dt/Class
:dt/subclass-of :dt/Resource
:dt/slots [:mm.memory/name :mm.memory/description :mm.memory/body-raw ...]
:dt/bm25f-weights [[:mm.memory/name 12.0]
[:mm.memory/description 8.0]
[:mm.memory/body-raw 1.0]]}
The substrate reads weights from the class entity at search time via dt/bm25f-weights-of — no consumer hardcoding. Weights are caller-overridable via the :field-weights opt; the metamodel declaration is the default.
:include [:snippets] emits per-slot snippet windows centered on the first query-term hit, with **term** markdown highlighting of all matched-term occurrences within a ~240-char window. Implementation is approximate (regex-based, not Lucene-position-aware) — Lucene's native positional highlighter is a Phase-2 optimization deferred per decisions/query_engine_architectural_cornerstone_2026_05_11.md.
Substrate primitives in sandbar.db.datatype:
| Primitive | What it does |
|---|---|
dt/search-fulltext | Single-attribute Lucene search; returns [[eid score] ...] pairs |
dt/bm25f-weights-of | Read declared weights for a class; returns {slot weight} map |
dt/fulltext-indexed? | Predicate: does this attribute have :db/fulltext true? |
The sandbar.search namespace composes these primitives + the analyzer + Datomic class-walking into the consumer-facing verb.
Search is one axis of four (search / aggregate / navigate / orient). The composition contract per decisions/multi_axis_search_composition_2026_05_08.md:
:facet-by slot list on search-bm25f emits per-slot value counts over the match set.:where Datalog clauses constrain the candidate set BEFORE BM25F scoring.:from + :via will accept path-grammar to restrict the candidate set to a graph-walk neighborhood.degree / backlink-density / recency / freshness over the BM25F-scored set.[?e :slot "exact string"] clause, not a BM25F query. Fulltext stems and tokenizes; exact match doesn't survive.:FILTER on the navigation axis, or a Datalog clause with clojure.string/includes?.Per decisions/query_engine_architectural_cornerstone_2026_05_11.md §12:
Profiling cliffs we know about: high-frequency stop-words on a corpus of mostly-similar documents can produce large match sets where BM25F's discrimination is weak. Mitigation: stop-word filtering at index time (Lucene-standard); enable per consumer demand.
:db/fulltext integration.doc/concepts/aggregation.md — the sibling axis that :facet-by composes withdoc/concepts/navigation.md — the sibling axis that future :from + :via composition will compose withdoc/guides/searching-the-corpus.md — task-oriented worked examplesdoc/api/mcp-verbs.md — sandbar.search.bm25f MCP entry (forthcoming at Stage 27)doc/api/dt-star.md — dt/search-fulltext / dt/bm25f-weights-of / dt/fulltext-indexed? substrate primitivesdecisions/bm25f_canonical_robertson_zaragoza_form.md — ADR locking the canonical formCan you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |