An active research direction, not a settled design. The question of how Sandbar's storage tiers — filesystem-canonical hierarchy, runtime DB, optional secondary indices — should partition workload is open. This document explains the frame the question lives inside, the primitives Sandbar exposes to enable empirical investigation, and the trade-off axes the answer must navigate. For the design rationale on filesystem-canonical commitment see
project-graph.md; for the codec-layer boundary seecodec-layer.md.
Sandbar's storage layer is not a single store — it is a topology of stores held in coherence by project-graph / ingest-graph (see project-graph.md). The filesystem hierarchy is canonical ground-truth; the runtime database is a projected view; auxiliary indices (full-text, vector embeddings, link graphs) may exist as additional projections.
Today, the production deployment is a single Datomic Peer paired with one filesystem hierarchy. That is the simplest point in the design space, not the settled one. The frame Sandbar is built around — codec layer absorbs wire format, project-graph absorbs FS↔DB translation, filters constrain projection — is deliberately constructed to make multi-store experimentation cheap.
This document explains the frame and the open questions. It does not claim answers; the answers come from measurement.
Sadalage & Fowler (2012, NoSQL Distilled) argued that no single storage paradigm is right for every kind of data; applications increasingly mix relational, document, key-value, graph, and search stores per workload. Sandbar adopts the mindset — but with discipline: the model is unified through the metamodel; the stores are projections of that model. No store has its own schema; all share the :dt/Class definitions.
SPARQL 1.1 Federated Query (Buil-Aranda et al, 2013) standardized the query a graph that spans multiple endpoints pattern. Sandbar's federated-query story is younger and less developed — today queries hit one store at a time — but the model permits the generalization: a query that says "find me decisions newer than X and their cited memories' embeddings nearer than ε" could in principle span three stores (Datomic for the metadata; filesystem for the content; vector store for the embeddings). This is a planned exploration, not a current capability.
Brewer (2000) and the subsequent CAP-theorem literature established that distributed stores must trade off consistency, availability, and partition-tolerance. Sandbar's tiered storage faces a milder version: the coherence between FS and DB is eventually consistent (a project-graph happens at specific moments; the FS and DB may diverge between projections); the availability properties are different per store (filesystem is locally-available; Datomic peer is locally-available; remote stores would not be).
Sandbar's coherence model is transactional inside a single store; eventual across stores. The discipline is to make the divergence intervals short and visible.
Datomic's central insight (Hickey, 2012) — that a database is a value; that history is first-class; that time-travel queries are natural — is what makes the tiered story work. When a project-graph operation produces FS state from DB state, the operation is deterministic for a given DB value. Two project-graph operations against the same DB value produce identical filesystems. This determinism is the precondition for diff-based coherence checks across stores.
Three storage tiers, each with a defined role:
┌─────────────────────────────────────────────────────────────────┐
│ Tier 1 — Filesystem-Canonical Hierarchy │
│ ──────────────────────────────────────── │
│ Source-of-truth for human-authored content. │
│ External tooling (vim, git, grep) operates directly. │
│ Permanent; survives all backend turnover. │
└─────────────────────────────────────────────────────────────────┘
↕ project-graph / ingest-graph
┌─────────────────────────────────────────────────────────────────┐
│ Tier 2 — Runtime Database (Datomic Peer) │
│ ──────────────────────────────────── │
│ Indexed for query — schema, ancestry, slot lookup, type checks. │
│ Holds workflow processes, validation history, MCP subscriptions.│
│ Ephemeral; reconstructible from Tier 1 + history. │
└─────────────────────────────────────────────────────────────────┘
↕ (planned) projection / mirror
┌─────────────────────────────────────────────────────────────────┐
│ Tier 3 — Auxiliary Stores (planned) │
│ ────────────────────────── │
│ Full-text indices, vector embeddings, link graphs, … │
│ Each projected from Tier 2 (or directly from Tier 1). │
│ Independently rebuildable; not authoritative for any class. │
└─────────────────────────────────────────────────────────────────┘
The discipline: no class is authoritatively owned by Tier 2 alone. Every class's ground-truth lives in Tier 1. Tier 2 is fast indexed access; Tier 3 is specialized indexed access. Both are derivable.
The exception that proves the rule: workflow processes and MCP subscriptions. Both are ephemeral runtime state, not human-authored content. These legitimately live in Tier 2 only — there is no canonical filesystem form for a half-running process or an active subscription. This is a deliberate boundary: ephemeral state lives in Tier 2; content lives in Tier 1.
project-graph's :filter option (see project-graph.md) is the experimentation surface. The questions it lets us answer empirically:
These experiments are cheap because the substrate makes them cheap. The filters are a research tool.
This frame is the load-bearing answer to a deeper question: should Sandbar evolve to meet the corpus's needs, or should the corpus work around Sandbar's limitations?
The standing directive (captured in interaction/substrate_first_friction_corpus_unmet_needs_signal_sandbar_evolution_2026_05_13) is substrate-first: when the corpus has unmet needs, look at the substrate. Tactical workarounds inside the corpus accumulate and become indistinguishable from the corpus's real shape; substrate-level evolution stays clean and benefits every consumer.
This commitment is why multi-store is a substrate concern, not an application concern. When the corpus eventually needs vector search (it will), the right move is to add a Tier 3 vector-store as a substrate-level projection — every other consumer benefits — rather than adding a vector index inside the corpus that no other application can use.
Stated openly, with no claim of resolution:
project-graph is invoked explicitly. Should it run continuously? On every transaction? On a debounced timer? The right answer depends on the workload (high-volume writes → debounced; low-volume edits → on-transaction).merge-with-conflict-markers model is one option; last-writer-wins is another; CRDT-shaped reconciliation a third. No current production has hit this case; design is deferred until it does.:dt/range changes; a class is renamed), how do older FS files load? Today, the answer is "they don't, until migrated"; a future answer might be lazy migration on ingest.These are not promised features. They are the questions the frame opens. Some will be answered; some will be deferred; some may turn out to be wrong questions.
When the answer to "which classes belong in which tier?" is investigated, these are the dimensions that matter:
ls is slow; git operations are slow). Tier 2 (Datomic) handles much larger entity counts but loses external-tool integration. The crossover point is a measurement, not a guess.These axes don't yield a single right answer — they yield a research program.
A monolithic store (everything in PostgreSQL, everything in Datomic) trades simplicity for tier-specialization. Sandbar deliberately rejects the monolithic shape because the filesystem-canonical commitment requires Tier 1 to be first-class — once that's first-class, the question of what else benefits from being multi-tiered opens naturally.
Lambda (Marz 2011) and kappa (Kreps 2014) architectures separate batch and stream processing tiers, with a serving layer reconciling them. Sandbar's multi-tier story is structurally similar — Tier 1 is the batch (durable, canonical, eventual); Tier 2 is the speed (fresh, indexed, ephemeral) — but the consumer-visible API is the metamodel, not the tiers. Consumers query dt/*; the substrate routes.
Caches are read-through projections of an authoritative store. Sandbar's Tier 2 is not a cache — it is an independently-typed, queryable, write-capable substrate that happens to be derivable from Tier 1. Caching invalidation does not apply; project-graph coherence does.
Notion and Roam are SaaS knowledge stores; Obsidian is a local-file knowledge store. Obsidian is the closest peer to Sandbar's FS-canonical model — files are first-class; the runtime index is derived; consumers can edit with any tool. Obsidian's runtime index is in-memory (Lucene-shaped); Sandbar's is Datomic-shaped with a richer typed model. The shapes converge from different starting points: Obsidian started with files and added structure; Sandbar started with structure and made files canonical.
The multi-store frame is forward-looking. Today's user gets:
project-graph calls.Most consumers don't need to reason about the multi-store frame. They write to the FS or the DB through the API surface, and the substrate handles the projection. The frame matters when:
:filter).Polyglot persistence
Federated query
CAP theorem and distributed-systems trade-offs
Lambda / kappa architectures
Datomic's tier model
Peer projects (for context)
project-graph.md — the FS↔DB boundary-layer primitivecodec-layer.md — per-entity wire formatmarkdown-as-canonical.md — the canonical form Tier 1 holdsmetamodel.md — the model the tiers projectdoc/guides/sandbar-as-substrate.md — embedding Sandbar in your own applicationCan you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |