Liking cljdoc? Tell your friends :D

Garbage Collection

Datahike uses persistent data structures that enable structural sharing—each update creates a new version efficiently by reusing unchanged parts. This allows time-travel queries and git-like versioning, but storage grows over time as old snapshots accumulate.

Garbage collection removes old database snapshots from storage while preserving current branch heads.

GC vs Purging

Don't confuse garbage collection with data purging:

  • Garbage Collection (this document): Removes old database snapshots to reclaim storage. Used for routine storage maintenance.
  • Data Purging: Permanently deletes specific data for privacy compliance (GDPR, HIPAA, CCPA). Used only when legally required.

How Garbage Collection Works

GC whitelists all current branches and marks snapshots as reachable based on a grace period. Snapshots older than the grace period are deleted from storage, but branch heads are always retained regardless of age.

Basic Usage

(require '[datahike.api :as d]
         '[superv.async :refer [<?? S]])

;; Remove only deleted branches, keep all snapshots
(<?? S (d/gc-storage conn))
;; => #{...} ; set of deleted storage blobs

Running without a date removes only deleted branches—all snapshots on active branches are preserved. This is safe to run anytime and reclaims storage from old experimental branches.

Note: Returns a core.async channel. Use <?? to block, or run without it for background execution. GC requires no coordination and won't slow down transactions or reads.

Grace Periods for Distributed Readers

Datahike's Distributed Index Space allows readers to access storage directly without coordination. This is powerful for scalability but means long-running processes might read from old snapshots for hours.

Examples of long-running readers:

  • Reporting jobs: Generate daily/weekly reports by querying yesterday's snapshot
  • Analytics pipelines: Process historical data over several hours
  • Monitoring dashboards: Display metrics from recent snapshots
  • Backup processes: Copy database state while it's being updated

The grace period ensures these readers don't encounter missing data. Snapshots created after the grace period date are kept; older ones are deleted.

(require '[datahike.api :as d])

;; Keep last 7 days of snapshots
(let [seven-days-ago (java.util.Date. (- (System/currentTimeMillis)
                                         (* 7 24 60 60 1000)))]
  (<?? S (d/gc-storage conn seven-days-ago)))

;; Keep last 30 days (common for compliance)
(let [thirty-days-ago (java.util.Date. (- (System/currentTimeMillis)
                                          (* 30 24 60 60 1000)))]
  (<?? S (d/gc-storage conn thirty-days-ago)))

;; Keep last 24 hours (for fast-moving data)
(let [yesterday (java.util.Date. (- (System/currentTimeMillis)
                                    (* 24 60 60 1000)))]
  (<?? S (d/gc-storage conn yesterday)))

Choosing a grace period:

  • Consider your longest-running reader process
  • Add buffer time for safety (if longest job is 2 hours, use 4-6 hours)
  • Balance storage costs against reader safety
  • Monitor reader patterns before shortening grace periods

Branch heads are always kept regardless of the grace period—only intermediate snapshots are removed.

Online Garbage Collection (Incremental GC)

⚠️ EXPERIMENTAL FEATURE

Online GC automatically deletes freed index nodes during transaction commits, preventing garbage accumulation during bulk imports and high-write workloads.

Online GC is currently an experimental feature. While it has been tested extensively in Clojure/JVM and includes safety mechanisms for multi-branch databases, use with caution in production. We recommend:

  • Thorough testing in your specific use case before production deployment
  • Monitoring freed address counts to verify expected behavior
  • Using it primarily for bulk imports and high-write workloads where it's most beneficial
  • ClojureScript: Online GC functionality is available in CLJS but has not been tested in big bulk loads yet. JVM testing is more comprehensive.
  • Reporting any issues at https://github.com/replikativ/datahike/issues

How Online GC Works

Online GC is ONLY safe for single-branch databases. For multi-branch databases, online GC is automatically disabled because freed nodes from one branch may still be referenced by other branches through structural sharing. Use offline GC (d/gc-storage) for multi-branch cleanup instead.

When PSS (Persistent Sorted Set) index trees are modified during transactions, old index nodes become unreachable. Online GC tracks these freed addresses with timestamps and deletes them incrementally:

  1. During transaction (transient mode): PSS calls markFreed() for each replaced index node
  2. At commit time: Freed addresses older than the grace period are batch-deleted
  3. Multi-branch safety check: If multiple branches detected, GC is skipped entirely
  4. No full tree walk: Only freed addresses are deleted, not requiring expensive tree traversal

Key benefits:

  • Prevents unbounded storage growth during bulk imports (single-branch only)
  • Incremental deletion: Small batches per commit, low overhead
  • Grace period support: Safe for concurrent readers accessing old snapshots
  • Multi-branch safety: Automatically disabled to prevent corruption
  • Configurable: Can be disabled, tuned, or run in background

Configuration

Enable online GC in your database config:

;; For bulk imports (no concurrent readers, single-branch)
;; See "Address Recycling" section below for details
{:online-gc {:enabled? true
             :grace-period-ms 0          ;; Recycle immediately
             :max-batch 10000}           ;; Large batches for efficiency
 :crypto-hash? false}                   ;; Required for address recycling

;; For production (concurrent readers)
{:online-gc {:enabled? true
             :grace-period-ms 300000     ;; 5 minutes
             :max-batch 1000}}           ;; Smaller batches

;; Disabled (default)
{:online-gc {:enabled? false}}

Configuration options:

  • :enabled? - Enable/disable online GC (default: false)
  • :grace-period-ms - Minimum age in milliseconds before deletion (default: 60000 = 1 minute)
  • :max-batch - Maximum addresses to delete per commit (default: 1000)
  • :sync? - Synchronous deletion (always false inside commits for async operation)

Background GC Mode

For production systems, run GC in a background thread instead of blocking commits:

(require '[datahike.online-gc :as online-gc])

;; Start background GC
(def stop-ch (online-gc/start-background-gc!
               (:store @conn)
               {:grace-period-ms 60000    ;; 1 minute
                :interval-ms 10000        ;; Run every 10 seconds
                :max-batch 1000}))

;; Later, stop background GC
(clojure.core.async/close! stop-ch)

Background mode advantages:

  • Non-blocking: Doesn't slow down commits
  • Periodic cleanup: Runs every N milliseconds
  • Graceful shutdown: Close channel to stop

Address Recycling (Bulk Import Optimization)

⚠️ EXPERIMENTAL FEATURE

Address recycling is an experimental optimization. It has been designed with safety checks (multi-branch detection, grace periods), but should be thoroughly tested in your environment before production use.

Online GC includes address recycling—freed addresses are reused for new index nodes instead of being deleted from storage. This optimization is particularly powerful for bulk imports.

How it works:

  1. When index trees are modified, old root addresses are marked as freed
  2. Online GC moves eligible addresses to a freelist (grace period applies)
  3. New index nodes reuse addresses from the freelist instead of generating new UUIDs
  4. LMDB overwrites the recycled address with new data

Benefits:

  • Zero delete operations: Converts O(freed_nodes) deletes to O(1) freelist append
  • Reduces LMDB fragmentation: Addresses are reused rather than accumulating
  • Perfect for bulk imports: With :grace-period-ms 0, recycling happens immediately
  • Minimal overhead: No tree traversal or complex reachability analysis

Safety limitations:

Address recycling is ONLY safe for:

  • Single-branch databases (shared nodes across branches would be corrupted)
  • No long-lived readers (or grace period exceeds reader lifetime)
  • Bulk import scenarios (write-only, no concurrent queries)

Online GC is automatically disabled when:

  • Multiple branches exist (online GC completely skipped - use offline GC instead) Reason: Freed nodes from one branch may still be referenced by other branches through structural sharing
  • Using :crypto-hash? true with recycling (falls back to deletion mode)

Bulk Import Configuration

For maximum performance during bulk imports where no concurrent readers exist:

;; Optimal bulk import configuration
{:online-gc {:enabled? true
             :grace-period-ms 0        ;; Recycle immediately (no readers)
             :max-batch 10000}         ;; Large batch (only for delete fallback)
 :crypto-hash? false                  ;; Required for recycling
 :branch :db}                         ;; Single branch only

;; Example bulk import
(let [cfg {:store {:backend :file :path "/data/bulk-import"}
           :online-gc {:enabled? true :grace-period-ms 0}
           :crypto-hash? false}
      conn (d/connect cfg)]
  ;; Import millions of entities
  (doseq [batch entity-batches]
    (d/transact conn batch))
  ;; Storage stays bounded - addresses are recycled
  (d/release conn))

Bulk import best practices:

  1. Set :grace-period-ms 0 (no concurrent readers to protect)
  2. Use :crypto-hash? false (enables address recycling)
  3. Stay on single branch (:branch :db)
  4. Increase :max-batch for efficiency (only affects delete fallback)
  5. Monitor freed address counts to verify recycling is working

Verifying address recycling:

  • Check logs for "Online GC: recycling N addresses to freelist"
  • If you see "Online GC: skipped (multi-branch detected)", ensure single branch (multi-branch databases require offline GC instead)
  • Freed address counts should drop to zero after each transaction

Online GC vs Offline GC

Online GC (incremental):

  • Runs during commits
  • Deletes only freed index nodes from recent transactions
  • Fast: No tree traversal required
  • With recycling: No delete operations at all, just freelist management
  • ONLY for single-branch databases - automatically disabled for multi-branch
  • Best for: Bulk imports, high-write workloads

Offline GC (d/gc-storage):

  • Runs manually
  • Deletes entire old snapshots by walking all branches
  • Slower: Full tree traversal and marking
  • Handles multi-branch databases safely through reachability analysis
  • Required for multi-branch databases (online GC doesn't work)
  • Best for: Periodic maintenance, deleting old branches, multi-branch cleanup

Use both: Online GC for incremental cleanup during single-branch writes, offline GC for periodic deep cleaning and all multi-branch scenarios.

Automatic Garbage Collection

With online GC enabled, garbage collection becomes largely automatic during normal operation. Manual d/gc-storage runs are only needed for:

  • Deleting old branches
  • Periodic deep cleaning (monthly/quarterly)
  • Compliance-driven snapshot removal

When to Run GC

  • After deleting branches: Immediately reclaim storage
  • Periodic maintenance: Weekly/monthly based on storage growth
  • Storage alerts: When approaching capacity limits
  • Version cleanup: After completing long-running migrations

What Gets Deleted

GC removes:

  • Old database snapshots older than the grace period
  • Deleted branches and their snapshots
  • Unreachable index nodes from old snapshots

GC preserves:

  • All current branch heads (always)
  • Snapshots created after the grace period
  • All data on retained snapshots (GC doesn't delete data, only snapshots)

Remember: For deleting specific data (GDPR compliance), use data purging, not garbage collection.

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close