Liking cljdoc? Tell your friends :D

Garbage Collection

Datahike uses persistent data structures that enable structural sharing—each update creates a new version efficiently by reusing unchanged parts. This allows time-travel queries and git-like versioning, but storage grows over time as old snapshots accumulate.

Garbage collection removes old database snapshots from storage while preserving current branch heads.

GC vs Purging

Don't confuse garbage collection with data purging:

Garbage Collection (this document): Removes old database snapshots to reclaim storage. Used for routine storage maintenance.
Data Purging: Permanently deletes specific data for privacy compliance (GDPR, HIPAA, CCPA). Used only when legally required.

How Garbage Collection Works

GC whitelists all current branches and marks snapshots as reachable based on a grace period. Snapshots older than the grace period are deleted from storage, but branch heads are always retained regardless of age.

Basic Usage

(require '[datahike.api :as d]
         '[superv.async :refer [<?? S]])

;; Remove only deleted branches, keep all snapshots
(<?? S (d/gc-storage conn))
;; => #{...} ; set of deleted storage blobs

Running without a date removes only deleted branches—all snapshots on active branches are preserved. This is safe to run anytime and reclaims storage from old experimental branches.

Note: Returns a core.async channel. Use <?? to block, or run without it for background execution. GC requires no coordination and won't slow down transactions or reads.

Grace Periods for Distributed Readers

Datahike's Distributed Index Space allows readers to access storage directly without coordination. This is powerful for scalability but means long-running processes might read from old snapshots for hours.

Examples of long-running readers:

Reporting jobs: Generate daily/weekly reports by querying yesterday's snapshot
Analytics pipelines: Process historical data over several hours
Monitoring dashboards: Display metrics from recent snapshots
Backup processes: Copy database state while it's being updated

The grace period ensures these readers don't encounter missing data. Snapshots created after the grace period date are kept; older ones are deleted.

(require '[datahike.api :as d])

;; Keep last 7 days of snapshots
(let [seven-days-ago (java.util.Date. (- (System/currentTimeMillis)
                                         (* 7 24 60 60 1000)))]
  (<?? S (d/gc-storage conn seven-days-ago)))

;; Keep last 30 days (common for compliance)
(let [thirty-days-ago (java.util.Date. (- (System/currentTimeMillis)
                                          (* 30 24 60 60 1000)))]
  (<?? S (d/gc-storage conn thirty-days-ago)))

;; Keep last 24 hours (for fast-moving data)
(let [yesterday (java.util.Date. (- (System/currentTimeMillis)
                                    (* 24 60 60 1000)))]
  (<?? S (d/gc-storage conn yesterday)))

Choosing a grace period:

Consider your longest-running reader process
Add buffer time for safety (if longest job is 2 hours, use 4-6 hours)
Balance storage costs against reader safety
Monitor reader patterns before shortening grace periods

Branch heads are always kept regardless of the grace period—only intermediate snapshots are removed.

Online Garbage Collection (Incremental GC)

⚠️ EXPERIMENTAL FEATURE

Online GC automatically deletes freed index nodes during transaction commits, preventing garbage accumulation during bulk imports and high-write workloads.

Online GC is currently an experimental feature. While it has been tested extensively in Clojure/JVM and includes safety mechanisms for multi-branch databases, use with caution in production. We recommend:

Thorough testing in your specific use case before production deployment
Monitoring freed address counts to verify expected behavior
Using it primarily for bulk imports and high-write workloads where it's most beneficial
ClojureScript: Online GC functionality is available in CLJS but has not been tested in big bulk loads yet. JVM testing is more comprehensive.
Reporting any issues at https://github.com/replikativ/datahike/issues

How Online GC Works

Online GC is ONLY safe for single-branch databases. For multi-branch databases, online GC is automatically disabled because freed nodes from one branch may still be referenced by other branches through structural sharing. Use offline GC (d/gc-storage) for multi-branch cleanup instead.

When PSS (Persistent Sorted Set) index trees are modified during transactions, old index nodes become unreachable. Online GC tracks these freed addresses with timestamps and deletes them incrementally:

During transaction (transient mode): PSS calls markFreed() for each replaced index node
At commit time: Freed addresses older than the grace period are batch-deleted
Multi-branch safety check: If multiple branches detected, GC is skipped entirely
No full tree walk: Only freed addresses are deleted, not requiring expensive tree traversal

Key benefits:

Prevents unbounded storage growth during bulk imports (single-branch only)
Incremental deletion: Small batches per commit, low overhead
Grace period support: Safe for concurrent readers accessing old snapshots
Multi-branch safety: Automatically disabled to prevent corruption
Configurable: Can be disabled, tuned, or run in background

Configuration

Enable online GC in your database config:

;; For bulk imports (no concurrent readers, single-branch)
;; See "Address Recycling" section below for details
{:online-gc {:enabled? true
             :grace-period-ms 0          ;; Recycle immediately
             :max-batch 10000}           ;; Large batches for efficiency
 :crypto-hash? false}                   ;; Required for address recycling

;; For production (concurrent readers)
{:online-gc {:enabled? true
             :grace-period-ms 300000     ;; 5 minutes
             :max-batch 1000}}           ;; Smaller batches

;; Disabled (default)
{:online-gc {:enabled? false}}

Configuration options:

:enabled? - Enable/disable online GC (default: false)
:grace-period-ms - Minimum age in milliseconds before deletion (default: 60000 = 1 minute)
:max-batch - Maximum addresses to delete per commit (default: 1000)
:sync? - Synchronous deletion (always false inside commits for async operation)

Background GC Mode

For production systems, run GC in a background thread instead of blocking commits:

(require '[datahike.online-gc :as online-gc])

;; Start background GC
(def stop-ch (online-gc/start-background-gc!
               (:store @conn)
               {:grace-period-ms 60000    ;; 1 minute
                :interval-ms 10000        ;; Run every 10 seconds
                :max-batch 1000}))

;; Later, stop background GC
(clojure.core.async/close! stop-ch)

Background mode advantages:

Non-blocking: Doesn't slow down commits
Periodic cleanup: Runs every N milliseconds
Graceful shutdown: Close channel to stop

Address Recycling (Bulk Import Optimization)

⚠️ EXPERIMENTAL FEATURE

Address recycling is an experimental optimization. It has been designed with safety checks (multi-branch detection, grace periods), but should be thoroughly tested in your environment before production use.

Online GC includes address recycling—freed addresses are reused for new index nodes instead of being deleted from storage. This optimization is particularly powerful for bulk imports.

How it works:

When index trees are modified, old root addresses are marked as freed
Online GC moves eligible addresses to a freelist (grace period applies)
New index nodes reuse addresses from the freelist instead of generating new UUIDs
LMDB overwrites the recycled address with new data

Benefits:

Zero delete operations: Converts O(freed_nodes) deletes to O(1) freelist append
Reduces LMDB fragmentation: Addresses are reused rather than accumulating
Perfect for bulk imports: With :grace-period-ms 0, recycling happens immediately
Minimal overhead: No tree traversal or complex reachability analysis

Safety limitations:

Address recycling is ONLY safe for:

Single-branch databases (shared nodes across branches would be corrupted)
No long-lived readers (or grace period exceeds reader lifetime)
Bulk import scenarios (write-only, no concurrent queries)

Online GC is automatically disabled when:

Multiple branches exist (online GC completely skipped - use offline GC instead) Reason: Freed nodes from one branch may still be referenced by other branches through structural sharing
Using :crypto-hash? true with recycling (falls back to deletion mode)

Bulk Import Configuration

For maximum performance during bulk imports where no concurrent readers exist:

;; Optimal bulk import configuration
{:online-gc {:enabled? true
             :grace-period-ms 0        ;; Recycle immediately (no readers)
             :max-batch 10000}         ;; Large batch (only for delete fallback)
 :crypto-hash? false                  ;; Required for recycling
 :branch :db}                         ;; Single branch only

;; Example bulk import
(let [cfg {:store {:backend :file :path "/data/bulk-import"}
           :online-gc {:enabled? true :grace-period-ms 0}
           :crypto-hash? false}
      conn (d/connect cfg)]
  ;; Import millions of entities
  (doseq [batch entity-batches]
    (d/transact conn batch))
  ;; Storage stays bounded - addresses are recycled
  (d/release conn))

Bulk import best practices:

Set :grace-period-ms 0 (no concurrent readers to protect)
Use :crypto-hash? false (enables address recycling)
Stay on single branch (:branch :db)
Increase :max-batch for efficiency (only affects delete fallback)
Monitor freed address counts to verify recycling is working

Verifying address recycling:

Check logs for "Online GC: recycling N addresses to freelist"
If you see "Online GC: skipped (multi-branch detected)", ensure single branch (multi-branch databases require offline GC instead)
Freed address counts should drop to zero after each transaction

Online GC vs Offline GC

Online GC (incremental):

Runs during commits
Deletes only freed index nodes from recent transactions
Fast: No tree traversal required
With recycling: No delete operations at all, just freelist management
ONLY for single-branch databases - automatically disabled for multi-branch
Best for: Bulk imports, high-write workloads

Offline GC (d/gc-storage):

Runs manually
Deletes entire old snapshots by walking all branches
Slower: Full tree traversal and marking
Handles multi-branch databases safely through reachability analysis
Required for multi-branch databases (online GC doesn't work)
Best for: Periodic maintenance, deleting old branches, multi-branch cleanup

Use both: Online GC for incremental cleanup during single-branch writes, offline GC for periodic deep cleaning and all multi-branch scenarios.

Automatic Garbage Collection

With online GC enabled, garbage collection becomes largely automatic during normal operation. Manual d/gc-storage runs are only needed for:

Deleting old branches
Periodic deep cleaning (monthly/quarterly)
Compliance-driven snapshot removal

When to Run GC

After deleting branches: Immediately reclaim storage
Periodic maintenance: Weekly/monthly based on storage growth
Storage alerts: When approaching capacity limits
Version cleanup: After completing long-running migrations

What Gets Deleted

GC removes:

Old database snapshots older than the grace period
Deleted branches and their snapshots
Unreachable index nodes from old snapshots

GC preserves:

All current branch heads (always)
Snapshots created after the grace period
All data on retained snapshots (GC doesn't delete data, only snapshots)

Remember: For deleting specific data (GDPR compliance), use data purging, not garbage collection.

❮Distributed Architecture Norms (Database Migrations)❯

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close