Datahike uses persistent data structures that enable structural sharing—each update creates a new version efficiently by reusing unchanged parts. This allows time-travel queries and git-like versioning, but storage grows over time as old snapshots accumulate.
Garbage collection removes old database snapshots from storage while preserving current branch heads.
Don't confuse garbage collection with data purging:
GC whitelists all current branches and marks snapshots as reachable based on a grace period. Snapshots older than the grace period are deleted from storage, but branch heads are always retained regardless of age.
(require '[datahike.api :as d]
'[superv.async :refer [<?? S]])
;; Remove only deleted branches, keep all snapshots
(<?? S (d/gc-storage conn))
;; => #{...} ; set of deleted storage blobs
Running without a date removes only deleted branches—all snapshots on active branches are preserved. This is safe to run anytime and reclaims storage from old experimental branches.
Note: Returns a core.async channel. Use <?? to block, or run without it for background execution. GC requires no coordination and won't slow down transactions or reads.
Datahike's Distributed Index Space allows readers to access storage directly without coordination. This is powerful for scalability but means long-running processes might read from old snapshots for hours.
Examples of long-running readers:
The grace period ensures these readers don't encounter missing data. Snapshots created after the grace period date are kept; older ones are deleted.
(require '[datahike.api :as d])
;; Keep last 7 days of snapshots
(let [seven-days-ago (java.util.Date. (- (System/currentTimeMillis)
(* 7 24 60 60 1000)))]
(<?? S (d/gc-storage conn seven-days-ago)))
;; Keep last 30 days (common for compliance)
(let [thirty-days-ago (java.util.Date. (- (System/currentTimeMillis)
(* 30 24 60 60 1000)))]
(<?? S (d/gc-storage conn thirty-days-ago)))
;; Keep last 24 hours (for fast-moving data)
(let [yesterday (java.util.Date. (- (System/currentTimeMillis)
(* 24 60 60 1000)))]
(<?? S (d/gc-storage conn yesterday)))
Choosing a grace period:
Branch heads are always kept regardless of the grace period—only intermediate snapshots are removed.
⚠️ EXPERIMENTAL FEATURE
Online GC automatically deletes freed index nodes during transaction commits, preventing garbage accumulation during bulk imports and high-write workloads.
Online GC is currently an experimental feature. While it has been tested extensively in Clojure/JVM and includes safety mechanisms for multi-branch databases, use with caution in production. We recommend:
- Thorough testing in your specific use case before production deployment
- Monitoring freed address counts to verify expected behavior
- Using it primarily for bulk imports and high-write workloads where it's most beneficial
- ClojureScript: Online GC functionality is available in CLJS but has not been tested in big bulk loads yet. JVM testing is more comprehensive.
- Reporting any issues at https://github.com/replikativ/datahike/issues
Online GC is ONLY safe for single-branch databases. For multi-branch databases, online GC is automatically disabled because freed nodes from one branch may still be referenced by other branches through structural sharing. Use offline GC (
d/gc-storage) for multi-branch cleanup instead.
When PSS (Persistent Sorted Set) index trees are modified during transactions, old index nodes become unreachable. Online GC tracks these freed addresses with timestamps and deletes them incrementally:
markFreed() for each replaced index nodeKey benefits:
Enable online GC in your database config:
;; For bulk imports (no concurrent readers, single-branch)
;; See "Address Recycling" section below for details
{:online-gc {:enabled? true
:grace-period-ms 0 ;; Recycle immediately
:max-batch 10000} ;; Large batches for efficiency
:crypto-hash? false} ;; Required for address recycling
;; For production (concurrent readers)
{:online-gc {:enabled? true
:grace-period-ms 300000 ;; 5 minutes
:max-batch 1000}} ;; Smaller batches
;; Disabled (default)
{:online-gc {:enabled? false}}
Configuration options:
:enabled? - Enable/disable online GC (default: false):grace-period-ms - Minimum age in milliseconds before deletion (default: 60000 = 1 minute):max-batch - Maximum addresses to delete per commit (default: 1000):sync? - Synchronous deletion (always false inside commits for async operation)For production systems, run GC in a background thread instead of blocking commits:
(require '[datahike.online-gc :as online-gc])
;; Start background GC
(def stop-ch (online-gc/start-background-gc!
(:store @conn)
{:grace-period-ms 60000 ;; 1 minute
:interval-ms 10000 ;; Run every 10 seconds
:max-batch 1000}))
;; Later, stop background GC
(clojure.core.async/close! stop-ch)
Background mode advantages:
⚠️ EXPERIMENTAL FEATURE
Address recycling is an experimental optimization. It has been designed with safety checks (multi-branch detection, grace periods), but should be thoroughly tested in your environment before production use.
Online GC includes address recycling—freed addresses are reused for new index nodes instead of being deleted from storage. This optimization is particularly powerful for bulk imports.
How it works:
Benefits:
:grace-period-ms 0, recycling happens immediatelySafety limitations:
Address recycling is ONLY safe for:
Online GC is automatically disabled when:
:crypto-hash? true with recycling (falls back to deletion mode)For maximum performance during bulk imports where no concurrent readers exist:
;; Optimal bulk import configuration
{:online-gc {:enabled? true
:grace-period-ms 0 ;; Recycle immediately (no readers)
:max-batch 10000} ;; Large batch (only for delete fallback)
:crypto-hash? false ;; Required for recycling
:branch :db} ;; Single branch only
;; Example bulk import
(let [cfg {:store {:backend :file :path "/data/bulk-import"}
:online-gc {:enabled? true :grace-period-ms 0}
:crypto-hash? false}
conn (d/connect cfg)]
;; Import millions of entities
(doseq [batch entity-batches]
(d/transact conn batch))
;; Storage stays bounded - addresses are recycled
(d/release conn))
Bulk import best practices:
:grace-period-ms 0 (no concurrent readers to protect):crypto-hash? false (enables address recycling):branch :db):max-batch for efficiency (only affects delete fallback)Verifying address recycling:
"Online GC: recycling N addresses to freelist""Online GC: skipped (multi-branch detected)", ensure single branch
(multi-branch databases require offline GC instead)Online GC (incremental):
Offline GC (d/gc-storage):
Use both: Online GC for incremental cleanup during single-branch writes, offline GC for periodic deep cleaning and all multi-branch scenarios.
With online GC enabled, garbage collection becomes largely automatic during normal operation. Manual d/gc-storage runs are only needed for:
GC removes:
GC preserves:
Remember: For deleting specific data (GDPR compliance), use data purging, not garbage collection.
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |