A B-tree based persistent sorted set for Clojure/Script.
PersistentSortedSet supports:
rseq on slices,lookup to retrieve actual stored keys,replace for single-traversal key updates at same logical position,markFreed.Almost a drop-in replacement for clojure.core/sorted-set, the only difference being this one can’t store nil.
Implementations are provided for Clojure and ClojureScript.
export JAVA8_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home"
lein jar
Dependency:
[io.replikativ/persistent-sorted-set "0.3.100"]
The version follows the pattern 0.3.{commit-count} and is automatically incremented with each commit.
Code:
(require '[org.replikativ.persistent-sorted-set :as set])
(set/sorted-set 3 2 1)
;=> #{1 2 3}
(-> (set/sorted-set 1 2 3 4)
(conj 2.5))
;=> #{1 2 2.5 3 4}
(-> (set/sorted-set 1 2 3 4)
(disj 3))
;=> #{1 2 4}
(-> (set/sorted-set 1 2 3 4)
(contains? 3))
;=> true
(-> (apply set/sorted-set (range 10000))
(set/slice 5000 5010))
;=> (5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010)
(-> (apply set/sorted-set (range 10000))
(set/rslice 5010 5000))
;=> (5010 5009 5008 5007 5006 5005 5004 5003 5002 5001 5000)
(set/sorted-set-by > 1 2 3)
;=> #{3 2 1}
;; lookup returns the actual stored key
(-> (set/sorted-set 1 2 3 4)
(set/lookup 3))
;=> 3
(-> (set/sorted-set 1 2 3 4)
(set/lookup 99 :not-found))
;=> :not-found
;; replace updates a key at the same logical position
;; (old-key and new-key must compare equal)
(defn cmp-by-id [[id1 _] [id2 _]] (compare id1 id2))
(-> (set/sorted-set-by cmp-by-id [1 :old] [2 :old] [3 :old])
(set/replace [2 :old] [2 :new]))
;=> #{[1 :old] [2 :new] [3 :old]}
One can also efficiently seek on the iterators.
(-> (seq (into (set/sorted-set) (range 10)))
(set/seek 5))
;; => (5 6 7 8 9)
(-> (into (set/sorted-set) (range 100))
(set/rslice 75 25)
(set/seek 60)
(set/seek 30))
;; => (30 29 28 27 26 25)
This project uses CircleCI for continuous integration and deployment:
0.3.{commit-count}, automatically calculatedmain branch, successful builds are automatically:
To ensure consistent code style for easier PR review:
# Check formatting
clj -M:format
# Auto-fix formatting issues
clj -M:ffix
Clojure version allows efficient storage of Persistent Sorted Set on disk/DB/anywhere.
To do that, implement IStorage interface:
(defrecord Storage [*storage]
IStorage
(store [_ node]
(let [address (random-uuid)]
(swap! *storage assoc address
(pr-str
(if (instance? Branch node)
{:level (.level ^Branch node)
:keys (.keys ^Branch node)
:addresses (.addresses ^Branch node)}
(.keys ^Leaf node))))
address))
(restore [_ address]
(let [value (-> (get @*storage address)
(edn/read-string))]
(if (map? value)
(Branch. (int (:level value)) ^java.util.List (:keys value) ^java.util.List (:addresses value))
(Leaf. ^java.util.List value))))
(markFreed [_ address]
;; Optional: track addresses that become obsolete during modifications.
;; Called automatically when tree nodes are replaced during conj/disj.
;; Enables garbage collection of unreachable nodes.
nil)
(accessed [_ address]
;; Optional: track node access for cache management (e.g., LRU).
nil)
(isFreed [_ address]
;; Optional: check if address has been marked as freed.
false)
(freedInfo [_ address]
;; Optional: return debug information about freed addresses.
nil))
Storing Persistent Sorted Set works per node. This will save each node once:
(def set
(into (set/sorted-set) (range 1000000)))
(def storage
(Storage. (atom {})))
(def root
(set/store set storage))
If you try to store once again, no store operations will be issued:
(assert
(= root
(set/store set storage)))
If you modify set and store new one, only nodes that were changed will be stored. For a tree of depth 3, it’s usually just ~3 nodes. The root will be new, though:
(def set2
(into set [-1 -2 -3]))
(assert
(not= root
(set/store set2 storage)))
Finally, one can construct a new set from its stored snapshot. You’ll need address for that:
(def set-lazy
(set/restore root storage))
Restore operation is lazy. By default it won’t do anything, but when you start accessing returned set, IStorage::restore operations will be issued and part of the set will be reconstructed in memory. Only nodes needed for a particular operation will be loaded.
E.g. this will load ~3 nodes for a set of depth 3:
(first set-lazy)
This will load ~50 nodes on default settings:
(take 5000 set-lazy)
Internally Persistent Sorted Set does not caches returned nodes, so don’t be surprised if subsequent first loads the same nodes again. One must implement cache inside IStorage implementation for efficient retrieval of already loaded nodes. Also see IStorage::accessed for access stats, e.g. for LRU.
Any operation that can be done on in-memory PSS can be done on a lazy one, too. It will fetch required nodes when needed, completely transparently for the user. Lazy PSS can exist arbitrary long without ever being fully realized in memory:
(def set3
(conj set-lazy [-1 -2 -3]))
(def set4
(disj set-lazy [4 5 6 7 8]))
(contains? set-lazy 5000)
Last piece of the puzzle: set/walk-addresses. Use it to check which nodes are actually in use by current PSS and optionally clean up garbage in your storage that is not referenced by it anymore:
(let [*alive-addresses (volatile! [])]
(set/walk-addresses set #(vswap! *alive-addresses conj %))
@*alive-addresses)
See test_storage.clj for more examples.
ClojureScript also supports durable storage with async operations. The IStorage interface works the same way, but store and restore methods return promises/async values instead of direct values. This allows integration with IndexedDB, remote storage APIs, and other async storage backends.
Count elements in a range without iterating through them:
(def s (into (set/sorted-set) (range 10000)))
;; Count elements in [1000, 2000]
(set/count-slice s 1000 2000)
;=> 1001
;; Count from beginning to 500
(set/count-slice s nil 500)
;=> 501
;; Count from 9000 to end
(set/count-slice s 9000 nil)
;=> 1000
;; Use custom comparator
(set/count-slice s 1000 2000 my-comparator)
This uses O(log n) traversal by leveraging subtree counts stored in branch nodes.
Access elements by their position (rank) in O(log n) time, enabling efficient percentile and quantile queries:
(def s (into (set/sorted-set* {:measure (NumericStatsOps.)})
(range 1000)))
;; Get median (50th percentile) - O(log n)
(set/get-nth s 500)
;=> [500 0] ; [value local-offset]
;; Get 95th percentile
(set/get-nth s 950)
;=> [950 0]
;; Helper function for percentiles
(defn percentile [s p]
(let [n (count s)
rank (long (* n p))]
(first (set/get-nth s rank))))
(percentile s 0.5) ;=> 500 (median)
(percentile s 0.95) ;=> 950 (95th percentile)
(percentile s 0.25) ;=> 250 (first quartile)
get-nth returns [entry local-offset] where local-offset is useful for weighted statistics (each entry can represent multiple elements). Returns nil if the rank is out of bounds.
Requirements: Must configure :measure with a weight implementation. The built-in NumericStatsOps uses count as weight (each element has weight 1).
Use cases:
ClojureScript provides the same get-nth API:
(require '[org.replikativ.persistent-sorted-set :as set])
(require '[org.replikativ.persistent-sorted-set.impl.numeric-stats :as nstats])
(def s (into (set/sorted-set* {:measure nstats/numeric-stats-ops})
(range 1000)))
(set/get-nth s 500)
;=> [500 0]
Both sync and async modes are supported via the :sync? option.
PersistentSortedSet can maintain aggregate statistics that update incrementally as elements are added or removed. This enables O(log n) queries for sum, count, min, max, variance, etc. over any range.
(import '[org.replikativ.persistent_sorted_set NumericStatsOps])
;; Create set with numeric stats tracking
(def s (into (set/sorted-set* {:measure (NumericStatsOps.)})
[1 2 3 4 5]))
;; Get measure for entire set
(set/measure s)
;=> NumericStats{count=5, sum=15.0, min=1, max=5, ...}
;; Get measure for a range [2, 4]
(set/measure-slice s 2 4)
;=> NumericStats{count=3, sum=9.0, min=2, max=4, ...}
The NumericStats object provides count, sum, sumSq, min, max, plus derived mean(), variance(), and stdDev() methods.
Statistics must form a monoid - they need an identity element and an associative merge operation. Implement the IMeasure interface:
public interface IMeasure<Key, S> {
// Identity element: merge(identity, x) == x
S identity();
// Extract measure from a single key
S extract(Key key);
// Associative merge: merge(a, merge(b, c)) == merge(merge(a, b), c)
S merge(S s1, S s2);
// Remove a key's contribution (see below)
S remove(S current, Key key, Supplier<S> recompute);
}
Example: Counting distinct categories
public class CategoryStats implements IMeasure<Item, Map<String, Long>> {
public Map<String, Long> identity() {
return Collections.emptyMap();
}
public Map<String, Long> extract(Item item) {
return Map.of(item.category(), 1L);
}
public Map<String, Long> merge(Map<String, Long> a, Map<String, Long> b) {
Map<String, Long> result = new HashMap<>(a);
b.forEach((k, v) -> result.merge(k, v, Long::sum));
return result;
}
public Map<String, Long> remove(Map<String, Long> current, Item item,
Supplier<Map<String, Long>> recompute) {
// For invertible stats, compute directly
Map<String, Long> result = new HashMap<>(current);
result.computeIfPresent(item.category(), (k, v) -> v > 1 ? v - 1 : null);
return result;
}
}
Some statistics like min and max can't be updated incrementally when removing elements - if you remove the minimum, you need to scan to find the new one. The remove method receives a recompute supplier for this:
public S remove(S current, Key key, Supplier<S> recompute) {
if (affectsResult(current, key)) {
// Can't compute incrementally, recompute from children
return recompute.get();
}
// Safe to compute incrementally
return subtractKey(current, key);
}
The built-in NumericStatsOps handles this automatically for min/max.
For ClojureScript, implement the IMeasure protocol from org.replikativ.persistent-sorted-set.impl.measure:
(require '[org.replikativ.persistent-sorted-set :as set])
(require '[org.replikativ.persistent-sorted-set.impl.stats :as stats])
(require '[org.replikativ.persistent-sorted-set.impl.numeric-stats :as numeric-stats])
;; Use built-in numeric stats
(def s (into (set/sorted-set* {:measure numeric-stats/numeric-stats-ops})
[1 2 3 4 5]))
;; Or implement custom measure via the IMeasure protocol
(defrecord MyStats [count sum])
(def my-stats-ops
(reify measure/IMeasure
(identity-measure [_]
(->MyStats 0 0))
(extract [_ key]
(->MyStats 1 key))
(merge-measure [_ s1 s2]
(->MyStats (+ (:count s1) (:count s2))
(+ (:sum s1) (:sum s2))))
(remove-measure [_ current key recompute-fn]
(->MyStats (dec (:count current))
(- (:sum current) key)))))
(def s (into (set/sorted-set* {:measure my-stats-ops}) [1 2 3 4 5]))
To reproduce:
[com.datomic/datomic-free "0.9.5703"] locally.lein bench.PersistentTreeSet is Clojure’s Red-black tree based sorted-set.
BTSet is Datomic’s B-tree based sorted set (no transients, no disjoins).
PersistentSortedSet is this implementation.
Numbers I get on my 3.2 GHz i7-8700B:
PersistentTreeSet 143..165ms
BTSet 125..141ms
PersistentSortedSet 105..121ms
PersistentSortedSet (transient) 50..54ms
PersistentTreeSet 51..54ms
BTSet 45..47ms
PersistentSortedSet 46..47ms
PersistentTreeSet 70..77ms
PersistentSortedSet 10..11ms
PersistentTreeSet 116..124ms
BTSet 92..105ms
PersistentSortedSet 56..68ms
For PersistentTreeSet we use ISeq produced by (take-while #(<= % 999999) (.seqFrom set 1 true)).
For PersistentSortedSet we use (.slice set 1 999999).
PersistentTreeSet 238..256ms
PersistentSortedSet 70..91ms
PersistentTreeSet 151..155ms
PersistentSortedSet 91..98ms
PersistentSortedSet (transient) 47..50ms
Copyright © 2019 Nikita Prokopov Copyright © 2024 Christian Weilbach
Licensed under MIT (see LICENSE).
Can you improve this documentation? These fine people already did:
Nikita Prokopov, Christian Weilbach & FiVoEdit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |