Index trees are modeled after B+ trees and contain both internal index nodes and leaf partitions. Records in the tree are sorted by their id key, which uniquely identifies each record within the table.
There are three groups of attributes which are used in the index tree structure. At the table level, two input parameters control how the data tree is shaped:
:merkle-db.index/fan-out
An integer which restricts the maximum number of children an index node can
have. In a tree with fan-out f
, every index node except the root
must have between ceiling(f/2)
and f
children. The root is allowed to have
between 2 and f
children. The fan-out must be at least 4, but is often much
higher to reduce tree height (default: 256).:merkle-db.partition/limit
This limit restricts the maximum number of records a single partition can
hold. This effectively scales the size of the units of parallelism available
to batch jobs. In a tree with partition limit p
every partition must contain
between ceiling(p/2)
and p
records, unless the root is a partition
(implying that there are fewer than p
records total).Inside the tree, every node maintains three bookkeeping properties which define the scope of the subtree under that node:
:merkle-db.record/count
:merkle-db.record/first-key
:merkle-db.record/last-key
Index nodes additionally use these attributes to store links and information about their children:
:merkle-db.index/height
:merkle-db.index/keys
:merkle-db.index/children
Record data can be read from the index tree in two fashions; a retrieval of a batch of keys for specific records, or a linear scan of some range of the data.
In the batch case, a simple recursive algorithm is used to divide up the keys at each node and forward them towards the child that contains the range the requested keys are in. Once a partition is reached, the records can be looked up from the tablets and returned back up the recursion stack.
For scanning a range of records, the start and end keys (if provided) can be used to eliminate parts of the tree which do not contain any records in the range. The recursion continues down the first eligible child at each level until it reaches a partition, at which point the records are scanned out starting from the first valid key. This can be modeled as a lazy sequence which only expands a node once the first record from that subtree is requested.
Updating an index tree starts with the root node and a batch of changes to apply to it. The changes may be either insertions supplying new field data for a certain record key or deletions which remove a record.
The batch-update algorithm has the following goals:
O(log_b(n))
nodes for n
partitions updated.Note that the first and third points here also imply that an update should not rewrite nodes that are not being changed if they no longer match the tree parameters. For example, changing the fanout from 4 to 6 means that any index nodes with only two children are no longer valid. An update which doesn't touch those nodes would leave them in the tree as-is for a later process to fix-up.
In the simplest case, we are updating an empty index tree. This is represented by a nil tree root.
If the tree has fewer than :merkle-db.partition/limit
records in it, the tree
will have a single partition node as the root. To update it:
In the general case, there is a tree of existing nodes including one or more layers of intermediate index nodes and a collection of partition leaves. The tree must be carefully updated to preserve the structural parameters. Unlike a regular B+ tree, which can be extended to batches of inserts or deletes, here we want to perform batch updates which include both updates and deletes. This means that some nodes in the tree may need to split while other parts need to merge, and we need to reconcile these while changing as few nodes as possible.
To understand how this is accomplished, let's consider the update to a generic individual index node in the tree. What would such a function look like? For inputs, first we'll need the graph store (so we can store and retrieve nodes) and a set of tree-level parameters specifying the fan-out, partition limit, and field families. We will assume these as a given throughout the algorithm.
Next we'll need the data for the node we're operating on, and the sequence of changes to apply to it. Generally, the changes should fall into the range of record keys that the index subtree covers. Finally, we'll accept a carry input, which will be explained later.
Initially, the obvious choice for an output is the updated version of the input
node. However, this fails to account for the fact that we may need to split the
node if there are many additions, or merge it if there are many deletions.
Instead, the function will return a vector of nodes, along with an integer
specifying the height of the results. If we input an index node n
at height
h
and had no splits or merges, we'd expect to get the result [h [n']]
.
The first thing to do is to take the set of input record changes and divide them up according to which child subtree they belong to. Each index node maintains a list of child links and split keys, so portion out the changes to bucket them by record key.
The result is a sequence of tuples containing the child and a vector of the record changes assigned to that child. Children with no changes will appear with a nil second value.
The next thing to consider is whether the update to this node has a carried input. If so, we need to determine whether to adopt the carried elements as children of the current node or keep carrying them down the subtree. Since carried inputs have the same form as an update result vector, we can check whether the height of the carry is one less than the height of the current node.
If so, adopt the elements as children of the current node by prepending them to the sequence of children and changes as nodes with no changes (as they have already been applied earlier).
If there is a carried input that is not adoptable because it falls deeper in the tree than the current node, save it as the first carried input into the sequence of children.
If there is no carried input, there's nothing to do.
Begin a loop to build up a new vector of updated child nodes. Recurse into the first child, using the carry if present. Examine the result - if the height is the same as the original child, add the elements to the output vector. Otherwise, carry the result into the update on the next child.
After processing all of the original children, there are four possible scenarios:
In the first two cases we are done with the update to this node and can return the result. In the latter two we have further processing to do. For carry-backs, we need to re-run update logic on the final output child. The result will either be an updated set of direct children or another carry output, in which case the process repeats until the results are valid or we're reduced back to a carry output.
Updating partitions actually requires loading data into memory, so the algorithm
constrains the total number of partitions loaded to double the partition limit
p
. Updated sets of records are buffered as the children are processed,
emitting full partitions where possible, merging in underflowing partitions as
needed. The buffer is filled until at least 150% of the partition limit has been
reached, then a 75% full partition is emitted, leaving (at worst) another 75%
full partition if there are no more adjacent to process.
If the result was a sequence of updated (direct) children, then we can now construct a new index node to contain them. Compare the number of children available to the tree's fan-out - if there are not enough children to make a valid half-full node, return the vector of children for carrying. Otherwise divide the children into valid-sized branching groups and build index nodes to contain them.
At this point we've completed the update algorithm for a single node, the result of which was either a sequence of one or more index nodes at the original level, or a sequence of sibling nodes carried up from further down the tree.
After recursively applying changes to the root node and unwinding back to the top of the tree, examine the result.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close