All the features Eva provides with respect to time can be a bit confusing. There are at least three different functions that yield access to the contents of the database with respect to time. This document exists to explain at a very high level how each of these methods work, and the distinctions between them. First, tl;dr:
db
provides access to a snapshot of the most recent set of added datoms in the database known at call time.asOf
provides access to a snapshot of the set of added datoms in the database at some point in time.history
provides access to a snapshot of the set of all datoms across time.We are able to yield the above views of the database in a performant manner
via the maintenance of a variety of indexes. You may have read elsewhere
that there are four primary sorts of datoms that we index: EAVT, AEVT, AVET,
VAET. What is not always expressed is that there are actually two indexes
maintained for each, one over the set of currently asserted datoms in the
database and one for all datoms ever asserted or retracted. The calls to
db
and asOf
rely on the set of four 'currently asserted' indexes,
whereas history
relies on the 'history' indexes that contain all datoms.
The datastructure we use for the indexes is our own implementation of a perisistent Bε-tree (B-epsilon-tree). New flushes to the indexes happen periodically and asynchronously, independent of the main transaction process, which is made durable through a write-ahead-log. This log also contains pointers to the most-recently committed states of the indexes, which permits it to act as a universal point-in-time source of truth for the state of the system.
The primary distinction between the 'current' and 'history' indexes is
that with the 'current' indexes we will actually elide the datoms that
have been retracted when generating the new index states. In the
historic, we readd the datom to the indexes, but with a flag that indicates
that it was retracted at the transaction given in the datom. This allows
us to have an implementation of the current database that is
O(currently-asserted), which is preferable for db
and asOf
.
db
The db
function realizes the most-recently commit database snapshot by
examining the head entry of the transaction log. The index roots are
fetched from the transaction log entry. Since the last flush of the roots
may be several transactions behind the current log entry, a small recent
range of the log is fetched and 'replayed' in memory on top of the indexes
to form an overlay for the intervening transactions. At this point, we
have the most recent viable state of the database, which is then yield
to the user, completing the call to db
.
asOf
The asOf
function behaves almost identically to db
. The primary
distinction is that we rely not on the most recent transaction log entry,
but on an entry specified by the user. This entry will have pointers
to a set of previous arbitrary persistent index roots. These roots are
then used as the basis for reconstructing state, as opposed to the most
recent. Once the database state with respect to the desired point in time
has been constructed, we return the database snapshot to the user.
history
The history
function, as discussed, relies on different indexes than
db
and asOf
, however, these 'historic' indexes are maintained the same
manner as the 'current' indexes. They are flushed periodically
and are referenced in the transaction log entries. Calling 'history' on
a database generated by db
or asOf
will yield the 'same'
database as of the desired point in time, but will contain a strict
superset of the datoms in the originating database. All reads on this
database will be performed against the persistent version of the 'history'
indexes as of the desired point in time, and so will yield not only the
currently asserted datoms, but all previously asserted datoms and
their corresponding retractions.
history
LimitationsAs a result of the 'history' database containing all datoms across time,
it is not able to support the full Database interface. Primarily, this
is the result of many of the assumptions we rely on being made with
respect to time. For example, consider :db.cardinality/one attributes.
The cardinality invariant is maintained with respect to any single
point in time, however, across time, this may be violated, since the
value of the cardinality one attribute can vary over timesteps. As a
result, we cannot rely on many of the assumptions required for features.
As of now, the history
endpoint will support raw index access via
datoms
and can be passed to the q
endpoint for queries, since the
query engine is essentially agnostic to the database itself.
Below is a minimal example to clarify the contracts provided by these various database snapshots.
(defn state-and-time-examples []
(let [conn (eva/connect {:local true})
;; in our first transaction we want to create a new entity
tx1 [[:db/add (eva/tempid :db.part/user -1) :db/doc "new!"]]
res1 @(eva/transact conn tx1)
tids (:tempids res1)
pid (-> tids first val) ;; fetch the permanent id generated for the ent
;; in our second transaction, we want to modify the doc on the entid
tx2 [[:db/add pid :db/doc "actually, this doc is better"]]
res2 @(eva/transact conn tx2)
;; grab the transaction id
tx2-eid (:tx (first (:tx-data res2)))
;; in our third transaction, we decide that we don't actually want
;; the entity at all.
tx3 [[:db.fn/retractEntity pid]]
_ @(eva/transact conn tx3)
;; for convenience , lets make a query to use later
query '[:find ?e ?a ?v ?tx ?added
:in $ ?e
:where
[?e ?a ?v ?tx ?added]]]
;; now, given the 'interesting' history of the above entity id, let us
;; look at it using the different capabilities of eva.
(let [most-recent-db (eva/db conn)]
(println "\n1) What does the entity look like *right now*?")
(println "by datoms:")
(clojure.pprint/pprint (eva/datoms most-recent-db :eavt pid))
(println "by query:")
(clojure.pprint/pprint (eva/q query most-recent-db pid))
(let [most-recent-history (eva/history most-recent-db)]
(println "\n2) What does the *history* of the entity look like *right now*?")
(println "by datoms:")
(clojure.pprint/pprint (eva/datoms most-recent-history :eavt pid))
(println "by query:")
;; NOTE: I'm using the same query here
(clojure.pprint/pprint (eva/q query most-recent-history pid))))
(let [before-retract (.asOf (eva/db conn) tx2-eid)]
(println "\n3) What does the entity look like *before the last transaction*?")
(println "by datoms:")
(clojure.pprint/pprint (eva/datoms before-retract :eavt pid))
(println "by query:")
(clojure.pprint/pprint (eva/q query before-retract pid))
(let [before-retract-history (eva/history before-retract)]
(println "\n4) What does the *history* of the entity look like *before the last transaction*?")
(println "by datoms:")
(clojure.pprint/pprint (eva/datoms before-retract-history :eavt pid))
(println "by query:")
(clojure.pprint/pprint (eva/q query before-retract-history pid))))))
;; yields:
(state-and-time-examples)
1) What does the entity look like *right now*?
by datoms:
#{}
by query:
[]
2) What does the *history* of the entity look like *right now*?
by datoms:
#{#datom[8796093023245
9
"actually, this doc is better"
4398046511146
false]
#datom[8796093023245
9
"actually, this doc is better"
4398046511145
true]
#datom[8796093023245 9 "new!" 4398046511145 false]
#datom[8796093023245 9 "new!" 4398046511144 true]}
by query:
[[8796093023245 9 "new!" 4398046511145 false]
[8796093023245 9 "actually, this doc is better" 4398046511145 true]
[8796093023245 9 "actually, this doc is better" 4398046511146 false]
[8796093023245 9 "new!" 4398046511144 true]]
3) What does the entity look like *before the last transaction*?
by datoms:
#{#datom[8796093023245
9
"actually, this doc is better"
4398046511145
true]}
by query:
[[8796093023245 9 "actually, this doc is better" 4398046511145 true]]
4) What does the *history* of the entity look like *before the last transaction*?
by datoms:
#{#datom[8796093023245
9
"actually, this doc is better"
4398046511145
true]
#datom[8796093023245 9 "new!" 4398046511145 false]
#datom[8796093023245 9 "new!" 4398046511144 true]}
by query:
[[8796093023245 9 "new!" 4398046511145 false]
[8796093023245 9 "actually, this doc is better" 4398046511145 true]
[8796093023245 9 "new!" 4398046511144 true]]
nil
Absolutely! The query engine is almost entirely agnostic to the details of the
database and vice-versa. There's a single ~50 line function shared by the database
snapshots that allows the query engine to use the database as a source of
information, and that's it. There are no restrictions on the number or types of
database snapshots you can feed into the query engine. We can reuse the databases
and scaffolding in the above state-and-time-examples
to demonstrate this:
;; say I want to see all of the changes to my entity id *not* including the
;; 'current' state. In other words, the difference of the 'now' database
;; and the 'history' database w.r.t. the provided entity id.
(let [not-now-but-before-query
'[:find ?e ?a ?v ?tx ?added
:in $db-now $db-now-history ?e
:where
(not [$db-now ?e ?a ?v ?tx ?added])
[$db-now-history ?e ?a ?v ?tx ?added]]
most-recent-db (eva/db conn)
most-recent-history (eva/history most-recent-db)
before-retract (.asOf most-recent-db tx2-eid)
before-retract-history (eva/history before-retract)]
(println "1) Running the query on the most recent database:")
(clojure.pprint/pprint (eva/q not-now-but-before-query most-recent-db most-recent-history pid))
(println "2) Running the query on the database before retraction:")
(clojure.pprint/pprint (eva/q not-now-but-before-query before-retract before-retract-history pid)))
;; yields:
1) Running the query on the most recent database:
[[8796093023247 9 "actually, this doc is better" 4398046511152 false]
[8796093023247 9 "actually, this doc is better" 4398046511151 true]
[8796093023247 9 "new!" 4398046511150 true]
[8796093023247 9 "new!" 4398046511151 false]]
2) Running the query on the database before retraction:
[[8796093023247 9 "new!" 4398046511150 true]
[8796093023247 9 "new!" 4398046511151 false]]
As you can see, the first query is the same as just asking the most-recent historic database
what the state is, since our last transaction was to retract the entity. However, the query
on the 'previous' state omits the 'current' state of the entity by eliding the datom
[8796093023247 9 "actually, this doc is better" 4398046511151 true]
. In this way, we have
written and used a query not only on multiple databases, but, multiple databases, with
multiple semantics (history vs current) and at multiple points in time.
Yes, you can perform basic filtering on the :tx
field of datoms either locally or by
embedding logic into a query to do so. We plan to offer this as a low-level feature on
databases by implementing a since
function on the Database
interface, but haven't had
sufficient need to warrant the time to do so. If this becomes an interesting need for your
team, let us know in the Support: Eva
hipchat room and we'll see what we can do to
prioritize it.
It can be, but we've done a lot of work to examine and mitigate those costs. The ability to reify time in your database is a very powerful feature, and as with most powerful features, there are tradeoffs. Over time we will end up using more overall space than traditional mutative databases, simply because we never perform mutation. This is entirely core to our model. This is how we allow arbitrary snapshotting and allow the read processes to live inside the consumers of the database, entirely decoupled from the write processes.
Now, for mitigating the costs, there are a variety of methods we use:
i) The indexes are actually secondary to the main transaction process. The transactor has control of a write-ahead-log which represents the 'real' source of truth in the process as a whole. The indexing process could be turned off, and the system would still work (in a degraded state) for an extended period of time. Whenever a database snapshot is realized, we retrieve the most recent index roots for the point in time you're interested in, and consume a chunk of the transaction log to 'fast forward' the indexes in memory to the desired state. We also leverage this independence of the indexers from the main transactor process to decouple the asymptotics of updating the indexes, since updating the indexes will always be strictly slower than adding an entry to the log.
ii) We actually utilize the independence in (i) to plan on not updating the indexes for every transaction, because by-and-large it isn't necessary. As of writing, the indexing processes will only commit their updates to storage under one of two conditions: a maximum span since last commit is exceeded or a maximum size of in-memory overlay is exceeded. Both of these are configuration variables that can be tuned to meet the needs of the consuming system, in acknowledgement that there is a time/space tradeoff that is almost certainly not universal. As discussed, the system will use the most recent indexes roots it can, and just read from the transaction log and lazily update in memory, which is a relatively cheap process.
iii) Finally, the write characteristics of the Bε-trees are asymptotically superior to B-trees, so when we do end up writing, we will write asymptotically fewer nodes than nearly all traditional database indexes.
tl;dr: the write-throughput of the system is basically decoupled from the index updates, and we provide tools to allow you to amortize the cost of updating the indexes to whatever extent it makes sense for you. This combined with an inherently write-efficient index datastructure greatly reduces the total costs of index maintenance
Yes, more or less. We strip a few implementation-specific details off of our log-entries
when returning them, but you can use the log
function on the Connection
interface to
access the transaction log entries, and examine the datoms in each for a more 'raw' look
at the changes happening in the database at each transaction.
db
, asOf
, history
, and log
.Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close