Bitemporality — juxt/crux 19.04-1.0.2-alpha

Bitemporality

Why?

The rationale for bitemporality is also explained in this blog post.

A baseline notion of time that is always available is transaction-time; the point at which data is transacted into the database.

Bitemporality is the addition of another time-axis: valid-time.

Table 1. Time Axes
Time	Purpose
`transaction-time`	Used for audit purposes, technical requirements such as event sourcing.
`valid-time`	Used for querying data across time, historical analysis.

transaction-time represents the point at which data arrives into the database. This gives us an audit trail and we can see what the state of the database was at a particular point in time. You cannot write a new transaction with a transaction-time that is in the past.

valid-time is an arbitrary time that can originate from an upstream system, or by default is set to transaction-time. Valid time is what users will typically use for query purposes.

Writes can be made in the past of valid-time. Users will normally ask "what is the value of this entity at valid-time?" regardless if this history has been rewritten several times at multiple transaction-times.

Only if you want to ensure consistent reads across nodes or to query for audit reasons, would you want to consider using both transaction-time and valid-time.

In Crux, when transaction-time isn’t specified, it is set to now. When writing data, in case there isn’t any specific valid-time available, valid-time and transaction-time take the same value.

Valid Time

In situations where your database is not the ultimate owner of the data—where corrections to data can flow in from various sources and at various times—use of transaction-time is inappropriate for historical queries.

Imagine you have a financial trading system and you want to perform calculations based on the official 'end of day', that occurs each day at 17:00 hours. Does all the data arrive into your database at exactly 17:00? Or does the data arrive in fact arrive from an upstream source, and we have to allow for some data to arrive out of order, and some might just arrive after 17:00?

This can often be the case with high throughput systems where there are clusters of processing nodes, enriching the data before it gets to our store.

In this example, we want our queries to include the straggling bits of data for our calculation purposes, and this is where valid-time comes in. When data arrives into our database, it can come with an arbitrary time-stamp that we can use for querying purposes.

We can tolerate data arriving out of order, as we’re not completely dependent on transaction-time.

In a ecosystem of many systems, where one cannot control the ultimate time line, or other systems abilities to write into the past, one needs bitemporality to ensure evolving but consistent views of the data.

Transaction Time

For audit reasons, we might wish to know with certainty the value of a given entity-attribute at a given tx-instant. In this case, we want to exclude the possibility of the valid past being amended, so we need a pre-correction view of the data, relying on tx-instant.

To achieve this you can use as-of using ts (valid-time) and tx-ts (transaction-time).

Only if you want to ensure consistent reads across nodes or to query for audit reasons, would you want to consider using both transaction-time and valid-time.

Known Uses

Recording bitemporal information with your data is essential when dealing with lag, corrections, and efficient auditability:

Lag is found wherever there is risk of non-trivial delay until an event can be recorded. This is common between systems that communicate over unreliable networks.
Corrections are needed as errors are uncovered and as facts are reconciled.
Ad-hoc auditing is an otherwise intensive and slow process requiring significant operational complexity.

With Crux you retain visibility of all historical changes whilst compensating for lag, making corrections, and performing audit queries. By default, deleting data only erases visibility of that data from the current perspective. You may of course still evict data completely as the legal status of information changes.

These capabilities are known to be useful for:

Event Sourcing (e.g. retroactive and scheduled events)
Ingesting out-of-order temporal data from upstream timestamping systems
Maintaining a slowly changing dimension for decision support applications
Recovering from accidental data changes and application errors
Auditing all data changes and performing data forensics when necessary
Responding to new regulations and audit requirements
Avoiding the need to set up additional databases for historical data and improving end-to-end data governance
Development, simulation and testing
Live migrations from legacy systems
Scheduling and previewing future states (e.g. publishing and content management)

Example Queries

Crime Investigations

This example is based on an academic paper.

Indexing temporal data using existing B +-trees

https://www.comp.nus.edu.sg/~ooibc/stbtree95.pdf
Cheng Hian Goh, Hongjun Lu, Kian-Lee Tan, Published in Data Knowl. Eng. 1996
DOI:10.1016/0169-023X(95)00034-P
See "7. Support for complex queries in bitemporal databases"

During a criminal investigation it is critical to be able to refine a temporal understanding of past events as new evidence is brought to light, errors in documentation are accounted for, and speculation is corroborated. The paper referenced above gives the following query example:

Find all persons who are known to be present in the United States on day 2 (valid time), as of day 3 (transaction time)

The paper then lists a sequence of entry and departure events at various United States border checkpoints. We as the investigator will step through this sequence to monitor a set of suspects. These events will arrive in an undetermined chronological order based on how and when each checkpoint is able to manually relay the information.

Day 0

Assuming Day 0 for the investigation period is #inst "2018-12-31", the initial documents are ingested using the Day 0 valid time:

link:./query_test_examples.clj[role=include]

The first document shows that Person 2 was recorded entering via :SFO and the second document shows that Person 3 was recorded entering :LA.

Day 1

No new recorded events arrive on Day 1 (#inst "2019-01-01"), so there are no documents available to ingest.

Day 2

A single event arrives on Day 2 showing Person 4 arriving at :NY:

link:./query_test_examples.clj[role=include]

Day 3

Next, we learn on Day 3 that Person 4 departed from :NY, which is represented as an update to the existing document using the Day 3 valid time:

link:./query_test_examples.clj[role=include]

Day 4

On Day 4 we begin to receive events relating to the previous days of the investigation.

First we receive an event showing that Person 1 entered :NY on Day 0 which must ingest using the Day 0 valid time #inst "2018-12-31":

link:./query_test_examples.clj[role=include]

We then receive an event showing that Person 1 departed from :NY on Day 3, so again we ingest this document using the corresponding Day 3 valid time:

link:./query_test_examples.clj[role=include]

Finally, we receive two events relating to Day 4, which can be ingested using the current valid time:

link:./query_test_examples.clj[role=include]

Day 5

On Day 5 there is an event showing that Person 2, having arrived on Day 0 (which we already knew), departed from :SFO on Day 5.

link:./query_test_examples.clj[role=include]

Day 6

No new recorded events arrive on Day 6 (#inst "2019-01-06"), so there are no documents available to ingest.

Day 7

On Day 7 two documents arrive. The first document corrects the previous assertion that Person 3 departed on Day 4, which was misrecorded due to human error. The second document shows that Person 3 has only just departed on Day 7, which is how the previous error was noticed.

link:./query_test_examples.clj[role=include]

Day 8

Two documents have been received relating to new arrivals on Day 8. Note that Person 3 has arrived back in the country again.

link:./query_test_examples.clj[role=include]

Day 9

On Day 9 we learn that Person 3 also departed on Day 8.

link:./query_test_examples.clj[role=include]

Day 10

A single document arrives showing that Person 5 entered at :LA earlier that day.

link:./query_test_examples.clj[role=include]

Day 11

Similarly to the previous day, a single document arrives showing that Person 7 entered at :NY earlier that day.

link:./query_test_examples.clj[role=include]

Day 12

Finally, on Day 12 we learn that Person 6 entered at :NY that same day.

link:./query_test_examples.clj[role=include]

Question Time

Let’s review the question we need to answer to aid our investigations:

Find all persons who are known to be present in the United States on day 2 (valid time), as of day 3 (transaction time)

We are able to easily express this as a query in Crux:

link:./query_test_examples.clj[role=include]

The answer given by Crux is a simple set of the three relevant people along with the details of their last entry and confirmation that none of them were known to have yet departed at this point:

link:./query_test_examples.clj[role=include]