This is work in progress and subject to change.
We provide the datahike
native executable to access Datahike databases from
the command line.
First you need to download the precompiled binary, or build it yourself, and put it on your executable path.
To access a database you need to provide the usual configuration for Datahike.
Put this into a file myconfig.edn
.
{:store {:backend :file
:path "/home/USERNAME/dh-shared-db"
:config {:in-place? true}}
:keep-history? true
:schema-flexibility :read}
Now you can invoke some of our core API functions on the database. Let us add a fact to the database (be careful to use single ' if you do not want your shell to substitute parts of your Datalog ;) ):
$ datahike transact db:myconfig.edn '[[:db/add -1 :name "Linus"]]'
And retrieve it:
$ datahike query '[:find ?n . :where [?e :name ?n]]' db:myconfig.edn
"Linus" # prints the name
By prefixing the path with db:
to the query engine you can pass multiple db
configuration files and join over arbitrary many databases. Everything else is
read in as edn
and passed to the query engine as well.
Provided the filestore is configured with {:in-place? true}
you can even write
to the same database without a dedicated daemon from different shells,
$ datahike benchmark db:myconfig.edn 0 50000 100
"Elapsed time: 116335.589411 msecs"
Here we use a provided benchmark helper which transacts facts of the form [eid :name (random-team-member)]
for eid=0,...,50000
into the store. 100
denotes
the batch size for each transaction, so here we chunk the 50000 facts into 500
transactions.
In a second shell you can now simultaneously add facts in a different range
$ datahike benchmark db:myconfig.edn 50000 100000 100
To check that everything has been added and no write operations have overwritten each other.
$ datahike query '[:find (count ?e) . :in $ :where [?e :name ?n]]' db:myconfig.edn
100000 # check :)
The persistent semantics of Datahike work more like git
and less like similar
mutable databases such as SQLite or Datalevin. In particular you can always read
and retain snapshots (copies) of the database for free, no matter what else is
happening in the system. The current version is tested with memory and file
storage, but hopefully many other backends will also work with the
native-image
.
In principle this shared memory access should even work while having a JVM server, e.g. datahike-server, serving the same database. Note that all reads can happen in parallel, only the writers experience congestion around exclusive file locks here. This access pattern does not provide highest throughput, but is extremely flexible and easy to start with.
Forking is easy, it is enough to copy the folder of the store (even if the
database is currently being written to). The only thing you need to take care of
is to copy the DB root first and place it into the target directory last, it is
the file 0594e3b6-9635-5c99-8142-412accf3023b.ksv
. Then you can use e.g.
rsync
(or git
) to copy all other (immutable) files into your new folder. In
the end you copy the root file in there as well, making sure that all files it
is referencing are reachable. Note that this will ensure that you only copy new
data each time.
Now here comes the cool part. You do not need anything more for merging than
Datalog itself. You can use a query like this to extract all new facts that are
in db1
but not in db2
like this:
datahike query '[:find ?e ?a ?v ?t :in $ $2 :where [$ ?e ?a ?v ?t] (not [$2 ?e ?a ?v ?t])]' db:config1.edn db:config2.edn
Since we cannot update transaction metadata, we should filter out
:db/txInstant
s. We can also use a trick to add :db/add
to each element in
the results, yielding valid transactions that we can then feed into db2
.
datahike query '[:find ?db-add ?e ?a ?v ?t :in $ $2 ?db-add :where [$ ?e ?a ?v ?t] [(not= :db/txInstant ?a)] (not [$2 ?e ?a ?v ?t])]' db:config1.edn db:config2.edn ":db/add" | transact db:config2.edn
Note that this very simple strategy assumes that the entity ids that have been
added to db1
do not overlap with potentially new ones added to db2
. You can
encode conflict resolution strategies and id mappings with Datalog as well and
we are exploring several such strategies at the moment. This strategy is fairly
universal, as CRDTs can be expressed in pure
Datalog.
While it is not the most efficient way to merge, we plan to provide fast paths
for common patterns in Datalog. Feel free to contact us if you are interested in
complex merging strategies or have related cool ideas.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close