Datahike supports two types of distributed access, distribution of data or
distribution of computation. Distribution of data means that each Datahike
runtime
can access store
s in the distributed index space
(DIS), while
distribution of computation means that client
s send requests to be evaluated to
be processed by a server
on a remote runtime.
Datahike has a similar memory model to Datomic, which is built on distributed persistent indices. But while Datomic requires active connections to its transactor, Datahike works with lightweight connections that do not require communication by default.
In case where you do not need to write to a database, only read from it, e.g. a database that a 3rd party provides you access to, it is sufficient to have read access rights to the store, no setup of a server or additional steps are needed to join against the indices of this external database!
Note: This allows you to massively shard databases. A good design pattern is to create a separate database for a set of facts that you need to consistently update together, e.g. one database per business client.
If you want to provide distributed write access to databases you need to setup a server as described in the section at the end. Datahike then centralizes all write operations and state changes to the database on this single machine, while all read operations still can happen locally on as many machines as have access to the distributed konserve store (e.g. shared filesystem, JDBC, S3, etc.). The benefit of the single writer is that it provides strong linearization guarantees for transactions, i.e. strong consistency. This memory model is also supported by the CLI, babashka and libdatahike clients.
The client setup is simple, you just add a :writer
entry in the configuration
for your database, e.g.
{:store {:backend :file :scope "your.domain.com" :path "/shared/filesystem/store"}
:keep-history? true
:schema-flexibility :read
:writer {:backend :datahike-server
:url "http://localhost:4444"
:token "securerandompassword"}}
You can now use the normal datahike.api
as usual and all operations changing a
database, e.g. create-database
, delete-database
and transact
are sent to
the server while all other calls are executed locally.
An example setup to run Datahike distributed in AWS lambda without a server can be found here. It configures a singleton lambda for write operations while reader lambdas can be run multiple times and scale out. This setup can be upgraded later to use dedicated servers through EC2 instances.
Datahike supports sending all requests to a server. This has the benefit that the server will do all the computation and its caches will be shared between different clients. The disadvantage is that you cannot easily share information in process, e.g. call your own functions or closures in queries without deploying them to the server first.
The remote API has the same call signatures as datahike.api
and is located in
datahike.api.client
. Except for listening and with
all functionality is
supported. Given a server is setup (see below), you can interact with it by
adding :remote-peer
to the config you would otherwise with datahike.api
:
{:store {:backend :mem :id "distributed-datahike"}
:keep-history? true
:schema-flexibility :read
:remote-peer {:backend :datahike-server
:url "http://localhost:4444"
:token "securerandompassword"}}
The API will return lightweight remote pointers that follow the same semantics
as datahike.api
, but do not support any of Datahike's local functionality,
i.e. you can only use them with this API.
Note that you can combine both data accesses, i.e. run a set of servers sharing a single writer among themselves, while they all serve a large set of outside clients through RPCs.
To build it locally you only need to clone the repository and run bb http-server-uber
to create the jar. The server can then be run with java -jar datahike-http-server-VERSION.jar path/to/config.edn
.
The edn configuration file looks like:
{:port 4444
:level :debug
:dev-mode true
:token "securerandompassword"}
Port sets the port
to run the HTTP server under, level
sets the log-level.
dev-mode
deactivates authentication during development and if token
is
provided then you need to send this token as the HTTP header "token" to
authenticate.
The server exports a swagger interface on the port and can serialize requests in
transit-json
, edn
and JSON
with
jsonista tagged literals. The server
exposes all referentially transparent calls (that don't change given their
arguments) as GET requests and all requests that depend on input information as
POST requests. All arguments in both cases are sent as a list in the request
body.
CORS headers can be set, e.g. with adding
:access-control-allow-origin [#"http://localhost" #"http://localhost:8080"]
The server also experimentally supports HTTP caching for GET requests, e.g. by adding
:cache {:get {:max-age 3600}}
This should be beneficially in case your HTTP client or proxy supports efficient caching and you often run the same queries many times on different queries (e.g. to retrieve a daily context in an app against a database only changes with low frequency.)
The remote API supports JSON with embedded tagged literals. There are two extensions for convenience provided (please provide feedback or explore better options if you have ideas!).
Provided you are sending HTTP requests to a datahike-server you can put the following JSON argument arrays into each method body. You have to provide the "token" in the header if you use authentication.
POST
to "/create-database"
["{:schema-flexibility :read}"]
Note that here you can pass the configuration as an edn
string, which is more concise. If you want to speak JSON directly you would pass
[{"schema-flexibility": ["!kw", "read"]}]
"!kw" annotates a tagged literal here and encodes that "read" is an edn
keyword.
The resulting configuration will look like (with random DB name):
cfg = {
"keep-history?": true,
"search-cache-size": 10000,
"index": [
"!kw",
"datahike.index/persistent-set"
],
"store": {
"id": "wiggly-field-vole",
"scope": "127.0.1.1",
"backend": [
"!kw",
"mem"
]
},
"store-cache-size": 1000,
"attribute-refs?": false,
"writer": {
"backend": [
"!kw",
"self"
]
},
"crypto-hash?": false,
"remote-peer": null,
"schema-flexibility": [
"!kw",
"read"
],
"branch": [
"!kw",
"db"
]
}
You can now use this cfg to connect to this database:
POST
to "/connect"
[cfg]
The result will look like:
conn = ["!datahike/Connection",[[["!kw","mem"],"127.0.1.1","wiggly-field-vole"],["!kw","db"]]]
Finally let's add some data to the database:
POST
to "/transact"
[conn, [{"name": "Peter", "age": 42}]]
The result is a comprehensive transaction record (feel free to ignore the details):
[
"!datahike/TxReport",
{
"db-before": [
"!datahike/DB",
{
"store-id": [
[
[
"!kw",
"mem"
],
"127.0.1.1",
"wiggly-field-vole"
],
[
"!kw",
"db"
]
],
"commit-id": [
"!uuid",
"2c8f71f9-a3c6-4189-ba0c-e183cc29c672"
],
"max-eid": 1,
"max-tx": 536870913
}
],
"db-after": [
"!datahike/DB",
{
"store-id": [
[
[
"!kw",
"mem"
],
"127.0.1.1",
"wiggly-field-vole"
],
[
"!kw",
"db"
]
],
"commit-id": [
"!uuid",
"6ebf8979-cdf0-41f4-b615-30ff81830b0c"
],
"max-eid": 2,
"max-tx": 536870914
}
],
"tx-data": [
[
"!datahike/Datom",
[
536870914,
[
"!kw",
"db/txInstant"
],
[
"!date",
"1695952443102"
],
536870914,
true
]
],
[
"!datahike/Datom",
[
2,
[
"!kw",
"age"
],
42,
536870914,
true
]
],
[
"!datahike/Datom",
[
2,
[
"!kw",
"name"
],
"Peter",
536870914,
true
]
]
],
"tempids": {
"db/current-tx": 536870914
},
"tx-meta": {
"db/txInstant": [
"!date",
"1695952443102"
],
"db/commitId": [
"!uuid",
"6ebf8979-cdf0-41f4-b615-30ff81830b0c"
]
}
}
]
Note that you can extract the snapshots of the database db-before
and db-after
the commit as well as the facts added to the database as tx-data
.
To retrieve the current database for your connection use
POST
to "/db"
[conn]
The result looks like:
db = [
"!datahike/DB",
{
"store-id": [
[
[
"!kw",
"mem"
],
"127.0.1.1",
"wiggly-field-vole"
],
[
"!kw",
"db"
]
],
"commit-id": [
"!uuid",
"6ebf8979-cdf0-41f4-b615-30ff81830b0c"
],
"max-eid": 2,
"max-tx": 536870914
}
]
You can query this database with the query endpoint. We recommend again using a string to denote the query DSL instead of direct JSON encoding unless you want to manipulate the queries in JSON programmatically.
GET
from "/q"
["[:find ?n ?a :where [?e :name ?n] [?e :age ?a]]", db]
The result set is
["!set",[["Peter",42]]]
You can also pass strings for pull expressions and to pass configurations to delete-database
and database-exists
.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close