Not sure how this would be done but maybe some restricting of the datalog capabilities would be necessary to do this.
This paper is to lay some information and thoughts out before the face2 discussions.
It will present some options and my thoughts about there trade offs so that we have some things we all have considered before.
Having all state available that is required for query on every instance of Crux is the model we have currently. From our experience it can scale to very large datasets using Rocksdb. If the entire index can fit on a single instance it is ideal to have it there if our goals are to provide as low latency queries as we can.
But not having the capability to partition/shard the data across instances could perhaps be a hindrance in adoption for clients. Even though they can scale up on a single instance for a long time in practice. Having a plan for what happens when that limit can be a requirement.
What the max data size is for performance queries on a single node is we don’t know yet.
A possible choices we can make is what type of queries we intend to support and optimise for.
Is Crux meant to be primarily used for fast responding lookup or is it supposed to support long running report/batch gathering style of jobs well. If the latter is true then having support for co-operating query resolution might be interesting even for smaller scale of data then when its necessary to start sharding it.
We can of course chose to support both type of queries and give the right options to the user (or somehow infer them) to handle there type of queries well.
Currently we support fully declarative queries via datalog. If we which to have these support partitioned queries some possible options would be.
I’m grouping SQL and other powerful query languages under this category As they would probably have similar trade offs and problems.
Infer from the query and input what partitions it will involve and automatically communicate and execute the query transparently to the user.
Not sure how this would be done but maybe some restricting of the datalog capabilities would be necessary to do this.
Split the query into multiple parts. have the user write a query that will be run on multiple instances and then have the user be responsible to join up the parts. The joining of the result could be done with more datalog.
So query one running on several instances and query two running on the requesting instance but has the result set from both available in the query.
Having werry explicit addressable shards with extensions in datalog to address them.
This is inspired from http://www.actordb.com/docs-querymodel.html were there is essentially separate addressable databases and a language with extensions to be able to address and join them.
In a mode like this these separate databases would be have to address on inserts to I guess.
This is inspired by https://docs.mongodb.com/manual/core/aggregation-pipeline/ In this mode we don’t try to infer anything for the user. Instead the user specifies a distributed pipeline like the ones you would build with Kafka or Onyx with keys and values that are routed between instances. But we instead execute it dynamically. A way to build a system like this would be to use some extra kafka topics with low retention that are just used for resolving pipeline jobs like this.
This implementation is not necessarily orthogonal to the "Full Datalog/SQL support" since they would probably also need some lower level cross instance coordination support.
But if this existed it could also be the only interface the user had for queries.
So this is something closer to what we have built before. In this mode we only support a declarative pull expression with some filtering extensions. And maybe some aggregation extensions. Queries that are less powerful than the "Full Datalog/SQL support" category. But with that probably easier to infer and automate co-operation between instances for queries.
This was some of the things I tought about but there are more posibilities so if you think of those please bring them to the meeting.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close