Liking cljdoc? Tell your friends :D

Motivation

This paper is to lay some information and thoughts out before the face2 discussions.

It will present some options and my thoughts about there trade offs so that we have some things we all have considered before.

Options for Query support

Single instance vs Partitioned cooperation (or limitation)

Having all state available that is required for query on every instance of Crux is the model we have currently. From our experience it can scale to very large datasets using Rocksdb. If the entire index can fit on a single instance it is ideal to have it there if our goals are to provide as low latency queries as we can.

But not having the capability to partition/shard the data across instances could perhaps be a hindrance in adoption for clients. Even though they can scale up on a single instance for a long time in practice. Having a plan for what happens when that limit can be a requirement.

What the max data size is for performance queries on a single node is we don’t know yet.

Low latency queries vs long running jobs

A possible choices we can make is what type of queries we intend to support and optimise for.

Is Crux meant to be primarily used for fast responding lookup or is it supposed to support long running report/batch gathering style of jobs well. If the latter is true then having support for co-operating query resolution might be interesting even for smaller scale of data then when its necessary to start sharding it.

We can of course chose to support both type of queries and give the right options to the user (or somehow infer them) to handle there type of queries well.

Type of query support

Full Datalog/SQL support

Currently we support fully declarative queries via datalog. If we which to have these support partitioned queries some possible options would be.

I’m grouping SQL and other powerful query languages under this category As they would probably have similar trade offs and problems.

  • Infer from the query and input what partitions it will involve and automatically communicate and execute the query transparently to the user.

    Not sure how this would be done but maybe some restricting of the datalog
    capabilities would be necessary to do this.
  • Split the query into multiple parts. have the user write a query that will be run on multiple instances and then have the user be responsible to join up the parts. The joining of the result could be done with more datalog.

    So query one running on several instances and query two
    running on the requesting instance but has the result set from
    both available in the query.
  • Having werry explicit addressable shards with extensions in datalog to address them.

    This is inspired from http://www.actordb.com/docs-querymodel.html
    were there is essentially separate addressable databases and a language
    with extensions to be able to address and join them.
    In a mode like this these separate databases would be have to address
    on inserts to I guess.

Low level pipeline operations

This is inspired by https://docs.mongodb.com/manual/core/aggregation-pipeline/ In this mode we don’t try to infer anything for the user. Instead the user specifies a distributed pipeline like the ones you would build with Kafka or Onyx with keys and values that are routed between instances. But we instead execute it dynamically. A way to build a system like this would be to use some extra kafka topics with low retention that are just used for resolving pipeline jobs like this.

This implementation is not necessarily orthogonal to the "Full Datalog/SQL support" since they would probably also need some lower level cross instance coordination support.

But if this existed it could also be the only interface the user had for queries.

Graphql / Pull expression with extensions support

So this is something closer to what we have built before. In this mode we only support a declarative pull expression with some filtering extensions. And maybe some aggregation extensions. Queries that are less powerful than the "Full Datalog/SQL support" category. But with that probably easier to infer and automate co-operation between instances for queries.

Other

This was some of the things I tought about but there are more posibilities so if you think of those please bring them to the meeting.

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close