MonkeyCI needs to store various kinds of information:
Most of this information is fairly small and structured, except for logs,
caches and artifacts, which can be large blobs. The structured information
needs to be searchable up to a level, and of course it must be durable. I
would like to keep an open view on which technology is most suited for this,
so I don't want to blindly fall back to a relational database. Currently I'm
thinking that keeping the information in edn
(or json
) files in object
storage could be useful. This can then be augmented with some kind of
indexing system, to allow for searching. Indices themselves could also be
stored in edn
, and could be loaded in a Redis or
ElasticSearch. As long as there is no income, I will
focus on the cheapest solution that gets the job done, without having to
re-invent the wheel. OCI also offers an autonomous JSON database,
which could also serve our needs.
The build process itself only needs to store entities, there is no need to read any, apart from caching and build parameters. Currently caches and artifacts are manipulated directly, information is retrieved using the API (only build parameters) and updates are sent through events.
Initially, we will store everything in object storage, as edn
files. The
advantage over json
is that edn can be appended, you can have multiple objects
in one file. This could be useful for adding log statements, or updating build
progress. The information is stored in a single bucket, organized like
<customer>/<repository>/<build>
.
The build id is generated by MonkeyCI, which could be as simple as a UUID
.
Each build "folder" contains the following information:
Depending on the configuration, this could also just be store locally, which is what we will do initially, or in development mode.
Update: We now know that buckets are fairly slow, and OCI also imposes a request limit, which we hit pretty early, even in development mode with only one user. So this is clearly not the way to go, unless we want to put something in front of it, like a microservice that does caching and request grouping.
Instead of buckets, we could also use files. Especially if we're prepared to build a microservice that handles the requests. We could use ZeroMQ for this, or something similar. (After playing with it I know it also has its issues, but let's talk about that later.) It is faster than buckets and we don't have to take into account request limites. But the biggest downside is that it is not easy to scale it. We could use NFS and mount it to multiple replicas but this still would mean we need some way to "lock" the files so changes don't get overwritten by another replica.
Another concern is that files are harder to search through. We have to structure the data carefully in order to be able to quickly find matches. And even then it's not always possible without duplicating information. Sometimes you just need to access the data from different points of view. If we were to solve this we would in reality be re-inventing the relational database. So we may as well use it.
A good and easy to use relational database is MySQL, which is owned by Oracle, so it has good support on OCI. This does mean we'll introduce another 3rd party service we need to host and maintain. We could use the cloud-provided service, but this comes at a cost (about €33/month for a basic system). Initially we could set one up in the cluster.
Using an RDBMS would make it a lot more flexible for us to look up data, and we
could use JSON fields
for the more dynamic parts (like job definitions and results). However, since
this data type is not standard and not supported by all database systems, it may
be better to just use edn
stored in VARCHAR
fields.
The biggest hurdle here is that we would need to rewrite most of the current entity
code because it is now oriented towards working with files. In order to avoid
this, the current implementation uses the same Protocol
as the other storage
systems. This does introduce an additional layer of complexity, but it also
makes for easier unit testing and decouples the application layer from the
database details.
Artifacts are just blobs that will be put into storage after each build step.
Since storage is not free, we will have to put a limit to the amount of data,
or to the period we will store it. Artifacts are configured at step level,
and have a name and one or more paths that will be added to the artifact.
We will probably use tar
and gzip
to put all files in one package.
Caches are similar to artifacts, but caches are not publicly available, but
rather reused between builds. Similar to CircleCI or Gitlab, we could
assign a key to each cache. This means that caches won't be stored along with
the build, but higher up, most likely at repository level. Each build step
can hold a cache
configuration entry, that has a key and a list of paths
that need to be cached/restored. Before the step is executed, the cache is
restored (if found), and after the step, it is updated. Depending on the
configuration, the update will happen only if the step was successful, or
regardless of status.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close