A Clojure wrapper for XGBoost4J: train, store and predict using the full power of XGBoost directly from your REPL.
Clojure is a great language for doing many things, but there's a field where it could shine and it doesn't: data science & machine learning. The main reason is the lack of domain libraries that would help practitioners to use off-the-shelf algorithms and solutions to do their work.
Python didn't become the leader in the field because it's inherently better or more performant, but because of scikit-learn, pandas and so on. While as Clojurists we don't really need pandas (dataframes) or similar stuff (everything is just a map, or if you care more about memory and performance a record) we don't have something like scikit-learn that makes really easy to train many kind of machine learning models and somewhat easier to deploy them.
clj-boost clearly isn't a shot at scikit-learn - something like that would require years of development - but it's a way to give people a better way to test and deploy their models. Clojure is robust, reliable and fast enough for most of the possible uses out there.
This project is at a very early stage, though I started using it in production without many issues. Please, let me know of any issue ASAP so we will be able to get the best out of it and make data science with Clojure more reliable and funnier.
Add to your leiningen project.clj
:
[clj-boost "0.0.3"]
For tools.deps
:
clj-boost {:mvn/version "0.0.3"}
Start by requiring clj-boost.core
in your namespace
(ns tutorial
(:require [clj-boost.core :refer :all]))
XGBoost forces us to use its data structures in exchange for speed and performance. So the first thing to do is to transform your data to a DMatrix. You can pass to dmatrix
various data structures:
It is possible to pass a map with :x
and optionally :y
keys, their values must be either a sequence of sequences or a vector of vectors for :x
and a flat vector or sequence for :y
. From now on everytime I use x
and y
I mean: x
-> training data, y
-> the objective to learn (required for training the model, optional for prediction)
(dmatrix {:x [[0 1 0]
[1 1 0]]
:y [1 0]})
(dmatrix {:x [[0 1 0]
[1 1 0]]})
The input can also be a vector of vectors/sequence of sequences for x
and optionally a flat vector/sequence for y
.
(dmatrix [[[0 1 0]
[1 1 0]]
[1 0]])
(dmatrix [[0 1 0]
[1 1 0]])
When given a string dmatrix
tries to load a stored dmatrix
on disk from the given path.
(dmatrix "data/train-set.dmatrix")
There's not much we can do with a DMatrix, for instance once it is created it is impossible to go back to a regular data structure. At the moment the only possible operation is to get the number of rows from it:
(nrow (dmatrix data))
;; 50
Now fitting a model is just a matter of calling fit
on the DMatrix and as second argument a config map with parameters for the model. Parameters are the same for every XGBoost declination, so the advice is to use this page as a reference.
(fit (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10})
fit
returns an XGBoost model instance, or a Booster for friends, that can be stored, used for prediction or as a baseline for further training. For the latter option just pass :booster
to the parameters map with an already trained Booster instance.
(fit (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10
:booster my-booster})
cross-validation
is basically the same, only you don't get a Booster in return, but the cross-validation results:
(cross-validation (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:nfold 3})
To get predictions there's the predict
function that takes a model (a Booster instance) and data to predict.
(-> (fit (dmatrix data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10})
(predict (dmatrix test-data))
Let's say that you're working either with large data or you're building an automated pipeline. Of course you would want to persist
your models and your data for later use or as intermediate results. Finally, you will be able to predict
new data by using load-model
and getting ready for the data to come in:
(persist (dmatrix data) "path/to/my-data")
(persist (dmatrix new-data) "path/to/my-new-data")
(-> (dmatrix "path/to/my-data")
(fit
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10})
(persist "path/to/my-model"))
(-> (load-model "path/to/my-model")
(predict (dmatrix "path/to/my-new-data")))
Since this is a common pattern you might want to take a look at the pipe
function: it takes train-dmatrix, test-dmatrix, config and optionally a path. pipe
will train a model by using config as parameters, make predictions on given test data and if a path is given it will store the model at path.
(pipe (dmatrix data)
(dmatrix new-data)
{:params {:eta 0.1
:objective "binary:logistic"}
:rounds 2
:watches {:train (dmatrix data)
:valid (dmatrix valid)}
:early-stopping 10}
"path/to/my-model")
You can find a demo folder in this repo where there are self-contained scripts and examples. In the doc folder there are guides and tutorials that you can find hosted on clj-doc as well as longer posts on my personal blog
clj-boost
usage© Alan Marazzi, 2018. Licensed under an Apache-2 license.
Can you improve this documentation?Edit on GitLab
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close