Liking cljdoc? Tell your friends :D

clj-boost :sparkler:

Clojars Project Hex.pm cljdoc badge

A Clojure wrapper for XGBoost4J: train, store and predict using the full power of XGBoost directly from your REPL.

Rationale

Clojure is a great language for doing many things, but there's a field where it could shine and it doesn't: data science & machine learning. The main reason is the lack of domain libraries that would help practitioners to use off-the-shelf algorithms and solutions to do their work.

Python didn't become the leader in the field because it's inherently better or more performant, but because of scikit-learn, pandas and so on. While as Clojurists we don't really need pandas (dataframes) or similar stuff (everything is just a map, or if you care more about memory and performance a record) we don't have something like scikit-learn that makes really easy to train many kind of machine learning models and somewhat easier to deploy them.

clj-boost clearly isn't a shot at scikit-learn - something like that would require years of development - but it's a way to give people a better way to test and deploy their models. Clojure is robust, reliable and fast enough for most of the possible uses out there.

Disclaimer

This project is at a very early stage, though I started using it in production without many issues. Please, let me know of any issue ASAP so we will be able to get the best out of it and make data science with Clojure more reliable and funnier.

Installation

Add to your leiningen project.clj:

[clj-boost "0.0.3"]

For tools.deps:

clj-boost {:mvn/version "0.0.3"}

Usage

Start by requiring clj-boost.core in your namespace

(ns tutorial
  (:require [clj-boost.core :refer :all]))

XGBoost forces us to use its data structures in exchange for speed and performance. So the first thing to do is to transform your data to a DMatrix. You can pass to dmatrix various data structures:

Dmatrix

Map

It is possible to pass a map with :x and optionally :y keys, their values must be either a sequence of sequences or a vector of vectors for :x and a flat vector or sequence for :y. From now on everytime I use x and y I mean: x -> training data, y -> the objective to learn (required for training the model, optional for prediction)

(dmatrix {:x [[0 1 0]
              [1 1 0]]
          :y [1 0]})

(dmatrix {:x [[0 1 0]
              [1 1 0]]})

Vector

The input can also be a vector of vectors/sequence of sequences for x and optionally a flat vector/sequence for y.

(dmatrix [[[0 1 0]
           [1 1 0]]
          [1 0]])
                
(dmatrix [[0 1 0]
          [1 1 0]])

String

When given a string dmatrix tries to load a stored dmatrix on disk from the given path.

(dmatrix "data/train-set.dmatrix")

There's not much we can do with a DMatrix, for instance once it is created it is impossible to go back to a regular data structure. At the moment the only possible operation is to get the number of rows from it:

(nrow (dmatrix data))
;; 50

Fitting

Now fitting a model is just a matter of calling fit on the DMatrix and as second argument a config map with parameters for the model. Parameters are the same for every XGBoost declination, so the advice is to use this page as a reference.

(fit (dmatrix data)
     {:params {:eta 0.1
               :objective "binary:logistic"}
      :rounds 2
      :watches {:train (dmatrix data)
                :valid (dmatrix valid)}
      :early-stopping 10})

fit returns an XGBoost model instance, or a Booster for friends, that can be stored, used for prediction or as a baseline for further training. For the latter option just pass :booster to the parameters map with an already trained Booster instance.

(fit (dmatrix data)
     {:params {:eta 0.1
               :objective "binary:logistic"}
      :rounds 2
      :watches {:train (dmatrix data)
                :valid (dmatrix valid)}
      :early-stopping 10
      :booster my-booster})

Cross-validation

cross-validation is basically the same, only you don't get a Booster in return, but the cross-validation results:

(cross-validation (dmatrix data)
                  {:params {:eta 0.1
                            :objective "binary:logistic"}
                   :rounds 2
                   :nfold 3})

Prediction

To get predictions there's the predict function that takes a model (a Booster instance) and data to predict.

(-> (fit (dmatrix data)
     {:params {:eta 0.1
               :objective "binary:logistic"}
      :rounds 2
      :watches {:train (dmatrix data)
                :valid (dmatrix valid)}
      :early-stopping 10})
    (predict (dmatrix test-data))

Persistence

Let's say that you're working either with large data or you're building an automated pipeline. Of course you would want to persist your models and your data for later use or as intermediate results. Finally, you will be able to predict new data by using load-model and getting ready for the data to come in:

(persist (dmatrix data) "path/to/my-data")

(persist (dmatrix new-data) "path/to/my-new-data")

(-> (dmatrix "path/to/my-data")
    (fit 
     {:params {:eta 0.1
               :objective "binary:logistic"}
      :rounds 2
      :watches {:train (dmatrix data)
                :valid (dmatrix valid)}
      :early-stopping 10})
    (persist "path/to/my-model"))
    
(-> (load-model "path/to/my-model")
    (predict (dmatrix "path/to/my-new-data")))

Pipe

Since this is a common pattern you might want to take a look at the pipe function: it takes train-dmatrix, test-dmatrix, config and optionally a path. pipe will train a model by using config as parameters, make predictions on given test data and if a path is given it will store the model at path.

(pipe (dmatrix data)
      (dmatrix new-data)
      {:params {:eta 0.1
               :objective "binary:logistic"}
      :rounds 2
      :watches {:train (dmatrix data)
                :valid (dmatrix valid)}
      :early-stopping 10}
      "path/to/my-model")

Tutorials

You can find a demo folder in this repo where there are self-contained scripts and examples. In the doc folder there are guides and tutorials that you can find hosted on clj-doc as well as longer posts on my personal blog

To do

  • [x] Make some tutorials and posts about clj-boost usage
  • [ ] Add CI/CD to the repo
  • [ ] Reach feature parity with XGBoost4J
  • [ ] Add a method to generate config programmatically (atom?)
  • [ ] Add a way to perform grid search over parameters
  • [ ] Find a way to make prediction more tunable
  • [ ] Some facilities (like accuracy, confusion matrix, etc)???

License

© Alan Marazzi, 2018. Licensed under an Apache-2 license.

Edit on GitLab

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close