Liking cljdoc? Tell your friends :D

Basic training and prediction

clj-boost is at all effects a Clojure interface on top of XGBoost, so most of its functionality is or will be replicated. What we do with clj-boost is basically:

  • Serializing data for algorithm feeding
  • Training a model over a dataset
  • Cross-validation of models
  • Prediction of new data

Data reading

For this example we use the Iris dataset, you can get it directly from here

(ns basic-train-predict.core
  (:require [clj-boost.core :as boost]
            [clojure.java.io :as io]
            [clojure.data.csv :as csv]))
 
(def iris-path "resources/iris.csv")

(defn generate-iris
  [iris-path]
  (with-open [reader (io/reader iris-path)]
    (into []
          (comp (drop 1) (map #(split-at 4 %)))
          (csv/read-csv reader))))

We read the csv file, drop column names and split resulting vectors between features (sepal_length, sepal_width, petal_length, petal_width) and our response variable (species) that in this case is defined by three classes.

The next step is to parse numbers and convert labels to integers while splitting X (features) and Y (response):

(defn parse-float
  [s]
  (Float/parseFloat s))

(def transform-x
  (comp
   (map first)
   (map #(map parse-float %))
   (map vec)))

(def transform-y
  (comp
   (map last)
   (map (fn [label]
          (let [l (first label)]
            (case l
              "setosa"     0
              "versicolor" 1
              "virginica"  2))))))
 
(defn munge-data
  [iris-data]
  (let [x (into [] transform-x iris-data)
        y (into [] transform-y iris-data)]
    (map conj x y)))

Train-test split

It is always a good practice to split training and test set by randomly sampling our dataset

(defn train-test-split
  [dataset n]
  (let [shuffled (shuffle dataset)]
    (split-at n shuffled)))

(defn train-set
  [split-set]
  (let [dset (first split-set)]
    {:x (mapv drop-last dset)
     :y (mapv last dset)}))
 
(defn test-set
  [split-set]
  (let [dset (last split-set)]
    {:x (mapv drop-last dset)
     :y (mapv last dset)}))

We create a map with the :x and :y keys where :x contains a vector of vectors of features and :y a vector of classes. We need this to create a DMatrix which is the needed data structure to perform operations with XGBoost.

There are many ways to generate a DMatrix, be sure to check docs or the dedicated README section to learn how to generate it from your data.

Model training

The fit function takes a DMatrix as first argument and it is required that there are both :x and :y values and a config map with the following keys:

  • :params: training parameters for XGBoost, be sure to check official docs to see all the possible parameters
  • :rounds: the number of boosting iterations
  • :watches: a map where you pass to XGBoost data to perform evaluation during training, must be {:name-you-want dmatrix-data} and you can pass more than one dataset at a time
  • :early-stopping: a number of consecutive rounds to stop training if any given (in :params) evaluation metric increases on any of the :watches
  • :booster: optionally pass an existing XGBoost model to use as a base margin
(defn train-model
  [train-set]
  (let [data   (boost/dmatrix train-set)
        params {:params         {:eta       0.00001
                                 :objective "multi:softmax"
                                 :num_class 3}
                :rounds         2
                :watches        {:train data}
                :early-stopping 10}]
    (boost/fit data params)))

In the train-model function we firstly create a dmatrix, then define a params map where we say to XGBoost to use a learning rate (:eta) of 0.00001, that the :objective of the learning task is multi-class classification, that we want probabilities as a result ("multi:softmax") and that the dataset has 3 classes (:num_class).

In this case the remaining parameters are just an example: XGBoost will perform 2 :rounds of boosting and will evaluate accuracy on the training data itself (:watches). This is a bad practice, usually you want to evaluate on another dataset that is not the training itself.

fit returns a Booster instance that can be used for prediction, for further training or can be stored somewhere for future usage.

Predicting new data

predict needs a trained model instance and new data in DMatrix form. It returns a sequence of predictions.

Be aware that new data must have the same format and be in the same order as training data was to get a result that makes sense.

(defn predict-model
  [model test-set]
  (boost/predict model (boost/dmatrix test-set)))
 
(defn accuracy
  [predicted real]
  (let [right (map #(compare %1 %2) predicted real)]
    (/ (count (filter zero?  right))
       (count real))))

Here we define an accuracy function as well to calculate it on the model prediction over test data.

Putting all together

(defn -main
  []
  (let [split-set    (->> iris-path
                          generate-iris
                          munge-data
                          (train-test-split 120))
        [train test] (map #(% split-set) [train-set test-set])
        model        (train-model train)
        result       (predict-model model test)]
    (println "Prediction:" (mapv int result))
    (println "Real:      " (:y test))
    (println "Accuracy:  " (accuracy result (:y test)))))

Now you can either run (-main) in your REPL or lein run from the command line and you'll get a result similar to this:

Prediction: [1 1 2 0 2 2 2 2 2 1 1 0 1 2 0 1 1 1 0 1 0 2 1 1 0 0 1 2 1 1]
Real:       [1 1 2 0 2 2 2 2 2 1 1 0 1 2 0 1 1 1 0 1 0 2 1 1 0 0 1 2 2 1]
Accuracy:   29/30

Be aware that since we didn't fix a seed for the random number generator you might get slightly different results.

Can you improve this documentation?Edit on GitLab

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close