clj-boost is at all effects a Clojure interface on top of XGBoost, so most of its functionality is or will be replicated. What we do with clj-boost is basically:
For this example we use the Iris dataset, you can get it directly from here
(ns basic-train-predict.core
(:require [clj-boost.core :as boost]
[clojure.java.io :as io]
[clojure.data.csv :as csv]))
(def iris-path "resources/iris.csv")
(defn generate-iris
[iris-path]
(with-open [reader (io/reader iris-path)]
(into []
(comp (drop 1) (map #(split-at 4 %)))
(csv/read-csv reader))))
We read the csv file, drop column names and split resulting vectors between features (sepal_length, sepal_width, petal_length, petal_width) and our response variable (species) that in this case is defined by three classes.
The next step is to parse numbers and convert labels to integers while splitting X (features) and Y (response):
(defn parse-float
[s]
(Float/parseFloat s))
(def transform-x
(comp
(map first)
(map #(map parse-float %))
(map vec)))
(def transform-y
(comp
(map last)
(map (fn [label]
(let [l (first label)]
(case l
"setosa" 0
"versicolor" 1
"virginica" 2))))))
(defn munge-data
[iris-data]
(let [x (into [] transform-x iris-data)
y (into [] transform-y iris-data)]
(map conj x y)))
It is always a good practice to split training and test set by randomly sampling our dataset
(defn train-test-split
[dataset n]
(let [shuffled (shuffle dataset)]
(split-at n shuffled)))
(defn train-set
[split-set]
(let [dset (first split-set)]
{:x (mapv drop-last dset)
:y (mapv last dset)}))
(defn test-set
[split-set]
(let [dset (last split-set)]
{:x (mapv drop-last dset)
:y (mapv last dset)}))
We create a map with the :x
and :y
keys where :x
contains a vector of vectors of features and :y
a vector of classes. We need this to create a DMatrix which is the needed data structure to perform operations with XGBoost.
There are many ways to generate a DMatrix, be sure to check docs or the dedicated README section to learn how to generate it from your data.
The fit
function takes a DMatrix as first argument and it is required that there are both :x
and :y
values and a config map with the following keys:
:params
: training parameters for XGBoost, be sure to check official docs to see all the possible parameters:rounds
: the number of boosting iterations:watches
: a map where you pass to XGBoost data to perform evaluation during training, must be {:name-you-want dmatrix-data}
and you can pass more than one dataset at a time:early-stopping
: a number of consecutive rounds to stop training if any given (in :params
) evaluation metric increases on any of the :watches
:booster
: optionally pass an existing XGBoost model to use as a base margin(defn train-model
[train-set]
(let [data (boost/dmatrix train-set)
params {:params {:eta 0.00001
:objective "multi:softmax"
:num_class 3}
:rounds 2
:watches {:train data}
:early-stopping 10}]
(boost/fit data params)))
In the train-model
function we firstly create a dmatrix
, then define a params
map where we say to XGBoost to use a learning rate (:eta
) of 0.00001, that the :objective
of the learning task is multi-class classification, that we want probabilities as a result ("multi:softmax"
) and that the dataset has 3 classes (:num_class
).
In this case the remaining parameters are just an example: XGBoost will perform 2 :rounds
of boosting and will evaluate accuracy on the training data itself (:watches
). This is a bad practice, usually you want to evaluate on another dataset that is not the training itself.
fit
returns a Booster instance that can be used for prediction, for further training or can be stored somewhere for future usage.
predict
needs a trained model instance and new data in DMatrix form. It returns a sequence of predictions.
Be aware that new data must have the same format and be in the same order as training data was to get a result that makes sense.
(defn predict-model
[model test-set]
(boost/predict model (boost/dmatrix test-set)))
(defn accuracy
[predicted real]
(let [right (map #(compare %1 %2) predicted real)]
(/ (count (filter zero? right))
(count real))))
Here we define an accuracy
function as well to calculate it on the model prediction over test data.
(defn -main
[]
(let [split-set (->> iris-path
generate-iris
munge-data
(train-test-split 120))
[train test] (map #(% split-set) [train-set test-set])
model (train-model train)
result (predict-model model test)]
(println "Prediction:" (mapv int result))
(println "Real: " (:y test))
(println "Accuracy: " (accuracy result (:y test)))))
Now you can either run (-main)
in your REPL or lein run
from the command line and you'll get a result similar to this:
Prediction: [1 1 2 0 2 2 2 2 2 1 1 0 1 2 0 1 1 1 0 1 0 2 1 1 0 0 1 2 1 1]
Real: [1 1 2 0 2 2 2 2 2 1 1 0 1 2 0 1 1 1 0 1 0 2 1 1 0 0 1 2 2 1]
Accuracy: 29/30
Be aware that since we didn't fix a seed for the random number generator you might get slightly different results.
Can you improve this documentation?Edit on GitLab
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close