Library provides the set of functions to fit univariate distribution to your (uncensored) data.
Entry point: fit
function with supported methods:
:mle
:ks
:cvm
:ad
, :adr
, :adl
, :ad2r
, :ad2l
and :ad2
:qme
:mme
:mps
Additionally you can use:
bootstrap
to generate parameters from set of resampled datainfer
to generate parameters computationally from dataLibrary is highly based on fitdistrplus R package.
For details please read this paper.
fastmath distributions and optimization methods are used here.
For every method target function is created which accepts distribution parameters and returns log-likelihood, MSE/MAE of quantiles or differences between cdfs. Such function is minimized or maximized using one of the available algorithms (gradient based or simplex based). Optimization is bounded. Initial values for optimization are infered from data.
For a bootstrap, sequences of resampled data are created and then each sequence is fitted. Best result (mean or median) is used as a final parametrization. Additionally confidence interval (or other extent like iqr or min-max) is returned.
Values of any target function can be calculated and returned as a fitness measure.
Distributions implementation don't provide higher order moments but can calculate mean (first moment) and variance (second central moment). MME method uses both to match empirical mean and variance (regardless number of parameters to estimate). MSE or MAE is used as a target for optimization.
To run inference just call one of the following functions:
(fit method distribution data params)
(bootstrap method distribution data params)
(infer distribution data params)
where:
method
- one of supported methods as a keyword (like: :mle
or :qme
)distribution
- a name of the distribution as keywords (see below) (like: :beta
)data
- any sequable
of numbersparams
- parametrization (optional, see below)All methods return map with following keys:
:params
- best parametrization:distribution
- distribution object:distribution-name
- name of the distribution:method
- used fitting method:stats
- statistics (see below)For bootstrap you receive additionally:
:ci
- confidence interval (several methods, see below):ci-type
- name of the interval method:all-params
- (optional) list of parameters for each resampled dataset:params
- best parametrization (mean or median, depending on confidence interval)Some validations of the data and initial parameters are made.
(require '[fitdistr.core :refer :all]
'[fitdistr.distributions :refer [distribution-data]])
Proof that matching is accurate enough
(def target-data (->seq (distribution :weibull {:alpha 0.5 :beta 2.2}) 10000))
(fit :ad :weibull target-data {:stats [:mle]})
;; => {:stats
;; {:ad 0.19749431207310408,
;; :mle -19126.212671469282,
;; :aic 38256.425342938564,
;; :bic 38270.84602368252},
;; :params {:alpha 0.5014214878565807, :beta 2.203213102262515},
;; :distribution-name :weibull,
;; :distribution #object[org.apache.commons.math3.distribution.WeibullDistribution 0x430997b7 "org.apache.commons.math3.distribution.WeibullDistribution@430997b7"],
;; :method :ad}
(bootstrap :mle :weibull target-data {:stats #{:ad}})
;; => {:stats
;; {:mle -19126.178345014738,
;; :ad 0.35561024021990306,
;; :aic 38256.356690029475,
;; :bic 38270.77737077343},
;; :mad-median
;; {:alpha [0.4910043451070347 0.5056336146263343 0.4983189798666845],
;; :beta [2.1185018316179285 2.326029409552982 2.222265620585455]},
;; :params {:alpha 0.4983189798666845, :beta 2.222265620585455},
;; :distribution-name :weibull,
;; :distribution #object[org.apache.commons.math3.distribution.WeibullDistribution 0x63a766b9 "org.apache.commons.math3.distribution.WeibullDistribution@63a766b9"]}
(infer :weibull target-data {:stats #{:mle :ad}})
;; => {:stats
;; {:mle -19126.13369575803,
;; :ad 0.22838225327177497,
;; :aic 38256.26739151606,
;; :bic 38270.68807226002},
;; :params {:alpha 0.5012938746206328, :beta 2.215448048490149},
;; :distribution-name :weibull,
;; :distribution #object[org.apache.commons.math3.distribution.WeibullDistribution 0x3a0f2314 "org.apache.commons.math3.distribution.WeibullDistribution@3a0f2314"]}
(def inferred-distribution (fit :ad :weibull target-data {:stats [:mle]}))
inferred-distribution
;; => {:stats
;; {:ad 0.22793837346762302,
;; :mle -18836.462685291066,
;; :aic 37676.92537058213,
;; :bic 37691.34605132609},
;; :params {:alpha 0.5020961308787267, :beta 2.1661515133303646},
;; :distribution-name :weibull,
;; :distribution
;; #object[org.apache.commons.math3.distribution.WeibullDistribution 0x7f421db2 "org.apache.commons.math3.distribution.WeibullDistribution@7f421db2"],
;; :method :ad}
(->distribution inferred-distribution)
;; => #object[org.apache.commons.math3.distribution.WeibullDistribution 0x7f421db2 "org.apache.commons.math3.distribution.WeibullDistribution@7f421db2"]
(cdf inferred-distribution 0.5)
;; => 0.38057731286029817
(cdf inferred-distribution 0.5 10.0)
;; => 0.5035774570603441
(pdf inferred-distribution 0.5)
;; => 0.2979270382668242
(lpdf inferred-distribution 0.5)
;; => -1.2109066604852476
(icdf inferred-distribution 0.5)
;; => 1.0439237628813434
(sample inferred-distribution)
;; => 0.002386261009069703
(log-likelihood inferred-distribution (take 10 target-data))
;; => -24.233608452146168
(likelihood inferred-distribution (take 10 target-data))
;; => 2.988667305414128E-11
(mean inferred-distribution)
;; => 4.299110979375332
(variance inferred-distribution)
;; => 91.33715492734707
(lower-bound inferred-distribution)
;; => 0.0
(upper-bound inferred-distribution)
;; => ##Inf
(distribution-id inferred-distribution)
;; => :weibull
(distribution-parameters inferred-distribution)
;; => [:beta :alpha]
(drandom inferred-distribution)
;; => 7.4061584562769776
(lrandom inferred-distribution)
;; => 4
(irandom inferred-distribution)
;; => 40
(set-seed! inferred-distribution 1337)
(take 10 (->seq inferred-distribution))
;; => (2.5184760984751717
;; 2.9550268761778735
;; 9.930032804583968
;; 11.259341860117786
;; 0.0808352042851777
;; 17.399335542961957
;; 0.0564922448326893
;; 0.32170752149468795
;; 6.063628565109016
;; 2.2215931112225826)
(set-seed! inferred-distribution 1337)
(->seq inferred-distribution 10)
;; => (2.5184760984751717
;; 2.9550268761778735
;; 9.930032804583968
;; 11.259341860117786
;; 0.0808352042851777
;; 17.399335542961957
;; 0.0564922448326893
;; 0.32170752149468795
;; 6.063628565109016
;; 2.2215931112225826)
Search for the best distribution and its parameters. Look at last example where Pareto distribution is wrongly considered best when using inadequate method.
(def atv [0.6 2.8 182.2 0.8 478.0 1.1 215.0 0.7 7.9 316.2 0.2 17780.0 7.8 100.0 0.9 180.0 0.3 300.9
0.6 17.5 10.0 0.1 5.8 87.7 4.1 3.5 4.9 7060.0 0.2 360.0 100.8 2.3 12.3 40.0 2.3 0.1
2.7 2.2 0.4 2.6 0.2 1.0 7.3 3.2 0.8 1.2 33.7 14.0 21.4 7.7 1.0 1.9 0.7 12.6
3.2 7.3 4.9 4000.0 2.5 6.7 3.0 63.0 6.0 1.6 10.1 1.2 1.5 1.2 30.0 3.2 3.5 1.2
0.2 1.9 0.7 17.0 2.8 4.8 1.3 3.7 0.2 1.8 2.6 5.9 2.6 6.3 1.4 0.8 670.0 810.0
1890.0 1800.0 8500.0 21000.0 31.0 20.5 4370.0 1000.0 39891.8
316.2 6400.0 1000.0 7400.0 31622.8])
(defn find-best
[method ds]
(let [selector (if (= method :mle) last first)]
(dissoc (->> (map #(fit method % atv {:stats #{:mle :ad :ks :cvm}}) ds)
(remove (fn [v] (Double/isNaN (method (:stats v)))))
(sort-by (comp method :stats))
(selector))
:distribution)))
(find-best :mle [:weibull :log-normal :gamma :exponential :normal :pareto])
;; => {:stats
;; {:mle -532.4052019871922,
;; :cvm 0.6373592936482382,
;; :ks 0.1672497620724005,
;; :ad 3.4721179220009617,
;; :aic 1068.8104039743844,
;; :bic 1074.0991857726672},
;; :params {:scale 2.553816262077493, :shape 3.147240361221695},
;; :distribution-name :log-normal,
;; :method :mle}
(find-best :ad [:weibull :log-normal :gamma :exponential :normal :pareto])
;; => {:stats
;; {:ad 3.0345123029861156,
;; :cvm 0.4615381958965107,
;; :ks 0.1332827771382316,
;; :mle -532.9364810533066,
;; :aic 1069.8729621066132,
;; :bic 1075.161743904896},
;; :params {:scale 2.2941800698596815, :shape 3.2934516278879205},
;; :distribution-name :log-normal,
;; :method :ad}
(find-best :ks [:weibull :log-normal :gamma :exponential :normal :pareto])
;; => {:stats
;; {:ks 0.10129796316277348,
;; :mle -535.0197747143928,
;; :cvm 0.3675703954412623,
;; :ad 3.830809551957188,
;; :aic 1074.0395494287857,
;; :bic 1079.3283312270685},
;; :params {:scale 2.03465815391538, :shape 2.873339450786136},
;; :distribution-name :log-normal,
;; :method :ks}
This is example from the paper mentioned above (chapter 2).
Here, we try to match three distributions and using qme
method.
(def gb [30.0 10.0 20.0 24.0 20.0 24.0 40.0 20.0 50.0 30.0 26.0 44.0 25.0 20.0 40.0 100.0 30.0 80.0 50.0 40.0 20.0 50.0 50.0 50.0
50.0 80.0 80.0 20.0 60.0 50.0 50.0 50.0 50.0 60.0 50.0 50.0 50.0 50.0 17.0 120.0 80.0 100.0 80.0 100.0 50.0 70.0 50.0 50.0
50.0 30.0 90.0 125.0 50.0 25.0 50.0 75.0 100.0 60.0 100.0 11.5 100.0 100.0 50.0 40.0 50.0 50.0 50.0 150.0 50.0 80.0 100.0 100.0
80.0 20.0 30.0 100.0 70.0 100.0 80.0 80.0 50.0 50.0 50.0 70.0 25.0 40.0 100.0 25.0 130.0 100.0 60.0 60.0 100.0 50.0 50.0 50.0
50.0 50.0 50.0 100.0 30.0 50.0 50.0 50.0 100.0 50.0 50.0 75.0 25.0 20.0 75.0 50.0 50.0 100.0 100.0 80.0 130.0 200.0 40.0 64.0
40.0 80.0 80.0 79.0 80.0 105.0 51.0 130.0 105.0 105.0 80.0 80.0 40.0 40.0 50.0 105.0 105.0 80.0 105.0 40.0 50.0 130.0 105.0 80.0
40.0 80.0 50.0 100.0 200.0 79.0 150.0 80.0 80.0 80.0 80.0 80.0 65.0 130.0 40.0 80.0 80.0 80.0 130.0 51.0 130.0 80.0 80.0 130.0
51.0 25.0 79.0 80.0 50.0 51.0 50.0 105.0 80.0 80.0 130.0 130.0 80.0 51.0 50.0 130.0 115.0 115.0 200.0 160.0 105.0 80.0 80.0 80.0
130.0 130.0 80.0 51.0 50.0 130.0 105.0 105.0 80.0 105.0 80.0 60.0 51.0 105.0 80.0 51.0 40.0 130.0 130.0 40.0 32.0 80.0 25.0 80.0
80.0 80.0 40.0 130.0 130.0 40.0 75.0 40.0 80.0 80.0 150.0 52.5 130.0 80.0 80.0 75.0 40.0 25.5 51.0 50.0 80.0 51.0 130.0 130.0
51.0 130.0 80.0 130.0 80.0 130.0 130.0 25.5 117.0 130.0 80.0 80.0 80.0 80.0])
Histgram with density and cumulative histogram for data.
(fit :qme :gamma gb {:stats [:mle]})
;; => {:stats
;; {:qme 45.41350132338318,
;; :mle -1253.652887342403,
;; :aic 2511.305774684806,
;; :bic 2518.3804432188426},
;; :params {:scale 18.58013425207939, :shape 3.992585123164037},
;; :distribution-name :gamma,
;; :method :qme}
(fit :qme :weibull gb {:stats [:mle]})
;; => {:stats
;; {:qme 52.75300844658587,
;; :mle -1256.1208259372665,
;; :aic 2516.241651874533,
;; :bic 2523.3163204085704},
;; :params {:alpha 2.052173279173565, :beta 83.1667239954531},
;; :distribution-name :weibull,
;; :method :qme}
(fit :qme :log-normal gb {:stats [:mle]})
;; => {:stats
;; {:qme 58.022608223760685,
;; :mle -1267.6351159421142,
;; :aic 2539.2702318842285,
;; :bic 2546.3449004182658},
;; :params {:scale 4.204520160089395, :shape 0.46585471421274693},
;; :distribution-name :log-normal,
;; :method :qme}
Looks that gamma
is the best.
Let's see some plots:
Bootstrap groundbeef data.
(def b-gamma (bootstrap :qme :gamma gb {:all-params? true
:ci-type :min-max-mean
:samples 10000
:size 100}))
(count (:all-params b-gamma))
;; => 10000
(:ci b-gamma)
;; => {:scale [9.436000626775108 28.586945660073017 17.967024829900268],
;; :shape [2.5005015188673814 7.565181039020107 4.238506386828045]}
When low number of quantiles are used, different parameters are infered.
(def b-gamma-lowq (bootstrap :qme :gamma gb {:all-params? true
:quantiles 3
:samples 10000
:size 100}))
A couple of words about distributions. All of them are backed by Apache Commons Math and SMILE libraries. You can call following functions on distribution object or on fitting result.
pdf
- densitylpdf
- log densitycdf
- cumulative densityicdf
- inversed cumulative density or quantileprobability
- pdf for continuous and probability for discrete densitiessample
- random value from distributionlog-likelihood
- log-likelihood of datalikelihood
- likelihood of datamean
and variance
of the distributionlower-bound
and upper-bounds
- distribution supportdrandom
, lrandom
, irandom
- double/long/integer random sample from the distribution->seq
- generate sequence of samplesset-seed!
- set distribution seedParameter names for given distribution match mostly Apache Commons Math scheme and can differ from other sources (like Wikipedia or R). The list of supported distributions can be obtained by calling:
(sort (keys (methods distribution-data)))
;; => (:bb
;; :bernoulli
;; :beta
;; :binomial
;; :cauchy
;; :chi
;; :chi-squared
;; :chi-squared-noncentral
;; :erlang
;; :exponential
;; :f
;; :fatigue-life
;; :frechet
;; :gamma
;; :geometric
;; :gumbel
;; :half-normal
;; :hyperbolic-secant
;; :inverse-gamma
;; :inverse-gaussian
;; :johnson-sb
;; :johnson-sl
;; :johnson-su
;; :laplace
;; :levy
;; :log-logistic
;; :log-normal
;; :logarithmic
;; :logistic
;; :nakagami
;; :negative-binomial
;; :normal
;; :pareto
;; :pascal
;; :pearson-6
;; :poisson
;; :power
;; :rayleigh
;; :t
;; :triangular
;; :weibull)
The list of distribution parameters can be checked by calling (:param-names (fitdistr.distributions/distribution-data :weibull))
Please refer Apache Commons Math API for all details.
For fit
and bootstrap
functions parameter optimization is used to minimize or maximize underlying target function..
Set :optimizer
to select optimizer. Optimizer tuning is possible by setting parameters listed below.
:lbfgsb
(default)
:gradient-h
:gradient
:gradient-h
- step size h
in two-point finite difference formula used to calculate gradient:formula
- :polak-riberie
(default) or :fletcher-reeves
:nelder-mead
:rho
, :khi
, :gamma
, :sigma
:side-length
:multidirectional-simplex
:rho
, :khi
, :gamma
:side-length
:cmaes
- often problem with convergence
:active-cma?
(true):diagonal-only
:check-feasable-count
:stop-fitness
:population-size
:bobyqa
- not satisfying results
:number-of-points
:initial-radius
:stopping-radius
:powell
- also not statisfyingParameters common to all optimizers.
:rel
and :abs
for relative and absolute accuracy:max-evals
for maximum number of function evaluation (default: no limit):max-iters
for maximum number of iterations (default: 1000)Please refer Apache Commons Math API for details.
When calling fitting method you can provide additional parameters which are. All are optional.
:stats
- set of requested fitting function values, the same as fitting methods, default: same as method or :mle
:initial
- vector of inital values for parameters, default: infered from data:quantiles
for :qme
- sequence of quantiles to match or number of quantiles (evenly distributed between 0 and 1), default: 50:strategy
for :qme
- quantile estimation strategy, one of :legacy
:r1
:r2
:r3
:r4
:r5
:r6
:r7
:r8
:r9
, default :legacy
:mse?
for :qme
and :mme
- method of difference measurement, true
- mean squared error or false
- mean absolute error, default true
:optimizer
- as above, default: :nelder-mead
:samples
for bootstrap
- number of resampled data, default: 100:size
for bootstrap
- size of the sequence, default: 10% of data, minimum 100, maximum 5000:ci-type
for bootstrap
- interval type (see below), default: :mad-median
:all-params?
for bootstrap
- return list of parameters for each resampled sequence, default: false
bootstrap
generates sequence of parameters and they also follow some distribution. Library provides various methods to analyze parameters from resampled data. bootstrap
returns three values under :ci
key: [left,right,center]
where: left
and right
form interval and center
is given statistic (mean or median). Following intervals are possible:
:stddev-mean
- standard deviation and mean:mad-median
- median absolute deviation and median:sem-mean
- standard error of mean and mean:iqr-median
- IQR and median:percentile-bc
- Bias corrected percentile method:percentile-bca
- Accelerated bias corrected percentile method:adj-median
- adjacent values and median:ci
- confidence interval based on Student's t-distribution and mean:min-max-mean
- minimum, maximum values and meanExample: values of each type for 10000 samples from N(0,1)
{:stddev-mean (-0.9922 1.0016 0.0047)
:mad-median (-0.6564 0.6762 0.0099)
:sem-mean (-0.0053 0.0147 0.0047)
:iqr-median (-0.663 0.6699 0.0099)
:percentile-bc (-0.0219 0.0331 0.0069)
:percentile-bca (-0.0222 0.0328 0.0069)
:adj-median (-0.6564 0.6762 0.0099)
:ci (-0.0185 0.0279 0.0047)
:min-max-mean (-3.9825 4.7877 0.0047)}
Possible strategies, try different:
:initial
parameter.:gradient
works best, sometimes :powell
:mle
and :ad
family are based on log
function which returns ##-Inf
for probability 0.0. Optimizers can fall into the trap. Often :ks
and :qme
are enough. When using :qme
play with quantiles.Also instead on fitting you can rely on bootstrap
(espacially for big datasets).
:ad
in mge
may converge slow:gradient
on Pareto distribution using :ad
method):levy
requires :gradient
optimizer to converge:johnson-su
inference only for xi and lambda, fits well:johnson-sl
inference doesn't calculate gamma
, doesn't fit well:johnson-sb
inference is wrong, but fits well:frechet
inference is wrong, but fits well:triangular
use inference onlyinfer
doesn't return proper parameters in some cases for :f
, :nakagami
, :levy
Copyright © 2019-2023 GenerateMe
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.
Can you improve this documentation? These fine people already did:
generateme, GenerateMe, genmeblog, Marc & Carsten BehringEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close