Namespace provides a comprehensive collection of functions for performing statistical analysis in Clojure. It focuses on providing efficient implementations for common statistical tasks, leveraging fastmath's underlying numerical capabilities.
This namespace covers a wide range of statistical methods, including:
This namespace aims to provide a robust set of statistical tools for data analysis and modeling within the Clojure ecosystem.
Namespace provides a comprehensive collection of functions for performing statistical analysis in Clojure. It focuses on providing efficient implementations for common statistical tasks, leveraging fastmath's underlying numerical capabilities. This namespace covers a wide range of statistical methods, including: * **Descriptive Statistics**: Measures of central tendency (mean, median, mode, expectile), dispersion (variance, standard deviation, MAD, SEM), and shape (skewness, kurtosis, L-moments). * **Quantiles and Percentiles**: Functions for calculating percentiles, quantiles, and the median, including weighted versions and various estimation strategies. * **Intervals and Extents**: Methods for defining ranges within data, such as span, IQR, standard deviation/MAD/SEM extents, percentile/quantile intervals, prediction intervals (PI, HPDI), and fence boundaries for outlier detection. * **Outlier Detection**: Functions for identifying data points outside conventional fence boundaries. * **Data Transformation**: Utilities for scaling, centering, trimming, winsorizing, and applying power transformations (Box-Cox, Yeo-Johnson) to data. * **Correlation and Covariance**: Measures of the linear and monotonic relationship between two or more variables (Pearson, Spearman, Kendall), and functions for generating covariance and correlation matrices. * **Distance and Similarity Metrics**: Functions for quantifying differences or likeness between data sequences or distributions, including error metrics (MAE, MSE, RMSE), L-p norms, and various distribution dissimilarity/similarity measures. * **Contingency Tables**: Functions for creating, analyzing, and deriving measures of association and agreement (Cramer's V, Cohen's Kappa) from contingency tables, including specialized functions for 2x2 tables. * **Binary Classification Metrics**: Functions for generating confusion matrices and calculating a wide array of performance metrics (Accuracy, Precision, Recall, F1, MCC, etc.). * **Effect Size**: Measures quantifying the magnitude of statistical effects, including difference-based (Cohen's d, Hedges' g, Glass's delta), ratio-based, ordinal/non-parametric (Cliff's Delta, Vargha-Delaney A), and overlap-based (Cohen's U, p-overlap), as well as measures related to explained variance (Eta-squared, Omega-squared, Cohen's f²). * **Statistical Tests**: Functions for performing hypothesis tests, including: - Normality and Shape tests (Skewness, Kurtosis, D'Agostino-Pearson K², Jarque-Bera, Bonett-Seier). - Binomial tests and confidence intervals. - Location tests (one-sample and two-sample T/Z tests, paired/unpaired). - Variance tests (F-test, Levene's, Brown-Forsythe, Fligner-Killeen). - Goodness-of-Fit and Independence tests (Power Divergence family including Chi-squared, G-test; AD/KS tests). - ANOVA and Rank Sum tests (One-way ANOVA, Kruskal-Wallis). - Autocorrelation tests (Durbin-Watson). * **Time Series Analysis**: Functions for analyzing the dependence structure of time series data, such as Autocorrelation (ACF) and Partial Autocorrelation (PACF). * **Histograms**: Functions for computing histograms and estimating optimal binning strategies. This namespace aims to provide a robust set of statistical tools for data analysis and modeling within the Clojure ecosystem.
(acf data)
(acf data lags)
Calculates the Autocorrelation Function (ACF) for a given time series data
.
The ACF measures the linear dependence between a time series and its lagged values. It helps identify patterns (like seasonality or trend) and inform the selection of models for time series analysis (e.g., in ARIMA modeling).
Parameters:
data
(seq of numbers): The time series data.lags
(long or seq of longs, optional):
(dec (count data))
.Returns a sequence of doubles: the autocorrelation coefficients for the specified lags. The value at lag 0 is always 1.0.
See also acf-ci
(Calculates ACF with confidence intervals), pacf
, pacf-ci
.
Calculates the Autocorrelation Function (ACF) for a given time series `data`. The ACF measures the linear dependence between a time series and its lagged values. It helps identify patterns (like seasonality or trend) and inform the selection of models for time series analysis (e.g., in ARIMA modeling). Parameters: * `data` (seq of numbers): The time series data. * `lags` (long or seq of longs, optional): * If a number, calculates ACF for lags from 0 up to this maximum lag. * If a sequence of numbers, calculates ACF for each lag specified in the sequence. * If omitted (1-arity call), calculates ACF for lags from 0 up to `(dec (count data))`. Returns a sequence of doubles: the autocorrelation coefficients for the specified lags. The value at lag 0 is always 1.0. See also [[acf-ci]] (Calculates ACF with confidence intervals), [[pacf]], [[pacf-ci]].
(acf-ci data)
(acf-ci data lags)
(acf-ci data lags alpha)
Calculates the Autocorrelation Function (ACF) for a time series and provides approximate confidence intervals.
This function computes the ACF of the input time series data
for specified lags
(see acf
) and includes approximate confidence intervals around the ACF
estimates. These intervals help determine whether the autocorrelation at a
specific lag is statistically significant (i.e., likely non-zero in the population).
Parameters:
data
(seq of numbers): The time series data.lags
(long or seq of longs, optional):
(dec (count data))
.alpha
(double, optional): The significance level for the confidence intervals.
Defaults to 0.05
(for a 95% CI).Returns a map containing:
:ci
(double): The value of the approximate standard confidence interval bound
for lags > 0. If the absolute value of an ACF
coefficient at lag k > 0
exceeds this value, it is considered statistically significant.:acf
(seq of doubles): The sequence of autocorrelation coefficients
at lags from 0 up to lags
(or specified lags if lags
is a sequence), calculated
using acf
.:cis
(seq of doubles): Cumulative confidence intervals for ACF. These are based on the
variance of the sum of squared sample autocorrelations up to each lag.Calculates the Autocorrelation Function (ACF) for a time series and provides approximate confidence intervals. This function computes the ACF of the input time series `data` for specified lags (see [[acf]]) and includes approximate confidence intervals around the ACF estimates. These intervals help determine whether the autocorrelation at a specific lag is statistically significant (i.e., likely non-zero in the population). Parameters: * `data` (seq of numbers): The time series data. * `lags` (long or seq of longs, optional): * If a number, calculates ACF for lags from 0 up to this maximum lag. * If a sequence of numbers, calculates ACF for each lag specified in the sequence. * If omitted (1-arity call), calculates ACF for lags from 0 up to `(dec (count data))`. * `alpha` (double, optional): The significance level for the confidence intervals. Defaults to `0.05` (for a 95% CI). Returns a map containing: * `:ci` (double): The value of the approximate standard confidence interval bound for lags > 0. If the absolute value of an ACF coefficient at lag `k > 0` exceeds this value, it is considered statistically significant. * `:acf` (seq of doubles): The sequence of autocorrelation coefficients at lags from 0 up to `lags` (or specified lags if `lags` is a sequence), calculated using [[acf]]. * `:cis` (seq of doubles): Cumulative confidence intervals for ACF. These are based on the variance of the sum of squared sample autocorrelations up to each lag. See also [[acf]], [[pacf]], [[pacf-ci]].
(ad-test-one-sample xs)
(ad-test-one-sample xs distribution-or-ys)
(ad-test-one-sample xs
distribution-or-ys
{:keys [sides kernel bandwidth]
:or {sides :right kernel :gaussian}})
Performs the Anderson-Darling (AD) test for goodness-of-fit.
This test assesses the null hypothesis that a sample xs
comes from a
specified theoretical distribution or another empirical distribution. It is
sensitive to differences in the tails of the distributions.
Parameters:
xs
(seq of numbers): The sample data to be tested.distribution-or-ys
(optional):
fastmath.random
distribution object to test against. If omitted, defaults
to the standard normal distribution (fastmath.random/default-normal
).ys
). In this case, an empirical distribution is
estimated from ys
using Kernel Density Estimation (KDE) or an enumerated
distribution (see :kernel
option).opts
(map, optional): Options map:
:sides
(keyword, default :right
): Specifies the side(s) of the
A^2 statistic's distribution used for p-value calculation.
:right
(default): Tests if the observed A^2 statistic is significantly
large (standard approach for AD test, indicating poor fit).:left
: Tests if the observed A^2 statistic is significantly small.:two-sided
: Tests if the observed A^2 statistic is extreme in either tail.:kernel
(keyword, default :gaussian
): Used only when distribution-or-ys
is a sequence. Specifies the method to estimate the empirical distribution:
:gaussian
(or other KDE kernels): Uses Kernel Density Estimation.:enumerated
: Creates a discrete empirical distribution from ys
.:bandwidth
(double, optional): Bandwidth for KDE (if applicable).Returns a map containing:
:A2
: The Anderson-Darling test statistic (A^2).:stat
: Alias for :A2
.:p-value
: The p-value associated with the test statistic and the specified :sides
.:n
: Sample size of xs
.:mean
: Mean of the sample xs
(for context).:stddev
: Standard deviation of the sample xs
(for context).:sides
: The alternative hypothesis side used for p-value calculation.Performs the Anderson-Darling (AD) test for goodness-of-fit. This test assesses the null hypothesis that a sample `xs` comes from a specified theoretical distribution or another empirical distribution. It is sensitive to differences in the tails of the distributions. Parameters: - `xs` (seq of numbers): The sample data to be tested. - `distribution-or-ys` (optional): - A `fastmath.random` distribution object to test against. If omitted, defaults to the standard normal distribution (`fastmath.random/default-normal`). - A sequence of numbers (`ys`). In this case, an empirical distribution is estimated from `ys` using Kernel Density Estimation (KDE) or an enumerated distribution (see `:kernel` option). - `opts` (map, optional): Options map: - `:sides` (keyword, default `:right`): Specifies the side(s) of the A^2 statistic's distribution used for p-value calculation. - `:right` (default): Tests if the observed A^2 statistic is significantly large (standard approach for AD test, indicating poor fit). - `:left`: Tests if the observed A^2 statistic is significantly small. - `:two-sided`: Tests if the observed A^2 statistic is extreme in either tail. - `:kernel` (keyword, default `:gaussian`): Used only when `distribution-or-ys` is a sequence. Specifies the method to estimate the empirical distribution: - `:gaussian` (or other KDE kernels): Uses Kernel Density Estimation. - `:enumerated`: Creates a discrete empirical distribution from `ys`. - `:bandwidth` (double, optional): Bandwidth for KDE (if applicable). Returns a map containing: - `:A2`: The Anderson-Darling test statistic (A^2). - `:stat`: Alias for `:A2`. - `:p-value`: The p-value associated with the test statistic and the specified `:sides`. - `:n`: Sample size of `xs`. - `:mean`: Mean of the sample `xs` (for context). - `:stddev`: Standard deviation of the sample `xs` (for context). - `:sides`: The alternative hypothesis side used for p-value calculation.
(adjacent-values vs)
(adjacent-values vs estimation-strategy)
(adjacent-values vs q1 q3 m)
Lower and upper adjacent values (LAV and UAV).
Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1)
.
(- Q1 (* 1.5 IQR))
.(+ Q3 (* 1.5 IQR))
.Optional estimation-strategy
argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
Lower and upper adjacent values (LAV and UAV). Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is `(- Q3 Q1)`. * LAV is smallest value which is greater or equal to the LIF = `(- Q1 (* 1.5 IQR))`. * UAV is largest value which is lower or equal to the UIF = `(+ Q3 (* 1.5 IQR))`. * third value is a median of samples Optional `estimation-strategy` argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
(ameasure [group1 group2])
(ameasure group1 group2)
Calculates the Vargha-Delaney A measure for two independent samples.
A non-parametric effect size measure quantifying the probability that a randomly chosen value from the first sample (group1
) is greater than a randomly chosen value from the second sample (group2
).
Parameters:
group1
: The first independent sample.group2
: The second independent sample.Returns the calculated A measure (a double) in the range [0, 1].
A value of 0.5 indicates stochastic equality (distributions are overlapping). Values > 0.5 mean group1
tends to be larger; values < 0.5 mean group2
tends to be larger.
Related to cliffs-delta
and the Wilcoxon-Mann-Whitney U test statistic.
See also cliffs-delta
, wmw-odds
.
Calculates the Vargha-Delaney A measure for two independent samples. A non-parametric effect size measure quantifying the probability that a randomly chosen value from the first sample (`group1`) is greater than a randomly chosen value from the second sample (`group2`). Parameters: - `group1`: The first independent sample. - `group2`: The second independent sample. Returns the calculated A measure (a double) in the range [0, 1]. A value of 0.5 indicates stochastic equality (distributions are overlapping). Values > 0.5 mean `group1` tends to be larger; values < 0.5 mean `group2` tends to be larger. Related to [[cliffs-delta]] and the Wilcoxon-Mann-Whitney U test statistic. See also [[cliffs-delta]], [[wmw-odds]].
(binary-measures confusion-matrix)
(binary-measures actual prediction)
(binary-measures actual prediction true-value)
(binary-measures tp fn fp tn)
Calculates a selected subset of common evaluation metrics for binary classification results.
This function is a convenience wrapper around binary-measures-all
, providing
a map containing the most frequently used metrics derived from a 2x2 confusion matrix.
The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN):
Predicted True | Predicted False | |
---|---|---|
Actual True | TP | FN |
Actual False | FP | TN |
The function accepts the same input formats as binary-measures-all
:
(binary-measures tp fn fp tn)
: Direct input of the four counts.(binary-measures confusion-matrix)
: Input as a structured representation
(map with keys like :tp
, :fn
, :fp
, :tn
; sequence of sequences
[ [TP FP] [FN TN] ]
; or flat sequence [TP FN FP TN]
).(binary-measures actual prediction)
: Input as two sequences of outcomes.(binary-measures actual prediction true-value)
: Input as two sequences with
a specified encoding for true
(success).Parameters:
tp, fn, fp, tn
(long): Counts from the confusion matrix.confusion-matrix
(map or sequence): Representation of the confusion matrix.actual
, prediction
(sequences): Sequences of true and predicted outcomes.true-value
(optional): Specifies how outcomes are converted to boolean true
/false
.Returns a map containing the following selected metrics:
:tp
(True Positives):tn
(True Negatives):fp
(False Positives):fn
(False Negatives):accuracy
:fdr
(False Discovery Rate, 1 - Precision):f-measure
(F1 Score, harmonic mean of Precision and Recall):fall-out
(False Positive Rate):precision
(Positive Predictive Value):recall
(True Positive Rate / Sensitivity):sensitivity
(Alias for Recall/TPR):specificity
(True Negative Rate):prevalence
(Proportion of positive cases)See also confusion-matrix
, binary-measures-all
, mcc
, contingency-2x2-measures-all
.
Calculates a selected subset of common evaluation metrics for binary classification results. This function is a convenience wrapper around [[binary-measures-all]], providing a map containing the most frequently used metrics derived from a 2x2 confusion matrix. The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN): | | Predicted True | Predicted False | |:---------------|:---------------|:----------------| | **Actual True** | TP | FN | | **Actual False** | FP | TN | The function accepts the same input formats as [[binary-measures-all]]: 1. `(binary-measures tp fn fp tn)`: Direct input of the four counts. 2. `(binary-measures confusion-matrix)`: Input as a structured representation (map with keys like `:tp`, `:fn`, `:fp`, `:tn`; sequence of sequences `[ [TP FP] [FN TN] ]`; or flat sequence `[TP FN FP TN]`). 3. `(binary-measures actual prediction)`: Input as two sequences of outcomes. 4. `(binary-measures actual prediction true-value)`: Input as two sequences with a specified encoding for `true` (success). Parameters: - `tp, fn, fp, tn` (long): Counts from the confusion matrix. - `confusion-matrix` (map or sequence): Representation of the confusion matrix. - `actual`, `prediction` (sequences): Sequences of true and predicted outcomes. - `true-value` (optional): Specifies how outcomes are converted to boolean `true`/`false`. Returns a map containing the following selected metrics: - `:tp` (True Positives) - `:tn` (True Negatives) - `:fp` (False Positives) - `:fn` (False Negatives) - `:accuracy` - `:fdr` (False Discovery Rate, 1 - Precision) - `:f-measure` (F1 Score, harmonic mean of Precision and Recall) - `:fall-out` (False Positive Rate) - `:precision` (Positive Predictive Value) - `:recall` (True Positive Rate / Sensitivity) - `:sensitivity` (Alias for Recall/TPR) - `:specificity` (True Negative Rate) - `:prevalence` (Proportion of positive cases) See also [[confusion-matrix]], [[binary-measures-all]], [[mcc]], [[contingency-2x2-measures-all]].
(binary-measures-all confusion-matrix)
(binary-measures-all actual prediction)
(binary-measures-all actual prediction true-value)
(binary-measures-all tp fn fp tn)
Calculates a comprehensive set of evaluation metrics for binary classification results.
This function computes various statistics derived from a 2x2 confusion matrix, summarizing the performance of a binary classifier.
The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN):
Predicted True | Predicted False | |
---|---|---|
Actual True | TP | FN |
Actual False | FP | TN |
The function supports several input formats:
(binary-measures-all tp fn fp tn)
: Direct input of the four counts as arguments.
tp
(long): True Positive count.fn
(long): False Negative count.fp
(long): False Positive count.tn
(long): True Negative count.(binary-measures-all confusion-matrix)
: Input as a structured representation of the confusion matrix.
confusion-matrix
: Can be:
:tp
, :fn
, :fp
, :tn
(e.g., {:tp 10 :fn 2 :fp 5 :tn 80}
).[[TP FP] [FN TN]]
(e.g., [[10 5] [2 80]]
).[TP FN FP TN]
(e.g., [10 2 5 80]
).(binary-measures-all actual prediction)
: Input as two sequences of outcomes.
actual
(sequence): Sequence of true outcomes.prediction
(sequence): Sequence of predicted outcomes. Must have the same length as actual
.
Values in actual
and prediction
are converted to boolean true
/false
. By default,
any non-nil
or non-zero numeric value is treated as true
, and nil
or 0.0
is
treated as false
.(binary-measures-all actual prediction true-value)
: Input as two sequences with a specified encoding for true
.
actual
, prediction
: Sequences as in the previous arity.true-value
(optional): Specifies how values in actual
and prediction
are converted to boolean true
(success) or false
(failure).
nil
(default): Non-nil
/non-zero (for numbers) is true.false
, the value is false.true
if the value satisfies the predicate.Returns a map containing a wide array of calculated metrics. This includes, but is not limited to:
:tp
, :fn
, :fp
, :tn
:cp
(Actual Positives), :cn
(Actual Negatives), :pcp
(Predicted Positives), :pcn
(Predicted Negatives), :total
(Grand Total):tpr
(True Positive Rate, Recall, Sensitivity, Hit Rate):fnr
(False Negative Rate, Miss Rate):fpr
(False Positive Rate, Fall-out):tnr
(True Negative Rate, Specificity, Selectivity):ppv
(Positive Predictive Value, Precision):fdr
(False Discovery Rate, 1 - ppv
):npv
(Negative Predictive Value):for
(False Omission Rate, 1 - npv
):lr+
(Positive Likelihood Ratio):lr-
(Negative Likelihood Ratio):dor
(Diagnostic Odds Ratio):accuracy
:ba
(Balanced Accuracy):fm
(Fowlkes–Mallows index):pt
(Prevalence Threshold):ts
(Threat Score, Jaccard index):f-measure
/ :f1-score
(F1 Score, special case of F-beta score):f-beta
(Function to calculate F-beta for any beta):mcc
/ :phi
(Matthews Correlation Coefficient, Phi coefficient):bm
(Bookmaker Informedness):kappa
(Cohen's Kappa, for 2x2 table):mk
(Markedness)Metrics are generally calculated using standard formulas based on the TP, FN, FP, TN counts. For more details on specific metrics, refer to standard classification literature or the Wikipedia page on Precision and recall, which covers many of these concepts.
See also confusion-matrix
, binary-measures
(for a selected subset of metrics),
mcc
, contingency-2x2-measures-all
(for a broader set of 2x2 table measures).
Calculates a comprehensive set of evaluation metrics for binary classification results. This function computes various statistics derived from a 2x2 confusion matrix, summarizing the performance of a binary classifier. The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN): | | Predicted True | Predicted False | |:---------------|:---------------|:----------------| | **Actual True** | TP | FN | | **Actual False** | FP | TN | The function supports several input formats: 1. `(binary-measures-all tp fn fp tn)`: Direct input of the four counts as arguments. - `tp` (long): True Positive count. - `fn` (long): False Negative count. - `fp` (long): False Positive count. - `tn` (long): True Negative count. 2. `(binary-measures-all confusion-matrix)`: Input as a structured representation of the confusion matrix. - `confusion-matrix`: Can be: - A map with keys like `:tp`, `:fn`, `:fp`, `:tn` (e.g., `{:tp 10 :fn 2 :fp 5 :tn 80}`). - A sequence of sequences representing rows `[[TP FP] [FN TN]]` (e.g., `[[10 5] [2 80]]`). - A flat sequence `[TP FN FP TN]` (e.g., `[10 2 5 80]`). 3. `(binary-measures-all actual prediction)`: Input as two sequences of outcomes. - `actual` (sequence): Sequence of true outcomes. - `prediction` (sequence): Sequence of predicted outcomes. Must have the same length as `actual`. Values in `actual` and `prediction` are converted to boolean `true`/`false`. By default, any non-`nil` or non-zero numeric value is treated as `true`, and `nil` or `0.0` is treated as `false`. 4. `(binary-measures-all actual prediction true-value)`: Input as two sequences with a specified encoding for `true`. - `actual`, `prediction`: Sequences as in the previous arity. - `true-value` (optional): Specifies how values in `actual` and `prediction` are converted to boolean `true` (success) or `false` (failure). - `nil` (default): Non-`nil`/non-zero (for numbers) is true. - Any sequence/set: Values found in this collection are true. - A map: Values are mapped according to the map; if a key is not found or maps to `false`, the value is false. - A predicate function: Returns `true` if the value satisfies the predicate. Returns a map containing a wide array of calculated metrics. This includes, but is not limited to: - Basic Counts: `:tp`, `:fn`, `:fp`, `:tn` - Totals: `:cp` (Actual Positives), `:cn` (Actual Negatives), `:pcp` (Predicted Positives), `:pcn` (Predicted Negatives), `:total` (Grand Total) - Rates (often ratios of counts): - `:tpr` (True Positive Rate, Recall, Sensitivity, Hit Rate) - `:fnr` (False Negative Rate, Miss Rate) - `:fpr` (False Positive Rate, Fall-out) - `:tnr` (True Negative Rate, Specificity, Selectivity) - `:ppv` (Positive Predictive Value, Precision) - `:fdr` (False Discovery Rate, `1 - ppv`) - `:npv` (Negative Predictive Value) - `:for` (False Omission Rate, `1 - npv`) - Ratios/Odds: - `:lr+` (Positive Likelihood Ratio) - `:lr-` (Negative Likelihood Ratio) - `:dor` (Diagnostic Odds Ratio) - Combined Scores: - `:accuracy` - `:ba` (Balanced Accuracy) - `:fm` (Fowlkes–Mallows index) - `:pt` (Prevalence Threshold) - `:ts` (Threat Score, Jaccard index) - `:f-measure` / `:f1-score` (F1 Score, special case of F-beta score) - `:f-beta` (Function to calculate F-beta for any beta) - `:mcc` / `:phi` (Matthews Correlation Coefficient, Phi coefficient) - `:bm` (Bookmaker Informedness) - `:kappa` (Cohen's Kappa, for 2x2 table) - `:mk` (Markedness) Metrics are generally calculated using standard formulas based on the TP, FN, FP, TN counts. For more details on specific metrics, refer to standard classification literature or the Wikipedia page on [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall), which covers many of these concepts. See also [[confusion-matrix]], [[binary-measures]] (for a selected subset of metrics), [[mcc]], [[contingency-2x2-measures-all]] (for a broader set of 2x2 table measures).
(binomial-ci number-of-successes number-of-trials)
(binomial-ci number-of-successes number-of-trials method)
(binomial-ci number-of-successes number-of-trials method alpha)
Calculates a confidence interval for a binomial proportion.
Given the number of observed successes
in a fixed number of trials
, this function
estimates a confidence interval for the true underlying probability of success (p
).
Different statistical methods are available for calculating the interval, as the accuracy and behavior of the interval can vary, especially for small sample sizes or probabilities close to 0 or 1.
Parameters:
number-of-successes
(long): The count of successful outcomes.number-of-trials
(long): The total number of independent trials.method
(keyword, optional): The method used to calculate the confidence interval.
Defaults to :asymptotic
.alpha
(double, optional): The significance level (alpha) for the interval.
The confidence level is 1 - alpha
. Defaults to 0.05
(yielding a 95% CI).Available method
values:
:asymptotic
: Normal approximation interval (Wald interval), based on the Central Limit Theorem. Simple but can be inaccurate for small samples or probabilities near 0 or 1.:agresti-coull
: An adjustment to the asymptotic interval, adding 'pseudo-counts' to improve performance for small samples.:clopper-pearson
: An exact method based on inverting binomial tests. Provides guaranteed coverage but can be overly conservative (wider than necessary).:wilson
: Score interval, derived from the score test. Generally recommended as a good balance of accuracy and coverage for various sample sizes.:prop.test
: Interval typically used with prop.test
in R, applies a continuity correction.:cloglog
: Confidence interval based on the complementary log-log transformation.:logit
: Confidence interval based on the logit transformation.:probit
: Confidence interval based on the probit transformation (inverse of standard normal CDF).:arcsine
: Confidence interval based on the arcsine transformation.:all
: Applies all available methods and returns a map where keys are method keywords and values are their respective confidence intervals (as triplets).Returns:
[lower-bound, upper-bound, estimated-p]
.
lower-bound
(double): The lower limit of the confidence interval.upper-bound
(double): The upper limit of the confidence interval.estimated-p
(double): The observed proportion of successes (number-of-successes / number-of-trials
).If method
is :all
, returns a map of results from each method.
See also binomial-test
for performing a hypothesis test on a binomial proportion.
Calculates a confidence interval for a binomial proportion. Given the number of observed `successes` in a fixed number of `trials`, this function estimates a confidence interval for the true underlying probability of success (`p`). Different statistical methods are available for calculating the interval, as the accuracy and behavior of the interval can vary, especially for small sample sizes or probabilities close to 0 or 1. Parameters: - `number-of-successes` (long): The count of successful outcomes. - `number-of-trials` (long): The total number of independent trials. - `method` (keyword, optional): The method used to calculate the confidence interval. Defaults to `:asymptotic`. - `alpha` (double, optional): The significance level (alpha) for the interval. The confidence level is `1 - alpha`. Defaults to `0.05` (yielding a 95% CI). Available `method` values: - `:asymptotic`: Normal approximation interval (Wald interval), based on the Central Limit Theorem. Simple but can be inaccurate for small samples or probabilities near 0 or 1. - `:agresti-coull`: An adjustment to the asymptotic interval, adding 'pseudo-counts' to improve performance for small samples. - `:clopper-pearson`: An exact method based on inverting binomial tests. Provides guaranteed coverage but can be overly conservative (wider than necessary). - `:wilson`: Score interval, derived from the score test. Generally recommended as a good balance of accuracy and coverage for various sample sizes. - `:prop.test`: Interval typically used with `prop.test` in R, applies a continuity correction. - `:cloglog`: Confidence interval based on the complementary log-log transformation. - `:logit`: Confidence interval based on the logit transformation. - `:probit`: Confidence interval based on the probit transformation (inverse of standard normal CDF). - `:arcsine`: Confidence interval based on the arcsine transformation. - `:all`: Applies all available methods and returns a map where keys are method keywords and values are their respective confidence intervals (as triplets). Returns: - A vector `[lower-bound, upper-bound, estimated-p]`. - `lower-bound` (double): The lower limit of the confidence interval. - `upper-bound` (double): The upper limit of the confidence interval. - `estimated-p` (double): The observed proportion of successes (`number-of-successes / number-of-trials`). If `method` is `:all`, returns a map of results from each method. See also [[binomial-test]] for performing a hypothesis test on a binomial proportion.
(binomial-test xs)
(binomial-test xs maybe-params)
(binomial-test number-of-successes
number-of-trials
{:keys [alpha p ci-method sides]
:or {alpha 0.05 p 0.5 ci-method :asymptotic sides :two-sided}})
Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment, based on the binomial distribution.
This test assesses the null hypothesis that the true probability of success (p
)
in the underlying population is equal to a specified value (default 0.5).
The function can be called in two ways:
(binomial-test number-of-successes number-of-trials params)
(binomial-test xs params)
, where xs
is a sequence of outcomes.
In this case, the outcomes in xs
are converted to true/false based on the
:true-false-conv
parameter (if provided, otherwise numeric 1s are true),
and the number of successes and total trials are derived from xs
.Parameters:
number-of-successes
(long): Observed number of successful outcomes.number-of-trials
(long): Total number of trials.xs
(sequence): Sample data (used in the alternative call signature).params
(map, optional): Options map:
:p
(double, default 0.5
): The hypothesized probability of success under the null hypothesis.:alpha
(double, default 0.05
): Significance level for confidence interval calculation.:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): True probability p
is not equal to the hypothesized p
.:one-sided-greater
: True probability p
is greater than the hypothesized p
.:one-sided-less
: True probability p
is less than the hypothesized p
.:ci-method
(keyword, default :asymptotic
): Method used to calculate the confidence interval for the probability of success. See binomial-ci
and binomial-ci-methods
for available options (e.g., :wilson
, :clopper-pearson
).:true-false-conv
(optional, used only with xs
): A function, set, or map to convert elements of xs
into boolean true
(success) or false
(failure). See binary-measures-all
documentation for details. If nil
and xs
contains numbers, 1.0
is treated as success.Returns a map containing:
:p-value
: The probability of observing a result as extreme as, or more extreme than, the observed number of successes, assuming the null hypothesis is true. Calculated using the binomial distribution.:p
: The hypothesized probability of success used in the test.:successes
: The observed number of successes.:trials
: The total number of trials.:alpha
: Significance level used for the confidence interval.:level
: Confidence level (1 - alpha
).:sides
/ :test-type
: Alternative hypothesis side used.:stat
: The test statistic (the observed number of successes).:estimate
: The observed proportion of successes (successes / trials
).:ci-method
: Confidence interval method used.:confidence-interval
: A confidence interval for the true probability of success, calculated using the specified :ci-method
and adjusted for the :sides
parameter.Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment, based on the binomial distribution. This test assesses the null hypothesis that the true probability of success (`p`) in the underlying population is equal to a specified value (default 0.5). The function can be called in two ways: 1. With counts: `(binomial-test number-of-successes number-of-trials params)` 2. With data: `(binomial-test xs params)`, where `xs` is a sequence of outcomes. In this case, the outcomes in `xs` are converted to true/false based on the `:true-false-conv` parameter (if provided, otherwise numeric 1s are true), and the number of successes and total trials are derived from `xs`. Parameters: - `number-of-successes` (long): Observed number of successful outcomes. - `number-of-trials` (long): Total number of trials. - `xs` (sequence): Sample data (used in the alternative call signature). - `params` (map, optional): Options map: - `:p` (double, default `0.5`): The hypothesized probability of success under the null hypothesis. - `:alpha` (double, default `0.05`): Significance level for confidence interval calculation. - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): True probability `p` is not equal to the hypothesized `p`. - `:one-sided-greater`: True probability `p` is greater than the hypothesized `p`. - `:one-sided-less`: True probability `p` is less than the hypothesized `p`. - `:ci-method` (keyword, default `:asymptotic`): Method used to calculate the confidence interval for the probability of success. See [[binomial-ci]] and [[binomial-ci-methods]] for available options (e.g., `:wilson`, `:clopper-pearson`). - `:true-false-conv` (optional, used only with `xs`): A function, set, or map to convert elements of `xs` into boolean `true` (success) or `false` (failure). See [[binary-measures-all]] documentation for details. If `nil` and `xs` contains numbers, `1.0` is treated as success. Returns a map containing: - `:p-value`: The probability of observing a result as extreme as, or more extreme than, the observed number of successes, assuming the null hypothesis is true. Calculated using the binomial distribution. - `:p`: The hypothesized probability of success used in the test. - `:successes`: The observed number of successes. - `:trials`: The total number of trials. - `:alpha`: Significance level used for the confidence interval. - `:level`: Confidence level (`1 - alpha`). - `:sides` / `:test-type`: Alternative hypothesis side used. - `:stat`: The test statistic (the observed number of successes). - `:estimate`: The observed proportion of successes (`successes / trials`). - `:ci-method`: Confidence interval method used. - `:confidence-interval`: A confidence interval for the true probability of success, calculated using the specified `:ci-method` and adjusted for the `:sides` parameter.
(bonett-seier-test xs)
(bonett-seier-test xs params)
(bonett-seier-test xs geary-kurtosis {:keys [sides] :or {sides :two-sided}})
Performs the Bonett-Seier test for normality based on Geary's 'g' kurtosis measure.
This test assesses the null hypothesis that the data comes from a normally
distributed population by checking if the sample Geary's 'g' statistic
significantly deviates from the value expected under normality (sqrt(2/pi)
).
Parameters:
xs
(seq of numbers): The sample data. Requires (count xs) > 3
for variance calculation.geary-kurtosis
(double, optional): A pre-calculated Geary's 'g' kurtosis value.
If omitted, it's calculated from xs
.params
(map, optional): Options map:
:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis
regarding the deviation from normal kurtosis.
:two-sided
(default): The population kurtosis (measured by 'g') is different from normal.:one-sided-greater
: Population is leptokurtic ('g' < sqrt(2/pi)). Note Geary's 'g' decreases with peakedness.:one-sided-less
: Population is platykurtic ('g' > sqrt(2/pi)). Note Geary's 'g' increases with flatness.Returns a map containing:
:Z
: The final test statistic (approximately standard normal under H0).:stat
: Alias for :Z
.:p-value
: The p-value associated with Z
and the specified :sides
.:kurtosis
: The Geary's 'g' kurtosis value used in the test.:n
: The sample size.:sides
: The alternative hypothesis side used.References:
See also kurtosis
, kurtosis-test
, normality-test
, jarque-bera-test
.
Performs the Bonett-Seier test for normality based on Geary's 'g' kurtosis measure. This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample Geary's 'g' statistic significantly deviates from the value expected under normality (`sqrt(2/pi)`). Parameters: - `xs` (seq of numbers): The sample data. Requires `(count xs) > 3` for variance calculation. - `geary-kurtosis` (double, optional): A pre-calculated Geary's 'g' kurtosis value. If omitted, it's calculated from `xs`. - `params` (map, optional): Options map: - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis regarding the deviation from normal kurtosis. - `:two-sided` (default): The population kurtosis (measured by 'g') is different from normal. - `:one-sided-greater`: Population is leptokurtic ('g' < sqrt(2/pi)). Note Geary's 'g' decreases with peakedness. - `:one-sided-less`: Population is platykurtic ('g' > sqrt(2/pi)). Note Geary's 'g' increases with flatness. Returns a map containing: - `:Z`: The final test statistic (approximately standard normal under H0). - `:stat`: Alias for `:Z`. - `:p-value`: The p-value associated with `Z` and the specified `:sides`. - `:kurtosis`: The Geary's 'g' kurtosis value used in the test. - `:n`: The sample size. - `:sides`: The alternative hypothesis side used. References: - Bonett, D. G., & Seier, E. (2002). A test of normality with high uniform power. Computational Statistics & Data Analysis, 40(3), 435-445. (Provides theoretical basis) See also [[kurtosis]], [[kurtosis-test]], [[normality-test]], [[jarque-bera-test]].
(bootstrap vs)
(bootstrap vs samples)
(bootstrap vs samples size)
Generate set of samples of given size from provided data.
Default samples
is 200, number of size
defaults to sample size.
Generate set of samples of given size from provided data. Default `samples` is 200, number of `size` defaults to sample size.
(bootstrap-ci vs)
(bootstrap-ci vs alpha)
(bootstrap-ci vs alpha samples)
(bootstrap-ci vs alpha samples stat-fn)
Bootstrap method to calculate confidence interval.
Alpha defaults to 0.98, samples to 1000.
Last parameter is statistical function used to measure, default: mean
.
Returns ci and statistical function value.
Bootstrap method to calculate confidence interval. Alpha defaults to 0.98, samples to 1000. Last parameter is statistical function used to measure, default: [[mean]]. Returns ci and statistical function value.
(box-cox-infer-lambda xs)
(box-cox-infer-lambda xs lambda-range)
(box-cox-infer-lambda xs lambda-range opts)
Finds the optimal lambda (λ) parameter for the Box-Cox transformation of a dataset using the Maximum Likelihood Estimation (MLE) method.
The Box-Cox transformation is a family of power transformations often applied to positive data to make it more closely resemble a normal distribution and stabilize variance. This function estimates the lambda value that maximizes the log-likelihood function of the transformed data, assuming the transformed data is normally distributed.
Parameters:
xs
(sequence of numbers): The input numerical data sequence.lambda-range
(vector of two numbers, optional): A sequence [min-lambda, max-lambda]
defining the closed interval within which the optimal lambda is searched. Defaults to [-3.0, 3.0]
.opts
(map, optional): Additional options affecting the data used for the likelihood calculation. These options are passed to the internal data preparation step. Key options include:
:alpha
(double, default 0.0): A constant value added to xs
before estimating lambda. This is often used when xs
contains zero or negative values and the standard Box-Cox (which requires positive input) is desired, or to explore transformations around a shifted location.:negative?
(boolean, default false
): If true
, indicates that the likelihood is estimated based on the modified Box-Cox transformation (Bickel and Doksum approach) suitable for negative values. The estimation process will work with the absolute values of the data shifted by :alpha
.Returns the estimated optimal lambda value as a double.
The inferred lambda value can then be used as the lambda
parameter for the box-cox-transformation
function to apply the actual transformation to the dataset.
See also box-cox-transformation
, yeo-johnson-infer-lambda
, yeo-johnson-transformation
.
Finds the optimal lambda (λ) parameter for the Box-Cox transformation of a dataset using the Maximum Likelihood Estimation (MLE) method. The Box-Cox transformation is a family of power transformations often applied to positive data to make it more closely resemble a normal distribution and stabilize variance. This function estimates the lambda value that maximizes the log-likelihood function of the transformed data, assuming the transformed data is normally distributed. Parameters: - `xs` (sequence of numbers): The input numerical data sequence. - `lambda-range` (vector of two numbers, optional): A sequence `[min-lambda, max-lambda]` defining the closed interval within which the optimal lambda is searched. Defaults to `[-3.0, 3.0]`. - `opts` (map, optional): Additional options affecting the data used for the likelihood calculation. These options are passed to the internal data preparation step. Key options include: - `:alpha` (double, default 0.0): A constant value added to `xs` before estimating lambda. This is often used when `xs` contains zero or negative values and the standard Box-Cox (which requires positive input) is desired, or to explore transformations around a shifted location. - `:negative?` (boolean, default `false`): If `true`, indicates that the likelihood is estimated based on the modified Box-Cox transformation (Bickel and Doksum approach) suitable for negative values. The estimation process will work with the absolute values of the data shifted by `:alpha`. Returns the estimated optimal lambda value as a double. The inferred lambda value can then be used as the `lambda` parameter for the [[box-cox-transformation]] function to apply the actual transformation to the dataset. See also [[box-cox-transformation]], [[yeo-johnson-infer-lambda]], [[yeo-johnson-transformation]].
(box-cox-transformation xs)
(box-cox-transformation xs lambda)
(box-cox-transformation xs lambda {:keys [scaled? inverse?] :as opts})
Applies Box-Cox transformation to a data.
The Box-Cox transformation is a family of power transformations used to stabilize variance and make data more normally distributed.
Parameters:
xs
(seq of numbers): The input data.lambda
(default 0.0
): The power parameter. If nil
or [lambda-min, lambda-max]
, lambda
is inferred using maximum log likelihood.alpha
(optional): A shift parameter applied before transformation.scaled?
(default false
): Scale by geometric mean or any other numbernegative?
(default false
): Allow negative valuesinverse?
(default: false
): Perform inverse operation, lambda
can't be inferred.Returns transformed data.
Related: yeo-johnson-transformation
Applies Box-Cox transformation to a data. The Box-Cox transformation is a family of power transformations used to stabilize variance and make data more normally distributed. Parameters: - `xs` (seq of numbers): The input data. - `lambda` (default `0.0`): The power parameter. If `nil` or `[lambda-min, lambda-max]`, `lambda` is inferred using maximum log likelihood. - Options map: - `alpha` (optional): A shift parameter applied before transformation. - `scaled?` (default `false`): Scale by geometric mean or any other number - `negative?` (default `false`): Allow negative values - `inverse?` (default: `false`): Perform inverse operation, `lambda` can't be inferred. Returns transformed data. Related: `yeo-johnson-transformation`
(brown-forsythe-test xss)
(brown-forsythe-test xss params)
Brown-Forsythe test for homogeneity of variances.
This test is a modification of Levene's test, using the median instead of the mean for calculating the spread within each group. This makes the test more robust against non-normally distributed data.
Calls levene-test
with :statistic
set to median
. Accepts the same parameters
as levene-test
, except for :statistic
.
Parameters:
xss
(sequence of sequences): A collection of data groups.params
(map, optional): Options map (see levene-test
).Brown-Forsythe test for homogeneity of variances. This test is a modification of Levene's test, using the median instead of the mean for calculating the spread within each group. This makes the test more robust against non-normally distributed data. Calls [[levene-test]] with `:statistic` set to [[median]]. Accepts the same parameters as [[levene-test]], except for `:statistic`. Parameters: - `xss` (sequence of sequences): A collection of data groups. - `params` (map, optional): Options map (see [[levene-test]]).
(chisq-test contingency-table-or-xs)
(chisq-test contingency-table-or-xs params)
Chi square test, a power divergence test for lambda
1.0
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Chi square test, a power divergence test for `lambda` 1.0 Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(ci vs)
(ci vs alpha)
T-student based confidence interval for given data. Alpha value defaults to 0.05.
Last value is mean.
T-student based confidence interval for given data. Alpha value defaults to 0.05. Last value is mean.
(cliffs-delta [group1 group2])
(cliffs-delta group1 group2)
Calculates Cliff's Delta (δ), a non-parametric effect size measure for assessing the difference between two groups of ordinal or continuous data.
Cliff's Delta quantifies the degree of overlap between two distributions. It represents the probability that a randomly chosen value from the first group is greater than a randomly chosen value from the second group, minus the reverse probability.
Parameters:
group1
(seq of numbers): The first sample.group2
(seq of numbers): The second sample.Returns the calculated Cliff's Delta value as a double.
Interpretation:
group1
is greater than every value in group2
.group2
is greater than every value in group1
.Cliff's Delta is a robust measure, suitable for ordinal data or when assumptions of parametric tests (like normality or equal variances) are violated. It is closely related to the wmw-odds
(Wilcoxon-Mann-Whitney odds) and the ameasure
(Vargha-Delaney A).
See also wmw-odds
, ameasure
, cohens-d
, glass-delta
.
Calculates Cliff's Delta (δ), a non-parametric effect size measure for assessing the difference between two groups of ordinal or continuous data. Cliff's Delta quantifies the degree of overlap between two distributions. It represents the probability that a randomly chosen value from the first group is greater than a randomly chosen value from the second group, minus the reverse probability. Parameters: - `group1` (seq of numbers): The first sample. - `group2` (seq of numbers): The second sample. Returns the calculated Cliff's Delta value as a double. Interpretation: - A value of +1 indicates complete separation where every value in `group1` is greater than every value in `group2`. - A value of -1 indicates complete separation where every value in `group2` is greater than every value in `group1`. - A value of 0 indicates complete overlap between the distributions. - Values between -1 and 1 indicate varying degrees of overlap. Cohen (1988) suggested guidelines for effect size: |δ| < 0.147 (negligible), 0.147 ≤ |δ| < 0.33 (small), 0.33 ≤ |δ| < 0.474 (medium), |δ| ≥ 0.474 (large). Cliff's Delta is a robust measure, suitable for ordinal data or when assumptions of parametric tests (like normality or equal variances) are violated. It is closely related to the [[wmw-odds]] (Wilcoxon-Mann-Whitney odds) and the [[ameasure]] (Vargha-Delaney A). See also [[wmw-odds]], [[ameasure]], [[cohens-d]], [[glass-delta]].
(coefficient-matrix vss)
(coefficient-matrix vss measure-fn)
(coefficient-matrix vss measure-fn symmetric?)
Generates a matrix of pairwise coefficients from a sequence of sequences.
This function calculates a matrix where the element at row i
and column j
is the result of applying the provided measure-fn
to the i
-th sequence
and the j
-th sequence from the input vss
.
Parameters:
vss
(sequence of sequences of numbers): The collection of data sequences. Each
inner sequence is treated as a variable or set of observations. All inner
sequences should ideally have the same length if the measure-fn
expects it.measure-fn
(function, optional): A function of two arguments (sequences)
that returns a double representing the coefficient or measure between them.
Defaults to pearson-correlation
.symmetric?
(boolean, optional): If true
, the function assumes that
measure-fn(a, b)
is equal to measure-fn(b, a)
. It calculates the upper
(or lower) triangle of the matrix and mirrors the values to the other side.
This is an optimization for symmetric measures like correlation and covariance.
If false
(default), all pairwise combinations (i, j)
are calculated independently.Returns a sequence of sequences (a matrix) of doubles.
Note: While this function's symmetric?
parameter defaults to false
,
convenience functions like correlation-matrix
and covariance-matrix
wrap this function and explicitly set symmetric?
to true
as their
respective measures are symmetric.
See also correlation-matrix
, covariance-matrix
.
Generates a matrix of pairwise coefficients from a sequence of sequences. This function calculates a matrix where the element at row `i` and column `j` is the result of applying the provided `measure-fn` to the `i`-th sequence and the `j`-th sequence from the input `vss`. Parameters: - `vss` (sequence of sequences of numbers): The collection of data sequences. Each inner sequence is treated as a variable or set of observations. All inner sequences should ideally have the same length if the `measure-fn` expects it. - `measure-fn` (function, optional): A function of two arguments (sequences) that returns a double representing the coefficient or measure between them. Defaults to [[pearson-correlation]]. - `symmetric?` (boolean, optional): If `true`, the function assumes that `measure-fn(a, b)` is equal to `measure-fn(b, a)`. It calculates the upper (or lower) triangle of the matrix and mirrors the values to the other side. This is an optimization for symmetric measures like correlation and covariance. If `false` (default), all pairwise combinations `(i, j)` are calculated independently. Returns a sequence of sequences (a matrix) of doubles. Note: While this function's `symmetric?` parameter defaults to `false`, convenience functions like [[correlation-matrix]] and [[covariance-matrix]] wrap this function and explicitly set `symmetric?` to `true` as their respective measures are symmetric. See also [[correlation-matrix]], [[covariance-matrix]].
(cohens-d [group1 group2])
(cohens-d group1 group2)
(cohens-d group1 group2 method)
Calculate Cohen's d effect size between two groups.
Cohen's d is a standardized measure used to quantify the magnitude of the difference between the means of two independent groups. It expresses the mean difference in terms of standard deviation units.
The most common formula for Cohen's d is:
d = (mean(group1) - mean(group2)) / pooled_stddev
where pooled_stddev
is the pooled standard deviation of the two groups,
calculated under the assumption of equal variances.
Parameters:
group1
(seq of numbers): The first independent sample.group2
(seq of numbers): The second independent sample.method
(optional keyword): Specifies the method for calculating the pooled standard deviation,
affecting the denominator of the formula. Possible values are :unbiased
(default),
:biased
, or :avg
. See pooled-stddev
for details on these methods.Returns the calculated Cohen's d effect size as a double.
Interpretation guidelines (approximate for normal distributions):
Assumptions:
:method
implies assumptions about equal variances (default :unbiased
and :biased
assume equal variances, while :avg
does not but might be less standard).See also hedges-g
(a version bias-corrected for small sample sizes),
glass-delta
(an alternative effect size measure using the control group standard deviation),
pooled-stddev
.
Calculate Cohen's d effect size between two groups. Cohen's d is a standardized measure used to quantify the magnitude of the difference between the means of two independent groups. It expresses the mean difference in terms of standard deviation units. The most common formula for Cohen's d is: d = (mean(group1) - mean(group2)) / pooled_stddev where `pooled_stddev` is the pooled standard deviation of the two groups, calculated under the assumption of equal variances. Parameters: - `group1` (seq of numbers): The first independent sample. - `group2` (seq of numbers): The second independent sample. - `method` (optional keyword): Specifies the method for calculating the pooled standard deviation, affecting the denominator of the formula. Possible values are `:unbiased` (default), `:biased`, or `:avg`. See [[pooled-stddev]] for details on these methods. Returns the calculated Cohen's d effect size as a double. Interpretation guidelines (approximate for normal distributions): - |d| = 0.2: small effect - |d| = 0.5: medium effect - |d| = 0.8: large effect Assumptions: - The two samples are independent. - Data within each group are approximately normally distributed. - The choice of `:method` implies assumptions about equal variances (default `:unbiased` and `:biased` assume equal variances, while `:avg` does not but might be less standard). See also [[hedges-g]] (a version bias-corrected for small sample sizes), [[glass-delta]] (an alternative effect size measure using the control group standard deviation), [[pooled-stddev]].
(cohens-d-corrected [group1 group2])
(cohens-d-corrected group1 group2)
(cohens-d-corrected group1 group2 method)
Calculates Cohen's d effect size corrected for bias in small sample sizes.
This function applies a correction factor (derived from the gamma function) to
Cohen's d (cohens-d
) to provide a less biased estimate of the population
effect size when sample sizes are small. This corrected measure is sometimes
referred to as Hedges' g, though this function specifically implements the
correction applied to Cohen's d.
The correction factor is (1 - 3 / (4 * df - 1))
where df
is the degrees of
freedom used in the standard Cohen's d calculation.
Parameters:
group1
(seq of numbers): The first independent sample.group2
(seq of numbers): The second independent sample.method
(optional keyword): Specifies the method for calculating the pooled
standard deviation, affecting the denominator of the formula (passed to
cohens-d
). Possible values are :unbiased
(default), :biased
, or :avg
.
See pooled-stddev
for details on these methods.Returns the calculated bias-corrected Cohen's d effect size as a double.
Note: While this function is named cohens-d-corrected
, Hedges' g (calculated
by hedges-g-corrected
) also applies a similar small-sample bias correction.
Differences might exist based on the specific correction formula or degree of
freedom definition used. This function uses (count group1) + (count group2) - 2
as the degrees of freedom for the correction by default (when :unbiased
method
is used for cohens-d
).
See also cohens-d
, hedges-g
, hedges-g-corrected
.
Calculates Cohen's d effect size corrected for bias in small sample sizes. This function applies a correction factor (derived from the gamma function) to Cohen's d ([[cohens-d]]) to provide a less biased estimate of the population effect size when sample sizes are small. This corrected measure is sometimes referred to as Hedges' g, though this function specifically implements the correction applied to Cohen's d. The correction factor is `(1 - 3 / (4 * df - 1))` where `df` is the degrees of freedom used in the standard Cohen's d calculation. Parameters: - `group1` (seq of numbers): The first independent sample. - `group2` (seq of numbers): The second independent sample. - `method` (optional keyword): Specifies the method for calculating the pooled standard deviation, affecting the denominator of the formula (passed to [[cohens-d]]). Possible values are `:unbiased` (default), `:biased`, or `:avg`. See [[pooled-stddev]] for details on these methods. Returns the calculated bias-corrected Cohen's d effect size as a double. Note: While this function is named `cohens-d-corrected`, Hedges' g (calculated by [[hedges-g-corrected]]) also applies a similar small-sample bias correction. Differences might exist based on the specific correction formula or degree of freedom definition used. This function uses `(count group1) + (count group2) - 2` as the degrees of freedom for the correction by default (when `:unbiased` method is used for `cohens-d`). See also [[cohens-d]], [[hedges-g]], [[hedges-g-corrected]].
(cohens-f [group1 group2])
(cohens-f group1 group2)
(cohens-f group1 group2 type)
Calculates Cohen's f, a measure of effect size derived as the square root of Cohen's f² (cohens-f2
).
Cohen's f is a standardized measure quantifying the magnitude of an effect, often used in the context of ANOVA or regression. It is the square root of the ratio of the variance explained by the effect to the unexplained variance.
Parameters:
group1
(seq of numbers): The dependent variable.group2
(seq of numbers): The independent variable (or predictor). Must have the same length as group1
.type
(keyword, optional): Specifies the measure of 'Proportion of Variance Explained'
used in the underlying cohens-f2
calculation. Defaults to :eta
.
:eta
(default): Uses Eta-squared (sample R²), a measure of variance explained in the sample.:omega
: Uses Omega-squared, a less biased estimate of variance explained in the population.:epsilon
: Uses Epsilon-squared, another less biased estimate of variance explained in the population.group1
and group2
and returning a double representing the proportion of variance explained.Returns the calculated Cohen's f effect size as a double. Values range from 0 upwards.
Interpretation:
group1
explained by group2
).See also cohens-f2
, eta-sq
, omega-sq
, epsilon-sq
.
Calculates Cohen's f, a measure of effect size derived as the square root of Cohen's f² ([[cohens-f2]]). Cohen's f is a standardized measure quantifying the magnitude of an effect, often used in the context of ANOVA or regression. It is the square root of the ratio of the variance explained by the effect to the unexplained variance. Parameters: - `group1` (seq of numbers): The dependent variable. - `group2` (seq of numbers): The independent variable (or predictor). Must have the same length as `group1`. - `type` (keyword, optional): Specifies the measure of 'Proportion of Variance Explained' used in the underlying [[cohens-f2]] calculation. Defaults to `:eta`. - `:eta` (default): Uses Eta-squared (sample R²), a measure of variance explained in the sample. - `:omega`: Uses Omega-squared, a less biased estimate of variance explained in the population. - `:epsilon`: Uses Epsilon-squared, another less biased estimate of variance explained in the population. - Any function: A function accepting `group1` and `group2` and returning a double representing the proportion of variance explained. Returns the calculated Cohen's f effect size as a double. Values range from 0 upwards. Interpretation: - Values are positive. Larger values indicate a stronger effect (more variance in `group1` explained by `group2`). - Cohen's guidelines for interpreting the magnitude of f² (and by extension, f) are: - $f = 0.10$ (approx. $f^2 = 0.01$): small effect - $f = 0.25$ (approx. $f^2 = 0.0625$): medium effect - $f = 0.40$ (approx. $f^2 = 0.16$): large effect (Note: Guidelines are often quoted for f², interpret f as $\sqrt{f^2}$) See also [[cohens-f2]], [[eta-sq]], [[omega-sq]], [[epsilon-sq]].
(cohens-f2 [group1 group2])
(cohens-f2 group1 group2)
(cohens-f2 group1 group2 type)
Calculates Cohen's f², a measure of effect size often used in ANOVA or regression.
Cohen's f² quantifies the magnitude of the effect of an independent variable or set of predictors on a dependent variable, expressed as the ratio of the variance explained by the effect to the unexplained variance.
This function allows calculating f² using different measures for the 'Proportion of Variance Explained',
specified by the type
parameter:
:eta
(default): Uses eta-sq
(Eta-squared), which in this implementation is
equivalent to the sample $R^2$ from a linear regression of group1
on group2
.
This is a measure of the proportion of variance explained in the sample.:omega
: Uses omega-sq
(Omega-squared), a less biased estimate of the
proportion of variance explained in the population.:epsilon
: Uses epsilon-sq
(Epsilon-squared), another less biased estimate
of the proportion of variance explained in the population, similar to adjusted $R^2$.group1
and group2
and returning a double representing the proportion of variance explained.Parameters:
group1
(seq of numbers): The dependent variable.group2
(seq of numbers): The independent variable (or predictor). Must have the same length as group1
.type
(keyword, optional): Specifies the measure of 'Proportion of Variance Explained' to use (:eta
, :omega
, :epsilon
or any function). Defaults to :eta
.Returns the calculated Cohen's f² effect size as a double. Values range from 0 upwards.
Interpretation Guidelines (approximate, often used for F-tests in ANOVA/regression):
See also cohens-f
, eta-sq
, omega-sq
, epsilon-sq
.
Calculates Cohen's f², a measure of effect size often used in ANOVA or regression. Cohen's f² quantifies the magnitude of the effect of an independent variable or set of predictors on a dependent variable, expressed as the ratio of the variance explained by the effect to the unexplained variance. This function allows calculating f² using different measures for the 'Proportion of Variance Explained', specified by the `type` parameter: - `:eta` (default): Uses [[eta-sq]] (Eta-squared), which in this implementation is equivalent to the sample $R^2$ from a linear regression of `group1` on `group2`. This is a measure of the proportion of variance explained in the sample. - `:omega`: Uses [[omega-sq]] (Omega-squared), a less biased estimate of the proportion of variance explained in the population. - `:epsilon`: Uses [[epsilon-sq]] (Epsilon-squared), another less biased estimate of the proportion of variance explained in the population, similar to adjusted $R^2$. - Any function: A function accepting `group1` and `group2` and returning a double representing the proportion of variance explained. Parameters: - `group1` (seq of numbers): The dependent variable. - `group2` (seq of numbers): The independent variable (or predictor). Must have the same length as `group1`. - `type` (keyword, optional): Specifies the measure of 'Proportion of Variance Explained' to use (`:eta`, `:omega`, `:epsilon` or any function). Defaults to `:eta`. Returns the calculated Cohen's f² effect size as a double. Values range from 0 upwards. Interpretation Guidelines (approximate, often used for F-tests in ANOVA/regression): - $f^2 = 0.02$: small effect - $f^2 = 0.15$: medium effect - $f^2 = 0.35$: large effect See also [[cohens-f]], [[eta-sq]], [[omega-sq]], [[epsilon-sq]].
(cohens-kappa contingency-table)
(cohens-kappa group1 group2)
Calculates Cohen's Kappa coefficient (κ), a statistic that measures inter-rater agreement for categorical items, while correcting for chance agreement.
It is often used to assess the consistency of agreement between two raters or methods. Its value typically ranges from -1 to +1:
κ = 1
: Perfect agreement.κ = 0
: Agreement is no better than chance.κ < 0
: Agreement is worse than chance.The function can be called in two ways:
With two sequences group1
and group2
:
The function will automatically construct a 2x2 contingency table from
the unique values in the sequences (assuming they represent two binary
variables). The mapping of values to table cells (e.g., what corresponds
to TP, TN, FP, FN) depends on how contingency-table
orders the unique values.
For direct control over which cell is which, use the contingency table input.
With a contingency table: The contingency table can be provided as:
[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}
). This is the output format
of contingency-table
with two inputs. The mapping of indices to TP/TN/FP/FN
depends on the order of unique values in the original data if generated by
contingency-table
, or the explicit structure if created manually or via
rows->contingency-table
. Standard convention maps [0 0]
to TP, [0 1]
to FP,
[1 0]
to FN, and [1 1]
to TN for binary outcomes.[[TP FP] [FN TN]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of binary outcomes/categories.group2
(sequence): The second sequence of binary outcomes/categories.
Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed 2x2 contingency table.
The cell values should represent counts (e.g., TP, FN, FP, TN).Returns the calculated Cohen's Kappa coefficient as a double.
See also weighted-kappa
(for ordinal data with partial agreement), contingency-table
, contingency-2x2-measures
, binary-measures-all
.
Calculates Cohen's Kappa coefficient (κ), a statistic that measures inter-rater agreement for categorical items, while correcting for chance agreement. It is often used to assess the consistency of agreement between two raters or methods. Its value typically ranges from -1 to +1: - `κ = 1`: Perfect agreement. - `κ = 0`: Agreement is no better than chance. - `κ < 0`: Agreement is worse than chance. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a 2x2 contingency table from the unique values in the sequences (assuming they represent two binary variables). The mapping of values to table cells (e.g., what corresponds to TP, TN, FP, FN) depends on how `contingency-table` orders the unique values. For direct control over which cell is which, use the contingency table input. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}`). This is the output format of [[contingency-table]] with two inputs. The mapping of indices to TP/TN/FP/FN depends on the order of unique values in the original data if generated by [[contingency-table]], or the explicit structure if created manually or via [[rows->contingency-table]]. Standard convention maps `[0 0]` to TP, `[0 1]` to FP, `[1 0]` to FN, and `[1 1]` to TN for binary outcomes. - A sequence of sequences representing the rows of the table (e.g., `[[TP FP] [FN TN]]`). This is equivalent to [[rows->contingency-table]]. Parameters: - `group1` (sequence): The first sequence of binary outcomes/categories. - `group2` (sequence): The second sequence of binary outcomes/categories. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed 2x2 contingency table. The cell values should represent counts (e.g., TP, FN, FP, TN). Returns the calculated Cohen's Kappa coefficient as a double. See also [[weighted-kappa]] (for ordinal data with partial agreement), [[contingency-table]], [[contingency-2x2-measures]], [[binary-measures-all]].
(cohens-q r1 r2)
(cohens-q group1 group2a group2b)
(cohens-q group1a group2a group1b group2b)
Compares two correlation coefficients by calculating the difference between their Fisher z-transformations.
The Fisher z-transformation (atanh
) of a correlation coefficient r
helps normalize the sampling distribution of correlation coefficients. The difference between two z'-transformed correlations is often used as a test statistic.
The function supports comparing correlations in different scenarios via its arities:
(cohens-q r1 r2)
: Calculates the difference between the Fisher z-transformations of two correlation values r1
and r2
provided directly. This is typically used when comparing two independent correlation coefficients (e.g., correlations from two separate studies). Returns atanh(r1) - atanh(r2)
.
r1
, r2
(double): Correlation coefficient values (-1.0 to 1.0).(cohens-q group1 group2a group2b)
: Calculates the difference between the correlation of group1
with group2a
and the correlation of group1
with group2b
. This is commonly used for comparing dependent correlations (where group1
is a common variable). Calculates atanh(pearson-correlation(group1, group2a)) - atanh(pearson-correlation(group1, group2b))
.
group1
, group2a
, group2b
(sequences): Data sequences from which Pearson correlations are computed.(cohens-q group1a group2a group1b group2b)
: Calculates the difference between the correlation of group1a
with group2a
and the correlation of group1b
with group2b
. This is typically used for comparing two independent correlations obtained from two distinct pairs of variables (all four sequences are independent). Calculates atanh(pearson-correlation(group1a, group2a)) - atanh(pearson-correlation(group1b, group2b))
.
group1a
, group2a
, group1b
, group2b
(sequences): Data sequences from which Pearson correlations are computed.Returns the difference between the Fisher z-transformed correlation values as a double.
Note: For comparing dependent correlations (3-arity case), standard statistical tests (e.g., Steiger's test) are more complex than a simple difference of z-transforms and involve the correlation between group2a
and group2b
. This function provides the basic difference value.
Compares two correlation coefficients by calculating the difference between their Fisher z-transformations. The Fisher z-transformation (`atanh`) of a correlation coefficient `r` helps normalize the sampling distribution of correlation coefficients. The difference between two z'-transformed correlations is often used as a test statistic. The function supports comparing correlations in different scenarios via its arities: - `(cohens-q r1 r2)`: Calculates the difference between the Fisher z-transformations of two correlation values `r1` and `r2` provided directly. This is typically used when comparing two *independent* correlation coefficients (e.g., correlations from two separate studies). Returns `atanh(r1) - atanh(r2)`. - `r1`, `r2` (double): Correlation coefficient values (-1.0 to 1.0). - `(cohens-q group1 group2a group2b)`: Calculates the difference between the correlation of `group1` with `group2a` and the correlation of `group1` with `group2b`. This is commonly used for comparing *dependent* correlations (where `group1` is a common variable). Calculates `atanh(pearson-correlation(group1, group2a)) - atanh(pearson-correlation(group1, group2b))`. - `group1`, `group2a`, `group2b` (sequences): Data sequences from which Pearson correlations are computed. - `(cohens-q group1a group2a group1b group2b)`: Calculates the difference between the correlation of `group1a` with `group2a` and the correlation of `group1b` with `group2b`. This is typically used for comparing two *independent* correlations obtained from two distinct pairs of variables (all four sequences are independent). Calculates `atanh(pearson-correlation(group1a, group2a)) - atanh(pearson-correlation(group1b, group2b))`. - `group1a`, `group2a`, `group1b`, `group2b` (sequences): Data sequences from which Pearson correlations are computed. Returns the difference between the Fisher z-transformed correlation values as a double. Note: For comparing dependent correlations (3-arity case), standard statistical tests (e.g., Steiger's test) are more complex than a simple difference of z-transforms and involve the correlation between `group2a` and `group2b`. This function provides the basic difference value.
(cohens-u1 [group1 group2])
(cohens-u1 group1 group2)
Calculates a non-parametric measure of difference or separation between two samples.
This function computes a value derived from cohens-u2
, which internally
quantifies a minimal difference between corresponding quantiles of the two
empirical distributions.
Parameters:
group1
(seq of numbers): The first sample.group2
(seq of numbers): The second sample.Returns the calculated measure as a double.
Interpretation:
This measure is symmetric, meaning the order of group1
and group2
does not
affect the result. It is a non-parametric measure applicable to any data samples.
See also cohens-u2
(the measure this calculation is based on),
cohens-u3
(related non-parametric measure), cohens-u1-normal
(the version applicable to normal data).
Calculates a non-parametric measure of difference or separation between two samples. This function computes a value derived from [[cohens-u2]], which internally quantifies a minimal difference between corresponding quantiles of the two empirical distributions. Parameters: - `group1` (seq of numbers): The first sample. - `group2` (seq of numbers): The second sample. Returns the calculated measure as a double. Interpretation: - Values close to -1 indicate high similarity or maximum overlap between the distributions (as the minimal difference between quantiles approaches zero). - Increasing values indicate greater difference or separation between the distributions (as the minimal difference between quantiles is larger). This measure is symmetric, meaning the order of `group1` and `group2` does not affect the result. It is a non-parametric measure applicable to any data samples. See also [[cohens-u2]] (the measure this calculation is based on), [[cohens-u3]] (related non-parametric measure), [[cohens-u1-normal]] (the version applicable to normal data).
(cohens-u1-normal d)
(cohens-u1-normal group1 group2)
(cohens-u1-normal group1 group2 method)
Calculates Cohen's U1, a measure of non-overlap between two distributions assumed to be normal with equal variances.
Cohen's U1 quantifies the proportion of scores in the lower-scoring group that overlap with the scores in the higher-scoring group. A U1 of 0 means no overlap, while a U1 of 1 means complete overlap (distributions are identical).
This measure is calculated directly from Cohen's d statistic (cohens-d
) assuming
normal distributions and equal variances.
Parameters:
group1
(seq of numbers): The first sample.group2
(seq of numbers): The second sample.method
(optional keyword): Specifies the method for calculating the pooled standard deviation
used in the underlying cohens-d
calculation. Possible values are :unbiased
(default),
:biased
, or :avg
. See pooled-stddev
for details.d
(double): A pre-calculated Cohen's d value. If provided, group1
, group2
, and method
are ignored.Returns the calculated Cohen's U1 as a double [0, 1].
Assumptions:
See also cohens-d
, cohens-u2-normal
, cohens-u3-normal
, p-overlap
(a non-parametric overlap measure).
Calculates Cohen's U1, a measure of non-overlap between two distributions assumed to be normal with equal variances. Cohen's U1 quantifies the proportion of scores in the lower-scoring group that overlap with the scores in the higher-scoring group. A U1 of 0 means no overlap, while a U1 of 1 means complete overlap (distributions are identical). This measure is calculated directly from Cohen's d statistic ([[cohens-d]]) assuming normal distributions and equal variances. Parameters: - `group1` (seq of numbers): The first sample. - `group2` (seq of numbers): The second sample. - `method` (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying [[cohens-d]] calculation. Possible values are `:unbiased` (default), `:biased`, or `:avg`. See [[pooled-stddev]] for details. - `d` (double): A pre-calculated Cohen's d value. If provided, `group1`, `group2`, and `method` are ignored. Returns the calculated Cohen's U1 as a double [0, 1]. Assumptions: - Both samples are drawn from normally distributed populations. - The populations have equal variances (homoscedasticity). See also [[cohens-d]], [[cohens-u2-normal]], [[cohens-u3-normal]], [[p-overlap]] (a non-parametric overlap measure).
(cohens-u2 [group1 group2])
(cohens-u2 group1 group2)
Calculates a measure of overlap between two samples, referred to as Cohen's U2.
This function quantifies the degree to which the distributions of group1
and group2
overlap. It is related to comparing values at corresponding percentile levels across the two groups or the proportion of values in one group that are below the median of the other. A value of 0 indicates no overlap, while a value of 1 indicates complete overlap (distributions are identical).
The measure is symmetric, meaning (cohens-u2 group1 group2)
is equal to (cohens-u2 group2 group1)
.
This is a non-parametric measure, suitable for any data samples, and does not assume normality, unlike cohens-u2-normal
.
Parameters:
group1
, group2
(sequences): The two samples directly as arguments.Returns the calculated Cohen's U2 value as a double. The value typically ranges from 0 to 1. A value closer to 0.5 indicates substantial overlap between the distributions (e.g., the median of one group is near the median of the other); values closer to 0 or 1 indicate less overlap (greater separation between the distributions).
Calculates a measure of overlap between two samples, referred to as Cohen's U2. This function quantifies the degree to which the distributions of `group1` and `group2` overlap. It is related to comparing values at corresponding percentile levels across the two groups or the proportion of values in one group that are below the median of the other. A value of 0 indicates no overlap, while a value of 1 indicates complete overlap (distributions are identical). The measure is symmetric, meaning `(cohens-u2 group1 group2)` is equal to `(cohens-u2 group2 group1)`. This is a non-parametric measure, suitable for any data samples, and does not assume normality, unlike [[cohens-u2-normal]]. Parameters: - `group1`, `group2` (sequences): The two samples directly as arguments. Returns the calculated Cohen's U2 value as a double. The value typically ranges from 0 to 1. A value closer to 0.5 indicates substantial overlap between the distributions (e.g., the median of one group is near the median of the other); values closer to 0 or 1 indicate less overlap (greater separation between the distributions).
(cohens-u2-normal d)
(cohens-u2-normal group1 group2)
(cohens-u2-normal group1 group2 method)
Calculates Cohen's U2, a measure of overlap between two distributions assumed to be normal with equal variances.
Cohen's U2 quantifies the proportion of scores in the lower-scoring group that are below the point located halfway between the means of the two groups (or equivalently, the proportion of scores in the higher-scoring group that are above this halfway point). This measure is calculated from Cohen's d statistic (cohens-d
) using the standard normal cumulative distribution function ($\Phi$): $\Phi(0.5 |d|)$.
Parameters:
group1
(seq of numbers): The first sample.group2
(seq of numbers): The second sample.method
(optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying cohens-d
calculation. Possible values are :unbiased
(default), :biased
, or :avg
. See pooled-stddev
for details.d
(double): A pre-calculated Cohen's d value. If provided, group1
, group2
, and method
are ignored.Returns the calculated Cohen's U2 as a double [0.0, 1.0]. A value closer to 0.5 indicates greater overlap between the distributions; values closer to 0 or 1 indicate less overlap.
Assumptions:
See also cohens-d
, cohens-u1-normal
, cohens-u3-normal
, p-overlap
(a non-parametric overlap measure).
Calculates Cohen's U2, a measure of overlap between two distributions assumed to be normal with equal variances. Cohen's U2 quantifies the proportion of scores in the lower-scoring group that are below the point located halfway between the means of the two groups (or equivalently, the proportion of scores in the higher-scoring group that are above this halfway point). This measure is calculated from Cohen's d statistic ([[cohens-d]]) using the standard normal cumulative distribution function ($\Phi$): $\Phi(0.5 |d|)$. Parameters: - `group1` (seq of numbers): The first sample. - `group2` (seq of numbers): The second sample. - `method` (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying [[cohens-d]] calculation. Possible values are `:unbiased` (default), `:biased`, or `:avg`. See [[pooled-stddev]] for details. - `d` (double): A pre-calculated Cohen's d value. If provided, `group1`, `group2`, and `method` are ignored. Returns the calculated Cohen's U2 as a double [0.0, 1.0]. A value closer to 0.5 indicates greater overlap between the distributions; values closer to 0 or 1 indicate less overlap. Assumptions: - Both samples are drawn from normally distributed populations. - The populations have equal variances (homoscedasticity). See also [[cohens-d]], [[cohens-u1-normal]], [[cohens-u3-normal]], [[p-overlap]] (a non-parametric overlap measure).
(cohens-u3 [group1 group2])
(cohens-u3 group1 group2)
(cohens-u3 group1 group2 estimation-strategy)
Calculates Cohen's U3 for two samples.
In this implementation, Cohen's U3 is defined as the proportion of values
in the second sample (group2
) that are less than the median of the first
sample (group1
).
Parameters:
group1
(seq of numbers): The first sample. The median of this sample is used as the threshold.group2
(seq of numbers): The second sample. Values from this sample are counted if they fall below the median of group1
.estimation-strategy
(optional keyword): The strategy used to estimate the median of group1
.
Defaults to :legacy
. See median
or quantile
for available strategies
(e.g., :r1
through :r9
).Returns the calculated proportion as a double between 0.0 and 1.0.
Interpretation:
group2
are greater than or equal to the median of group1
.group2
are below the median of group1
.group2
are less than the median of group1
.Note: This measure is not symmetric. (cohens-u3 group1 group2)
is generally
not equal to (cohens-u3 group2 group1)
.
This is a non-parametric measure, suitable for any data samples, and does not
assume normality, unlike cohens-u3-normal
.
See also cohens-u3-normal
(the version applicable to normal data), cohens-u2
(a related symmetric non-parametric measure), median
, quantile
.
Calculates Cohen's U3 for two samples. In this implementation, Cohen's U3 is defined as the proportion of values in the second sample (`group2`) that are less than the median of the first sample (`group1`). Parameters: - `group1` (seq of numbers): The first sample. The median of this sample is used as the threshold. - `group2` (seq of numbers): The second sample. Values from this sample are counted if they fall below the median of `group1`. - `estimation-strategy` (optional keyword): The strategy used to estimate the median of `group1`. Defaults to `:legacy`. See [[median]] or [[quantile]] for available strategies (e.g., `:r1` through `:r9`). Returns the calculated proportion as a double between 0.0 and 1.0. Interpretation: - A value close to 0 means most values in `group2` are greater than or equal to the median of `group1`. - A value close to 0.5 means approximately half the values in `group2` are below the median of `group1`. - A value close to 1 means most values in `group2` are less than the median of `group1`. Note: This measure is **not symmetric**. `(cohens-u3 group1 group2)` is generally not equal to `(cohens-u3 group2 group1)`. This is a non-parametric measure, suitable for any data samples, and does not assume normality, unlike [[cohens-u3-normal]]. See also [[cohens-u3-normal]] (the version applicable to normal data), [[cohens-u2]] (a related symmetric non-parametric measure), [[median]], [[quantile]].
(cohens-u3-normal d)
(cohens-u3-normal group1 group2)
(cohens-u3-normal group1 group2 method)
Calculates Cohen's U3, a measure of overlap between two distributions assumed to be normal with equal variances.
Cohen's U3 quantifies the proportion of scores in the lower-scoring group that fall
below the mean of the higher-scoring group. It is calculated from Cohen's d statistic
(cohens-d
) using the standard normal cumulative distribution function ($\Phi$):
U3 = Φ(d)
.
The measure is asymmetric: U3(group1, group2)
is not necessarily equal to
U3(group2, group1)
. The interpretation depends on which group is considered
the 'higher-scoring' one based on the sign of d. By convention, the result
often represents the proportion of the first group (group1
) that is below the
mean of the second group (group2
) if d is negative, or the proportion of the
second group (group2
) that is below the mean of the first group (group1
) if d is positive.
Parameters:
group1
(seq of numbers): The first sample.group2
(seq of numbers): The second sample.method
(optional keyword): Specifies the method for calculating the pooled standard deviation
used in the underlying cohens-d
calculation. Possible values are :unbiased
(default),
:biased
, or :avg
. See pooled-stddev
for details.d
(double): A pre-calculated Cohen's d value. If provided, group1
, group2
, and method
are ignored.Returns the calculated Cohen's U3 as a double [0.0, 1.0]. A value close to 0.5 suggests significant overlap. Values closer to 0 or 1 suggest less overlap (greater separation between the means).
Assumptions:
See also cohens-d
, cohens-u1-normal
, cohens-u2-normal
, p-overlap
(a non-parametric overlap measure).
Calculates Cohen's U3, a measure of overlap between two distributions assumed to be normal with equal variances. Cohen's U3 quantifies the proportion of scores in the lower-scoring group that fall below the mean of the higher-scoring group. It is calculated from Cohen's d statistic ([[cohens-d]]) using the standard normal cumulative distribution function ($\Phi$): `U3 = Φ(d)`. The measure is asymmetric: `U3(group1, group2)` is not necessarily equal to `U3(group2, group1)`. The interpretation depends on which group is considered the 'higher-scoring' one based on the sign of d. By convention, the result often represents the proportion of the *first* group (`group1`) that is below the mean of the *second* group (`group2`) if d is negative, or the proportion of the *second* group (`group2`) that is below the mean of the *first* group (`group1`) if d is positive. Parameters: - `group1` (seq of numbers): The first sample. - `group2` (seq of numbers): The second sample. - `method` (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying [[cohens-d]] calculation. Possible values are `:unbiased` (default), `:biased`, or `:avg`. See [[pooled-stddev]] for details. - `d` (double): A pre-calculated Cohen's d value. If provided, `group1`, `group2`, and `method` are ignored. Returns the calculated Cohen's U3 as a double [0.0, 1.0]. A value close to 0.5 suggests significant overlap. Values closer to 0 or 1 suggest less overlap (greater separation between the means). Assumptions: - Both samples are drawn from normally distributed populations. - The populations have equal variances (homoscedasticity). See also [[cohens-d]], [[cohens-u1-normal]], [[cohens-u2-normal]], [[p-overlap]] (a non-parametric overlap measure).
(cohens-w contingency-table)
(cohens-w group1 group2)
Calculates Cohen's W effect size for the association between two nominal variables represented in a contingency table.
Cohen's W is a measure of association derived from the Pearson's Chi-squared statistic. It quantifies the magnitude of the difference between the observed frequencies and the frequencies expected under the assumption of independence between the variables.
Its value ranges from 0 upwards:
The function can be called in two ways:
group1
and group2
:
The function will automatically construct a contingency table from
the unique values in the sequences.[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This is the output format
of contingency-table
with two inputs.[[10 5] [3 12]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of categorical data.group2
(sequence): The second sequence of categorical data. Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed contingency table.Returns the calculated Cohen's W coefficient as a double.
See also chisq-test
, cramers-v
, cramers-c
, tschuprows-t
, contingency-table
.
Calculates Cohen's W effect size for the association between two nominal variables represented in a contingency table. Cohen's W is a measure of association derived from the Pearson's Chi-squared statistic. It quantifies the magnitude of the difference between the observed frequencies and the frequencies expected under the assumption of independence between the variables. Its value ranges from 0 upwards: - A value of 0 indicates no association between the variables. - Larger values indicate a stronger association. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a contingency table from the unique values in the sequences. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format of [[contingency-table]] with two inputs. - A sequence of sequences representing the rows of the table (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]]. Parameters: - `group1` (sequence): The first sequence of categorical data. - `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed contingency table. Returns the calculated Cohen's W coefficient as a double. See also [[chisq-test]], [[cramers-v]], [[cramers-c]], [[tschuprows-t]], [[contingency-table]].
(confusion-matrix confusion-mat)
(confusion-matrix actual prediction)
(confusion-matrix actual prediction encode-true)
(confusion-matrix tp fn fp tn)
Creates a 2x2 confusion matrix for binary classification.
A confusion matrix summarizes the results of a binary classification problem, showing the counts of True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).
TP: Actual is True, Predicted is True FP: Actual is False, Predicted is True (Type I error) FN: Actual is True, Predicted is False (Type II error) TN: Actual is False, Predicted is False
The function supports several input formats:
(confusion-matrix tp fn fp tn)
: Direct input of the four counts.
tp
(long): True Positive count.fn
(long): False Negative count.fp
(long): False Positive count.tn
(long): True Negative count.(confusion-matrix confusion-matrix-representation)
: Input as a structured representation.
confusion-matrix-representation
: Can be:
:tp
, :fn
, :fp
, :tn
(e.g., {:tp 10 :fn 2 :fp 5 :tn 80}
).[[TP FP] [FN TN]]
(e.g., [[10 5] [2 80]]
).[TP FN FP TN]
(e.g., [10 2 5 80]
).(confusion-matrix actual prediction)
: Input as two sequences of outcomes.
actual
(sequence): Sequence of true outcomes.prediction
(sequence): Sequence of predicted outcomes. Must have the same length as actual
.
Values in actual
and prediction
are compared element-wise. By default,
any non-nil
or non-zero value is treated as true
, and nil
or 0.0
is
treated as false
.(confusion-matrix actual prediction encode-true)
: Input as two sequences with a specified encoding for true
.
actual
, prediction
: Sequences as in the previous arity.encode-true
: Specifies how values in actual
and prediction
are converted to boolean true
or false
.
nil
(default): Non-nil
/non-zero is true.false
, the value is false.true
if the value satisfies the predicate.Returns a map with keys :tp
, :fn
, :fp
, and :tn
representing the counts.
This function is commonly used to prepare input for binary classification
metrics like those provided by binary-measures-all
and binary-measures
.
Creates a 2x2 confusion matrix for binary classification. A confusion matrix summarizes the results of a binary classification problem, showing the counts of True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). TP: Actual is True, Predicted is True FP: Actual is False, Predicted is True (Type I error) FN: Actual is True, Predicted is False (Type II error) TN: Actual is False, Predicted is False The function supports several input formats: 1. `(confusion-matrix tp fn fp tn)`: Direct input of the four counts. - `tp` (long): True Positive count. - `fn` (long): False Negative count. - `fp` (long): False Positive count. - `tn` (long): True Negative count. 2. `(confusion-matrix confusion-matrix-representation)`: Input as a structured representation. - `confusion-matrix-representation`: Can be: - A map with keys like `:tp`, `:fn`, `:fp`, `:tn` (e.g., `{:tp 10 :fn 2 :fp 5 :tn 80}`). - A sequence of sequences representing rows `[[TP FP] [FN TN]]` (e.g., `[[10 5] [2 80]]`). - A flat sequence `[TP FN FP TN]` (e.g., `[10 2 5 80]`). 3. `(confusion-matrix actual prediction)`: Input as two sequences of outcomes. - `actual` (sequence): Sequence of true outcomes. - `prediction` (sequence): Sequence of predicted outcomes. Must have the same length as `actual`. Values in `actual` and `prediction` are compared element-wise. By default, any non-`nil` or non-zero value is treated as `true`, and `nil` or `0.0` is treated as `false`. 4. `(confusion-matrix actual prediction encode-true)`: Input as two sequences with a specified encoding for `true`. - `actual`, `prediction`: Sequences as in the previous arity. - `encode-true`: Specifies how values in `actual` and `prediction` are converted to boolean `true` or `false`. - `nil` (default): Non-`nil`/non-zero is true. - Any sequence/set: Values found in this collection are true. - A map: Values are mapped according to the map; if a key is not found or maps to `false`, the value is false. - A predicate function: Returns `true` if the value satisfies the predicate. Returns a map with keys `:tp`, `:fn`, `:fp`, and `:tn` representing the counts. This function is commonly used to prepare input for binary classification metrics like those provided by [[binary-measures-all]] and [[binary-measures]].
(contingency-2x2-measures & args)
Calculates a subset of common statistics and measures for a 2x2 contingency table.
This function provides a selection of the most frequently used measures from the
more comprehensive contingency-2x2-measures-all
.
The function accepts the same input formats as contingency-2x2-measures-all
:
(contingency-2x2-measures a b c d)
: Takes the four counts as arguments.(contingency-2x2-measures [a b c d])
: Takes a sequence of the four counts.(contingency-2x2-measures [[a b] [c d]])
: Takes a sequence of sequences representing the rows.(contingency-2x2-measures {:a a :b b :c c :d d})
: Takes a map of counts (accepts :a/:b/:c/:d
keys).Parameters:
a, b, c, d
(long): Counts in the 2x2 table cells.map-or-seq
(map or sequence): A representation of the 2x2 table.Returns a map containing a selection of measures:
:OR
: Odds Ratio (Odds Ratio):chi2
: Pearson's Chi-squared statistic:yates
: Yates' continuity corrected Chi-squared statistic:cochran-mantel-haenszel
: Cochran-Mantel-Haenszel statistic:cohens-kappa
: Cohen's Kappa coefficient:yules-q
: Yule's Q measure of association:holley-guilfords-g
: Holley-Guilford's G measure:huberts-gamma
: Hubert's Gamma measure:yules-y
: Yule's Y measure of association:cramers-v
: Cramer's V measure of association:phi
: Phi coefficient (Matthews Correlation Coefficient):scotts-pi
: Scott's Pi measure of agreement:cohens-h
: Cohen's H measure:PCC
: Pearson's Contingency Coefficient:PCC-adjusted
: Adjusted Pearson's Contingency Coefficient:TCC
: Tschuprow's Contingency Coefficient:F1
: F1 Score:bangdiwalas-b
: Bangdiwala's B statistic:mcnemars-chi2
: McNemar's Chi-squared test statistic:gwets-ac1
: Gwet's AC1 measureFor a more comprehensive set of 2x2 measures and their detailed descriptions, see contingency-2x2-measures-all
.
Calculates a subset of common statistics and measures for a 2x2 contingency table. This function provides a selection of the most frequently used measures from the more comprehensive [[contingency-2x2-measures-all]]. The function accepts the same input formats as [[contingency-2x2-measures-all]]: 1. `(contingency-2x2-measures a b c d)`: Takes the four counts as arguments. 2. `(contingency-2x2-measures [a b c d])`: Takes a sequence of the four counts. 3. `(contingency-2x2-measures [[a b] [c d]])`: Takes a sequence of sequences representing the rows. 4. `(contingency-2x2-measures {:a a :b b :c c :d d})`: Takes a map of counts (accepts `:a/:b/:c/:d` keys). Parameters: - `a, b, c, d` (long): Counts in the 2x2 table cells. - `map-or-seq` (map or sequence): A representation of the 2x2 table. Returns a map containing a selection of measures: - `:OR`: Odds Ratio (Odds Ratio) - `:chi2`: Pearson's Chi-squared statistic - `:yates`: Yates' continuity corrected Chi-squared statistic - `:cochran-mantel-haenszel`: Cochran-Mantel-Haenszel statistic - `:cohens-kappa`: Cohen's Kappa coefficient - `:yules-q`: Yule's Q measure of association - `:holley-guilfords-g`: Holley-Guilford's G measure - `:huberts-gamma`: Hubert's Gamma measure - `:yules-y`: Yule's Y measure of association - `:cramers-v`: Cramer's V measure of association - `:phi`: Phi coefficient (Matthews Correlation Coefficient) - `:scotts-pi`: Scott's Pi measure of agreement - `:cohens-h`: Cohen's H measure - `:PCC`: Pearson's Contingency Coefficient - `:PCC-adjusted`: Adjusted Pearson's Contingency Coefficient - `:TCC`: Tschuprow's Contingency Coefficient - `:F1`: F1 Score - `:bangdiwalas-b`: Bangdiwala's B statistic - `:mcnemars-chi2`: McNemar's Chi-squared test statistic - `:gwets-ac1`: Gwet's AC1 measure For a more comprehensive set of 2x2 measures and their detailed descriptions, see [[contingency-2x2-measures-all]].
(contingency-2x2-measures-all map-or-seq)
(contingency-2x2-measures-all [a b] [c d])
(contingency-2x2-measures-all a b c d)
Calculates a comprehensive set of statistics and measures for a 2x2 contingency table.
A 2x2 contingency table cross-tabulates two categorical variables, each with two levels. The table counts are typically represented as:
+---+---+ | a | b | +---+---+ | c | d | +---+---+
Where a, b, c, d
are the counts in the respective cells.
This function calculates numerous measures, including:
The function can be called with the four counts directly or with a representation of the contingency table:
(contingency-2x2-measures-all a b c d)
: Takes the four counts as arguments.(contingency-2x2-measures-all [a b c d])
: Takes a sequence of the four counts.(contingency-2x2-measures-all [[a b] [c d]])
: Takes a sequence of sequences representing the rows.(contingency-2x2-measures-all {:a a :b b :c c :d d})
: Takes a map of counts (accepts :a/:b/:c/:d
keys).Parameters:
a
(long): Count in the top-left cell.b
(long): Count in the top-right cell.c
(long): Count in the bottom-left cell.d
(long): Count in the bottom-right cell.map-or-seq
(map or sequence): A representation of the 2x2 table as described above.Returns a map containing a wide range of calculated statistics. Keys include:
:n
, :table
, :expected
, :marginals
, :proportions
, :p-values
(map), :OR
, :lOR
, :RR
, :risk
(map), :SE
, :measures
(map).
See also contingency-2x2-measures
for a selected subset of these measures,
mcc
for the Matthews Correlation Coefficient (Phi), and binary-measures-all
for metrics derived from a confusion matrix (often a 2x2 table in binary classification).
Calculates a comprehensive set of statistics and measures for a 2x2 contingency table. A 2x2 contingency table cross-tabulates two categorical variables, each with two levels. The table counts are typically represented as: +---+---+ | a | b | +---+---+ | c | d | +---+---+ Where `a, b, c, d` are the counts in the respective cells. This function calculates numerous measures, including: * Chi-squared statistics (Pearson, Yates' corrected, CMH) and their p-values. * Measures of association (Phi, Yule's Q, Holley-Guilford's G, Hubert's Gamma, Yule's Y, Cramer's V, Scott's Pi, Cohen's H, Pearson/Tschuprow's CC). * Measures of agreement (Cohen's Kappa). * Risk and effect size measures (Odds Ratio (OR), Relative Risk (RR), Risk Difference (RD), NNT, etc.). * Table marginals and proportions. The function can be called with the four counts directly or with a representation of the contingency table: 1. `(contingency-2x2-measures-all a b c d)`: Takes the four counts as arguments. 2. `(contingency-2x2-measures-all [a b c d])`: Takes a sequence of the four counts. 3. `(contingency-2x2-measures-all [[a b] [c d]])`: Takes a sequence of sequences representing the rows. 4. `(contingency-2x2-measures-all {:a a :b b :c c :d d})`: Takes a map of counts (accepts `:a/:b/:c/:d` keys). Parameters: - `a` (long): Count in the top-left cell. - `b` (long): Count in the top-right cell. - `c` (long): Count in the bottom-left cell. - `d` (long): Count in the bottom-right cell. - `map-or-seq` (map or sequence): A representation of the 2x2 table as described above. Returns a map containing a wide range of calculated statistics. Keys include: `:n`, `:table`, `:expected`, `:marginals`, `:proportions`, `:p-values` (map), `:OR`, `:lOR`, `:RR`, `:risk` (map), `:SE`, `:measures` (map). See also [[contingency-2x2-measures]] for a selected subset of these measures, [[mcc]] for the Matthews Correlation Coefficient (Phi), and [[binary-measures-all]] for metrics derived from a confusion matrix (often a 2x2 table in binary classification).
(contingency-table & seqs)
Creates a frequency map (contingency table) from one or more sequences.
If one sequence xs
is provided, it returns a simple frequency map of the values
in xs
.
If multiple sequences s1, s2, ..., sn
are provided, it creates a contingency table
of the tuples formed by the corresponding elements [s1_i, s2_i, ..., sn_i]
at
each index i
. The returned map keys are these tuples, and values are their
frequencies.
Parameters:
seqs
(one or more sequences): The input sequences. All sequences should ideally
have the same length, as elements are paired by index.Returns a map where keys represent unique combinations of values (or single values if only one sequence is input) and values are the counts of these combinations.
See also rows->contingency-table
, contingency-table->marginals
.
Creates a frequency map (contingency table) from one or more sequences. If one sequence `xs` is provided, it returns a simple frequency map of the values in `xs`. If multiple sequences `s1, s2, ..., sn` are provided, it creates a contingency table of the tuples formed by the corresponding elements `[s1_i, s2_i, ..., sn_i]` at each index `i`. The returned map keys are these tuples, and values are their frequencies. Parameters: - `seqs` (one or more sequences): The input sequences. All sequences should ideally have the same length, as elements are paired by index. Returns a map where keys represent unique combinations of values (or single values if only one sequence is input) and values are the counts of these combinations. See also [[rows->contingency-table]], [[contingency-table->marginals]].
(contingency-table->marginals ct)
Calculates marginal sums (row and column totals) and the grand total from a contingency table.
A contingency table represents the frequency distribution of observations for two or more categorical variables. This function summarizes these frequencies along the rows and columns.
The function accepts two main input formats for the contingency table:
[row-index, column-index]
tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This format is produced by contingency-table
when given multiple sequences or by rows->contingency-table
.[[10 5] [3 12]]
). The function internally converts this format to the map format.Parameters:
ct
(map or sequence of sequences): The contingency table input.Returns a map containing:
:rows
: A sequence of [row-index, row-total]
pairs.:cols
: A sequence of [column-index, column-total]
pairs.:n
: The grand total of all counts in the table.:diag
: A sequence of [[index, index], count]
pairs for cells on the diagonal
(where row index equals column index). This is useful for square tables like
confusion matrices.See also contingency-table
, rows->contingency-table
.
Calculates marginal sums (row and column totals) and the grand total from a contingency table. A contingency table represents the frequency distribution of observations for two or more categorical variables. This function summarizes these frequencies along the rows and columns. The function accepts two main input formats for the contingency table: 1. A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This format is produced by [[contingency-table]] when given multiple sequences or by [[rows->contingency-table]]. 2. A sequence of sequences representing the rows of the table, where each inner sequence contains counts for the columns in that row (e.g., `[[10 5] [3 12]]`). The function internally converts this format to the map format. Parameters: - `ct` (map or sequence of sequences): The contingency table input. Returns a map containing: - `:rows`: A sequence of `[row-index, row-total]` pairs. - `:cols`: A sequence of `[column-index, column-total]` pairs. - `:n`: The grand total of all counts in the table. - `:diag`: A sequence of `[[index, index], count]` pairs for cells on the diagonal (where row index equals column index). This is useful for square tables like confusion matrices. See also [[contingency-table]], [[rows->contingency-table]].
(correlation [vs1 vs2])
(correlation vs1 vs2)
Calculates the correlation coefficient between two sequences.
By default, this function calculates the Pearson product-moment correlation coefficient, which measures the linear relationship between two datasets.
This function handles the standard deviation normalization based on whether
the inputs vs1
and vs2
are treated as samples or populations (it uses
sample standard deviation derived from variance
).
Parameters:
[vs1 vs2]
(sequence of two sequences): A sequence containing the two sequences of numbers.vs1
, vs2
(sequences): The two sequences of numbers directly as arguments.Both sequences must have the same length.
Returns the calculated correlation coefficient (a value between -1.0 and 1.0) as a double.
Returns NaN
if one or both sequences have zero variance (are constant).
See also covariance
, pearson-correlation
, spearman-correlation
, kendall-correlation
, correlation-matrix
.
Calculates the correlation coefficient between two sequences. By default, this function calculates the Pearson product-moment correlation coefficient, which measures the linear relationship between two datasets. This function handles the standard deviation normalization based on whether the inputs `vs1` and `vs2` are treated as samples or populations (it uses sample standard deviation derived from [[variance]]). Parameters: - `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers. - `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments. Both sequences must have the same length. Returns the calculated correlation coefficient (a value between -1.0 and 1.0) as a double. Returns `NaN` if one or both sequences have zero variance (are constant). See also [[covariance]], [[pearson-correlation]], [[spearman-correlation]], [[kendall-correlation]], [[correlation-matrix]].
(correlation-matrix vss)
(correlation-matrix vss measure)
Generates a matrix of pairwise correlation coefficients from a sequence of sequences.
Given a collection of data sequences vss
, where each inner sequence represents
a variable, this function calculates a square matrix where the element at row i
and column j
is the correlation coefficient between the i
-th and j
-th
sequences in vss
.
Parameters:
vss
(sequence of sequences of numbers): The collection of data sequences.
Each inner sequence is treated as a variable. All inner sequences must have the same length.measure
(keyword, optional): Specifies the type of correlation coefficient to calculate.
Defaults to :pearson
.
:pearson
(default): Calculates the Pearson product-moment correlation coefficient.:kendall
: Calculates Kendall's Tau rank correlation coefficient.:spearman
: Calculates Spearman's rank correlation coefficient.Returns a sequence of sequences (a matrix) of doubles representing the correlation matrix. The matrix is symmetric, as correlation is a symmetric measure.
See also pearson-correlation
, spearman-correlation
, kendall-correlation
,
covariance-matrix
, coefficient-matrix
.
Generates a matrix of pairwise correlation coefficients from a sequence of sequences. Given a collection of data sequences `vss`, where each inner sequence represents a variable, this function calculates a square matrix where the element at row `i` and column `j` is the correlation coefficient between the `i`-th and `j`-th sequences in `vss`. Parameters: - `vss` (sequence of sequences of numbers): The collection of data sequences. Each inner sequence is treated as a variable. All inner sequences must have the same length. - `measure` (keyword, optional): Specifies the type of correlation coefficient to calculate. Defaults to `:pearson`. - `:pearson` (default): Calculates the Pearson product-moment correlation coefficient. - `:kendall`: Calculates Kendall's Tau rank correlation coefficient. - `:spearman`: Calculates Spearman's rank correlation coefficient. Returns a sequence of sequences (a matrix) of doubles representing the correlation matrix. The matrix is symmetric, as correlation is a symmetric measure. See also [[pearson-correlation]], [[spearman-correlation]], [[kendall-correlation]], [[covariance-matrix]], [[coefficient-matrix]].
(count= [vs1 vs2-or-val])
(count= vs1 vs2-or-val)
Count equal values in both seqs. Same as L0
Calculates the number of pairs of corresponding elements that are equal between two sequences, or between a sequence and a single scalar value.
Parameters:
vs1
(sequence of numbers): The first sequence.vs2-or-val
(sequence of numbers or single number): The second sequence of
numbers, or a single number to compare against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the count of equal elements as a long integer.
Count equal values in both seqs. Same as [[L0]] Calculates the number of pairs of corresponding elements that are equal between two sequences, or between a sequence and a single scalar value. Parameters: - `vs1` (sequence of numbers): The first sequence. - `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the count of equal elements as a long integer.
(covariance [vs1 vs2])
(covariance vs1 vs2)
Covariance of two sequences.
This function calculates the sample covariance.
Parameters:
[vs1 vs2]
(sequence of two sequences): A sequence containing the two sequences of numbers.vs1
, vs2
(sequences): The two sequences of numbers directly as arguments.Both sequences must have the same length.
Returns the calculated sample covariance as a double.
See also correlation
, covariance-matrix
.
Covariance of two sequences. This function calculates the *sample* covariance. Parameters: - `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers. - `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments. Both sequences must have the same length. Returns the calculated sample covariance as a double. See also [[correlation]], [[covariance-matrix]].
(covariance-matrix vss)
Generates a matrix of pairwise covariance coefficients from a sequence of sequences.
Given a collection of data sequences vss
, where each inner sequence represents
a variable, this function calculates a square matrix where the element at row i
and column j
is the sample covariance between the i
-th and j
-th sequences
in vss
.
Parameters:
vss
(sequence of sequences of numbers): The collection of data sequences.
Each inner sequence is treated as a variable. All inner sequences must have the same length.Returns a sequence of sequences (a matrix) of doubles representing the covariance matrix. The matrix is symmetric, as covariance is a symmetric measure ($Cov(X,Y) = Cov(Y,X)$).
Internally uses coefficient-matrix
with the covariance
function and symmetric?
set to true
.
See also covariance
, correlation-matrix
, coefficient-matrix
.
Generates a matrix of pairwise covariance coefficients from a sequence of sequences. Given a collection of data sequences `vss`, where each inner sequence represents a variable, this function calculates a square matrix where the element at row `i` and column `j` is the sample covariance between the `i`-th and `j`-th sequences in `vss`. Parameters: - `vss` (sequence of sequences of numbers): The collection of data sequences. Each inner sequence is treated as a variable. All inner sequences must have the same length. Returns a sequence of sequences (a matrix) of doubles representing the covariance matrix. The matrix is symmetric, as covariance is a symmetric measure ($Cov(X,Y) = Cov(Y,X)$). Internally uses [[coefficient-matrix]] with the [[covariance]] function and `symmetric?` set to `true`. See also [[covariance]], [[correlation-matrix]], [[coefficient-matrix]].
(cramers-c contingency-table)
(cramers-c group1 group2)
Calculates Cramer's C, a measure of association (effect size) between two nominal variables represented in a contingency table.
Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. It is particularly useful for tables larger than 2x2.
The function can be called in two ways:
group1
and group2
:
The function will automatically construct a contingency table from
the unique values in the sequences.[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This is the output format
of contingency-table
with two inputs.[[10 5] [3 12]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of categorical data.group2
(sequence): The second sequence of categorical data. Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed contingency table.Returns the calculated Cramer's C coefficient as a double.
See also chisq-test
, cramers-v
, cohens-w
, tschuprows-t
, contingency-table
.
Calculates Cramer's C, a measure of association (effect size) between two nominal variables represented in a contingency table. Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. It is particularly useful for tables larger than 2x2. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a contingency table from the unique values in the sequences. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format of [[contingency-table]] with two inputs. - A sequence of sequences representing the rows of the table (e.g., `[[10 5] [3 12]]`). This is equivalent to `rows->contingency-table`. Parameters: - `group1` (sequence): The first sequence of categorical data. - `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed contingency table. Returns the calculated Cramer's C coefficient as a double. See also [[chisq-test]], [[cramers-v]], [[cohens-w]], [[tschuprows-t]], [[contingency-table]].
(cramers-v contingency-table)
(cramers-v group1 group2)
Calculates Cramer's V, a measure of association (effect size) between two nominal variables represented in a contingency table.
Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. It is related to the Pearson's Chi-squared statistic and is useful for tables of any size.
The function can be called in two ways:
group1
and group2
:
The function will automatically construct a contingency table from
the unique values in the sequences.[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This is the output format
of contingency-table
with two inputs.[[10 5] [3 12]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of categorical data.group2
(sequence): The second sequence of categorical data. Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed contingency table.Returns the calculated Cramer's V coefficient as a double.
See also chisq-test
, cramers-c
, cohens-w
, tschuprows-t
, contingency-table
.
Calculates Cramer's V, a measure of association (effect size) between two nominal variables represented in a contingency table. Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. It is related to the Pearson's Chi-squared statistic and is useful for tables of any size. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a contingency table from the unique values in the sequences. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format of [[contingency-table]] with two inputs. - A sequence of sequences representing the rows of the table (e.g., `[[10 5] [3 12]]`). This is equivalent to `rows->contingency-table`. Parameters: - `group1` (sequence): The first sequence of categorical data. - `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed contingency table. Returns the calculated Cramer's V coefficient as a double. See also [[chisq-test]], [[cramers-c]], [[cohens-w]], [[tschuprows-t]], [[contingency-table]].
(cramers-v-corrected contingency-table)
(cramers-v-corrected group1 group2)
Calculates the corrected Cramer's V, a measure of association (effect size) between two nominal variables represented in a contingency table, with a correction to reduce bias, particularly for small sample sizes or tables with many cells having small expected counts.
Like the uncorrected Cramer's V (cramers-v
), its value ranges from 0 to 1,
where 0 indicates no association and 1 indicates a perfect association. The
correction tends to yield a value closer to the true population value in
biased situations.
The function can be called in two ways:
group1
and group2
:
The function will automatically construct a contingency table from
the unique values in the sequences.[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This is the output format
of contingency-table
with two inputs.[[10 5] [3 12]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of categorical data.group2
(sequence): The second sequence of categorical data. Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed contingency table.Returns the calculated corrected Cramer's V coefficient as a double.
See also chisq-test
, cramers-v
(uncorrected), cramers-c
, cohens-w
,
tschuprows-t
, contingency-table
.
Calculates the **corrected Cramer's V**, a measure of association (effect size) between two nominal variables represented in a contingency table, with a correction to reduce bias, particularly for small sample sizes or tables with many cells having small expected counts. Like the uncorrected Cramer's V ([[cramers-v]]), its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. The correction tends to yield a value closer to the true population value in biased situations. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a contingency table from the unique values in the sequences. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format of [[contingency-table]] with two inputs. - A sequence of sequences representing the rows of the table (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]]. Parameters: - `group1` (sequence): The first sequence of categorical data. - `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed contingency table. Returns the calculated corrected Cramer's V coefficient as a double. See also [[chisq-test]], [[cramers-v]] (uncorrected), [[cramers-c]], [[cohens-w]], [[tschuprows-t]], [[contingency-table]].
(cressie-read-test contingency-table-or-xs)
(cressie-read-test contingency-table-or-xs params)
Cressie-Read test, a power divergence test for lambda
2/3
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Cressie-Read test, a power divergence test for `lambda` 2/3 Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(dissimilarity method P-observed Q-expected)
(dissimilarity method
P-observed
Q-expected
{:keys [bins probabilities? epsilon log-base power remove-zeros?]
:or
{probabilities? true epsilon 1.0E-6 log-base m/E power 2.0}})
Various PDF distance between two histograms (frequencies) or probabilities.
Q can be a distribution object. Then, histogram will be created out of P.
Arguments:
method
- distance methodP-observed
- frequencies, probabilities or actual data (when Q is a distribution of :bins
is set)Q-expected
- frequencies, probabilities or distribution object (when P is a data or :bins
is set)Options:
:probabilities?
- should P/Q be converted to a probabilities, default: true
.:epsilon
- small number which replaces 0.0
when division or logarithm is used`:log-base
- base for logarithms, default: e
:power
- exponent for :minkowski
distance, default: 2.0
:bins
- number of bins or bins estimation method, see histogram
.The list of methods: :euclidean
, :city-block
, :manhattan
, :chebyshev
, :minkowski
, :sorensen
, :gower
, :soergel
, :kulczynski
, :canberra
, :lorentzian
, :non-intersection
, :wave-hedges
, :czekanowski
, :motyka
, :tanimoto
, :jaccard
, :dice
, :bhattacharyya
, :hellinger
, :matusita
, :squared-chord
, :euclidean-sq
, :squared-euclidean
, :pearson-chisq
, :chisq
, :neyman-chisq
, :squared-chisq
, :symmetric-chisq
, :divergence
, :clark
, :additive-symmetric-chisq
, :kullback-leibler
, :jeffreys
, :k-divergence
, :topsoe
, :jensen-shannon
, :jensen-difference
, :taneja
, :kumar-johnson
, :avg
See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha
Various PDF distance between two histograms (frequencies) or probabilities. Q can be a distribution object. Then, histogram will be created out of P. Arguments: * `method` - distance method * `P-observed` - frequencies, probabilities or actual data (when Q is a distribution of `:bins` is set) * `Q-expected` - frequencies, probabilities or distribution object (when P is a data or `:bins` is set) Options: * `:probabilities?` - should P/Q be converted to a probabilities, default: `true`. * `:epsilon` - small number which replaces `0.0` when division or logarithm is used` * `:log-base` - base for logarithms, default: `e` * `:power` - exponent for `:minkowski` distance, default: `2.0` * `:bins` - number of bins or bins estimation method, see [[histogram]]. The list of methods: `:euclidean`, `:city-block`, `:manhattan`, `:chebyshev`, `:minkowski`, `:sorensen`, `:gower`, `:soergel`, `:kulczynski`, `:canberra`, `:lorentzian`, `:non-intersection`, `:wave-hedges`, `:czekanowski`, `:motyka`, `:tanimoto`, `:jaccard`, `:dice`, `:bhattacharyya`, `:hellinger`, `:matusita`, `:squared-chord`, `:euclidean-sq`, `:squared-euclidean`, `:pearson-chisq`, `:chisq`, `:neyman-chisq`, `:squared-chisq`, `:symmetric-chisq`, `:divergence`, `:clark`, `:additive-symmetric-chisq`, `:kullback-leibler`, `:jeffreys`, `:k-divergence`, `:topsoe`, `:jensen-shannon`, `:jensen-difference`, `:taneja`, `:kumar-johnson`, `:avg` See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha
(durbin-watson rs)
Calculates the Durbin-Watson statistic (d) for a sequence of residuals.
This statistic is used to test for the presence of serial correlation, especially first-order (lag-1) autocorrelation, in the residuals from a regression analysis. Autocorrelation violates the assumption of independent errors.
Parameters:
rs
(sequence of numbers): The sequence of residuals from a regression model.
The sequence should represent observations ordered by time or sequence index.Returns the calculated Durbin-Watson statistic as a double. The value ranges from 0 to 4.
Interpretation:
Calculates the Durbin-Watson statistic (d) for a sequence of residuals. This statistic is used to test for the presence of serial correlation, especially first-order (lag-1) autocorrelation, in the residuals from a regression analysis. Autocorrelation violates the assumption of independent errors. Parameters: - `rs` (sequence of numbers): The sequence of residuals from a regression model. The sequence should represent observations ordered by time or sequence index. Returns the calculated Durbin-Watson statistic as a double. The value ranges from 0 to 4. Interpretation: - Values near 2 suggest no first-order autocorrelation. - Values less than 2 suggest positive autocorrelation (residuals tend to be followed by residuals of the same sign). - Values greater than 2 suggest negative autocorrelation (residuals tend to be followed by residuals of the opposite sign).
(epsilon-sq [group1 group2])
(epsilon-sq group1 group2)
Calculates Epsilon squared (ε²), an effect size measure for the simple linear regression of group1
on group2
.
Epsilon squared estimates the proportion of variance in the dependent variable (group1
)
that is accounted for by the independent variable (group2
) in the population. It is
considered a less biased alternative to the sample R-squared (r2-determination
).
The calculation is based on the sums of squares from the simple linear regression of
group1
on group2
.
Parameters:
group1
(seq of numbers): The dependent variable.group2
(seq of numbers): The independent variable. Must have the same length as group1
.Returns the calculated Epsilon squared value as a double. The value typically ranges from 0.0 to 1.0.
Interpretation:
group2
explains none of the variance in group1
in the population.group2
perfectly explains the variance in group1
in the population.Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences.
See also eta-sq
(Eta-squared, often based on $R^2$), omega-sq
(another adjusted
R²-like measure), r2-determination
(R-squared).
Calculates Epsilon squared (ε²), an effect size measure for the simple linear regression of `group1` on `group2`. Epsilon squared estimates the proportion of variance in the dependent variable (`group1`) that is accounted for by the independent variable (`group2`) in the population. It is considered a less biased alternative to the sample R-squared ([[r2-determination]]). The calculation is based on the sums of squares from the simple linear regression of `group1` on `group2`. Parameters: - `group1` (seq of numbers): The dependent variable. - `group2` (seq of numbers): The independent variable. Must have the same length as `group1`. Returns the calculated Epsilon squared value as a double. The value typically ranges from 0.0 to 1.0. Interpretation: - 0.0 indicates that `group2` explains none of the variance in `group1` in the population. - 1.0 indicates that `group2` perfectly explains the variance in `group1` in the population. Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences. See also [[eta-sq]] (Eta-squared, often based on $R^2$), [[omega-sq]] (another adjusted R²-like measure), [[r2-determination]] (R-squared).
(estimate-bins vs)
(estimate-bins vs bins-or-estimate-method)
Estimate number of bins for histogram.
Possible methods are: :sqrt
:sturges
:rice
:doane
:scott
:freedman-diaconis
(default).
The number returned is not higher than number of samples.
Estimate number of bins for histogram. Possible methods are: `:sqrt` `:sturges` `:rice` `:doane` `:scott` `:freedman-diaconis` (default). The number returned is not higher than number of samples.
List of estimation strategies for percentile
/quantile
functions.
List of estimation strategies for [[percentile]]/[[quantile]] functions.
(eta-sq [group1 group2])
(eta-sq group1 group2)
Calculates a measure of association between two sequences, named eta-sq
(Eta-squared).
Note: The current implementation calculates the R-squared coefficient of determination from a simple linear regression where the first input sequence (group1
) is treated as the dependent variable and the second (group2
) as the independent variable. In this context, it quantifies the proportion of the variance in group1
that is linearly predictable from group2
.
Parameters:
group1
(seq of numbers): The first sequence (treated as dependent variable).group2
(seq of numbers): The second sequence (treated as independent variable).Returns the calculated R-squared value as a double [0.0, 1.0].
Interpretation:
group2
explains none of the variance in group1
linearly.group2
linearly explains all the variance in group1
.While Eta-squared ($\eta^2$) is commonly used in ANOVA to quantify the proportion of variance in a dependent variable explained by group membership, this function's calculation method differs from the standard ANOVA $\eta^2$ unless group2
explicitly represents numeric codes for two groups.
See also r2-determination
(which is equivalent to this function), pearson-correlation
, omega-sq
, epsilon-sq
, one-way-anova-test
.
Calculates a measure of association between two sequences, named `eta-sq` (Eta-squared). *Note*: The current implementation calculates the R-squared coefficient of determination from a simple linear regression where the first input sequence (`group1`) is treated as the dependent variable and the second (`group2`) as the independent variable. In this context, it quantifies the proportion of the variance in `group1` that is linearly predictable from `group2`. Parameters: - `group1` (seq of numbers): The first sequence (treated as dependent variable). - `group2` (seq of numbers): The second sequence (treated as independent variable). Returns the calculated R-squared value as a double [0.0, 1.0]. Interpretation: - 0.0 indicates that `group2` explains none of the variance in `group1` linearly. - 1.0 indicates that `group2` linearly explains all the variance in `group1`. While Eta-squared ($\eta^2$) is commonly used in ANOVA to quantify the proportion of variance in a dependent variable explained by group membership, this function's calculation method differs from the standard ANOVA $\eta^2$ unless `group2` explicitly represents numeric codes for two groups. See also [[r2-determination]] (which is equivalent to this function), [[pearson-correlation]], [[omega-sq]], [[epsilon-sq]], [[one-way-anova-test]].
(expectile vs tau)
(expectile vs weights tau)
Calculate the tau-th expectile of a sequence vs
.
Expectiles are related to quantiles but are determined by minimizing an
asymmetrically weighted sum of squared differences, rather than absolute
differences. The tau
parameter controls the asymmetry.
A key property is that the expectile for tau = 0.5
is equal to the mean
.
The calculation involves finding the value t
such that the weighted sum
of w_i * (v_i - t)
is zero, where the effective weights depend on tau
and whether
v_i
is above or below t
.
Parameters:
vs
: Sequence of data values.weights
(optional): Sequence of corresponding non-negative weights.
Must have the same count as vs
. If omitted, calculates the unweighted expectile.tau
: The expectile level, a value between 0.0 and 1.0 (inclusive).Returns the calculated expectile as a double.
Calculate the tau-th expectile of a sequence `vs`. Expectiles are related to quantiles but are determined by minimizing an asymmetrically weighted sum of squared differences, rather than absolute differences. The `tau` parameter controls the asymmetry. A key property is that the expectile for `tau = 0.5` is equal to the [[mean]]. The calculation involves finding the value `t` such that the weighted sum of `w_i * (v_i - t)` is zero, where the effective weights depend on `tau` and whether `v_i` is above or below `t`. Parameters: - `vs`: Sequence of data values. - `weights` (optional): Sequence of corresponding non-negative weights. Must have the same count as `vs`. If omitted, calculates the unweighted expectile. - `tau`: The expectile level, a value between 0.0 and 1.0 (inclusive). Returns the calculated expectile as a double. See also [[quantile]], [[mean]], [[median]].
(extent vs)
(extent vs mean?)
Return extent (min, max, mean) values from sequence. Mean is optional (default: true)
Return extent (min, max, mean) values from sequence. Mean is optional (default: true)
(f-test xs ys)
(f-test xs ys {:keys [sides alpha] :or {sides :two-sided alpha 0.05}})
Performs an F-test to compare the variances of two independent samples.
The test assesses the null hypothesis that the variances of the populations
from which xs
and ys
are drawn are equal.
Assumes independence of samples. The test is sensitive to departures from the assumption that both populations are normally distributed.
Parameters:
xs
(seq of numbers): The first sample.ys
(seq of numbers): The second sample.params
(map, optional): Options map:
:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis
regarding the ratio of variances (Var(xs) / Var(ys)).
:two-sided
(default): Variances are not equal (ratio != 1).:one-sided-greater
: Variance of xs
is greater than variance of ys
(ratio > 1).:one-sided-less
: Variance of xs
is less than variance of ys
(ratio < 1).:alpha
(double, default 0.05
): Significance level for the confidence interval.Returns a map containing:
:F
: The calculated F-statistic (ratio of sample variances: Var(xs) / Var(ys)).:stat
: Alias for :F
.:estimate
: Alias for :F
, representing the estimated ratio of variances.:df
: Degrees of freedom as [numerator-df, denominator-df]
, corresponding to [(count xs)-1, (count ys)-1]
.:n
: Sample sizes as [count xs, count ys]
.:nx
: Sample size of xs
.:ny
: Sample size of ys
.:sides
: The alternative hypothesis side used (:two-sided
, :one-sided-greater
, or :one-sided-less
).:test-type
: Alias for :sides
.:p-value
: The p-value associated with the F-statistic and the specified :sides
.:confidence-interval
: A confidence interval for the true ratio of the population variances (Var(xs) / Var(ys)).Performs an F-test to compare the variances of two independent samples. The test assesses the null hypothesis that the variances of the populations from which `xs` and `ys` are drawn are equal. Assumes independence of samples. The test is sensitive to departures from the assumption that both populations are normally distributed. Parameters: - `xs` (seq of numbers): The first sample. - `ys` (seq of numbers): The second sample. - `params` (map, optional): Options map: - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis regarding the ratio of variances (Var(xs) / Var(ys)). - `:two-sided` (default): Variances are not equal (ratio != 1). - `:one-sided-greater`: Variance of `xs` is greater than variance of `ys` (ratio > 1). - `:one-sided-less`: Variance of `xs` is less than variance of `ys` (ratio < 1). - `:alpha` (double, default `0.05`): Significance level for the confidence interval. Returns a map containing: - `:F`: The calculated F-statistic (ratio of sample variances: Var(xs) / Var(ys)). - `:stat`: Alias for `:F`. - `:estimate`: Alias for `:F`, representing the estimated ratio of variances. - `:df`: Degrees of freedom as `[numerator-df, denominator-df]`, corresponding to `[(count xs)-1, (count ys)-1]`. - `:n`: Sample sizes as `[count xs, count ys]`. - `:nx`: Sample size of `xs`. - `:ny`: Sample size of `ys`. - `:sides`: The alternative hypothesis side used (`:two-sided`, `:one-sided-greater`, or `:one-sided-less`). - `:test-type`: Alias for `:sides`. - `:p-value`: The p-value associated with the F-statistic and the specified `:sides`. - `:confidence-interval`: A confidence interval for the true ratio of the population variances (Var(xs) / Var(ys)).
(fligner-killeen-test xss)
(fligner-killeen-test xss {:keys [sides] :or {sides :one-sided-greater}})
Performs the Fligner-Killeen test for homogeneity of variances across two or more groups.
The Fligner-Killeen test is a non-parametric test that assesses the null hypothesis that the variances of the groups are equal. It is robust against departures from normality. The test is based on ranks of the absolute deviations from the group medians.
Parameters:
xss
(sequence of sequences): A collection where each element is a sequence representing a group of observations.params
(map, optional): Options map with the following key:
:sides
(keyword, default :one-sided-greater
): Alternative hypothesis side for the Chi-squared test.
Possible values: :one-sided-greater
, :one-sided-less
, :two-sided
.Returns a map containing:
:chi2
: The Fligner-Killeen test statistic (Chi-squared value).:stat
: Alias for :chi2
.:p-value
: The p-value for the test.:df
: Degrees of freedom for the test (number of groups - 1).:n
: Sequence of sample sizes for each group.:SSt
: Sum of squares between groups (treatment) based on transformed ranks.:SSe
: Sum of squares within groups (error) based on transformed ranks.:DFt
: Degrees of freedom between groups.:DFe
: Degrees of freedom within groups.:MSt
: Mean square between groups.:MSe
: Mean square within groups.:sides
: Test side used.Performs the Fligner-Killeen test for homogeneity of variances across two or more groups. The Fligner-Killeen test is a non-parametric test that assesses the null hypothesis that the variances of the groups are equal. It is robust against departures from normality. The test is based on ranks of the absolute deviations from the group medians. Parameters: - `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations. - `params` (map, optional): Options map with the following key: - `:sides` (keyword, default `:one-sided-greater`): Alternative hypothesis side for the Chi-squared test. Possible values: `:one-sided-greater`, `:one-sided-less`, `:two-sided`. Returns a map containing: - `:chi2`: The Fligner-Killeen test statistic (Chi-squared value). - `:stat`: Alias for `:chi2`. - `:p-value`: The p-value for the test. - `:df`: Degrees of freedom for the test (number of groups - 1). - `:n`: Sequence of sample sizes for each group. - `:SSt`: Sum of squares between groups (treatment) based on transformed ranks. - `:SSe`: Sum of squares within groups (error) based on transformed ranks. - `:DFt`: Degrees of freedom between groups. - `:DFe`: Degrees of freedom within groups. - `:MSt`: Mean square between groups. - `:MSe`: Mean square within groups. - `:sides`: Test side used.
(freeman-tukey-test contingency-table-or-xs)
(freeman-tukey-test contingency-table-or-xs params)
Freeman-Tukey test, a power divergence test for lambda
-0.5
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Freeman-Tukey test, a power divergence test for `lambda` -0.5 Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(geomean vs)
(geomean vs weights)
Calculates the geometric mean of a sequence vs
.
The geometric mean is suitable for averaging ratios or rates of change and requires all values in the sequence to be positive. It is calculated as the n-th root of the product of n numbers.
Parameters:
vs
: Sequence of numbers. Non-positive values will result in NaN
or 0.0
due
to the internal use of log
.weights
(optional): Sequence of non-negative weights corresponding to vs
.
Must have the same count as vs
.Returns the calculated geometric mean as a double.
Calculates the geometric mean of a sequence `vs`. The geometric mean is suitable for averaging ratios or rates of change and requires all values in the sequence to be positive. It is calculated as the n-th root of the product of n numbers. Parameters: - `vs`: Sequence of numbers. Non-positive values will result in `NaN` or `0.0` due to the internal use of `log`. - `weights` (optional): Sequence of non-negative weights corresponding to `vs`. Must have the same count as `vs`. Returns the calculated geometric mean as a double. See also [[mean]], [[harmean]], [[powmean]].
(glass-delta [group1 group2])
(glass-delta group1 group2)
Calculates Glass's delta (Δ), an effect size measure for the difference between two group means, using the standard deviation of the control group.
Glass's delta is used to quantify the magnitude of the difference between an experimental group and a control group, specifically when the control group's standard deviation is considered a better estimate of the population standard deviation than a pooled variance.
Parameters:
group1
(seq of numbers): The experimental group.group2
(seq of numbers): The control group.Returns the calculated Glass's delta as a double.
This measure is less common than cohens-d
or hedges-g
but is preferred
when the intervention is expected to affect the variance or when group2 (the control)
is clearly the baseline against which variability should be assessed.
Calculates Glass's delta (Δ), an effect size measure for the difference between two group means, using the standard deviation of the control group. Glass's delta is used to quantify the magnitude of the difference between an experimental group and a control group, specifically when the control group's standard deviation is considered a better estimate of the population standard deviation than a pooled variance. Parameters: - `group1` (seq of numbers): The experimental group. - `group2` (seq of numbers): The control group. Returns the calculated Glass's delta as a double. This measure is less common than [[cohens-d]] or [[hedges-g]] but is preferred when the intervention is expected to affect the variance or when group2 (the control) is clearly the baseline against which variability should be assessed. See also [[cohens-d]], [[hedges-g]].
(harmean vs)
(harmean vs weights)
Calculates the harmonic mean of a sequence vs
.
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the observations.
Parameters:
vs
: Sequence of numbers. Values must be non-zero.weights
(optional): Sequence of non-negative weights corresponding to vs
.
Must have the same count as vs
.Returns the calculated harmonic mean as a double.
Calculates the harmonic mean of a sequence `vs`. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the observations. Parameters: - `vs`: Sequence of numbers. Values must be non-zero. - `weights` (optional): Sequence of non-negative weights corresponding to `vs`. Must have the same count as `vs`. Returns the calculated harmonic mean as a double. See also [[mean]], [[geomean]], [[powmean]].
(hedges-g [group1 group2])
(hedges-g group1 group2)
Calculates Hedges's g effect size for comparing the means of two independent groups.
Hedges's g is a standardized measure quantifying the magnitude of the difference between the means of two independent groups. It is similar to Cohen's d but uses the unbiased pooled standard deviation in the denominator.
This implementation calculates g using the unbiased pooled standard deviation as the denominator.
Parameters:
group1
, group2
(sequences): The two independent samples directly as arguments.Returns the calculated Hedges's g effect size as a double.
Note: This specific function uses the unbiased pooled standard deviation but does
not apply the small-sample bias correction factor (often denoted as J)
sometimes associated with Hedges's g. For a bias-corrected version, see hedges-g-corrected
.
This function is equivalent to calling (cohens-d group1 group2 :unbiased)
.
See also cohens-d
, hedges-g-corrected
, glass-delta
, pooled-stddev
.
Calculates Hedges's g effect size for comparing the means of two independent groups. Hedges's g is a standardized measure quantifying the magnitude of the difference between the means of two independent groups. It is similar to Cohen's d but uses the *unbiased* pooled standard deviation in the denominator. This implementation calculates g using the unbiased pooled standard deviation as the denominator. Parameters: - `group1`, `group2` (sequences): The two independent samples directly as arguments. Returns the calculated Hedges's g effect size as a double. Note: This specific function uses the unbiased pooled standard deviation but does *not* apply the small-sample bias correction factor (often denoted as J) sometimes associated with Hedges's g. For a bias-corrected version, see [[hedges-g-corrected]]. This function is equivalent to calling `(cohens-d group1 group2 :unbiased)`. See also [[cohens-d]], [[hedges-g-corrected]], [[glass-delta]], [[pooled-stddev]].
(hedges-g* [group1 group2])
(hedges-g* group1 group2)
Calculates a less biased estimate of Hedges's g effect size for comparing the means of two independent groups, using the exact J bias correction.
Hedges's g is a standardized measure of the difference between two means. For small sample sizes, the standard Hedges's g (and Cohen's d) can overestimate the true population effect size. This function applies a specific correction factor, often denoted as J, to mitigate this bias.
The calculation involves:
hedges-g
, which uses the unbiased pooled standard deviation).n1 + n2 - 2
) using the gamma function.The J factor is calculated as (Gamma(df/2) / (sqrt(df/2) * Gamma((df-1)/2)))
.
Parameters:
group1
(seq of numbers): The first independent sample.group2
(seq of numbers): The second independent sample.Returns the calculated bias-corrected Hedges's g effect size as a double.
This version of Hedges's g is generally preferred over the standard version or Cohen's d when working with small sample sizes, as it provides a more accurate estimate of the population effect size.
Assumptions:
See also cohens-d
, hedges-g
(uncorrected), hedges-g-corrected
(another correction method).
Calculates a less biased estimate of Hedges's g effect size for comparing the means of two independent groups, using the exact J bias correction. Hedges's g is a standardized measure of the difference between two means. For small sample sizes, the standard Hedges's g (and Cohen's d) can overestimate the true population effect size. This function applies a specific correction factor, often denoted as J, to mitigate this bias. The calculation involves: 1. Calculating the standard Hedges's g (equivalent to [[hedges-g]], which uses the unbiased pooled standard deviation). 2. Calculating the J correction factor based on the degrees of freedom (`n1 + n2 - 2`) using the gamma function. 3. Multiplying the standard Hedges's g by the J factor. The J factor is calculated as `(Gamma(df/2) / (sqrt(df/2) * Gamma((df-1)/2)))`. Parameters: - `group1` (seq of numbers): The first independent sample. - `group2` (seq of numbers): The second independent sample. Returns the calculated bias-corrected Hedges's g effect size as a double. This version of Hedges's g is generally preferred over the standard version or Cohen's d when working with small sample sizes, as it provides a more accurate estimate of the population effect size. Assumptions: - The two samples are independent. - Data within each group are approximately normally distributed. - Equal variances are assumed for calculating the pooled standard deviation. See also [[cohens-d]], [[hedges-g]] (uncorrected), [[hedges-g-corrected]] (another correction method).
(hedges-g-corrected [group1 group2])
(hedges-g-corrected group1 group2)
Calculates a small-sample bias-corrected effect size for comparing the means of two independent groups, often referred to as a form of Hedges's g.
This function calculates Cohen's d (cohens-d
) using the unbiased
pooled standard deviation (equivalent to hedges-g
), and then applies
a specific correction factor designed to reduce the bias in the effect size
estimate for small sample sizes.
The correction factor applied is (1 - 3 / (4 * df - 1))
, where df
is the
degrees of freedom for the unbiased pooled variance calculation (n1 + n2 - 2
).
This corresponds to calling cohens-d-corrected
with the :unbiased
method
for pooled standard deviation.
Parameters:
group1
(seq of numbers): The first independent sample.group2
(seq of numbers): The second independent sample.Returns the calculated bias-corrected effect size as a double.
Note: This function applies a correction factor. For the more
standard Hedges's g bias correction using the exact gamma function
based correction factor, see hedges-g*
.
See also cohens-d
, cohens-d-corrected
, hedges-g
, hedges-g*
,
pooled-stddev
.
Calculates a small-sample bias-corrected effect size for comparing the means of two independent groups, often referred to as a form of Hedges's g. This function calculates Cohen's d ([[cohens-d]]) using the *unbiased* pooled standard deviation (equivalent to [[hedges-g]]), and then applies a specific correction factor designed to reduce the bias in the effect size estimate for small sample sizes. The correction factor applied is `(1 - 3 / (4 * df - 1))`, where `df` is the degrees of freedom for the unbiased pooled variance calculation (`n1 + n2 - 2`). This corresponds to calling [[cohens-d-corrected]] with the `:unbiased` method for pooled standard deviation. Parameters: - `group1` (seq of numbers): The first independent sample. - `group2` (seq of numbers): The second independent sample. Returns the calculated bias-corrected effect size as a double. Note: This function applies *a* correction factor. For the more standard Hedges's g bias correction using the exact gamma function based correction factor, see [[hedges-g*]]. See also [[cohens-d]], [[cohens-d-corrected]], [[hedges-g]], [[hedges-g*]], [[pooled-stddev]].
(histogram vs)
(histogram vs bins-or-estimate-method)
(histogram vs bins-or-estimate-method [mn mx])
(histogram vs bins-or-estimate-method mn mx)
Calculate histogram.
Estimation method can be a number, named method: :sqrt
:sturges
:rice
:doane
:scott
:freedman-diaconis
(default) or a sequence of points used as intervals.
In the latter case or when mn
and mx
values are provided - data will be filtered to fit in desired interval(s).
Returns map with keys:
:size
- number of bins:step
- average distance between bins:bins
- seq of pairs of range lower value and number of elements:min
- min value:max
- max value:samples
- number of used samples:frequencies
- a map containing counts for bin's average:intervals
- intervals used to create bins:bins-maps
- seq of maps containing:
:min
- lower bound:mid
- middle value:max
- upper bound:step
- actual distance between bins:count
- number of elements:avg
- average value:probability
- probability for binIf difference between min and max values is 0
, number of bins is set to 1.
Calculate histogram. Estimation method can be a number, named method: `:sqrt` `:sturges` `:rice` `:doane` `:scott` `:freedman-diaconis` (default) or a sequence of points used as intervals. In the latter case or when `mn` and `mx` values are provided - data will be filtered to fit in desired interval(s). Returns map with keys: * `:size` - number of bins * `:step` - average distance between bins * `:bins` - seq of pairs of range lower value and number of elements * `:min` - min value * `:max` - max value * `:samples` - number of used samples * `:frequencies` - a map containing counts for bin's average * `:intervals` - intervals used to create bins * `:bins-maps` - seq of maps containing: * `:min` - lower bound * `:mid` - middle value * `:max` - upper bound * `:step` - actual distance between bins * `:count` - number of elements * `:avg` - average value * `:probability` - probability for bin If difference between min and max values is `0`, number of bins is set to 1.
(hpdi-extent vs)
(hpdi-extent vs size)
Higher Posterior Density interval + median.
size
parameter is the target probability content of the interval.
Higher Posterior Density interval + median. `size` parameter is the target probability content of the interval.
(inner-fence-extent vs)
(inner-fence-extent vs estimation-strategy)
Returns LIF, UIF and median
Returns LIF, UIF and median
(iqr vs)
(iqr vs estimation-strategy)
Interquartile range.
Interquartile range.
(jarque-bera-test xs)
(jarque-bera-test xs params)
(jarque-bera-test xs skew kurt {:keys [sides] :or {sides :one-sided-greater}})
Performs the Jarque-Bera goodness-of-fit test to determine if sample data exhibits skewness and kurtosis consistent with a normal distribution.
The test assesses the null hypothesis that the data comes from a normally distributed population (i.e., population skewness is 0 and population excess kurtosis is 0).
The test statistic is calculated as:
JB = (n/6) * (S^2 + (1/4)*K^2)
where n
is the sample size, S
is the sample skewness (using :g1
type),
and K
is the excess kurtosis :g2
.
Under the null hypothesis, the JB statistic asymptotically follows a Chi-squared
distribution with 2 degrees of freedom.
Parameters:
xs
(seq of numbers): The sample data.skew
(double, optional): A pre-calculated sample skewness value (type :g1
).
If omitted, it's calculated from xs
.kurt
(double, optional): A pre-calculated sample excess kurtosis value (type :g2
).
If omitted, it's calculated from xs
.params
(map, optional): Options map:
:sides
(keyword, default :one-sided-greater
): Specifies the side(s) of the
Chi-squared(2) distribution used for p-value calculation.
:one-sided-greater
(default and standard for JB): Tests if the JB statistic is
significantly large, indicating departure from normality.:one-sided-less
: Tests if the statistic is significantly small.:two-sided
: Tests if the statistic is extreme in either tail.Returns a map containing:
:Z
: The calculated Jarque-Bera test statistic (labeled :Z
for consistency,
though it follows Chi-squared(2)).:stat
: Alias for :Z
.:p-value
: The p-value associated with the test statistic and :sides
, derived
from the Chi-squared(2) distribution.:skewness
: The sample skewness (type :g1
) used in the calculation.:kurtosis
: The sample kurtosis (type :g2
) used in the calculation.See also skewness-test
, kurtosis-test
, normality-test
, bonett-seier-test
.
Performs the Jarque-Bera goodness-of-fit test to determine if sample data exhibits skewness and kurtosis consistent with a normal distribution. The test assesses the null hypothesis that the data comes from a normally distributed population (i.e., population skewness is 0 and population excess kurtosis is 0). The test statistic is calculated as: `JB = (n/6) * (S^2 + (1/4)*K^2)` where `n` is the sample size, `S` is the sample skewness (using `:g1` type), and `K` is the excess kurtosis `:g2`. Under the null hypothesis, the JB statistic asymptotically follows a Chi-squared distribution with 2 degrees of freedom. Parameters: - `xs` (seq of numbers): The sample data. - `skew` (double, optional): A pre-calculated sample skewness value (type `:g1`). If omitted, it's calculated from `xs`. - `kurt` (double, optional): A pre-calculated sample *excess* kurtosis value (type `:g2`). If omitted, it's calculated from `xs`. - `params` (map, optional): Options map: - `:sides` (keyword, default `:one-sided-greater`): Specifies the side(s) of the Chi-squared(2) distribution used for p-value calculation. - `:one-sided-greater` (default and standard for JB): Tests if the JB statistic is significantly large, indicating departure from normality. - `:one-sided-less`: Tests if the statistic is significantly small. - `:two-sided`: Tests if the statistic is extreme in either tail. Returns a map containing: - `:Z`: The calculated Jarque-Bera test statistic (labeled `:Z` for consistency, though it follows Chi-squared(2)). - `:stat`: Alias for `:Z`. - `:p-value`: The p-value associated with the test statistic and `:sides`, derived from the Chi-squared(2) distribution. - `:skewness`: The sample skewness (type `:g1`) used in the calculation. - `:kurtosis`: The sample kurtosis (type `:g2`) used in the calculation. See also [[skewness-test]], [[kurtosis-test]], [[normality-test]], [[bonett-seier-test]].
(jensen-shannon-divergence [vs1 vs2])
(jensen-shannon-divergence vs1 vs2)
Jensen-Shannon divergence of two sequences.
Jensen-Shannon divergence of two sequences.
(kendall-correlation [vs1 vs2])
(kendall-correlation vs1 vs2)
Calculates Kendall's rank correlation coefficient (Kendall's Tau) between two sequences.
Kendall's Tau is a non-parametric statistic used to measure the ordinal association between two measured quantities. It assesses the degree of similarity between the orderings of data when ranked by each of the quantities.
The coefficient value ranges from -1.0 (perfect disagreement in ranking) to 1.0 (perfect agreement in ranking), with 0.0 indicating no monotonic relationship. Unlike Pearson correlation, it does not require the relationship to be linear.
Parameters:
[vs1 vs2]
(sequence of two sequences): A sequence containing the two sequences of numbers.vs1
, vs2
(sequences): The two sequences of numbers directly as arguments.Both input sequences must contain only numbers and must have the same length.
Returns the calculated Kendall's Tau coefficient as a double.
See also pearson-correlation
, spearman-correlation
, correlation
.
Calculates Kendall's rank correlation coefficient (Kendall's Tau) between two sequences. Kendall's Tau is a non-parametric statistic used to measure the ordinal association between two measured quantities. It assesses the degree of similarity between the orderings of data when ranked by each of the quantities. The coefficient value ranges from -1.0 (perfect disagreement in ranking) to 1.0 (perfect agreement in ranking), with 0.0 indicating no monotonic relationship. Unlike Pearson correlation, it does not require the relationship to be linear. Parameters: - `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers. - `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments. Both input sequences must contain only numbers and must have the same length. Returns the calculated Kendall's Tau coefficient as a double. See also [[pearson-correlation]], [[spearman-correlation]], [[correlation]].
(kruskal-test xss)
(kruskal-test xss {:keys [sides] :or {sides :right}})
Performs the Kruskal-Wallis H-test (rank sum test) for independent samples.
The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA. It determines whether there is a statistically significant difference between the distributions of two or more independent groups. It does not assume normality but requires that distributions have a similar shape for the test to be valid.
Parameters:
data-groups
(vector of sequences): A collection where each element is a sequence
representing a group of observations.:sides
key with values of: :right
(default), :left
or :both
Returns a map containing:
:stat
: The Kruskal-Wallis H statistic.:n
: Total number of observations across all groups.:df
: Degrees of freedom (number of groups - 1).:k
: Number of groups.:sides
: Test side:p-value
: The p-value for the test (null hypothesis: all groups have the same distribution).Performs the Kruskal-Wallis H-test (rank sum test) for independent samples. The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA. It determines whether there is a statistically significant difference between the distributions of two or more independent groups. It does not assume normality but requires that distributions have a similar shape for the test to be valid. Parameters: - `data-groups` (vector of sequences): A collection where each element is a sequence representing a group of observations. - a map containing `:sides` key with values of: `:right` (default), `:left` or `:both` Returns a map containing: - `:stat`: The Kruskal-Wallis H statistic. - `:n`: Total number of observations across all groups. - `:df`: Degrees of freedom (number of groups - 1). - `:k`: Number of groups. - `:sides`: Test side - `:p-value`: The p-value for the test (null hypothesis: all groups have the same distribution).
(ks-test-one-sample xs)
(ks-test-one-sample xs distribution-or-ys)
(ks-test-one-sample xs
distribution-or-ys
{:keys [sides kernel bandwidth distinct?]
:or {sides :two-sided kernel :gaussian distinct? true}})
Performs the one-sample Kolmogorov-Smirnov (KS) test.
This test compares the empirical cumulative distribution function (ECDF) of a
sample xs
against a specified theoretical distribution or the ECDF of
another empirical sample. It assesses the null hypothesis that xs
is drawn
from the reference distribution.
Parameters:
xs
(seq of numbers): The sample data to be tested.distribution-or-ys
(optional):
fastmath.random
distribution object to test against. If omitted, defaults
to the standard normal distribution (fastmath.random/default-normal
).ys
). In this case, an empirical distribution is
estimated from ys
using Kernel Density Estimation (KDE) or an enumerated
distribution (see :kernel
option).opts
(map, optional): Options map:
:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis
regarding the difference between the ECDF of xs
and the reference CDF.
:two-sided
(default): Tests if the ECDF of xs
is different from the reference CDF.:right
: Tests if the ECDF of xs
is significantly below the reference CDF (i.e., xs
tends to have larger values, stochastically greater).:left
: Tests if the ECDF of xs
is significantly above the reference CDF (i.e., xs
tends to have smaller values, stochastically smaller).:kernel
(keyword, default :gaussian
): Used only when distribution-or-ys
is a sequence. Specifies the method to estimate the empirical distribution:
:gaussian
(or other KDE kernels): Uses Kernel Density Estimation.:enumerated
: Creates a discrete empirical distribution from ys
.:bandwidth
(double, optional): Bandwidth for KDE (if applicable).:distinct?
(boolean or keyword, default true
): How to handle duplicate values in xs
.
true
(default): Removes duplicate values from xs
before computation.false
: Uses all values in xs
, including duplicates.:jitter
: Adds a small amount of random noise to each value in xs
to break ties.Returns a map containing:
:n
: Sample size of xs
(after applying :distinct?
).:dp
: Maximum positive difference (ECDF(xs) - CDF(ref)).:dn
: Maximum positive difference (CDF(ref) - ECDF(xs)).:d
: The KS test statistic (max absolute difference: max(dp, dn)
).:stat
: The specific statistic used for p-value calculation, depending on :sides
(d
, dp
, or dn
).:p-value
: The p-value associated with the test statistic and the specified :sides
.:sides
: The alternative hypothesis side used.Performs the one-sample Kolmogorov-Smirnov (KS) test. This test compares the empirical cumulative distribution function (ECDF) of a sample `xs` against a specified theoretical distribution or the ECDF of another empirical sample. It assesses the null hypothesis that `xs` is drawn from the reference distribution. Parameters: - `xs` (seq of numbers): The sample data to be tested. - `distribution-or-ys` (optional): - A `fastmath.random` distribution object to test against. If omitted, defaults to the standard normal distribution (`fastmath.random/default-normal`). - A sequence of numbers (`ys`). In this case, an empirical distribution is estimated from `ys` using Kernel Density Estimation (KDE) or an enumerated distribution (see `:kernel` option). - `opts` (map, optional): Options map: - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis regarding the difference between the ECDF of `xs` and the reference CDF. - `:two-sided` (default): Tests if the ECDF of `xs` is different from the reference CDF. - `:right`: Tests if the ECDF of `xs` is significantly *below* the reference CDF (i.e., `xs` tends to have larger values, stochastically greater). - `:left`: Tests if the ECDF of `xs` is significantly *above* the reference CDF (i.e., `xs` tends to have smaller values, stochastically smaller). - `:kernel` (keyword, default `:gaussian`): Used only when `distribution-or-ys` is a sequence. Specifies the method to estimate the empirical distribution: - `:gaussian` (or other KDE kernels): Uses Kernel Density Estimation. - `:enumerated`: Creates a discrete empirical distribution from `ys`. - `:bandwidth` (double, optional): Bandwidth for KDE (if applicable). - `:distinct?` (boolean or keyword, default `true`): How to handle duplicate values in `xs`. - `true` (default): Removes duplicate values from `xs` before computation. - `false`: Uses all values in `xs`, including duplicates. - `:jitter`: Adds a small amount of random noise to each value in `xs` to break ties. Returns a map containing: - `:n`: Sample size of `xs` (after applying `:distinct?`). - `:dp`: Maximum positive difference (ECDF(xs) - CDF(ref)). - `:dn`: Maximum positive difference (CDF(ref) - ECDF(xs)). - `:d`: The KS test statistic (max absolute difference: `max(dp, dn)`). - `:stat`: The specific statistic used for p-value calculation, depending on `:sides` (`d`, `dp`, or `dn`). - `:p-value`: The p-value associated with the test statistic and the specified `:sides`. - `:sides`: The alternative hypothesis side used.
(ks-test-two-samples xs ys)
(ks-test-two-samples xs
ys
{:keys [method sides distinct? correct?]
:or {sides :two-sided distinct? :ties correct? true}})
Performs the two-sample Kolmogorov-Smirnov (KS) test.
This test compares the empirical cumulative distribution functions (ECDFs) of two
independent samples, xs
and ys
, to assess the null hypothesis that they
are drawn from the same continuous distribution.
Parameters:
xs
(seq of numbers): The first sample.ys
(seq of numbers): The second sample.opts
(map, optional): Options map:
:method
(keyword, optional): Specifies the calculation method for the p-value.
:exact
: Attempts an exact calculation (suitable for small samples, sensitive to ties). Default if nx * ny < 10000
.:approximate
: Uses the asymptotic Kolmogorov distribution (suitable for larger samples). Default otherwise.:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): Tests if the distributions differ (ECDFs are different).:right
: Tests if xs
is stochastically greater than ys
(ECDF(xs) is below ECDF(ys)).:left
: Tests if xs
is stochastically smaller than ys
(ECDF(xs) is above ECDF(ys)).:distinct?
(keyword or boolean, default :ties
): How to handle duplicate values (ties).
:ties
(default): Includes all points. Passes information about ties to the :exact
calculation method. Accuracy depends on the exact method's tie handling.:jitter
: Adds a small amount of random noise to break ties before comparison. A practical approach if exact tie handling is complex or not required.true
: Applies distinct
to xs
and ys
separately before combining. May not resolve all ties between the combined samples.false
: Uses the data as-is, without attempting to handle ties explicitly (may lead to less accurate p-values, especially with the exact method).:correct?
(boolean, default true
): Apply continuity correction when using the :exact
calculation method for a more accurate p-value especially for smaller sample sizes.Returns a map containing:
:nx
: Number of observations in xs
(after :distinct?
processing if applicable).:ny
: Number of observations in ys
(after :distinct?
processing if applicable).:n
: Effective sample size used for asymptotic calculation (nx*ny / (nx+ny)
).:dp
: Maximum positive difference (ECDF(xs) - ECDF(ys)).:dn
: Maximum positive difference (ECDF(ys) - ECDF(xs)).:d
: The KS test statistic (max absolute difference: max(dp, dn)
).:stat
: The specific statistic used for p-value calculation (d
, dp
, or dn
for exact; scaled version for approximate).:KS
: Alias for :stat
.:p-value
: The p-value associated with the test statistic and :sides
.:sides
: The alternative hypothesis side used.:method
: The calculation method used (:exact
or :approximate
).Note on Ties: The KS test is strictly defined for continuous distributions where ties have zero probability.
The presence of ties in sample data affects the p-value calculation. The :distinct?
option provides ways to manage this, with :jitter
being a common pragmatic choice.
Performs the two-sample Kolmogorov-Smirnov (KS) test. This test compares the empirical cumulative distribution functions (ECDFs) of two independent samples, `xs` and `ys`, to assess the null hypothesis that they are drawn from the same continuous distribution. Parameters: - `xs` (seq of numbers): The first sample. - `ys` (seq of numbers): The second sample. - `opts` (map, optional): Options map: - `:method` (keyword, optional): Specifies the calculation method for the p-value. - `:exact`: Attempts an exact calculation (suitable for small samples, sensitive to ties). Default if `nx * ny < 10000`. - `:approximate`: Uses the asymptotic Kolmogorov distribution (suitable for larger samples). Default otherwise. - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): Tests if the distributions differ (ECDFs are different). - `:right`: Tests if `xs` is stochastically greater than `ys` (ECDF(xs) is below ECDF(ys)). - `:left`: Tests if `xs` is stochastically smaller than `ys` (ECDF(xs) is above ECDF(ys)). - `:distinct?` (keyword or boolean, default `:ties`): How to handle duplicate values (ties). - `:ties` (default): Includes all points. Passes information about ties to the `:exact` calculation method. Accuracy depends on the exact method's tie handling. - `:jitter`: Adds a small amount of random noise to break ties before comparison. A practical approach if exact tie handling is complex or not required. - `true`: Applies `distinct` to `xs` and `ys` separately before combining. May not resolve all ties between the combined samples. - `false`: Uses the data as-is, without attempting to handle ties explicitly (may lead to less accurate p-values, especially with the exact method). - `:correct?` (boolean, default `true`): Apply continuity correction when using the `:exact` calculation method for a more accurate p-value especially for smaller sample sizes. Returns a map containing: - `:nx`: Number of observations in `xs` (after `:distinct?` processing if applicable). - `:ny`: Number of observations in `ys` (after `:distinct?` processing if applicable). - `:n`: Effective sample size used for asymptotic calculation (`nx*ny / (nx+ny)`). - `:dp`: Maximum positive difference (ECDF(xs) - ECDF(ys)). - `:dn`: Maximum positive difference (ECDF(ys) - ECDF(xs)). - `:d`: The KS test statistic (max absolute difference: `max(dp, dn)`). - `:stat`: The specific statistic used for p-value calculation (`d`, `dp`, or `dn` for exact; scaled version for approximate). - `:KS`: Alias for `:stat`. - `:p-value`: The p-value associated with the test statistic and `:sides`. - `:sides`: The alternative hypothesis side used. - `:method`: The calculation method used (`:exact` or `:approximate`). Note on Ties: The KS test is strictly defined for continuous distributions where ties have zero probability. The presence of ties in sample data affects the p-value calculation. The `:distinct?` option provides ways to manage this, with `:jitter` being a common pragmatic choice.
(kullback-leibler-divergence [vs1 vs2])
(kullback-leibler-divergence vs1 vs2)
Kullback-Leibler divergence of two sequences.
Kullback-Leibler divergence of two sequences.
(kurtosis vs)
(kurtosis vs typ)
Calculates the kurtosis of a sequence, a measure of the 'tailedness' or 'peakedness' of the distribution compared to a normal distribution.
Parameters:
vs
(seq of numbers): The input sequence.typ
(keyword or sequence, optional): Specifies the type of kurtosis measure to calculate.
Different types use different algorithms and may have different expected values
under normality (e.g., 0 or 3). Defaults to :G2
.Available typ
values:
:G2
(Default): Sample kurtosis based on the fourth standardized moment, as
implemented by Apache Commons Math Kurtosis
. Its value approaches 3 for
a large normal sample, but the exact expected value depends on sample size.:g2
or :excess
: Sample excess kurtosis. This is calculated from :G2
and adjusted for sample bias, such that the expected value for a normal
distribution is approximately 0.:kurt
: Kurtosis definition where normal = 3. Calculated as :g2
+ 3.:b2
: Kurtosis defined as fourth moment divided by standard deviation to the power of 4:geary
: Geary's 'g', a robust measure calculated as mean_abs_deviation / population_stddev
.
Expected value for normal is sqrt(2/pi) ≈ 0.798
. Lower values indicate leptokurtosis.:moors
: Moors' robust kurtosis measure based on octiles. The implementation
returns a centered version where the expected value for normal is 0.:crow
: Crow-Siddiqui robust kurtosis measure based on quantiles. The implementation
returns a centered version where the expected value for normal is 0.
Can accept parameters alpha
and beta
via sequential type [:crow alpha beta]
.:hogg
: Hogg's robust kurtosis measure based on trimmed means. The implementation
returns a centered version where the expected value for normal is 0.
Can accept parameters alpha
and beta
via sequential type [:hogg alpha beta]
.:l-kurtosis
: L-kurtosis (τ₄), the ratio of the 4th L-moment (λ₄) to the
2nd L-moment (λ₂, L-scale). Calculated directly using l-moment
with the
:ratio?
option set to true. It's a robust measure.
Expected value for normal distribution is ≈ 0.1226.Interpretation (for excess kurtosis :g2
):
Returns the calculated kurtosis value as a double.
See also kurtosis-test
, bonett-seier-test
, normality-test
, jarque-bera-test
, l-moment
.
Calculates the kurtosis of a sequence, a measure of the 'tailedness' or 'peakedness' of the distribution compared to a normal distribution. Parameters: - `vs` (seq of numbers): The input sequence. - `typ` (keyword or sequence, optional): Specifies the type of kurtosis measure to calculate. Different types use different algorithms and may have different expected values under normality (e.g., 0 or 3). Defaults to `:G2`. Available `typ` values: - `:G2` (Default): Sample kurtosis based on the fourth standardized moment, as implemented by Apache Commons Math `Kurtosis`. Its value approaches 3 for a large normal sample, but the exact expected value depends on sample size. - `:g2` or `:excess`: Sample excess kurtosis. This is calculated from `:G2` and adjusted for sample bias, such that the expected value for a normal distribution is approximately 0. - `:kurt`: Kurtosis definition where normal = 3. Calculated as `:g2` + 3. - `:b2`: Kurtosis defined as fourth moment divided by standard deviation to the power of 4 - `:geary`: Geary's 'g', a robust measure calculated as `mean_abs_deviation / population_stddev`. Expected value for normal is `sqrt(2/pi) ≈ 0.798`. Lower values indicate leptokurtosis. - `:moors`: Moors' robust kurtosis measure based on octiles. The implementation returns a centered version where the expected value for normal is 0. - `:crow`: Crow-Siddiqui robust kurtosis measure based on quantiles. The implementation returns a centered version where the expected value for normal is 0. Can accept parameters `alpha` and `beta` via sequential type `[:crow alpha beta]`. - `:hogg`: Hogg's robust kurtosis measure based on trimmed means. The implementation returns a centered version where the expected value for normal is 0. Can accept parameters `alpha` and `beta` via sequential type `[:hogg alpha beta]`. - `:l-kurtosis`: L-kurtosis (τ₄), the ratio of the 4th L-moment (λ₄) to the 2nd L-moment (λ₂, L-scale). Calculated directly using [[l-moment]] with the `:ratio?` option set to true. It's a robust measure. Expected value for normal distribution is ≈ 0.1226. Interpretation (for excess kurtosis `:g2`): - Positive values indicate a leptokurtic distribution (heavier tails, more peaked than normal). - Negative values indicate a platykurtic distribution (lighter tails, flatter than normal). - Values near 0 suggest kurtosis similar to a normal distribution. Returns the calculated kurtosis value as a double. See also [[kurtosis-test]], [[bonett-seier-test]], [[normality-test]], [[jarque-bera-test]], [[l-moment]].
(kurtosis-test xs)
(kurtosis-test xs params)
(kurtosis-test xs kurt {:keys [sides type] :or {sides :two-sided type :kurt}})
Performs a test for normality based on sample kurtosis.
This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample kurtosis significantly deviates from the kurtosis expected under normality (approximately 3).
The test works by:
:type
, default :kurt
).Z
that more closely follows a
standard normal distribution under the null hypothesis, especially for
smaller sample sizes.Parameters:
xs
(seq of numbers): The sample data.kurt
(double, optional): A pre-calculated kurtosis value. If omitted, it's calculated from xs
.params
(map, optional): Options map:
:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): The population kurtosis is different from normal.:one-sided-greater
: The population kurtosis is greater than normal (leptokurtic).:one-sided-less
: The population kurtosis is less than normal (platykurtic).:type
(keyword, default :kurt
): The type of kurtosis to calculate if kurt
is not provided. See kurtosis
for options (e.g., :kurt
, :G2
, :g2
).Returns a map containing:
:Z
: The final test statistic, approximately standard normal under H0.:stat
: Alias for :Z
.:p-value
: The p-value associated with Z
and the specified :sides
.:kurtosis
: The sample kurtosis value used in the test (either provided or calculated).See also skewness-test
, normality-test
, jarque-bera-test
, bonett-seier-test
.
Performs a test for normality based on sample kurtosis. This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample kurtosis significantly deviates from the kurtosis expected under normality (approximately 3). The test works by: 1. Calculating the sample kurtosis (type configurable via `:type`, default `:kurt`). 2. Standardizing the difference between the sample kurtosis and the expected kurtosis under normality using the theoretical standard error. 3. Applying a further transformation (e.g., Anscombe-Glynn/D'Agostino) to this standardized score to yield a final test statistic `Z` that more closely follows a standard normal distribution under the null hypothesis, especially for smaller sample sizes. Parameters: - `xs` (seq of numbers): The sample data. - `kurt` (double, optional): A pre-calculated kurtosis value. If omitted, it's calculated from `xs`. - `params` (map, optional): Options map: - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): The population kurtosis is different from normal. - `:one-sided-greater`: The population kurtosis is greater than normal (leptokurtic). - `:one-sided-less`: The population kurtosis is less than normal (platykurtic). - `:type` (keyword, default `:kurt`): The type of kurtosis to calculate if `kurt` is not provided. See [[kurtosis]] for options (e.g., `:kurt`, `:G2`, `:g2`). Returns a map containing: - `:Z`: The final test statistic, approximately standard normal under H0. - `:stat`: Alias for `:Z`. - `:p-value`: The p-value associated with `Z` and the specified `:sides`. - `:kurtosis`: The sample kurtosis value used in the test (either provided or calculated). See also [[skewness-test]], [[normality-test]], [[jarque-bera-test]], [[bonett-seier-test]].
(l-moment vs order)
(l-moment vs order {:keys [s t sorted? ratio?] :or {s 0 t 0} :as opts})
Calculates L-moment, TL-moment (trimmed) or (T)L-moment ratios.
Options:
:s
(default: 0) - number of left trimmed values:t
(default: 0) - number of right tirmmed values:sorted?
(default: false) - if input is already sorted:ratio?
(default: false) - normalized l-moment, l-moment ratioCalculates L-moment, TL-moment (trimmed) or (T)L-moment ratios. Options: - `:s` (default: 0) - number of left trimmed values - `:t` (default: 0) - number of right tirmmed values - `:sorted?` (default: false) - if input is already sorted - `:ratio?` (default: false) - normalized l-moment, l-moment ratio
(l-variation vs)
Coefficient of L-variation, L-CV
Coefficient of L-variation, L-CV
Count equal values in both seqs. Alias for [[count==]]
Count equal values in both seqs. Alias for [[count==]]
(L1 [vs1 vs2-or-val])
(L1 vs1 vs2-or-val)
Calculates the L1 distance (Manhattan or City Block distance) between two sequences or a sequence and a constant value.
The L1 distance is the sum of the absolute differences between corresponding elements.
Parameters:
vs1
(sequence of numbers): The first sequence.vs2-or-val
(sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated L1 distance as a double.
Calculates the L1 distance (Manhattan or City Block distance) between two sequences or a sequence and a constant value. The L1 distance is the sum of the absolute differences between corresponding elements. Parameters: - `vs1` (sequence of numbers): The first sequence. - `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated L1 distance as a double. See also [[L2]], [[L2sq]], [[LInf]], [[mae]] (Mean Absolute Error).
(L2 [vs1 vs2-or-val])
(L2 vs1 vs2-or-val)
Calculates the L2 distance (Euclidean distance) between two sequences or a sequence and a constant value.
This is the standard straight-line distance between two points (vectors) in Euclidean space.
It is the square root of the L2sq
distance.
Parameters:
vs1
(sequence of numbers): The first sequence.vs2-or-val
(sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated L2 distance as a double.
Calculates the L2 distance (Euclidean distance) between two sequences or a sequence and a constant value. This is the standard straight-line distance between two points (vectors) in Euclidean space. It is the square root of the [[L2sq]] distance. Parameters: - `vs1` (sequence of numbers): The first sequence. - `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated L2 distance as a double. See also [[L1]], [[L2sq]], [[LInf]], [[rmse]] (Root Mean Squared Error).
(L2sq [vs1 vs2-or-val])
(L2sq vs1 vs2-or-val)
Calculates the Squared Euclidean distance between two sequences or a sequence and a constant value.
This is the sum of the squared differences between corresponding elements.
It is equivalent to the rss
(Residual Sum of Squares).
Parameters:
vs1
(sequence of numbers): The first sequence.vs2-or-val
(sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated Squared Euclidean distance as a double.
See also L1
, L2
, LInf
, rss
(Residual Sum of Squares), mse
(Mean Squared Error).
Calculates the Squared Euclidean distance between two sequences or a sequence and a constant value. This is the sum of the squared differences between corresponding elements. It is equivalent to the [[rss]] (Residual Sum of Squares). Parameters: - `vs1` (sequence of numbers): The first sequence. - `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated Squared Euclidean distance as a double. See also [[L1]], [[L2]], [[LInf]], [[rss]] (Residual Sum of Squares), [[mse]] (Mean Squared Error).
(levene-test xss)
(levene-test xss
{:keys [sides statistic scorediff]
:or {sides :one-sided-greater statistic mean scorediff abs}})
Performs Levene's test for homogeneity of variances across two or more groups.
Levene's test assesses the null hypothesis that the variances of the groups are equal. It calculates an ANOVA on the absolute deviations of the data points from their group center (mean by default).
Parameters:
xss
(sequence of sequences): A collection where each element is a sequence representing a group of observations.params
(map, optional): Options map with the following keys:
:sides
(keyword, default :one-sided-greater
): Alternative hypothesis side for the F-test.
Possible values: :one-sided-greater
, :one-sided-less
, :two-sided
.:statistic
(fn, default mean
): Function to calculate the center of each group (e.g., mean
, median
). Using median
results in the Brown-Forsythe test.:scorediff
(fn, default [[abs]]): Function applied to the difference between each data point and its group center (e.g., [[abs]], [[sq]]).Returns a map containing:
:W
: The Levene test statistic (which is an F-statistic).:stat
: Alias for :W
.:p-value
: The p-value for the test.:df
: Degrees of freedom for the F-statistic ([DFt, DFe]).:n
: Sequence of sample sizes for each group.:SSt
: Sum of squares between groups (treatment).:SSe
: Sum of squares within groups (error).:DFt
: Degrees of freedom between groups.:DFe
: Degrees of freedom within groups.:MSt
: Mean square between groups.:MSe
: Mean square within groups.:sides
: Test side used.See also brown-forsythe-test
.
Performs Levene's test for homogeneity of variances across two or more groups. Levene's test assesses the null hypothesis that the variances of the groups are equal. It calculates an ANOVA on the absolute deviations of the data points from their group center (mean by default). Parameters: - `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations. - `params` (map, optional): Options map with the following keys: - `:sides` (keyword, default `:one-sided-greater`): Alternative hypothesis side for the F-test. Possible values: `:one-sided-greater`, `:one-sided-less`, `:two-sided`. - `:statistic` (fn, default [[mean]]): Function to calculate the center of each group (e.g., [[mean]], [[median]]). Using [[median]] results in the Brown-Forsythe test. - `:scorediff` (fn, default [[abs]]): Function applied to the difference between each data point and its group center (e.g., [[abs]], [[sq]]). Returns a map containing: - `:W`: The Levene test statistic (which is an F-statistic). - `:stat`: Alias for `:W`. - `:p-value`: The p-value for the test. - `:df`: Degrees of freedom for the F-statistic ([DFt, DFe]). - `:n`: Sequence of sample sizes for each group. - `:SSt`: Sum of squares between groups (treatment). - `:SSe`: Sum of squares within groups (error). - `:DFt`: Degrees of freedom between groups. - `:DFe`: Degrees of freedom within groups. - `:MSt`: Mean square between groups. - `:MSe`: Mean square within groups. - `:sides`: Test side used. See also [[brown-forsythe-test]].
(LInf [vs1 vs2-or-val])
(LInf vs1 vs2-or-val)
Calculates the L-infinity distance (Chebyshev distance) between two sequences or a sequence and a constant value.
The Chebyshev distance is the maximum absolute difference between corresponding elements.
Parameters:
vs1
(sequence of numbers): The first sequence.vs2-or-val
(sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated L-infinity distance as a double.
Calculates the L-infinity distance (Chebyshev distance) between two sequences or a sequence and a constant value. The Chebyshev distance is the maximum absolute difference between corresponding elements. Parameters: - `vs1` (sequence of numbers): The first sequence. - `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated L-infinity distance as a double. See also [[L1]], [[L2]], [[L2sq]].
Alias for median-absolute-deviation
Alias for [[median-absolute-deviation]]
(mad-extent vs)
-/+ median-absolute-deviation and median
-/+ median-absolute-deviation and median
(mae [vs1 vs2-or-val])
(mae vs1 vs2-or-val)
Calculates the Mean Absolute Error (MAE) between two sequences or a sequence and constant value.
MAE is a measure of the difference between two sequences of values. It quantifies the average magnitude of the errors, without considering their direction.
Parameters:
vs1
(sequence of numbers): The first sequence (often the observed or true values).vs2-or-val
(sequence of numbers or single number): The second sequence
(often the predicted or reference values), or a single number to compare
against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated Mean Absolute Error as a double.
Note: MAE is less sensitive to large outliers than metrics like Mean Squared Error (MSE) because it uses the absolute value of differences rather than the squared difference.
See also me
(Mean Error), mse
(Mean Squared Error), rmse
(Root Mean Squared Error).
Calculates the Mean Absolute Error (MAE) between two sequences or a sequence and constant value. MAE is a measure of the difference between two sequences of values. It quantifies the average magnitude of the errors, without considering their direction. Parameters: - `vs1` (sequence of numbers): The first sequence (often the observed or true values). - `vs2-or-val` (sequence of numbers or single number): The second sequence (often the predicted or reference values), or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated Mean Absolute Error as a double. Note: MAE is less sensitive to large outliers than metrics like Mean Squared Error (MSE) because it uses the absolute value of differences rather than the squared difference. See also [[me]] (Mean Error), [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error).
(mape [vs1 vs2-or-val])
(mape vs1 vs2-or-val)
Calculates the Mean Absolute Percentage Error (MAPE) between two sequences or a sequence and a constant value.
MAPE is a measure of prediction accuracy of a forecasting method, for example in time series analysis. It is calculated as the average of the absolute percentage errors.
Parameters:
vs1
(sequence of numbers): The first sequence (conventionally, the actual or true values).vs2-or-val
(sequence of numbers or single number): The second sequence
(conventionally, the predicted or reference values), or a single number to
compare against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated Mean Absolute Percentage Error as a double.
Note: MAPE is scale-independent and useful for comparing performance across
different datasets. However, it is undefined if any of the actual values (x_i
)
are zero, and can be skewed by small actual values.
See also me
(Mean Error), mae
(Mean Absolute Error), mse
(Mean Squared Error), rmse
(Root Mean Squared Error).
Calculates the Mean Absolute Percentage Error (MAPE) between two sequences or a sequence and a constant value. MAPE is a measure of prediction accuracy of a forecasting method, for example in time series analysis. It is calculated as the average of the absolute percentage errors. Parameters: - `vs1` (sequence of numbers): The first sequence (conventionally, the actual or true values). - `vs2-or-val` (sequence of numbers or single number): The second sequence (conventionally, the predicted or reference values), or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated Mean Absolute Percentage Error as a double. Note: MAPE is scale-independent and useful for comparing performance across different datasets. However, it is undefined if any of the actual values (`x_i`) are zero, and can be skewed by small actual values. See also [[me]] (Mean Error), [[mae]] (Mean Absolute Error), [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error).
(maximum vs)
Finds the maximum value in a sequence of numbers.
Finds the maximum value in a sequence of numbers.
(mcc ct)
(mcc group1 group2)
Calculates the Matthews Correlation Coefficient (MCC), also known as the Phi coefficient, for a 2x2 contingency table or binary classification outcomes.
MCC is a measure of the quality of binary classifications. It is a balanced measure which can be used even if the classes are of very different sizes. Its value ranges from -1 to +1.
The function can be called in two ways:
With two sequences group1
and group2
:
The function will automatically construct a 2x2 contingency table from
the unique values in the sequences (assuming they represent two binary
variables). The mapping of values to table cells (e.g., what corresponds
to TP, TN, FP, FN) depends on how contingency-table
orders the unique values.
For direct control over which cell is which, use the contingency table input.
With a contingency table: The contingency table can be provided as:
[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}
). This is the output format
of contingency-table
with two inputs.[[TP FP] [FN TN]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of binary outcomes/categories.group2
(sequence): The second sequence of binary outcomes/categories.
Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed 2x2 contingency table.Returns the calculated Matthews Correlation Coefficient as a double.
Note: The implementation uses marginal sums from the contingency table, which is mathematically equivalent to the standard formula but avoids potential division by zero in the denominator product if any marginal sum is zero.
See also contingency-table
, contingency-2x2-measures
, binary-measures-all
.
Calculates the Matthews Correlation Coefficient (MCC), also known as the Phi coefficient, for a 2x2 contingency table or binary classification outcomes. MCC is a measure of the quality of binary classifications. It is a balanced measure which can be used even if the classes are of very different sizes. Its value ranges from -1 to +1. - A coefficient of +1 represents a perfect prediction. - 0 represents a prediction no better than random. - -1 represents a perfect inverse prediction. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a 2x2 contingency table from the unique values in the sequences (assuming they represent two binary variables). The mapping of values to table cells (e.g., what corresponds to TP, TN, FP, FN) depends on how `contingency-table` orders the unique values. For direct control over which cell is which, use the contingency table input. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}`). This is the output format of [[contingency-table]] with two inputs. - A sequence of sequences representing the rows of the table (e.g., `[[TP FP] [FN TN]]`). This is equivalent to `rows->contingency-table`. Parameters: - `group1` (sequence): The first sequence of binary outcomes/categories. - `group2` (sequence): The second sequence of binary outcomes/categories. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed 2x2 contingency table. Returns the calculated Matthews Correlation Coefficient as a double. Note: The implementation uses marginal sums from the contingency table, which is mathematically equivalent to the standard formula but avoids potential division by zero in the denominator product if any marginal sum is zero. See also [[contingency-table]], [[contingency-2x2-measures]], [[binary-measures-all]].
(me [vs1 vs2-or-val])
(me vs1 vs2-or-val)
Calculates the Mean Error (ME) between two sequences or a sequence and constant value.
Parameters:
vs1
(sequence of numbers): The first sequence.vs2-or-val
(sequence of numbers or single number): The second sequence of
numbers, or a single number to compare against each element of vs1
.Both sequences (vs1
and vs2
) must have the same length if both are sequences.
If vs2-or-val
is a single number, it is compared element-wise to vs1
.
Returns the calculated Mean Error as a double.
Note: Positive ME indicates that vs1
values tend to be greater than vs2
values
on average, while negative ME indicates vs1
values tend to be smaller. ME can be
influenced by the magnitude of errors and their signs. It does not directly measure
the magnitude of the typical error due to potential cancellation of positive and
negative differences.
See also mae
(Mean Absolute Error), mse
(Mean Squared Error), rmse
(Root Mean Squared Error).
Calculates the Mean Error (ME) between two sequences or a sequence and constant value. Parameters: - `vs1` (sequence of numbers): The first sequence. - `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`. Both sequences (`vs1` and `vs2`) must have the same length if both are sequences. If `vs2-or-val` is a single number, it is compared element-wise to `vs1`. Returns the calculated Mean Error as a double. Note: Positive ME indicates that `vs1` values tend to be greater than `vs2` values on average, while negative ME indicates `vs1` values tend to be smaller. ME can be influenced by the magnitude of errors and their signs. It does not directly measure the magnitude of the typical error due to potential cancellation of positive and negative differences. See also [[mae]] (Mean Absolute Error), [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error).
(mean vs)
(mean vs weights)
Calculates the arithmetic mean (average) of a sequence vs
.
If weights
are provided, calculates the weighted arithmetic mean.
Parameters:
vs
: Sequence of numbers.weights
(optional): Sequence of non-negative weights corresponding to vs
.
Must have the same count as vs
.Returns the calculated mean as a double.
Calculates the arithmetic mean (average) of a sequence `vs`. If `weights` are provided, calculates the weighted arithmetic mean. Parameters: - `vs`: Sequence of numbers. - `weights` (optional): Sequence of non-negative weights corresponding to `vs`. Must have the same count as `vs`. Returns the calculated mean as a double. See also [[geomean]], [[harmean]], [[powmean]], [[median]].
(mean-absolute-deviation vs)
(mean-absolute-deviation vs center)
Calculates the Mean Absolute Deviation of a sequence vs
.
MeanAD is a measure of the variability of a univariate sample of quantitative data. It is defined as the mean of the absolute deviations from a central point, typically the data's mean.
MeanAD = mean(|X_i - center|)
Parameters:
vs
: Sequence of numbers.center
(optional, double): The central point from which to calculate deviations.
If nil
or not provided, the arithmetic mean
of vs
is used as the center.Returns the calculated Mean Absolute Deviation as a double.
Unlike median-absolute-deviation
, which uses the median of absolute deviations
from the median, the Mean Absolute Deviation uses the mean of absolute deviations
from the mean (or specified center). This makes it more sensitive to outliers
than median-absolute-deviation
but less sensitive than the standard deviation.
See also median-absolute-deviation
, stddev
, mean
.
Calculates the Mean Absolute Deviation of a sequence `vs`. MeanAD is a measure of the variability of a univariate sample of quantitative data. It is defined as the mean of the absolute deviations from a central point, typically the data's mean. `MeanAD = mean(|X_i - center|)` Parameters: - `vs`: Sequence of numbers. - `center` (optional, double): The central point from which to calculate deviations. If `nil` or not provided, the arithmetic [[mean]] of `vs` is used as the center. Returns the calculated Mean Absolute Deviation as a double. Unlike [[median-absolute-deviation]], which uses the median of absolute deviations from the median, the Mean Absolute Deviation uses the mean of absolute deviations from the mean (or specified center). This makes it more sensitive to outliers than [[median-absolute-deviation]] but less sensitive than the standard deviation. See also [[median-absolute-deviation]], [[stddev]], [[mean]].
(means-ratio [group1 group2])
(means-ratio group1 group2)
(means-ratio group1 group2 adjusted?)
Calculates the ratio of the mean of group1
to the mean of group2
.
This is a measure of effect size in the 'Ratio Family', comparing the central tendency of two groups multiplicatively.
Parameters:
group1
(seq of numbers): The first independent sample. The mean of this group is the numerator.group2
(seq of numbers): The second independent sample. The mean of this group is the denominator.adjusted?
(boolean, optional): If true
, applies a small-sample bias correction to the ratio.
Defaults to false
.Returns the calculated ratio of means as a double.
A value greater than 1 indicates that group1
has a larger mean than group2
.
A value less than 1 indicates group1
has a smaller mean.
A value close to 1 indicates similar means.
The adjusted?
version attempts to provide a less biased estimate of the population
mean ratio, particularly for small sample sizes, by incorporating variances into the calculation
(based on Bickel and Doksum, see also means-ratio-corrected
).
See also means-ratio-corrected
(which is equivalent to calling this with adjusted?
set to true
).
Calculates the ratio of the mean of `group1` to the mean of `group2`. This is a measure of effect size in the 'Ratio Family', comparing the central tendency of two groups multiplicatively. Parameters: - `group1` (seq of numbers): The first independent sample. The mean of this group is the numerator. - `group2` (seq of numbers): The second independent sample. The mean of this group is the denominator. - `adjusted?` (boolean, optional): If `true`, applies a small-sample bias correction to the ratio. Defaults to `false`. Returns the calculated ratio of means as a double. A value greater than 1 indicates that `group1` has a larger mean than `group2`. A value less than 1 indicates `group1` has a smaller mean. A value close to 1 indicates similar means. The `adjusted?` version attempts to provide a less biased estimate of the population mean ratio, particularly for small sample sizes, by incorporating variances into the calculation (based on Bickel and Doksum, see also [[means-ratio-corrected]]). See also [[means-ratio-corrected]] (which is equivalent to calling this with `adjusted?` set to `true`).
(means-ratio-corrected [group1 group2])
(means-ratio-corrected group1 group2)
Calculates a bias-corrected ratio of the mean of group1
to the mean of group2
.
This function applies a correction (based on Bickel and Doksum) to the simple
ratio mean(group1) / mean(group2)
to reduce bias, particularly for small
sample sizes.
It is equivalent to calling (means-ratio group1 group2 true)
.
Parameters:
group1
(seq of numbers): The first independent sample. The mean of this group
is the numerator.group2
(seq of numbers): The second independent sample. The mean of this group
is the denominator.Returns the calculated bias-corrected ratio of means as a double.
See also means-ratio
(for the simple, uncorrected ratio).
Calculates a bias-corrected ratio of the mean of `group1` to the mean of `group2`. This function applies a correction (based on Bickel and Doksum) to the simple ratio `mean(group1) / mean(group2)` to reduce bias, particularly for small sample sizes. It is equivalent to calling `(means-ratio group1 group2 true)`. Parameters: - `group1` (seq of numbers): The first independent sample. The mean of this group is the numerator. - `group2` (seq of numbers): The second independent sample. The mean of this group is the denominator. Returns the calculated bias-corrected ratio of means as a double. See also [[means-ratio]] (for the simple, uncorrected ratio).
(median vs)
(median vs estimation-strategy)
Calculates median of a sequence vs
.
An optional estimation-strategy
keyword can be provided to specify the
method used for estimating the quantile, particularly how interpolation is
handled when the desired quantile falls between data points in the sorted
sequence.
Available estimation-strategy
values:
:legacy
(Default): The original method used in Apache Commons Math.:r1
through :r9
: Correspond to the nine quantile estimation algorithms
recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np
or (n+1)p
) and how it interpolates between points.For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.
Calculates median of a sequence `vs`. An optional `estimation-strategy` keyword can be provided to specify the method used for estimating the quantile, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence. Available `estimation-strategy` values: - `:legacy` (Default): The original method used in Apache Commons Math. - `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points. For detailed mathematical descriptions of each estimation strategy, refer to the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html). See also [[quantile]], [[median-3]]
(median-3 a b c)
Median of three values. See median
.
Median of three values. See [[median]].
(median-absolute-deviation vs)
(median-absolute-deviation vs center-or-estimation-strategy)
(median-absolute-deviation vs center estimation-strategy)
Calculates the Median Absolute Deviation (MAD) of a sequence vs
.
MAD is a robust measure of the variability of a univariate sample of quantitative data. It is defined as the median of the absolute deviations from the data's median (or a specified center).
MAD = median(|X_i - median(X)|)
Parameters:
vs
: Sequence of numbers.center-or-estimation-strategy
(optional): The central point from which to calculate deviations or estimation strategy.
If nil
or not provided, the median
of vs
is used as the center. If keyword, it's treated as estimation strategy for median.estimation-strategy
(optional, keyword): The estimation strategy to use for
calculating the median(s). This applies to the calculation of the central
value (if center
is not provided) and to the final median of the absolute
deviations. See median
or quantile
for available strategies (e.g.,
:legacy
, :r1
through :r9
).Returns the calculated MAD as a double.
MAD is less sensitive to outliers than the standard deviation.
See also mean-absolute-deviation
, stddev
, median
, quantile
.
Calculates the Median Absolute Deviation (MAD) of a sequence `vs`. MAD is a robust measure of the variability of a univariate sample of quantitative data. It is defined as the median of the absolute deviations from the data's median (or a specified center). `MAD = median(|X_i - median(X)|)` Parameters: - `vs`: Sequence of numbers. - `center-or-estimation-strategy` (optional): The central point from which to calculate deviations or estimation strategy. If `nil` or not provided, the [[median]] of `vs` is used as the center. If keyword, it's treated as estimation strategy for median. - `estimation-strategy` (optional, keyword): The estimation strategy to use for calculating the median(s). This applies to the calculation of the central value (if `center` is not provided) and to the final median of the absolute deviations. See [[median]] or [[quantile]] for available strategies (e.g., `:legacy`, `:r1` through `:r9`). Returns the calculated MAD as a double. MAD is less sensitive to outliers than the standard deviation. See also [[mean-absolute-deviation]], [[stddev]], [[median]], [[quantile]].
(minimum vs)
Finds the minimum value in a sequence of numbers.
Finds the minimum value in a sequence of numbers.
(minimum-discrimination-information-test contingency-table-or-xs)
(minimum-discrimination-information-test contingency-table-or-xs params)
Minimum discrimination information test, a power divergence test for lambda
-1.0
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Minimum discrimination information test, a power divergence test for `lambda` -1.0 Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(mode vs)
(mode vs method)
(mode vs method opts)
Find the value that appears most often in a dataset vs
.
If multiple values share the same highest frequency (or estimated density/histogram peak),
this function returns only the first one encountered during processing. The specific
mode returned in case of a tie is not guaranteed to be stable. Use modes
if
you need all tied modes.
For samples potentially drawn from a continuous distribution, several estimation
methods are provided via the method
argument:
:histogram
: Calculates the mode based on the peak of a histogram constructed from vs
.
Uses interpolation within the bin with the highest frequency.
Accepts options via opts
, primarily :bins
to control histogram
construction (see histogram
).:pearson
: Estimates the mode using Pearson's second skewness coefficient
formula: mode ≈ 3 * median - 2 * mean
. Accepts :estimation-strategy
in opts
for median calculation (see median
).:kde
: Estimates the mode by finding the original data point in vs
with the highest estimated probability density, based on Kernel Density
Estimation (KDE). Accepts KDE options in opts
like :kernel
, :bandwidth
,
etc. (passed to fastmath.kernel.density/kernel-density
).:default
(or when method
is omitted): Finds the exact value that occurs
most frequently in vs
. Suitable for discrete data.The optional opts
map provides method-specific configuration.
See also modes
(returns all modes) and wmode
(for weighted data).
Find the value that appears most often in a dataset `vs`. If multiple values share the same highest frequency (or estimated density/histogram peak), this function returns only the *first* one encountered during processing. The specific mode returned in case of a tie is not guaranteed to be stable. Use [[modes]] if you need all tied modes. For samples potentially drawn from a continuous distribution, several estimation methods are provided via the `method` argument: * `:histogram`: Calculates the mode based on the peak of a histogram constructed from `vs`. Uses interpolation within the bin with the highest frequency. Accepts options via `opts`, primarily `:bins` to control histogram construction (see [[histogram]]). * `:pearson`: Estimates the mode using Pearson's second skewness coefficient formula: `mode ≈ 3 * median - 2 * mean`. Accepts `:estimation-strategy` in `opts` for median calculation (see [[median]]). * `:kde`: Estimates the mode by finding the original data point in `vs` with the highest estimated probability density, based on Kernel Density Estimation (KDE). Accepts KDE options in `opts` like `:kernel`, `:bandwidth`, etc. (passed to `fastmath.kernel.density/kernel-density`). * `:default` (or when `method` is omitted): Finds the exact value that occurs most frequently in `vs`. Suitable for discrete data. The optional `opts` map provides method-specific configuration. See also [[modes]] (returns all modes) and [[wmode]] (for weighted data).
(modes vs)
(modes vs method)
(modes vs method opts)
Find the values that appear most often in a dataset vs
.
Returns sequence with all most appearing values. For the default method (discrete data), modes are sorted in increasing order.
For samples potentially drawn from a continuous distribution, simply finding the
most frequent exact value might not be meaningful. Several estimation methods
are provided via the method
argument:
:histogram
: Calculates the mode(s) based on the peak(s) of a histogram constructed
from vs
. Uses interpolation within the bin(s) with the highest frequency.
Accepts options via opts
, primarily :bins
to control histogram
construction (see histogram
).:pearson
: Estimates the mode using Pearson's second skewness coefficient
formula: mode ≈ 3 * median - 2 * mean
. Accepts :estimation-strategy
in opts
for median calculation (see median
). Returns a single estimated mode.:kde
: Estimates the mode(s) by finding the original data points in vs
with the highest estimated probability density, based on Kernel Density
Estimation (KDE). Accepts KDE options in opts
like :kernel
, :bandwidth
,
etc. (passed to fastmath.kernel.density/kernel-density
).:default
(or when method
is omitted): Finds the exact value(s) that occur
most frequently in vs
. Suitable for discrete data.The optional opts
map provides method-specific configuration.
See also mode
(returns only the first mode) and wmodes
(for weighted data).
Find the values that appear most often in a dataset `vs`. Returns sequence with all most appearing values. For the default method (discrete data), modes are sorted in increasing order. For samples potentially drawn from a continuous distribution, simply finding the most frequent exact value might not be meaningful. Several estimation methods are provided via the `method` argument: * `:histogram`: Calculates the mode(s) based on the peak(s) of a histogram constructed from `vs`. Uses interpolation within the bin(s) with the highest frequency. Accepts options via `opts`, primarily `:bins` to control histogram construction (see [[histogram]]). * `:pearson`: Estimates the mode using Pearson's second skewness coefficient formula: `mode ≈ 3 * median - 2 * mean`. Accepts `:estimation-strategy` in `opts` for median calculation (see [[median]]). Returns a single estimated mode. * `:kde`: Estimates the mode(s) by finding the original data points in `vs` with the highest estimated probability density, based on Kernel Density Estimation (KDE). Accepts KDE options in `opts` like `:kernel`, `:bandwidth`, etc. (passed to `fastmath.kernel.density/kernel-density`). * `:default` (or when `method` is omitted): Finds the exact value(s) that occur most frequently in `vs`. Suitable for discrete data. The optional `opts` map provides method-specific configuration. See also [[mode]] (returns only the first mode) and [[wmodes]] (for weighted data).
(modified-power-transformation xs)
(modified-power-transformation xs lambda)
(modified-power-transformation xs lambda alpha)
Applies a modified power transformation (Bickel and Doksum) to a data.
Applies a modified power transformation (Bickel and Doksum) to a data.
(moment vs)
(moment vs order)
(moment vs order {:keys [absolute? center mean? normalize?] :or {mean? true}})
Calculate moment (central or/and absolute) of given order (default: 2).
Additional parameters as a map:
:absolute?
- calculate sum as absolute values (default: false
):mean?
- returns mean (proper moment) or just sum of differences (default: true
):center
- value of center (default: nil
= mean):normalize?
- apply normalization by standard deviation to the order powerCalculate moment (central or/and absolute) of given order (default: 2). Additional parameters as a map: * `:absolute?` - calculate sum as absolute values (default: `false`) * `:mean?` - returns mean (proper moment) or just sum of differences (default: `true`) * `:center` - value of center (default: `nil` = mean) * `:normalize?` - apply normalization by standard deviation to the order power
(mse [vs1 vs2-or-val])
(mse vs1 vs2-or-val)
Calculates the Mean Squared Error (MSE) between two sequences or a sequence and a constant value.
MSE is a measure of the quality of an estimator or predictor. It quantifies the average of the squared differences between corresponding elements of the input sequences.
Parameters:
vs1
(sequence of numbers): The first sequence (often the observed or true values).vs2-or-val
(sequence of numbers or single number): The second sequence
(often the predicted or reference values), or a single number to compare
against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated Mean Squared Error as a double.
Note: MSE penalizes larger errors more heavily than smaller errors because the
errors are squared. This makes it sensitive to outliers. It is the average
of the rss
(Residual Sum of Squares). Its square root is the rmse
.
See also rss
(Residual Sum of Squares), rmse
(Root Mean Squared Error),
me
(Mean Error), mae
(Mean Absolute Error), r2
(Coefficient of Determination).
Calculates the Mean Squared Error (MSE) between two sequences or a sequence and a constant value. MSE is a measure of the quality of an estimator or predictor. It quantifies the average of the squared differences between corresponding elements of the input sequences. Parameters: - `vs1` (sequence of numbers): The first sequence (often the observed or true values). - `vs2-or-val` (sequence of numbers or single number): The second sequence (often the predicted or reference values), or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated Mean Squared Error as a double. Note: MSE penalizes larger errors more heavily than smaller errors because the errors are squared. This makes it sensitive to outliers. It is the average of the [[rss]] (Residual Sum of Squares). Its square root is the [[rmse]]. See also [[rss]] (Residual Sum of Squares), [[rmse]] (Root Mean Squared Error), [[me]] (Mean Error), [[mae]] (Mean Absolute Error), [[r2]] (Coefficient of Determination).
(multinomial-likelihood-ratio-test contingency-table-or-xs)
(multinomial-likelihood-ratio-test contingency-table-or-xs params)
Multinomial likelihood ratio test, a power divergence test for lambda
0.0
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Multinomial likelihood ratio test, a power divergence test for `lambda` 0.0 Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(neyman-modified-chisq-test contingency-table-or-xs)
(neyman-modified-chisq-test contingency-table-or-xs params)
Neyman modifield chi square test, a power divergence test for lambda
-2.0
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Neyman modifield chi square test, a power divergence test for `lambda` -2.0 Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(normality-test xs)
(normality-test xs params)
(normality-test xs skew kurt {:keys [sides] :or {sides :one-sided-greater}})
Performs the D'Agostino-Pearson K² omnibus test for normality.
This test combines the results of the skewness and kurtosis tests to provide an overall assessment of whether the sample data deviates from a normal distribution in terms of either asymmetry or peakedness/tailedness.
The test works by:
skewness-test
.kurtosis-test
.Parameters:
xs
(seq of numbers): The sample data.skew
(double, optional): A pre-calculated skewness value (type :g1
used by default in underlying test).kurt
(double, optional): A pre-calculated kurtosis value (type :kurt
used by default in underlying test).params
(map, optional): Options map:
:sides
(keyword, default :one-sided-greater
): Specifies the side(s) of the
Chi-squared(2) distribution used for p-value calculation.
:one-sided-greater
(default and standard): Tests if K² is significantly large,
indicating departure from normality in skewness, kurtosis, or both.:one-sided-less
: Tests if the K² statistic is significantly small.:two-sided
: Tests if the K² statistic is extreme in either tail.Returns a map containing:
:Z
: The calculated K² omnibus test statistic (labeled :Z
for consistency,
though it follows Chi-squared(2)).:stat
: Alias for :Z
.:p-value
: The p-value associated with the K² statistic and :sides
.:skewness
: The sample skewness value used (either provided or calculated).:kurtosis
: The sample kurtosis value used (either provided or calculated).See also skewness-test
, kurtosis-test
, jarque-bera-test
.
Performs the D'Agostino-Pearson K² omnibus test for normality. This test combines the results of the skewness and kurtosis tests to provide an overall assessment of whether the sample data deviates from a normal distribution in terms of either asymmetry or peakedness/tailedness. The test works by: 1. Calculating a normalized test statistic (Z₁) for skewness using [[skewness-test]]. 2. Calculating a normalized test statistic (Z₂) for kurtosis using [[kurtosis-test]]. 3. Combining these into an omnibus statistic: K² = Z₁² + Z₂². 4. Under the null hypothesis that the data comes from a normal distribution, K² approximately follows a Chi-squared distribution with 2 degrees of freedom. Parameters: - `xs` (seq of numbers): The sample data. - `skew` (double, optional): A pre-calculated skewness value (type `:g1` used by default in underlying test). - `kurt` (double, optional): A pre-calculated kurtosis value (type `:kurt` used by default in underlying test). - `params` (map, optional): Options map: - `:sides` (keyword, default `:one-sided-greater`): Specifies the side(s) of the Chi-squared(2) distribution used for p-value calculation. - `:one-sided-greater` (default and standard): Tests if K² is significantly large, indicating departure from normality in skewness, kurtosis, or both. - `:one-sided-less`: Tests if the K² statistic is significantly small. - `:two-sided`: Tests if the K² statistic is extreme in either tail. Returns a map containing: - `:Z`: The calculated K² omnibus test statistic (labeled `:Z` for consistency, though it follows Chi-squared(2)). - `:stat`: Alias for `:Z`. - `:p-value`: The p-value associated with the K² statistic and `:sides`. - `:skewness`: The sample skewness value used (either provided or calculated). - `:kurtosis`: The sample kurtosis value used (either provided or calculated). See also [[skewness-test]], [[kurtosis-test]], [[jarque-bera-test]].
(omega-sq [group1 group2])
(omega-sq group1 group2)
(omega-sq group1 group2 degrees-of-freedom)
Calculates Omega squared (ω²), an effect size measure for the simple linear regression of group1
on group2
.
Omega squared estimates the proportion of variance in the dependent variable (group1
) that is accounted for by the independent variable (group2
) in the population. It is considered a less biased alternative to r2-determination
.
Parameters:
group1
(seq of numbers): The dependent variable.group2
(seq of numbers): The independent variable. Must have the same length as group1
.degrees-of-freedom
(double, optional): The degrees of freedom for the regression model. Defaults to 1.0, which is standard for simple linear regression and used in the 2-arity version. Providing a different value allows calculating ω² for cases with multiple predictors if the sums of squares are computed for the overall model.Returns the calculated Omega squared value as a double. The value typically ranges from 0.0 to 1.0.
Interpretation:
group2
explains none of the variance in group1
in the population.group2
perfectly explains the variance in group1
in the population.Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences. The 3-arity version allows specifying a custom degrees of freedom for regression, which might be relevant for calculating overall $\omega^2$ in multiple regression contexts (where degrees-of-freedom
would be the number of predictors).
See also eta-sq
(Eta-squared, often based on $R^2$), epsilon-sq
(another adjusted R²-like measure), r2-determination
(R-squared).
Calculates Omega squared (ω²), an effect size measure for the simple linear regression of `group1` on `group2`. Omega squared estimates the proportion of variance in the dependent variable (`group1`) that is accounted for by the independent variable (`group2`) in the population. It is considered a less biased alternative to [[r2-determination]]. Parameters: - `group1` (seq of numbers): The dependent variable. - `group2` (seq of numbers): The independent variable. Must have the same length as `group1`. - `degrees-of-freedom` (double, optional): The degrees of freedom for the regression model. Defaults to 1.0, which is standard for simple linear regression and used in the 2-arity version. Providing a different value allows calculating ω² for cases with multiple predictors if the sums of squares are computed for the overall model. Returns the calculated Omega squared value as a double. The value typically ranges from 0.0 to 1.0. Interpretation: - 0.0 indicates that `group2` explains none of the variance in `group1` in the population. - 1.0 indicates that `group2` perfectly explains the variance in `group1` in the population. Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences. The 3-arity version allows specifying a custom degrees of freedom for regression, which might be relevant for calculating overall $\omega^2$ in multiple regression contexts (where `degrees-of-freedom` would be the number of predictors). See also [[eta-sq]] (Eta-squared, often based on $R^2$), [[epsilon-sq]] (another adjusted R²-like measure), [[r2-determination]] (R-squared).
(one-way-anova-test xss)
(one-way-anova-test xss {:keys [sides] :or {sides :one-sided-greater}})
Performs a one-way analysis of variance (ANOVA) test.
ANOVA tests the null hypothesis that the means of two or more independent groups are equal. It assumes that the data within each group are normally distributed and have equal variances.
Parameters:
xss
(sequence of sequences): A collection where each element is a sequence
representing a group of observations.params
(map, optional): Options map with the following key:
:sides
(keyword, default :one-sided-greater
): Alternative hypothesis side for the F-test.
Possible values: :one-sided-greater
, :one-sided-less
, :two-sided
.Returns a map containing:
:F
: The F-statistic for the test.:stat
: Alias for :F
.:p-value
: The p-value for the test.:df
: Degrees of freedom for the F-statistic ([DFt, DFe]).:n
: Sequence of sample sizes for each group.:SSt
: Sum of squares between groups (treatment).:SSe
: Sum of squares within groups (error).:DFt
: Degrees of freedom between groups.:DFe
: Degrees of freedom within groups.:MSt
: Mean square between groups.:MSe
: Mean square within groups.:sides
: Test side used.Performs a one-way analysis of variance (ANOVA) test. ANOVA tests the null hypothesis that the means of two or more independent groups are equal. It assumes that the data within each group are normally distributed and have equal variances. Parameters: - `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations. - `params` (map, optional): Options map with the following key: - `:sides` (keyword, default `:one-sided-greater`): Alternative hypothesis side for the F-test. Possible values: `:one-sided-greater`, `:one-sided-less`, `:two-sided`. Returns a map containing: - `:F`: The F-statistic for the test. - `:stat`: Alias for `:F`. - `:p-value`: The p-value for the test. - `:df`: Degrees of freedom for the F-statistic ([DFt, DFe]). - `:n`: Sequence of sample sizes for each group. - `:SSt`: Sum of squares between groups (treatment). - `:SSe`: Sum of squares within groups (error). - `:DFt`: Degrees of freedom between groups. - `:DFe`: Degrees of freedom within groups. - `:MSt`: Mean square between groups. - `:MSe`: Mean square within groups. - `:sides`: Test side used.
(outer-fence-extent vs)
(outer-fence-extent vs estimation-strategy)
Returns LOF, UOF and median
Returns LOF, UOF and median
(outliers vs)
(outliers vs estimation-strategy)
(outliers vs q1 q3)
Find outliers defined as values outside inner fences.
Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1)
.
(- Q1 (* 1.5 IQR))
.(+ Q3 (* 1.5 IQR))
.Returns a sequence of outliers.
Optional estimation-strategy
argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
Find outliers defined as values outside inner fences. Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is `(- Q3 Q1)`. * LIF (Lower Inner Fence) equals `(- Q1 (* 1.5 IQR))`. * UIF (Upper Inner Fence) equals `(+ Q3 (* 1.5 IQR))`. Returns a sequence of outliers. Optional `estimation-strategy` argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
(p-overlap [group1 group2])
(p-overlap group1 group2)
(p-overlap group1
group2
{:keys [kde bandwidth min-iterations steps]
:or {kde :gaussian min-iterations 3 steps 500}})
Calculates the overlapping index between the estimated distributions of two samples using Kernel Density Estimation (KDE).
This function estimates the probability density function (PDF) for group1
and group2
using KDE and then calculates the area of overlap between the two estimated PDFs. The area of overlap is the integral of the minimum of the two density functions.
Parameters:
group1
(seq of numbers): The first sample.group2
(seq of numbers): The second sample.opts
(map, optional): Options map for KDE and integration:
:kde
(keyword, default :gaussian
): The kernel function to use for KDE. See fastmath.kernel.density/kernel-density+
for options.:bandwidth
(double, optional): The bandwidth for KDE. If omitted, it is automatically estimated.:min-iterations
(long, default 3): Minimum number of iterations for Romberg integration.:steps
(long, default 500): Number of steps (subintervals) for numerical integration over the relevant range.Returns the calculated overlapping index as a double, representing the area of overlap between the two estimated distributions. A value closer to 1 indicates greater overlap, while a value closer to 0 indicates less overlap.
This measure quantifies the degree to which two distributions share common values and can be seen as a measure of similarity.
Calculates the overlapping index between the estimated distributions of two samples using Kernel Density Estimation (KDE). This function estimates the probability density function (PDF) for `group1` and `group2` using KDE and then calculates the area of overlap between the two estimated PDFs. The area of overlap is the integral of the minimum of the two density functions. Parameters: - `group1` (seq of numbers): The first sample. - `group2` (seq of numbers): The second sample. - `opts` (map, optional): Options map for KDE and integration: - `:kde` (keyword, default `:gaussian`): The kernel function to use for KDE. See `fastmath.kernel.density/kernel-density+` for options. - `:bandwidth` (double, optional): The bandwidth for KDE. If omitted, it is automatically estimated. - `:min-iterations` (long, default 3): Minimum number of iterations for Romberg integration. - `:steps` (long, default 500): Number of steps (subintervals) for numerical integration over the relevant range. Returns the calculated overlapping index as a double, representing the area of overlap between the two estimated distributions. A value closer to 1 indicates greater overlap, while a value closer to 0 indicates less overlap. This measure quantifies the degree to which two distributions share common values and can be seen as a measure of similarity.
(p-value stat)
(p-value distribution stat)
(p-value distribution stat sides)
Calculates the p-value for a given test statistic based on a reference probability distribution.
The p-value represents the probability of observing a test statistic as extreme as,
or more extreme than, the provided stat
, assuming the null hypothesis is true
(where the null hypothesis implies stat
follows the given distribution
).
Parameters:
distribution
(distribution object, optional): The probability distribution object
(from fastmath.random
) that the test statistic follows under the null
hypothesis. Defaults to the standard normal distribution (fastmath.random/default-normal
)
if omitted.stat
(double): The observed value of the test statistic.sides
(keyword, optional): Specifies the type of alternative hypothesis and
how 'extremeness' is defined. Defaults to :two-sided
.
:two-sided
or :both
: Alternative hypothesis is that the true parameter is
different from the null value (tests for extremeness in either tail).
Calculates 2 * min(CDF(stat), CCDF(stat))
(adjusted for discrete).:one-sided-greater
or :right
: Alternative hypothesis is that the true
parameter is greater than the null value (tests for extremeness in the right tail).
Calculates CCDF(stat)
(adjusted for discrete).:one-sided-less
, :left
, or :one-sided
: Alternative hypothesis is that the true
parameter is less than the null value (tests for extremeness in the left tail).
Calculates CDF(stat)
.Note: For discrete distributions, a continuity correction (stat - 1
for CCDF calculations)
is applied when calculating right-tail or two-tail probabilities involving the
upper tail. This ensures the probability mass at the statistic value is correctly
accounted for.
Returns the calculated p-value (a double between 0.0 and 1.0).
Calculates the p-value for a given test statistic based on a reference probability distribution. The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the provided `stat`, assuming the null hypothesis is true (where the null hypothesis implies `stat` follows the given `distribution`). Parameters: - `distribution` (distribution object, optional): The probability distribution object (from `fastmath.random`) that the test statistic follows under the null hypothesis. Defaults to the standard normal distribution (`fastmath.random/default-normal`) if omitted. - `stat` (double): The observed value of the test statistic. - `sides` (keyword, optional): Specifies the type of alternative hypothesis and how 'extremeness' is defined. Defaults to `:two-sided`. - `:two-sided` or `:both`: Alternative hypothesis is that the true parameter is different from the null value (tests for extremeness in either tail). Calculates `2 * min(CDF(stat), CCDF(stat))` (adjusted for discrete). - `:one-sided-greater` or `:right`: Alternative hypothesis is that the true parameter is greater than the null value (tests for extremeness in the right tail). Calculates `CCDF(stat)` (adjusted for discrete). - `:one-sided-less`, `:left`, or `:one-sided`: Alternative hypothesis is that the true parameter is less than the null value (tests for extremeness in the left tail). Calculates `CDF(stat)`. Note: For discrete distributions, a continuity correction (`stat - 1` for CCDF calculations) is applied when calculating right-tail or two-tail probabilities involving the upper tail. This ensures the probability mass *at* the statistic value is correctly accounted for. Returns the calculated p-value (a double between 0.0 and 1.0).
(pacf data)
(pacf data lags)
Calculates the Partial Autocorrelation Function (PACF) for a given time series data
.
The PACF measures the linear dependence between a time series and its lagged values after removing the effects of the intermediate lags. It helps identify the direct relationship at each lag and is used to determine the order of autoregressive (AR) components in time series models (e.g., ARIMA).
Parameters:
data
(seq of numbers): The time series data.lags
(long, optional): The maximum lag for which to calculate the PACF. If omitted, calculates PACF for lags from 0 up to (dec (count data))
.Returns a sequence of doubles representing the partial autocorrelation coefficients for the specified lags. The value at lag 0 is always 0.0.
Calculates the Partial Autocorrelation Function (PACF) for a given time series `data`. The PACF measures the linear dependence between a time series and its lagged values *after removing* the effects of the intermediate lags. It helps identify the direct relationship at each lag and is used to determine the order of autoregressive (AR) components in time series models (e.g., ARIMA). Parameters: * `data` (seq of numbers): The time series data. * `lags` (long, optional): The maximum lag for which to calculate the PACF. If omitted, calculates PACF for lags from 0 up to `(dec (count data))`. Returns a sequence of doubles representing the partial autocorrelation coefficients for the specified lags. The value at lag 0 is always 0.0. See also [[acf]], [[acf-ci]], [[pacf-ci]].
(pacf-ci data)
(pacf-ci data lags)
(pacf-ci data lags alpha)
Calculates the Partial Autocorrelation Function (PACF) for a time series and provides approximate confidence intervals.
This function computes the PACF of the input time series data
for specified lags
(see pacf
) and includes approximate confidence intervals around the PACF
estimates. These intervals help determine whether the partial autocorrelation at
a specific lag is statistically significant (i.e., likely non-zero in the population).
Parameters:
data
(seq of numbers): The time series data.lags
(long, optional): The maximum lag for which to calculate the PACF and CI.
If omitted, calculates for lags up to (dec (count data))
.alpha
(double, optional): The significance level for the confidence intervals.
Defaults to 0.05
(for a 95% CI).Returns a map containing:
:ci
(double): The value of the approximate standard confidence interval bound
for lags > 0. If the absolute value of a PACF
coefficient at lag k > 0
exceeds this value, it is considered statistically significant.:pacf
(seq of doubles): The sequence of partial autocorrelation coefficients
at lags from 0 up to lags
(calculated using pacf
).Calculates the Partial Autocorrelation Function (PACF) for a time series and provides approximate confidence intervals. This function computes the PACF of the input time series `data` for specified lags (see [[pacf]]) and includes approximate confidence intervals around the PACF estimates. These intervals help determine whether the partial autocorrelation at a specific lag is statistically significant (i.e., likely non-zero in the population). Parameters: * `data` (seq of numbers): The time series data. * `lags` (long, optional): The maximum lag for which to calculate the PACF and CI. If omitted, calculates for lags up to `(dec (count data))`. * `alpha` (double, optional): The significance level for the confidence intervals. Defaults to `0.05` (for a 95% CI). Returns a map containing: * `:ci` (double): The value of the approximate standard confidence interval bound for lags > 0. If the absolute value of a PACF coefficient at lag `k > 0` exceeds this value, it is considered statistically significant. * `:pacf` (seq of doubles): The sequence of partial autocorrelation coefficients at lags from 0 up to `lags` (calculated using [[pacf]]). See also [[pacf]], [[acf]], [[acf-ci]].
(pearson-correlation [vs1 vs2])
(pearson-correlation vs1 vs2)
Calculates the Pearson product-moment correlation coefficient between two sequences.
This function measures the linear relationship between two datasets. The coefficient value ranges from -1.0 (perfect negative linear correlation) to 1.0 (perfect positive linear correlation), with 0.0 indicating no linear correlation.
Parameters:
[vs1 vs2]
(sequence of two sequences): A sequence containing the two sequences of numbers.vs1
, vs2
(sequences): The two sequences of numbers directly as arguments.Both input sequences must contain only numbers and must have the same length.
Returns the calculated Pearson correlation coefficient as a double. Returns NaN
if
either sequence has zero variance (i.e., all elements are the same).
See also correlation
(general correlation, defaults to Pearson), spearman-correlation
,
kendall-correlation
, correlation-matrix
.
Calculates the Pearson product-moment correlation coefficient between two sequences. This function measures the linear relationship between two datasets. The coefficient value ranges from -1.0 (perfect negative linear correlation) to 1.0 (perfect positive linear correlation), with 0.0 indicating no linear correlation. Parameters: - `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers. - `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments. Both input sequences must contain only numbers and must have the same length. Returns the calculated Pearson correlation coefficient as a double. Returns `NaN` if either sequence has zero variance (i.e., all elements are the same). See also [[correlation]] (general correlation, defaults to Pearson), [[spearman-correlation]], [[kendall-correlation]], [[correlation-matrix]].
(pearson-r [group1 group2])
(pearson-r group1 group2)
Calculates the Pearson r
correlation coefficient between two sequences.
This function is an alias for pearson-correlation
.
See pearson-correlation
for detailed documentation, parameters, and usage examples.
Calculates the Pearson `r` correlation coefficient between two sequences. This function is an alias for [[pearson-correlation]]. See [[pearson-correlation]] for detailed documentation, parameters, and usage examples.
(percentile vs p)
(percentile vs p estimation-strategy)
Calculates the p-th percentile of a sequence vs
.
The percentile p
is a value between 0 and 100, inclusive.
An optional estimation-strategy
keyword can be provided to specify the
method used for estimating the percentile, particularly how interpolation is
handled when the desired percentile falls between data points in the sorted
sequence.
Available estimation-strategy
values:
:legacy
(Default): The original method used in Apache Commons Math.:r1
through :r9
: Correspond to the nine quantile estimation algorithms
recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index
(e.g., using np
or (n+1)p
) and how it interpolates between points.For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.
See also quantile
(which uses a 0.0-1.0 range) and percentiles
.
Calculates the p-th percentile of a sequence `vs`. The percentile `p` is a value between 0 and 100, inclusive. An optional `estimation-strategy` keyword can be provided to specify the method used for estimating the percentile, particularly how interpolation is handled when the desired percentile falls between data points in the sorted sequence. Available `estimation-strategy` values: - `:legacy` (Default): The original method used in Apache Commons Math. - `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points. For detailed mathematical descriptions of each estimation strategy, refer to the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html). See also [[quantile]] (which uses a 0.0-1.0 range) and [[percentiles]].
(percentile-bc-extent vs)
(percentile-bc-extent vs p)
(percentile-bc-extent vs p1 p2)
(percentile-bc-extent vs p1 p2 estimation-strategy)
Return bias corrected percentile range and mean for bootstrap samples. See https://projecteuclid.org/euclid.ss/1032280214
p
- calculates extent of bias corrected p
and 100-p
(default: p=2.5
)
Set estimation-strategy
to :r7
to get the same result as in R coxed::bca
.
Return bias corrected percentile range and mean for bootstrap samples. See https://projecteuclid.org/euclid.ss/1032280214 `p` - calculates extent of bias corrected `p` and `100-p` (default: `p=2.5`) Set `estimation-strategy` to `:r7` to get the same result as in R `coxed::bca`.
(percentile-bca-extent vs)
(percentile-bca-extent vs p)
(percentile-bca-extent vs p1 p2)
(percentile-bca-extent vs p1 p2 estimation-strategy)
(percentile-bca-extent vs p1 p2 accel estimation-strategy)
Return bias corrected percentile range and mean for bootstrap samples. Also accounts for variance variations throught the accelaration parameter. See https://projecteuclid.org/euclid.ss/1032280214
p
- calculates extent of bias corrected p
and 100-p
(default: p=2.5
)
Set estimation-strategy
to :r7
to get the same result as in R coxed::bca
.
Return bias corrected percentile range and mean for bootstrap samples. Also accounts for variance variations throught the accelaration parameter. See https://projecteuclid.org/euclid.ss/1032280214 `p` - calculates extent of bias corrected `p` and `100-p` (default: `p=2.5`) Set `estimation-strategy` to `:r7` to get the same result as in R `coxed::bca`.
(percentile-extent vs)
(percentile-extent vs p)
(percentile-extent vs p1 p2)
(percentile-extent vs p1 p2 estimation-strategy)
Return percentile range and median.
p
- calculates extent of p
and 100-p
(default: p=25
)
Return percentile range and median. `p` - calculates extent of `p` and `100-p` (default: `p=25`)
(percentiles vs)
(percentiles vs ps)
(percentiles vs ps estimation-strategy)
Calculates the sequence of p-th percentiles of a sequence vs
.
Percentiles ps
is sequence of values between 0 and 100, inclusive.
An optional estimation-strategy
keyword can be provided to specify the
method used for estimating the percentile, particularly how interpolation is
handled when the desired percentile falls between data points in the sorted
sequence.
Available estimation-strategy
values:
:legacy
(Default): The original method used in Apache Commons Math.:r1
through :r9
: Correspond to the nine quantile estimation algorithms
recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np
or (n+1)p
) and how it interpolates between points.For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.
See also quantiles
(which uses a 0.0-1.0 range) and percentile
.
Calculates the sequence of p-th percentiles of a sequence `vs`. Percentiles `ps` is sequence of values between 0 and 100, inclusive. An optional `estimation-strategy` keyword can be provided to specify the method used for estimating the percentile, particularly how interpolation is handled when the desired percentile falls between data points in the sorted sequence. Available `estimation-strategy` values: - `:legacy` (Default): The original method used in Apache Commons Math. - `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points. For detailed mathematical descriptions of each estimation strategy, refer to the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html). See also [[quantiles]] (which uses a 0.0-1.0 range) and [[percentile]].
(pi vs)
(pi vs size)
(pi vs size estimation-strategy)
Returns PI as a map, quantile intervals based on interval size.
Quantiles are (1-size)/2
and 1-(1-size)/2
Returns PI as a map, quantile intervals based on interval size. Quantiles are `(1-size)/2` and `1-(1-size)/2`
(pi-extent vs)
(pi-extent vs size)
(pi-extent vs size estimation-strategy)
Returns PI extent, quantile intervals based on interval size + median.
Quantiles are (1-size)/2
and 1-(1-size)/2
Returns PI extent, quantile intervals based on interval size + median. Quantiles are `(1-size)/2` and `1-(1-size)/2`
(pooled-mad groups)
(pooled-mad groups const)
Calculate pooled median absolute deviation for samples.
k is a scaling constant which equals around 1.4826 by default.
Calculate pooled median absolute deviation for samples. k is a scaling constant which equals around 1.4826 by default.
(pooled-stddev groups)
(pooled-stddev groups method)
Calculate pooled standard deviation for samples and method
Methods:
:unbiased
- sqrt of weighted average of variances (default):biased
- biased version of :unbiased
, no count correction.:avg
- sqrt of average of variancesCalculate pooled standard deviation for samples and method Methods: * `:unbiased` - sqrt of weighted average of variances (default) * `:biased` - biased version of `:unbiased`, no count correction. * `:avg` - sqrt of average of variances
(pooled-variance groups)
(pooled-variance groups method)
Calculate pooled variance for samples and method.
Methods:
:unbiased
- weighted average of variances (default):biased
- biased version of :unbiased
, no count correction.:avg
- average of variancesCalculate pooled variance for samples and method. Methods: * `:unbiased` - weighted average of variances (default) * `:biased` - biased version of `:unbiased`, no count correction. * `:avg` - average of variances
(population-stddev vs)
(population-stddev vs mu)
Calculate population standard deviation of vs
.
See stddev
.
Calculate population standard deviation of `vs`. See [[stddev]].
(population-variance vs)
(population-variance vs mu)
Calculate population variance of vs
.
See variance
.
Calculate population variance of `vs`. See [[variance]].
(population-wstddev vs weights)
Calculate population weighted standard deviation of vs
Calculate population weighted standard deviation of `vs`
(population-wvariance vs freqs)
Calculate weighted population variance of vs
.
Calculate weighted population variance of `vs`.
(power-divergence-test contingency-table-or-xs)
(power-divergence-test contingency-table-or-xs
{:keys [lambda ci-sides sides p alpha bootstrap-samples
ddof bins]
:or {lambda m/TWO_THIRD
sides :one-sided-greater
ci-sides :two-sided
alpha 0.05
bootstrap-samples 1000
ddof 0}})
Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.
Usage:
Goodness-of-Fit (GOF):
observed-counts
(sequence of numbers) and :p
(expected probabilities/weights).data
(sequence of numbers) and :p
(a distribution object).
In this case, a histogram of data
is created (controlled by :bins
) and
compared against the probability mass/density of the distribution in those bins.Test for Independence:
contingency-table
(2D sequence or map format). The :p
option is ignored.Options map:
:lambda
(double, default: 2/3
): Determines the specific test statistic. Common values:
1.0
: Pearson Chi-squared test (chisq-test
).0.0
: G-test / Multinomial Likelihood Ratio test (multinomial-likelihood-ratio-test
).-0.5
: Freeman-Tukey test (freeman-tukey-test
).-1.0
: Minimum Discrimination Information test (minimum-discrimination-information-test
).-2.0
: Neyman Modified Chi-squared test (neyman-modified-chisq-test
).2/3
: Cressie-Read test (default, cressie-read-test
).:p
(seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
or a fastmath.random
distribution object (for GOF with data). Ignored for independence tests.:alpha
(double, default: 0.05
): Significance level for confidence intervals.:ci-sides
(keyword, default: :two-sided
): Sides for bootstrap confidence intervals
(:two-sided
, :one-sided-greater
, :one-sided-less
).:sides
(keyword, default: :one-sided-greater
): Alternative hypothesis side for the p-value calculation
against the Chi-squared distribution (:one-sided-greater
, :one-sided-less
, :two-sided
).:bootstrap-samples
(long, default: 1000
): Number of bootstrap samples for confidence interval estimation.:ddof
(long, default: 0
): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.:bins
(number, keyword, or seq): Used only for GOF test against a distribution.
Specifies the number of bins, an estimation method (see histogram
), or explicit bin edges for histogram creation.Returns a map containing:
:stat
: The calculated power divergence test statistic.:chi2
: Alias for :stat
.:df
: Degrees of freedom for the test.:p-value
: The p-value associated with the test statistic.:n
: Total number of observations.:estimate
: Observed proportions.:expected
: Expected counts or proportions under the null hypothesis.:confidence-interval
: Bootstrap confidence intervals for the observed proportions.:lambda
, :alpha
, :sides
, :ci-sides
: Input options used.Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table. Usage: 1. **Goodness-of-Fit (GOF):** - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights). - Input: `data` (sequence of numbers) and `:p` (a distribution object). In this case, a histogram of `data` is created (controlled by `:bins`) and compared against the probability mass/density of the distribution in those bins. 2. **Test for Independence:** - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored. Options map: * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values: * `1.0`: Pearson Chi-squared test ([[chisq-test]]). * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]). * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]). * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]). * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]). * `2/3`: Cressie-Read test (default, [[cressie-read-test]]). * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests. * `:alpha` (double, default: `0.05`): Significance level for confidence intervals. * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals (`:two-sided`, `:one-sided-greater`, `:one-sided-less`). * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`). * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation. * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom. * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation. Returns a map containing: - `:stat`: The calculated power divergence test statistic. - `:chi2`: Alias for `:stat`. - `:df`: Degrees of freedom for the test. - `:p-value`: The p-value associated with the test statistic. - `:n`: Total number of observations. - `:estimate`: Observed proportions. - `:expected`: Expected counts or proportions under the null hypothesis. - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions. - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
(power-transformation xs)
(power-transformation xs lambda)
(power-transformation xs lambda alpha)
Applies a power transformation to a data.
Applies a power transformation to a data.
(powmean vs power)
(powmean vs weights power)
Calculates the generalized power mean (also known as the Hölder mean) of a sequence vs
.
The power mean is a generalization of the Pythagorean means (arithmetic, geometric, harmonic)
and other means like the quadratic mean (RMS). It is defined for a non-zero real number power
.
Parameters:
vs
: Sequence of numbers. Constraints depend on the power
:
power > 0
, values should be non-negative.power = 0
, values must be positive (reduces to geometric mean).power < 0
, values must be positive and non-zero.weights
(optional): Sequence of non-negative weights corresponding to vs
.
Must have the same count as vs
.power
(double): The exponent defining the mean.Special Cases:
power = 0
: Returns the geomean
.power = 1
: Returns the arithmetic mean
.power = -1
: Equivalent to the harmean
. (Handled by the general formula)power = 2
: Returns the Root Mean Square (RMS) or quadratic mean.power = inf
: Returns maximum.power = -inf
: Returms minimum.power
values 1/3, 0.5, 2, and 3.Returns the calculated power mean as a double.
Calculates the generalized power mean (also known as the Hölder mean) of a sequence `vs`. The power mean is a generalization of the Pythagorean means (arithmetic, geometric, harmonic) and other means like the quadratic mean (RMS). It is defined for a non-zero real number `power`. Parameters: - `vs`: Sequence of numbers. Constraints depend on the `power`: - For `power > 0`, values should be non-negative. - For `power = 0`, values must be positive (reduces to geometric mean). - For `power < 0`, values must be positive and non-zero. - `weights` (optional): Sequence of non-negative weights corresponding to `vs`. Must have the same count as `vs`. - `power` (double): The exponent defining the mean. Special Cases: - `power = 0`: Returns the [[geomean]]. - `power = 1`: Returns the arithmetic [[mean]]. - `power = -1`: Equivalent to the [[harmean]]. (Handled by the general formula) - `power = 2`: Returns the Root Mean Square (RMS) or quadratic mean. - `power = inf`: Returns maximum. - `power = -inf`: Returms minimum. - The implementation includes optimized paths for `power` values 1/3, 0.5, 2, and 3. Returns the calculated power mean as a double. See also [[mean]], [[geomean]], [[harmean]].
(psnr [vs1 vs2-or-val])
(psnr vs1 vs2-or-val)
(psnr vs1 vs2-or-val max-value)
Peak Signal-to-Noise Ratio (PSNR).
PSNR is a measure used to quantify the quality of reconstruction of lossy compression codecs (e.g., for images or video). It is calculated using the Mean Squared Error (MSE) between the original and compressed images/signals. A higher PSNR generally indicates a higher quality signal reconstruction (i.e., less distortion).
Parameters:
vs1
(sequence of numbers): The first sequence (conventionally, the original or reference signal/data).vs2-or-val
(sequence of numbers or single number): The second sequence
(conventionally, the reconstructed or noisy signal/data), or a single number
to compare against each element of vs1
.max-value
(optional, double): The maximum possible value of a sample in the data.
If not provided, the function automatically determines the maximum value present
across both input sequences (vs1
and vs2
if a sequence, or vs1
and the scalar value
if vs2-or-val
is a number). Providing an explicit max-value
is often more
appropriate based on the data type's theoretical maximum range (e.g., 255 for 8-bit).If vs2-or-val
is a sequence, both vs1
and vs2
must have the same length.
Returns the calculated Peak Signal-to-Noise Ratio as a double. Returns -Double/Infinity
if the MSE is zero (perfect match). Returns NaN
if MSE is non-positive.
Peak Signal-to-Noise Ratio (PSNR). PSNR is a measure used to quantify the quality of reconstruction of lossy compression codecs (e.g., for images or video). It is calculated using the Mean Squared Error (MSE) between the original and compressed images/signals. A higher PSNR generally indicates a higher quality signal reconstruction (i.e., less distortion). Parameters: - `vs1` (sequence of numbers): The first sequence (conventionally, the original or reference signal/data). - `vs2-or-val` (sequence of numbers or single number): The second sequence (conventionally, the reconstructed or noisy signal/data), or a single number to compare against each element of `vs1`. - `max-value` (optional, double): The maximum possible value of a sample in the data. If not provided, the function automatically determines the maximum value present across both input sequences (`vs1` and `vs2` if a sequence, or `vs1` and the scalar value if `vs2-or-val` is a number). Providing an explicit `max-value` is often more appropriate based on the data type's theoretical maximum range (e.g., 255 for 8-bit). If `vs2-or-val` is a sequence, both `vs1` and `vs2` must have the same length. Returns the calculated Peak Signal-to-Noise Ratio as a double. Returns `-Double/Infinity` if the MSE is zero (perfect match). Returns `NaN` if MSE is non-positive. See also [[mse]], [[rmse]].
(quantile vs q)
(quantile vs q estimation-strategy)
Calculates the q-th quantile of a sequence vs
.
The quantile q
is a value between 0.0 and 1.0, inclusive.
An optional estimation-strategy
keyword can be provided to specify the
method used for estimating the quantile, particularly how interpolation is
handled when the desired quantile falls between data points in the sorted
sequence.
Available estimation-strategy
values:
:legacy
(Default): The original method used in Apache Commons Math.:r1
through :r9
: Correspond to the nine quantile estimation algorithms
recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np
or (n+1)p
) and how it interpolates between points.For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.
See also percentile
(which uses a 0-100 range) and quantiles
.
Calculates the q-th quantile of a sequence `vs`. The quantile `q` is a value between 0.0 and 1.0, inclusive. An optional `estimation-strategy` keyword can be provided to specify the method used for estimating the quantile, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence. Available `estimation-strategy` values: - `:legacy` (Default): The original method used in Apache Commons Math. - `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points. For detailed mathematical descriptions of each estimation strategy, refer to the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html). See also [[percentile]] (which uses a 0-100 range) and [[quantiles]].
(quantile-extent vs)
(quantile-extent vs q)
(quantile-extent vs q1 q2)
(quantile-extent vs q1 q2 estimation-strategy)
Return quantile range and median.
q
- calculates extent of q
and 1.0-q
(default: q=0.25
)
Return quantile range and median. `q` - calculates extent of `q` and `1.0-q` (default: `q=0.25`)
(quantiles vs)
(quantiles vs qs)
(quantiles vs qs estimation-strategy)
Calculates the sequence of q-th quantiles of a sequence vs
.
Quantiles q
is a sequence of values between 0.0 and 1.0, inclusive.
An optional estimation-strategy
keyword can be provided to specify the
method used for estimating the quantile, particularly how interpolation is
handled when the desired quantile falls between data points in the sorted
sequence.
Available estimation-strategy
values:
:legacy
(Default): The original method used in Apache Commons Math.:r1
through :r9
: Correspond to the nine quantile estimation algorithms
recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index
(e.g., using np
or (n+1)p
) and how it interpolates between points.For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.
See also percentiles
(which uses a 0-100 range) and quantile
.
Calculates the sequence of q-th quantiles of a sequence `vs`. Quantiles `q` is a sequence of values between 0.0 and 1.0, inclusive. An optional `estimation-strategy` keyword can be provided to specify the method used for estimating the quantile, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence. Available `estimation-strategy` values: - `:legacy` (Default): The original method used in Apache Commons Math. - `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points. For detailed mathematical descriptions of each estimation strategy, refer to the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html). See also [[percentiles]] (which uses a 0-100 range) and [[quantile]].
(r2 [vs1 vs2-or-val])
(r2 vs1 vs2-or-val)
(r2 vs1 vs2-or-val no-of-variables)
Calculates the Coefficient of Determination ($R^2$) or adjusted version between two sequences or a sequence and a constant value.
$R^2$ is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a statistical model. It indicates how well the model fits the observed data.
The standard $R^2$ is calculated as $1 - (RSS / TSS)$, where:
vs1
) and the predicted/reference values (vs2
or vs2-or-val
).
See rss
.vs1
) and their mean. This is calculated
using moment
of order 2 with :mean?
set to false
.This function has two arities:
(r2 vs1 vs2-or-val)
: Calculates the standard $R^2$.
vs1
(seq of numbers): The sequence of observed or actual values.vs2-or-val
(seq of numbers or single number): The sequence of predicted or
reference values, or a single constant value to compare against.Returns the calculated standard $R^2$ as a double. For simple linear regression,
this is equal to the square of the Pearson correlation coefficient (r2-determination
).
$R^2$ typically ranges from 0 to 1 in this context, but can be negative
if the chosen model fits the data worse than a horizontal line through the mean
of the observed data.
(r2 vs1 vs2-or-val no-of-variables)
: Calculates the Adjusted $R^2$.
The adjusted $R^2$ is a modified version of $R^2$ that has been adjusted
for the number of predictors in the model. It increases only if the new term
improves the model more than would be expected by chance.
The formula for adjusted $R^2$ is:
$$ R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1} $$
where $n$ is the number of observations (length of vs1
) and $p$ is the
number of independent variables (no-of-variables
).
vs1
(seq of numbers): The sequence of observed or actual values.vs2-or-val
(seq of numbers or single number): The sequence of predicted or
reference values, or a single constant value to compare against.no-of-variables
(double): The number of independent variables ($p$) used in the model
that produced the vs2-or-val
predictions.Returns the calculated adjusted $R^2$ as a double.
Both vs1
and vs2
(if vs2-or-val
is a sequence) must have the same length.
See also rss
, mse
, rmse
, pearson-correlation
, r2-determination
.
Calculates the Coefficient of Determination ($R^2$) or adjusted version between two sequences or a sequence and a constant value. $R^2$ is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a statistical model. It indicates how well the model fits the observed data. The standard $R^2$ is calculated as $1 - (RSS / TSS)$, where: - $RSS$ (Residual Sum of Squares) is the sum of the squared differences between the observed values (`vs1`) and the predicted/reference values (`vs2` or `vs2-or-val`). See [[rss]]. - $TSS$ (Total Sum of Squares) is the sum of the squared differences between the observed values (`vs1`) and their mean. This is calculated using [[moment]] of order 2 with `:mean?` set to `false`. This function has two arities: 1. `(r2 vs1 vs2-or-val)`: Calculates the standard $R^2$. - `vs1` (seq of numbers): The sequence of observed or actual values. - `vs2-or-val` (seq of numbers or single number): The sequence of predicted or reference values, or a single constant value to compare against. Returns the calculated standard $R^2$ as a double. For simple linear regression, this is equal to the square of the Pearson correlation coefficient ([[r2-determination]]). $R^2$ typically ranges from 0 to 1 in this context, but can be negative if the chosen model fits the data worse than a horizontal line through the mean of the observed data. 2. `(r2 vs1 vs2-or-val no-of-variables)`: Calculates the **Adjusted $R^2$**. The adjusted $R^2$ is a modified version of $R^2$ that has been adjusted for the number of predictors in the model. It increases only if the new term improves the model more than would be expected by chance. The formula for adjusted $R^2$ is: $$ R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1} $$ where $n$ is the number of observations (length of `vs1`) and $p$ is the number of independent variables (`no-of-variables`). - `vs1` (seq of numbers): The sequence of observed or actual values. - `vs2-or-val` (seq of numbers or single number): The sequence of predicted or reference values, or a single constant value to compare against. - `no-of-variables` (double): The number of independent variables ($p$) used in the model that produced the `vs2-or-val` predictions. Returns the calculated adjusted $R^2$ as a double. Both `vs1` and `vs2` (if `vs2-or-val` is a sequence) must have the same length. See also [[rss]], [[mse]], [[rmse]], [[pearson-correlation]], [[r2-determination]].
(r2-determination [group1 group2])
(r2-determination group1 group2)
Calculates the Coefficient of Determination ($R^2$) between two sequences.
This function computes the square of the Pearson product-moment correlation
coefficient (pearson-correlation
) between group1
and group2
.
$R^2$ measures the proportion of the variance in one variable that is predictable from the other variable in a linear relationship. For a simple linear regression with one independent variable, this value is equivalent to the $R^2$ calculated from the Residual Sum of Squares (RSS) and Total Sum of Squares (TSS).
Parameters:
group1
(seq of numbers): The first sequence.group2
(seq of numbers): The second sequence.Both sequences must have the same length.
Returns the calculated $R^2$ value (a double between 0.0 and 1.0) as a double.
Returns NaN
if the Pearson correlation cannot be calculated (e.g., one sequence is constant).
See also r2
(for general $R^2$ and adjusted $R^2$), pearson-correlation
.
Calculates the Coefficient of Determination ($R^2$) between two sequences. This function computes the square of the Pearson product-moment correlation coefficient ([[pearson-correlation]]) between `group1` and `group2`. $R^2$ measures the proportion of the variance in one variable that is predictable from the other variable in a linear relationship. For a simple linear regression with one independent variable, this value is equivalent to the $R^2$ calculated from the Residual Sum of Squares (RSS) and Total Sum of Squares (TSS). Parameters: - `group1` (seq of numbers): The first sequence. - `group2` (seq of numbers): The second sequence. Both sequences must have the same length. Returns the calculated $R^2$ value (a double between 0.0 and 1.0) as a double. Returns `NaN` if the Pearson correlation cannot be calculated (e.g., one sequence is constant). See also [[r2]] (for general $R^2$ and adjusted $R^2$), [[pearson-correlation]].
(rank-epsilon-sq xss)
Calculates Rank Epsilon-squared (ε²), a measure of effect size for the Kruskal-Wallis H-test.
Rank Epsilon-squared is a non-parametric measure quantifying the proportion of the total variability (based on ranks) in the dependent variable that is associated with group membership (the independent variable). It is analogous to Eta-squared or Epsilon-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test.
This function calculates Epsilon-squared based on the Kruskal-Wallis H statistic (H
)
and the total number of observations (n
) across all groups.
Parameters:
xss
(sequence of sequences): A collection where each element is a sequence
representing a group of observations, as used in kruskal-test
.Returns the calculated Rank Epsilon-squared value as a double, ranging from 0 to 1.
Interpretation:
Rank Epsilon-squared is a useful supplement to the Kruskal-Wallis test, providing a measure of the magnitude of the group effect that is not sensitive to assumptions about the data distribution shape (beyond having similar shapes for valid interpretation of the Kruskal-Wallis test itself).
See also kruskal-test
, rank-eta-sq
(another rank-based effect size).
Calculates Rank Epsilon-squared (ε²), a measure of effect size for the Kruskal-Wallis H-test. Rank Epsilon-squared is a non-parametric measure quantifying the proportion of the total variability (based on ranks) in the dependent variable that is associated with group membership (the independent variable). It is analogous to Eta-squared or Epsilon-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test. This function calculates Epsilon-squared based on the Kruskal-Wallis H statistic (`H`) and the total number of observations (`n`) across all groups. Parameters: - `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations, as used in [[kruskal-test]]. Returns the calculated Rank Epsilon-squared value as a double, ranging from 0 to 1. Interpretation: - A value of 0 indicates no difference in the distributions across groups. - A value closer to 1 indicates that a large proportion of the variability is due to differences between group ranks. Rank Epsilon-squared is a useful supplement to the Kruskal-Wallis test, providing a measure of the magnitude of the group effect that is not sensitive to assumptions about the data distribution shape (beyond having similar shapes for valid interpretation of the Kruskal-Wallis test itself). See also [[kruskal-test]], [[rank-eta-sq]] (another rank-based effect size).
(rank-eta-sq xss)
Calculates the Rank Eta-squared (η²), an effect size measure for the Kruskal-Wallis H-test.
Rank Eta-squared is a non-parametric measure quantifying the proportion of the total variability (based on ranks) in the dependent variable that is associated with group membership (the independent variable). It is analogous to Eta-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test.
The statistic is calculated based on the Kruskal-Wallis H statistic, the number
of groups (k
), and the total number of observations (n
).
Parameters:
xss
(sequence of sequences): A collection where each element is a sequence
representing a group of observations, as used in kruskal-test
.Returns the calculated Rank Eta-squared value as a double, ranging from 0 to 1.
Interpretation:
Rank Eta-squared is a useful supplement to the Kruskal-Wallis test, providing a measure of the magnitude of the group effect that is not sensitive to assumptions about the data distribution shape (beyond having similar shapes for valid interpretation of the Kruskal-Wallis test itself).
See also kruskal-test
, rank-epsilon-sq
(another rank-based effect size).
Calculates the Rank Eta-squared (η²), an effect size measure for the Kruskal-Wallis H-test. Rank Eta-squared is a non-parametric measure quantifying the proportion of the total variability (based on ranks) in the dependent variable that is associated with group membership (the independent variable). It is analogous to Eta-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test. The statistic is calculated based on the Kruskal-Wallis H statistic, the number of groups (`k`), and the total number of observations (`n`). Parameters: - `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations, as used in [[kruskal-test]]. Returns the calculated Rank Eta-squared value as a double, ranging from 0 to 1. Interpretation: - A value of 0 indicates no difference in the distributions across groups (all variability is within groups). - A value closer to 1 indicates that a large proportion of the variability is due to differences between group ranks. Rank Eta-squared is a useful supplement to the Kruskal-Wallis test, providing a measure of the magnitude of the group effect that is not sensitive to assumptions about the data distribution shape (beyond having similar shapes for valid interpretation of the Kruskal-Wallis test itself). See also [[kruskal-test]], [[rank-epsilon-sq]] (another rank-based effect size).
(remove-outliers vs)
(remove-outliers vs estimation-strategy)
(remove-outliers vs q1 q3)
Remove outliers defined as values outside inner fences.
Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1)
.
(- Q1 (* 1.5 IQR))
.(+ Q3 (* 1.5 IQR))
.Returns a sequence without outliers.
Optional estimation-strategy
argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
Remove outliers defined as values outside inner fences. Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is `(- Q3 Q1)`. * LIF (Lower Inner Fence) equals `(- Q1 (* 1.5 IQR))`. * UIF (Upper Inner Fence) equals `(+ Q3 (* 1.5 IQR))`. Returns a sequence without outliers. Optional `estimation-strategy` argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
(rescale vs)
(rescale vs low high)
Lineary rascale data to desired range, [0,1] by default
Lineary rascale data to desired range, [0,1] by default
(rmse [vs1 vs2-or-val])
(rmse vs1 vs2-or-val)
Calculates the Root Mean Squared Error (RMSE) between two sequences or a sequence and a constant value.
RMSE is the square root of the mse
(Mean Squared Error). It represents the
standard deviation of the residuals (prediction errors) and has the same units
as the original data, making it more interpretable than MSE. It measures the
average magnitude of the errors, penalizing larger errors more than smaller ones
due to the squaring involved.
Parameters:
vs1
(sequence of numbers): The first sequence (often the observed or true values).vs2-or-val
(sequence of numbers or single number): The second sequence
(often the predicted or reference values), or a single number to compare
against each element of vs1
.If both inputs are sequences, they must have the same length. If vs2-or-val
is a single number, it is effectively treated as a sequence of that number
repeated count(vs1)
times.
Returns the calculated Root Mean Squared Error as a double.
See also mse
(Mean Squared Error), rss
(Residual Sum of Squares),
me
(Mean Error), mae
(Mean Absolute Error), r2
(Coefficient of Determination).
Calculates the Root Mean Squared Error (RMSE) between two sequences or a sequence and a constant value. RMSE is the square root of the [[mse]] (Mean Squared Error). It represents the standard deviation of the residuals (prediction errors) and has the same units as the original data, making it more interpretable than MSE. It measures the average magnitude of the errors, penalizing larger errors more than smaller ones due to the squaring involved. Parameters: - `vs1` (sequence of numbers): The first sequence (often the observed or true values). - `vs2-or-val` (sequence of numbers or single number): The second sequence (often the predicted or reference values), or a single number to compare against each element of `vs1`. If both inputs are sequences, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated Root Mean Squared Error as a double. See also [[mse]] (Mean Squared Error), [[rss]] (Residual Sum of Squares), [[me]] (Mean Error), [[mae]] (Mean Absolute Error), [[r2]] (Coefficient of Determination).
(robust-standardize vs)
(robust-standardize vs q)
Normalize samples to have median = 0 and MAD = 1.
If q
argument is used, scaling is done by quantile difference (Q_q, Q_(1-q)). Set 0.25 for IQR.
Normalize samples to have median = 0 and MAD = 1. If `q` argument is used, scaling is done by quantile difference (Q_q, Q_(1-q)). Set 0.25 for IQR.
(rows->contingency-table xss)
Converts a sequence of sequences (representing rows of counts) into a map-based contingency table.
This function takes a collection where each inner sequence is treated as a row
of counts in a grid or matrix. It transforms this matrix representation into a
map where keys are [row-index, column-index]
tuples and values are the
non-zero counts at that intersection.
This is particularly useful for converting structured count data, like the
output of some grouping or tabulation processes, into a format suitable for
functions expecting a contingency table map (like contingency-table->marginals
or chi-squared tests).
Parameters:
xss
(sequence of sequences of numbers): A collection where each inner
sequence xs_i
contains counts for row i
. Values within xs_i
are
interpreted as counts for columns 0, 1, ...
.Returns a map where keys are [row-index, column-index]
vectors and values
are the corresponding non-zero counts from the input matrix. Zero counts are
omitted from the output map.
See also contingency-table
(for building tables from raw data), contingency-table->marginals
.
Converts a sequence of sequences (representing rows of counts) into a map-based contingency table. This function takes a collection where each inner sequence is treated as a row of counts in a grid or matrix. It transforms this matrix representation into a map where keys are `[row-index, column-index]` tuples and values are the non-zero counts at that intersection. This is particularly useful for converting structured count data, like the output of some grouping or tabulation processes, into a format suitable for functions expecting a contingency table map (like `contingency-table->marginals` or chi-squared tests). Parameters: - `xss` (sequence of sequences of numbers): A collection where each inner sequence `xs_i` contains counts for row `i`. Values within `xs_i` are interpreted as counts for columns `0, 1, ...`. Returns a map where keys are `[row-index, column-index]` vectors and values are the corresponding non-zero counts from the input matrix. Zero counts are omitted from the output map. See also [[contingency-table]] (for building tables from raw data), [[contingency-table->marginals]].
(rss [vs1 vs2-or-val])
(rss vs1 vs2-or-val)
Calculates the Residual Sum of Squares (RSS) between two sequences or a sequence and a constant value.
RSS is a measure of the discrepancy between data and a model, often used in regression analysis to quantify the total squared difference between observed values and predicted (or reference) values.
Parameters:
vs1
(sequence of numbers): The first sequence (often observed values).vs2-or-val
(sequence of numbers or single number): The second sequence
(often predicted or reference values), or a single number to compare
against each element of vs1
.If both sequences (vs1
and vs2
) are provided, they must have the same length.
If vs2-or-val
is a single number, it is effectively treated as a sequence
of that number repeated count(vs1)
times.
Returns the calculated Residual Sum of Squares as a double.
See also mse
(Mean Squared Error), rmse
(Root Mean Squared Error), r2
(Coefficient of Determination).
Calculates the Residual Sum of Squares (RSS) between two sequences or a sequence and a constant value. RSS is a measure of the discrepancy between data and a model, often used in regression analysis to quantify the total squared difference between observed values and predicted (or reference) values. Parameters: - `vs1` (sequence of numbers): The first sequence (often observed values). - `vs2-or-val` (sequence of numbers or single number): The second sequence (often predicted or reference values), or a single number to compare against each element of `vs1`. If both sequences (`vs1` and `vs2`) are provided, they must have the same length. If `vs2-or-val` is a single number, it is effectively treated as a sequence of that number repeated `count(vs1)` times. Returns the calculated Residual Sum of Squares as a double. See also [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error), [[r2]] (Coefficient of Determination).
(sem vs)
Calculates the Standard Error of the Mean (SEM) for a sequence vs
.
The SEM estimates the standard deviation of the sample mean, providing an indication of how accurately the sample mean represents the population mean. It is calculated as:
SEM = stddev(vs) / sqrt(count(vs))
where stddev(vs)
is the sample standard deviation and count(vs)
is the
sample size.
Parameters:
vs
: Sequence of numbers.Returns the calculated SEM as a double.
A smaller SEM indicates that the sample mean is likely to be a more precise estimate of the population mean.
Calculates the Standard Error of the Mean (SEM) for a sequence `vs`. The SEM estimates the standard deviation of the sample mean, providing an indication of how accurately the sample mean represents the population mean. It is calculated as: `SEM = stddev(vs) / sqrt(count(vs))` where `stddev(vs)` is the sample standard deviation and `count(vs)` is the sample size. Parameters: - `vs`: Sequence of numbers. Returns the calculated SEM as a double. A smaller SEM indicates that the sample mean is likely to be a more precise estimate of the population mean. See also [[stddev]], [[mean]].
(similarity method P-observed Q-expected)
(similarity method
P-observed
Q-expected
{:keys [bins probabilities? epsilon]
:or {probabilities? true epsilon 1.0E-6}})
Various PDF similarities between two histograms (frequencies) or probabilities.
Q can be a distribution object. Then, histogram will be created out of P.
Arguments:
method
- distance methodP-observed
- frequencies, probabilities or actual data (when Q is a distribution)Q-expected
- frequencies, probabilities or distribution object (when P is a data)Options:
:probabilities?
- should P/Q be converted to a probabilities, default: true
.:epsilon
- small number which replaces 0.0
when division or logarithm is used`:bins
- number of bins or bins estimation method, see histogram
.The list of methods: :intersection
, :czekanowski
, :motyka
, :kulczynski
, :ruzicka
, :inner-product
, :harmonic-mean
, :cosine
, :jaccard
, :dice
, :fidelity
, :squared-chord
See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha
Various PDF similarities between two histograms (frequencies) or probabilities. Q can be a distribution object. Then, histogram will be created out of P. Arguments: * `method` - distance method * `P-observed` - frequencies, probabilities or actual data (when Q is a distribution) * `Q-expected` - frequencies, probabilities or distribution object (when P is a data) Options: * `:probabilities?` - should P/Q be converted to a probabilities, default: `true`. * `:epsilon` - small number which replaces `0.0` when division or logarithm is used` * `:bins` - number of bins or bins estimation method, see [[histogram]]. The list of methods: `:intersection`, `:czekanowski`, `:motyka`, `:kulczynski`, `:ruzicka`, `:inner-product`, `:harmonic-mean`, `:cosine`, `:jaccard`, `:dice`, `:fidelity`, `:squared-chord` See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha
(skewness vs)
(skewness vs typ)
Calculate skewness from sequence, a measure of the asymmetry of the probability distribution about its mean.
Parameters:
vs
(seq of numbers): The input sequence.typ
(keyword or sequence, optional): Specifies the type of skewness measure to calculate.
Defaults to :G1
.Available typ
values:
:G1
(Default): Sample skewness based on the third standardized moment, as
implemented by Apache Commons Math Skewness
. Adjusted for sample size bias.:g1
or :pearson
: Pearson's moment coefficient of skewness (g1), a bias-adjusted
version of the third standardized moment. Expected value 0 for symmetric distributions.:b1
: Sample skewness coefficient (b1), related to :g1.:B1
or :yule
: Yule's coefficient (robust), based on quantiles. Takes an optional
quantile u
(default 0.25) via sequence [:B1 u]
or [:yule u]
.:B3
: Robust measure comparing the mean and median relative to the mean absolute
deviation around the median.:skew
: An adjusted skewness definition sometimes used in bootstrap (BCa) calculations.:mode
: Pearson's second skewness coefficient: (mean - mode) / stddev
. Requires
calculating the mode. Mode calculation method can be specified via sequence
[:mode method opts]
, see mode
.:median
: Robust measure: 3 * (mean - median) / stddev
.:bowley
: Bowley's coefficient (robust), based on quartiles (Q1, Q2, Q3). Also
known as Yule-Bowley coefficient. Calculated as (Q3 + Q1 - 2*Q2) / (Q3 - Q1)
.:hogg
: Hogg's robust measure based on the ratio of differences between trimmed means.:l-skewness
: L-skewness (τ₃), the ratio of the 3rd L-moment (λ₃) to the
2nd L-moment (λ₂, L-scale). Calculated directly using l-moment
with the
:ratio?
option set to true. It's a robust measure of asymmetry.
Expected value 0 for symmetric distributions.Interpretation:
Returns the calculated skewness value as a double.
See also skewness-test
, normality-test
, jarque-bera-test
, l-moment
.
Calculate skewness from sequence, a measure of the asymmetry of the probability distribution about its mean. Parameters: - `vs` (seq of numbers): The input sequence. - `typ` (keyword or sequence, optional): Specifies the type of skewness measure to calculate. Defaults to `:G1`. Available `typ` values: - `:G1` (Default): Sample skewness based on the third standardized moment, as implemented by Apache Commons Math `Skewness`. Adjusted for sample size bias. - `:g1` or `:pearson`: Pearson's moment coefficient of skewness (g1), a bias-adjusted version of the third standardized moment. Expected value 0 for symmetric distributions. - `:b1`: Sample skewness coefficient (b1), related to :g1. - `:B1` or `:yule`: Yule's coefficient (robust), based on quantiles. Takes an optional quantile `u` (default 0.25) via sequence `[:B1 u]` or `[:yule u]`. - `:B3`: Robust measure comparing the mean and median relative to the mean absolute deviation around the median. - `:skew`: An adjusted skewness definition sometimes used in bootstrap (BCa) calculations. - `:mode`: Pearson's second skewness coefficient: `(mean - mode) / stddev`. Requires calculating the mode. Mode calculation method can be specified via sequence `[:mode method opts]`, see [[mode]]. - `:median`: Robust measure: `3 * (mean - median) / stddev`. - `:bowley`: Bowley's coefficient (robust), based on quartiles (Q1, Q2, Q3). Also known as Yule-Bowley coefficient. Calculated as `(Q3 + Q1 - 2*Q2) / (Q3 - Q1)`. - `:hogg`: Hogg's robust measure based on the ratio of differences between trimmed means. - `:l-skewness`: L-skewness (τ₃), the ratio of the 3rd L-moment (λ₃) to the 2nd L-moment (λ₂, L-scale). Calculated directly using [[l-moment]] with the `:ratio?` option set to true. It's a robust measure of asymmetry. Expected value 0 for symmetric distributions. Interpretation: - Positive values generally indicate a distribution skewed to the right (tail is longer on the right). - Negative values generally indicate a distribution skewed to the left (tail is longer on the left). - Values near 0 suggest relative symmetry. Returns the calculated skewness value as a double. See also [[skewness-test]], [[normality-test]], [[jarque-bera-test]], [[l-moment]].
(skewness-test xs)
(skewness-test xs params)
(skewness-test xs skew {:keys [sides type] :or {sides :two-sided type :g1}})
Performs the D'Agostino test for normality based on sample skewness.
This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample skewness significantly deviates from the zero skewness expected under normality.
The test works by:
:type
, default :g1
).Z
that more closely
follows a standard normal distribution under the null hypothesis.Parameters:
xs
(seq of numbers): The sample data.skew
(double, optional): A pre-calculated skewness value. If omitted, it's calculated from xs
.params
(map, optional): Options map:
:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): The population skewness is different from 0.:one-sided-greater
: The population skewness is greater than 0 (right-skewed).:one-sided-less
: The population skewness is less than 0 (left-skewed).:type
(keyword, default :g1
): The type of skewness to calculate if skew
is not provided. Note that the internal normalization constants are derived based on :g1
. See skewness
for options.Returns a map containing:
:Z
: The final test statistic, approximately standard normal under H0.:stat
: Alias for :Z
.:p-value
: The p-value associated with Z
and the specified :sides
.:skewness
: The sample skewness value used in the test (either provided or calculated).See also kurtosis-test
, normality-test
, jarque-bera-test
.
Performs the D'Agostino test for normality based on sample skewness. This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample skewness significantly deviates from the zero skewness expected under normality. The test works by: 1. Calculating the sample skewness (type configurable via `:type`, default `:g1`). 2. Standardizing the sample skewness relative to its expected value (0) and standard error under the null hypothesis. 3. Applying a further transformation (inverse hyperbolic sine based) to this standardized score to yield a final test statistic `Z` that more closely follows a standard normal distribution under the null hypothesis. Parameters: - `xs` (seq of numbers): The sample data. - `skew` (double, optional): A pre-calculated skewness value. If omitted, it's calculated from `xs`. - `params` (map, optional): Options map: - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): The population skewness is different from 0. - `:one-sided-greater`: The population skewness is greater than 0 (right-skewed). - `:one-sided-less`: The population skewness is less than 0 (left-skewed). - `:type` (keyword, default `:g1`): The type of skewness to calculate if `skew` is not provided. Note that the internal normalization constants are derived based on `:g1`. See [[skewness]] for options. Returns a map containing: - `:Z`: The final test statistic, approximately standard normal under H0. - `:stat`: Alias for `:Z`. - `:p-value`: The p-value associated with `Z` and the specified `:sides`. - `:skewness`: The sample skewness value used in the test (either provided or calculated). See also [[kurtosis-test]], [[normality-test]], [[jarque-bera-test]].
(span vs)
Width of the sample, maximum value minus minimum value
Width of the sample, maximum value minus minimum value
(spearman-correlation [vs1 vs2])
(spearman-correlation vs1 vs2)
Calculates Spearman's rank correlation coefficient between two sequences.
Spearman's rank correlation is a non-parametric measure of the monotonic relationship between two datasets. It assesses how well the relationship between two variables can be described using a monotonic function. It does not require the data to be linearly related or follow a specific distribution. The coefficient is calculated on the ranks of the data rather than the raw values.
Parameters:
[vs1 vs2]
(sequence of two sequences): A sequence containing the two sequences of numbers.vs1
, vs2
(sequences): The two sequences of numbers directly as arguments.Both sequences must have the same length.
Returns the calculated Spearman rank correlation coefficient (a value between -1.0 and 1.0) as a double. A value of 1 indicates a perfect monotonic increasing relationship, -1 a perfect monotonic decreasing relationship, and 0 no monotonic relationship.
See also pearson-correlation
, kendall-correlation
, correlation
.
Calculates Spearman's rank correlation coefficient between two sequences. Spearman's rank correlation is a non-parametric measure of the monotonic relationship between two datasets. It assesses how well the relationship between two variables can be described using a monotonic function. It does not require the data to be linearly related or follow a specific distribution. The coefficient is calculated on the ranks of the data rather than the raw values. Parameters: - `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers. - `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments. Both sequences must have the same length. Returns the calculated Spearman rank correlation coefficient (a value between -1.0 and 1.0) as a double. A value of 1 indicates a perfect monotonic increasing relationship, -1 a perfect monotonic decreasing relationship, and 0 no monotonic relationship. See also [[pearson-correlation]], [[kendall-correlation]], [[correlation]].
(standardize vs)
Normalize samples to have mean = 0 and stddev = 1.
Normalize samples to have mean = 0 and stddev = 1.
(stats-map vs)
(stats-map vs estimation-strategy)
Calculates a comprehensive set of descriptive statistics for a numerical dataset.
This function computes various summary measures and returns them as a map, providing a quick overview of the data's central tendency, dispersion, shape, and potential outliers.
Parameters:
vs
(seq of numbers): The input sequence of numerical data.estimation-strategy
(keyword, optional): Specifies the method for calculating
quantiles (including median, quartiles, and values used for fences).
Defaults to :legacy
. See percentile
or quantile
for available
strategies (e.g., :r1
through :r9
).Returns a map where keys are statistic names (as keywords) and values are their calculated measures:
:Size
: The number of data points in the sequence (count).:Min
: The minimum value (see minimum
).:Max
: The maximum value (see maximum
).:Range
: The difference between the maximum and minimum values (Max - Min).:Mean
: The arithmetic average (see mean
).:Median
: The middle value (see median
with estimation-strategy
).:Mode
: The most frequent value (see mode
with default method).:Q1
: The first quartile (25th percentile) (see percentile
with estimation-strategy
).:Q3
: The third quartile (75th percentile) (see percentile
with estimation-strategy
).:Total
: The sum of all values (see sum
).:SD
: The sample standard deviation (see stddev
).:Variance
: The sample variance (SD^2, see variance
).:MAD
: The Median Absolute Deviation (see median-absolute-deviation
).:SEM
: The Standard Error of the Mean (see sem
).:LAV
: The Lower Adjacent Value (smallest value within the inner fence, see adjacent-values
).:UAV
: The Upper Adjacent Value (largest value within the inner fence, see adjacent-values
).:IQR
: The Interquartile Range (Q3 - Q1).:LOF
: The Lower Outer Fence (Q1 - 3*IQR, see outer-fence-extent
).:UOF
: The Upper Outer Fence (Q3 + 3*IQR, see outer-fence-extent
).:LIF
: The Lower Inner Fence (Q1 - 1.5*IQR, see inner-fence-extent
).:UIF
: The Upper Inner Fence (Q3 + 1.5*IQR, see inner-fence-extent
).:Outliers
: A sequence of data points falling outside the inner fences (see outliers
).:Kurtosis
: A measure of tailedness/peakedness (see kurtosis
with default :G2
type).:Skewness
: A measure of asymmetry (see skewness
with default :G1
type).This function is a convenient way to get a standard set of summary statistics for a dataset in a single call.
Calculates a comprehensive set of descriptive statistics for a numerical dataset. This function computes various summary measures and returns them as a map, providing a quick overview of the data's central tendency, dispersion, shape, and potential outliers. Parameters: - `vs` (seq of numbers): The input sequence of numerical data. - `estimation-strategy` (keyword, optional): Specifies the method for calculating quantiles (including median, quartiles, and values used for fences). Defaults to `:legacy`. See [[percentile]] or [[quantile]] for available strategies (e.g., `:r1` through `:r9`). Returns a map where keys are statistic names (as keywords) and values are their calculated measures: - `:Size`: The number of data points in the sequence (count). - `:Min`: The minimum value (see [[minimum]]). - `:Max`: The maximum value (see [[maximum]]). - `:Range`: The difference between the maximum and minimum values (Max - Min). - `:Mean`: The arithmetic average (see [[mean]]). - `:Median`: The middle value (see [[median]] with `estimation-strategy`). - `:Mode`: The most frequent value (see [[mode]] with default method). - `:Q1`: The first quartile (25th percentile) (see [[percentile]] with `estimation-strategy`). - `:Q3`: The third quartile (75th percentile) (see [[percentile]] with `estimation-strategy`). - `:Total`: The sum of all values (see [[sum]]). - `:SD`: The sample standard deviation (see [[stddev]]). - `:Variance`: The sample variance (SD^2, see [[variance]]). - `:MAD`: The Median Absolute Deviation (see [[median-absolute-deviation]]). - `:SEM`: The Standard Error of the Mean (see [[sem]]). - `:LAV`: The Lower Adjacent Value (smallest value within the inner fence, see [[adjacent-values]]). - `:UAV`: The Upper Adjacent Value (largest value within the inner fence, see [[adjacent-values]]). - `:IQR`: The Interquartile Range (Q3 - Q1). - `:LOF`: The Lower Outer Fence (Q1 - 3*IQR, see [[outer-fence-extent]]). - `:UOF`: The Upper Outer Fence (Q3 + 3*IQR, see [[outer-fence-extent]]). - `:LIF`: The Lower Inner Fence (Q1 - 1.5*IQR, see [[inner-fence-extent]]). - `:UIF`: The Upper Inner Fence (Q3 + 1.5*IQR, see [[inner-fence-extent]]). - `:Outliers`: A sequence of data points falling outside the inner fences (see [[outliers]]). - `:Kurtosis`: A measure of tailedness/peakedness (see [[kurtosis]] with default `:G2` type). - `:Skewness`: A measure of asymmetry (see [[skewness]] with default `:G1` type). This function is a convenient way to get a standard set of summary statistics for a dataset in a single call.
(stddev vs)
(stddev vs mu)
Calculate standard deviation of vs
.
See population-stddev
.
Calculate standard deviation of `vs`. See [[population-stddev]].
(sum vs)
(sum vs compensation-method)
Sum of all vs
values.
Possible compensated summation methods are: :kahan
, :neumayer
and :klein
Sum of all `vs` values. Possible compensated summation methods are: `:kahan`, `:neumayer` and `:klein`
(t-test-one-sample xs)
(t-test-one-sample xs m)
Performs a one-sample Student's t-test to compare the sample mean against a hypothesized population mean.
This test assesses the null hypothesis that the true population mean is equal to mu
.
It is suitable when the population standard deviation is unknown and is estimated
from the sample.
Parameters:
xs
(seq of numbers): The sample data.params
(map, optional): Options map:
:alpha
(double, default 0.05
): Significance level for the confidence interval.:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): The true mean is not equal to mu
.:one-sided-greater
: The true mean is greater than mu
.:one-sided-less
: The true mean is less than mu
.:mu
(double, default 0.0
): The hypothesized population mean under the null hypothesis.Returns a map containing:
:t
: The calculated t-statistic.:stat
: Alias for :t
.:df
: Degrees of freedom (n-1
).:p-value
: The p-value associated with the t-statistic and :sides
.:confidence-interval
: Confidence interval for the true population mean.:estimate
: The calculated sample mean.:n
: The sample size.:mu
: The hypothesized population mean used in the test.:stderr
: The standard error of the mean (calculated from the sample).:alpha
: Significance level used.:sides
: Alternative hypothesis side used.:test-type
: Alias for :sides
.Assumptions:
See also z-test-one-sample
for large samples or known population standard deviation.
Performs a one-sample Student's t-test to compare the sample mean against a hypothesized population mean. This test assesses the null hypothesis that the true population mean is equal to `mu`. It is suitable when the population standard deviation is unknown and is estimated from the sample. Parameters: - `xs` (seq of numbers): The sample data. - `params` (map, optional): Options map: - `:alpha` (double, default `0.05`): Significance level for the confidence interval. - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): The true mean is not equal to `mu`. - `:one-sided-greater`: The true mean is greater than `mu`. - `:one-sided-less`: The true mean is less than `mu`. - `:mu` (double, default `0.0`): The hypothesized population mean under the null hypothesis. Returns a map containing: - `:t`: The calculated t-statistic. - `:stat`: Alias for `:t`. - `:df`: Degrees of freedom (`n-1`). - `:p-value`: The p-value associated with the t-statistic and `:sides`. - `:confidence-interval`: Confidence interval for the true population mean. - `:estimate`: The calculated sample mean. - `:n`: The sample size. - `:mu`: The hypothesized population mean used in the test. - `:stderr`: The standard error of the mean (calculated from the sample). - `:alpha`: Significance level used. - `:sides`: Alternative hypothesis side used. - `:test-type`: Alias for `:sides`. Assumptions: - The data are independent observations. - The data are drawn from a population that is approximately normally distributed. (The t-test is relatively robust to moderate violations, especially with larger sample sizes). See also [[z-test-one-sample]] for large samples or known population standard deviation.
(t-test-two-samples xs ys)
(t-test-two-samples xs
ys
{:keys [paired? equal-variances?]
:or {paired? false equal-variances? false}
:as params})
Performs a two-sample Student's t-test to compare the means of two samples.
This function can perform:
:equal-variances? false
): Does not assume equal population variances. Uses the Satterthwaite approximation for degrees of freedom. Recommended unless variances are known to be equal.:equal-variances? true
): Assumes equal population variances and uses a pooled variance estimate.:paired? true
): Assumes observations in xs
and ys
are paired (e.g., before/after measurements on the same subjects). This performs a one-sample t-test on the differences between paired observations.The test assesses the null hypothesis that the true difference between the population
means (or the mean of the differences for paired test) is equal to mu
.
Parameters:
xs
(seq of numbers): The first sample.ys
(seq of numbers): The second sample.params
(map, optional): Options map:
:alpha
(double, default 0.05
): Significance level for the confidence interval.:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): The true difference in means is not equal to mu
.:one-sided-greater
: The true difference (mean(xs) - mean(ys)
or mean(diff)
) is greater than mu
.:one-sided-less
: The true difference (mean(xs) - mean(ys)
or mean(diff)
) is less than mu
.:mu
(double, default 0.0
): The hypothesized difference in means under the null hypothesis.:paired?
(boolean, default false
): If true
, performs a paired t-test (requires xs
and ys
to have the same length). If false
, performs an unpaired test.:equal-variances?
(boolean, default false
): Used only when paired?
is false
. If true
, assumes equal population variances (Student's). If false
, does not assume equal variances (Welch's).Returns a map containing:
:t
: The calculated t-statistic.:stat
: Alias for :t
.:df
: Degrees of freedom used for the t-distribution.:p-value
: The p-value associated with the t-statistic and :sides
.:confidence-interval
: Confidence interval for the true difference in means.:estimate
: The observed difference between sample means (mean(xs) - mean(ys)
or mean(differences)
).:n
: Sample sizes as [count xs, count ys]
(or count diffs
if paired).:nx
: Sample size of xs
(if unpaired).:ny
: Sample size of ys
(if unpaired).:estimated-mu
: Observed sample means as [mean xs, mean ys]
(if unpaired).:mu
: The hypothesized difference under the null hypothesis.:stderr
: The standard error of the difference between the means (or of the mean difference if paired).:alpha
: Significance level used.:sides
: Alternative hypothesis side used.:test-type
: Alias for :sides
.:paired?
: Boolean indicating if a paired test was performed.:equal-variances?
: Boolean indicating the variance assumption used (if unpaired).Assumptions:
:equal-variances? true
).Performs a two-sample Student's t-test to compare the means of two samples. This function can perform: - An **unpaired t-test** (assuming independent samples) using either: - **Welch's t-test** (default: `:equal-variances? false`): Does not assume equal population variances. Uses the Satterthwaite approximation for degrees of freedom. Recommended unless variances are known to be equal. - **Student's t-test** (`:equal-variances? true`): Assumes equal population variances and uses a pooled variance estimate. - A **paired t-test** (`:paired? true`): Assumes observations in `xs` and `ys` are paired (e.g., before/after measurements on the same subjects). This performs a one-sample t-test on the differences between paired observations. The test assesses the null hypothesis that the true difference between the population means (or the mean of the differences for paired test) is equal to `mu`. Parameters: - `xs` (seq of numbers): The first sample. - `ys` (seq of numbers): The second sample. - `params` (map, optional): Options map: - `:alpha` (double, default `0.05`): Significance level for the confidence interval. - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): The true difference in means is not equal to `mu`. - `:one-sided-greater`: The true difference (`mean(xs) - mean(ys)` or `mean(diff)`) is greater than `mu`. - `:one-sided-less`: The true difference (`mean(xs) - mean(ys)` or `mean(diff)`) is less than `mu`. - `:mu` (double, default `0.0`): The hypothesized difference in means under the null hypothesis. - `:paired?` (boolean, default `false`): If `true`, performs a paired t-test (requires `xs` and `ys` to have the same length). If `false`, performs an unpaired test. - `:equal-variances?` (boolean, default `false`): Used only when `paired?` is `false`. If `true`, assumes equal population variances (Student's). If `false`, does not assume equal variances (Welch's). Returns a map containing: - `:t`: The calculated t-statistic. - `:stat`: Alias for `:t`. - `:df`: Degrees of freedom used for the t-distribution. - `:p-value`: The p-value associated with the t-statistic and `:sides`. - `:confidence-interval`: Confidence interval for the true difference in means. - `:estimate`: The observed difference between sample means (`mean(xs) - mean(ys)` or `mean(differences)`). - `:n`: Sample sizes as `[count xs, count ys]` (or `count diffs` if paired). - `:nx`: Sample size of `xs` (if unpaired). - `:ny`: Sample size of `ys` (if unpaired). - `:estimated-mu`: Observed sample means as `[mean xs, mean ys]` (if unpaired). - `:mu`: The hypothesized difference under the null hypothesis. - `:stderr`: The standard error of the difference between the means (or of the mean difference if paired). - `:alpha`: Significance level used. - `:sides`: Alternative hypothesis side used. - `:test-type`: Alias for `:sides`. - `:paired?`: Boolean indicating if a paired test was performed. - `:equal-variances?`: Boolean indicating the variance assumption used (if unpaired). Assumptions: - Independence of observations (within and between groups for unpaired). - Normality of the underlying populations (or of the differences for paired). The t-test is relatively robust to violations of normality, especially with larger sample sizes. - Equal variances (only if `:equal-variances? true`).
(trim vs)
(trim vs quantile)
(trim vs quantile estimation-strategy)
(trim vs low high nan)
Return trimmed data. Trim is done by using quantiles, by default is set to 0.2.
Return trimmed data. Trim is done by using quantiles, by default is set to 0.2.
(trim-lower vs)
(trim-lower vs quantile)
(trim-lower vs quantile estimation-strategy)
Trim data below given quanitle, default: 0.2.
Trim data below given quanitle, default: 0.2.
(trim-upper vs)
(trim-upper vs quantile)
(trim-upper vs quantile estimation-strategy)
Trim data above given quanitle, default: 0.2.
Trim data above given quanitle, default: 0.2.
(tschuprows-t contingency-table)
(tschuprows-t group1 group2)
Calculates Tschuprow's T, a measure of association between two nominal variables represented in a contingency table.
Tschuprow's T is derived from the Pearson's Chi-squared statistic and measures the strength of the association. Its value ranges from 0 to 1.
r
) equals the number of columns (k
) in the contingency table. If r != k
,
Tschuprow's T cannot reach 1, making Cramer's V (cramers-v
) often preferred
as it can reach 1 for any table size.The function can be called in two ways:
group1
and group2
:
The function will automatically construct a contingency table from
the unique values in the sequences.[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This is the output format
of contingency-table
with two inputs.[[10 5] [3 12]]
). This is equivalent to rows->contingency-table
.Parameters:
group1
(sequence): The first sequence of categorical data.group2
(sequence): The second sequence of categorical data. Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed contingency table.Returns the calculated Tschuprow's T coefficient as a double.
See also chisq-test
, cramers-c
, cramers-v
, cohens-w
, contingency-table
.
Calculates Tschuprow's T, a measure of association between two nominal variables represented in a contingency table. Tschuprow's T is derived from the Pearson's Chi-squared statistic and measures the strength of the association. Its value ranges from 0 to 1. - A value of 0 indicates no association between the variables. - A value of 1 indicates perfect association, but only when the number of rows (`r`) equals the number of columns (`k`) in the contingency table. If `r != k`, Tschuprow's T cannot reach 1, making Cramer's V ([[cramers-v]]) often preferred as it can reach 1 for any table size. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a contingency table from the unique values in the sequences. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format of [[contingency-table]] with two inputs. - A sequence of sequences representing the rows of the table (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]]. Parameters: - `group1` (sequence): The first sequence of categorical data. - `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed contingency table. Returns the calculated Tschuprow's T coefficient as a double. See also [[chisq-test]], [[cramers-c]], [[cramers-v]], [[cohens-w]], [[contingency-table]].
(variance vs)
(variance vs mu)
Calculate variance of vs
.
See population-variance
.
Calculate variance of `vs`. See [[population-variance]].
(variation vs)
Calculates the coefficient of variation (CV) for a sequence vs
.
The CV is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation to the mean:
CV = stddev(vs) / mean(vs)
This measure is unitless and allows for comparison of variability between datasets with different means or different units.
Parameters:
vs
: Sequence of numbers.Returns the calculated coefficient of variation as a double.
Note: The CV is undefined if the mean is zero, and may be misleading if the
mean is close to zero or if the data can take both positive and negative values.
All values in vs
should ideally be positive.
Calculates the coefficient of variation (CV) for a sequence `vs`. The CV is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation to the mean: `CV = stddev(vs) / mean(vs)` This measure is unitless and allows for comparison of variability between datasets with different means or different units. Parameters: - `vs`: Sequence of numbers. Returns the calculated coefficient of variation as a double. Note: The CV is undefined if the mean is zero, and may be misleading if the mean is close to zero or if the data can take both positive and negative values. All values in `vs` should ideally be positive. See also [[stddev]], [[mean]].
(weighted-kappa contingency-table)
(weighted-kappa contingency-table weights)
Calculates Cohen's weighted Kappa coefficient (κ) for a contingency table, allowing for partial agreement between categories, typically used for ordinal data.
Weighted Kappa measures inter-rater agreement, similar to cohens-kappa
,
but assigns different penalties to disagreements based on their magnitude.
Disagreements between closely related categories are penalized less than
disagreements between distantly related categories.
The function can be called in two ways:
With two sequences group1
and group2
:
The function will automatically construct a contingency table from
the unique values in the sequences. These values are assumed to be ordinal
and their position in the sorted unique value list determines their index.
The mapping of values to table indices might need verification.
With a contingency table: The contingency table can be provided as:
[row-index, column-index]
tuples and values are counts
(e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}
). This is the output format
of contingency-table
with two inputs. Indices are assumed to represent
the ordered categories.[[10 5] [3 12]]
). This is equivalent to rows->contingency-table
.
The row and column indices are assumed to correspond to the ordered categories.Parameters:
group1
(sequence): The first sequence of ordinal outcomes/categories.group2
(sequence): The second sequence of ordinal outcomes/categories.
Must have the same length as group1
.contingency-table
(map or sequence of sequences): A pre-computed contingency table
where row and column indices correspond to ordered categories.weights
(keyword, function, or map, optional): Specifies the weighting scheme
to quantify the difference between categories. Defaults to :equal-spacing
.
:equal-spacing
(default, linear weights): Penalizes disagreements linearly
with the distance between categories. Weight is 1 - |i-j|/R
, where i
is
row index, j
is column index, and R
is the maximum dimension of the table (max(max_row_index, max_col_index)).:fleiss-cohen
(quadratic weights): Penalizes disagreements quadratically
with the distance. Weight is 1 - (|i-j|/R)^2
.(fn [R id1 id2])
): A custom function that takes the maximum
dimension R
, row index id1
, and column index id2
and returns the weight
(typically between 0 and 1, where 1 is perfect agreement).{[id1 id2] weight}
): A custom map providing weights for specific
[row-index, column-index]
pairs. Missing pairs default to a weight of 0.0.Returns the calculated weighted Cohen's Kappa coefficient as a double.
Interpretation:
κ_w = 1
: Perfect agreement.κ_w = 0
: Agreement is no better than chance.κ_w < 0
: Agreement is worse than chance.See also cohens-kappa
(unweighted Kappa).
Calculates Cohen's weighted Kappa coefficient (κ) for a contingency table, allowing for partial agreement between categories, typically used for ordinal data. Weighted Kappa measures inter-rater agreement, similar to [[cohens-kappa]], but assigns different penalties to disagreements based on their magnitude. Disagreements between closely related categories are penalized less than disagreements between distantly related categories. The function can be called in two ways: 1. With two sequences `group1` and `group2`: The function will automatically construct a contingency table from the unique values in the sequences. These values are assumed to be ordinal and their position in the sorted unique value list determines their index. The mapping of values to table indices might need verification. 2. With a contingency table: The contingency table can be provided as: - A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format of [[contingency-table]] with two inputs. Indices are assumed to represent the ordered categories. - A sequence of sequences representing the rows of the table (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]]. The row and column indices are assumed to correspond to the ordered categories. Parameters: - `group1` (sequence): The first sequence of ordinal outcomes/categories. - `group2` (sequence): The second sequence of ordinal outcomes/categories. Must have the same length as `group1`. - `contingency-table` (map or sequence of sequences): A pre-computed contingency table where row and column indices correspond to ordered categories. - `weights` (keyword, function, or map, optional): Specifies the weighting scheme to quantify the difference between categories. Defaults to `:equal-spacing`. - `:equal-spacing` (default, linear weights): Penalizes disagreements linearly with the distance between categories. Weight is `1 - |i-j|/R`, where `i` is row index, `j` is column index, and `R` is the maximum dimension of the table (max(max_row_index, max_col_index)). - `:fleiss-cohen` (quadratic weights): Penalizes disagreements quadratically with the distance. Weight is `1 - (|i-j|/R)^2`. - (function `(fn [R id1 id2])`): A custom function that takes the maximum dimension `R`, row index `id1`, and column index `id2` and returns the weight (typically between 0 and 1, where 1 is perfect agreement). - (map `{[id1 id2] weight}`): A custom map providing weights for specific `[row-index, column-index]` pairs. Missing pairs default to a weight of 0.0. Returns the calculated weighted Cohen's Kappa coefficient as a double. Interpretation: - `κ_w = 1`: Perfect agreement. - `κ_w = 0`: Agreement is no better than chance. - `κ_w < 0`: Agreement is worse than chance. See also [[cohens-kappa]] (unweighted Kappa).
(winsor vs)
(winsor vs quantile)
(winsor vs quantile estimation-strategy)
(winsor vs low high nan)
Return winsorized data. Trim is done by using quantiles, by default is set to 0.2.
Return winsorized data. Trim is done by using quantiles, by default is set to 0.2.
(wmedian vs ws)
(wmedian vs ws method)
Calculates median of a sequence vs
with corresponding weights ws
.
Parameters:
vs
: Sequence of data values.ws
: Sequence of corresponding non-negative weights. Must have the same count as vs
.method
(optional keyword): Specifies the interpolation method used when qs
falls
between points in the weighted ECDF. Defaults to :linear
.
:linear
: Performs linear interpolation between the data values corresponding
to the cumulative weights surrounding q=0.5
.:step
: Uses a step function (specifically, step-before) based on the
weighted ECDF. The result is the data value whose cumulative weight range
includes q=0.5
.:average
: Computes the average of the step-before and step-after
interpolation methods.Calculates median of a sequence `vs` with corresponding weights `ws`. Parameters: - `vs`: Sequence of data values. - `ws`: Sequence of corresponding non-negative weights. Must have the same count as `vs`. - `method` (optional keyword): Specifies the interpolation method used when `qs` falls between points in the weighted ECDF. Defaults to `:linear`. - `:linear`: Performs linear interpolation between the data values corresponding to the cumulative weights surrounding `q=0.5`. - `:step`: Uses a step function (specifically, step-before) based on the weighted ECDF. The result is the data value whose cumulative weight range includes `q=0.5`. - `:average`: Computes the average of the step-before and step-after interpolation methods. See also: [[wquantile]], [[quantile]].
(wmode vs)
(wmode vs weights)
Returns the primary weighted mode of a sequence vs
.
The mode is the value that appears most often in a dataset. This function generalizes the mode concept by considering weights associated with each value. A value's contribution to the mode calculation is proportional to its weight.
If multiple values share the same highest total weight (i.e., there are ties
for the mode), this function returns only the first one encountered during
processing. The specific mode returned in case of a tie is not guaranteed
to be stable across different runs or environments. Use wmodes
if you
need all tied modes.
Parameters:
vs
: Sequence of data values. Can contain any data type (numbers, keywords, etc.).weights
(optional): Sequence of non-negative weights corresponding to vs
.
Must have the same count as vs
. Defaults to a sequence of 1.0s if omitted,
effectively calculating the unweighted mode.Returns a single value representing the mode (or one of the modes if ties exist).
See also wmodes
(returns all modes) and mode
(for unweighted numeric data).
Returns the primary weighted mode of a sequence `vs`. The mode is the value that appears most often in a dataset. This function generalizes the mode concept by considering weights associated with each value. A value's contribution to the mode calculation is proportional to its weight. If multiple values share the same highest total weight (i.e., there are ties for the mode), this function returns only the first one encountered during processing. The specific mode returned in case of a tie is not guaranteed to be stable across different runs or environments. Use [[wmodes]] if you need all tied modes. Parameters: - `vs`: Sequence of data values. Can contain any data type (numbers, keywords, etc.). - `weights` (optional): Sequence of non-negative weights corresponding to `vs`. Must have the same count as `vs`. Defaults to a sequence of 1.0s if omitted, effectively calculating the unweighted mode. Returns a single value representing the mode (or one of the modes if ties exist). See also [[wmodes]] (returns all modes) and [[mode]] (for unweighted numeric data).
(wmodes vs)
(wmodes vs weights)
Returns the weighted mode(s) of a sequence vs
.
The mode is the value that appears most often in a dataset. This function generalizes the mode concept by considering weights associated with each value. A value's contribution to the mode calculation is proportional to its weight.
Parameters:
vs
: Sequence of data values. Can contain any data type (numbers, keywords, etc.).weights
(optional): Sequence of non-negative weights corresponding to vs
.
Must have the same count as vs
. Defaults to a sequence of 1.0s if omitted,
effectively calculating the unweighted modes.Returns a sequence containing all values that have the highest total weight. If there are ties (multiple values share the same maximum total weight), all tied values are included in the returned sequence. The order of modes in the returned sequence is not guaranteed.
See also wmode
(returns only one mode in case of ties) and modes
(for unweighted numeric data).
Returns the weighted mode(s) of a sequence `vs`. The mode is the value that appears most often in a dataset. This function generalizes the mode concept by considering weights associated with each value. A value's contribution to the mode calculation is proportional to its weight. Parameters: - `vs`: Sequence of data values. Can contain any data type (numbers, keywords, etc.). - `weights` (optional): Sequence of non-negative weights corresponding to `vs`. Must have the same count as `vs`. Defaults to a sequence of 1.0s if omitted, effectively calculating the unweighted modes. Returns a sequence containing all values that have the highest total weight. If there are ties (multiple values share the same maximum total weight), all tied values are included in the returned sequence. The order of modes in the returned sequence is not guaranteed. See also [[wmode]] (returns only one mode in case of ties) and [[modes]] (for unweighted numeric data).
(wmw-odds [group1 group2])
(wmw-odds group1 group2)
Calculates the Wilcoxon-Mann-Whitney odds (often denoted as ψ) for two independent samples.
This non-parametric effect size measure quantifies the odds that a randomly chosen
observation from the first group (group1
) is greater than a randomly chosen
observation from the second group (group2
).
The statistic is directly related to cliffs-delta
(δ): ψ = (1 + δ) / (1 - δ).
Parameters:
group1
(seq of numbers): The first independent sample.group2
(seq of numbers): The second independent sample.Returns the calculated WMW odds as a double.
Interpretation:
group1
tend to be larger than values from group2
.group1
tend to be smaller than values from group2
.This measure is robust to violations of normality and is suitable for ordinal data. It is closely related to Cliff's Delta (δ) and the Mann-Whitney U test statistic.
See also cliffs-delta
, ameasure
.
Calculates the Wilcoxon-Mann-Whitney odds (often denoted as ψ) for two independent samples. This non-parametric effect size measure quantifies the odds that a randomly chosen observation from the first group (`group1`) is greater than a randomly chosen observation from the second group (`group2`). The statistic is directly related to [[cliffs-delta]] (δ): ψ = (1 + δ) / (1 - δ). Parameters: - `group1` (seq of numbers): The first independent sample. - `group2` (seq of numbers): The second independent sample. Returns the calculated WMW odds as a double. Interpretation: - A value greater than 1 indicates that values from `group1` tend to be larger than values from `group2`. - A value less than 1 indicates that values from `group1` tend to be smaller than values from `group2`. - A value of 1 indicates stochastic equality between the distributions (50/50 odds). This measure is robust to violations of normality and is suitable for ordinal data. It is closely related to Cliff's Delta (δ) and the Mann-Whitney U test statistic. See also [[cliffs-delta]], [[ameasure]].
(wquantile vs ws q)
(wquantile vs ws q method)
Calculates the q-th weighted quantile of a sequence vs
with corresponding weights ws
.
The quantile q
is a value between 0.0 and 1.0, inclusive.
The calculation involves constructing a weighted empirical cumulative distribution
function (ECDF) and interpolating to find the value at quantile q
.
Parameters:
vs
: Sequence of data values.ws
: Sequence of corresponding non-negative weights. Must have the same count as vs
.q
: The quantile level (0.0 < q <= 1.0).method
(optional keyword): Specifies the interpolation method used when q
falls
between points in the weighted ECDF. Defaults to :linear
.
:linear
: Performs linear interpolation between the data values corresponding
to the cumulative weights surrounding q
.:step
: Uses a step function (specifically, step-before) based on the
weighted ECDF. The result is the data value whose cumulative weight range
includes q
.:average
: Computes the average of the step-before and step-after
interpolation methods. Useful when q
corresponds exactly to a cumulative
weight boundary.See also: wmedian
, wquantiles
, quantile
.
Calculates the q-th weighted quantile of a sequence `vs` with corresponding weights `ws`. The quantile `q` is a value between 0.0 and 1.0, inclusive. The calculation involves constructing a weighted empirical cumulative distribution function (ECDF) and interpolating to find the value at quantile `q`. Parameters: - `vs`: Sequence of data values. - `ws`: Sequence of corresponding non-negative weights. Must have the same count as `vs`. - `q`: The quantile level (0.0 < q <= 1.0). - `method` (optional keyword): Specifies the interpolation method used when `q` falls between points in the weighted ECDF. Defaults to `:linear`. - `:linear`: Performs linear interpolation between the data values corresponding to the cumulative weights surrounding `q`. - `:step`: Uses a step function (specifically, step-before) based on the weighted ECDF. The result is the data value whose cumulative weight range includes `q`. - `:average`: Computes the average of the step-before and step-after interpolation methods. Useful when `q` corresponds exactly to a cumulative weight boundary. See also: [[wmedian]], [[wquantiles]], [[quantile]].
(wquantiles vs ws)
(wquantiles vs ws qs)
(wquantiles vs ws qs method)
Calculates the sequence of q-th weighted quantiles of a sequence vs
with corresponding weights ws
.
Quantiles qs
is a sequence of values between 0.0 and 1.0, inclusive.
The calculation involves constructing a weighted empirical cumulative distribution
function (ECDF) and interpolating to find the value at quantiles qs
.
Parameters:
vs
: Sequence of data values.ws
: Sequence of corresponding non-negative weights. Must have the same count as vs
.qs
: Sequence of quantiles level (0.0 < q <= 1.0).method
(optional keyword): Specifies the interpolation method used when qs
falls
between points in the weighted ECDF. Defaults to :linear
.
:linear
: Performs linear interpolation between the data values corresponding
to the cumulative weights surrounding q
.:step
: Uses a step function (specifically, step-before) based on the
weighted ECDF. The result is the data value whose cumulative weight range
includes q
.:average
: Computes the average of the step-before and step-after
interpolation methods. Useful when q
corresponds exactly to a cumulative
weight boundary.Calculates the sequence of q-th weighted quantiles of a sequence `vs` with corresponding weights `ws`. Quantiles `qs` is a sequence of values between 0.0 and 1.0, inclusive. The calculation involves constructing a weighted empirical cumulative distribution function (ECDF) and interpolating to find the value at quantiles `qs`. Parameters: - `vs`: Sequence of data values. - `ws`: Sequence of corresponding non-negative weights. Must have the same count as `vs`. - `qs`: Sequence of quantiles level (0.0 < q <= 1.0). - `method` (optional keyword): Specifies the interpolation method used when `qs` falls between points in the weighted ECDF. Defaults to `:linear`. - `:linear`: Performs linear interpolation between the data values corresponding to the cumulative weights surrounding `q`. - `:step`: Uses a step function (specifically, step-before) based on the weighted ECDF. The result is the data value whose cumulative weight range includes `q`. - `:average`: Computes the average of the step-before and step-after interpolation methods. Useful when `q` corresponds exactly to a cumulative weight boundary. See also: [[wquantile]], [[quantiles]].
(wstddev vs freqs)
Calculate weighted (unbiased) standard deviation of vs
Calculate weighted (unbiased) standard deviation of `vs`
(wvariance vs freqs)
Calculate weighted (unbiased) variance of vs
.
Calculate weighted (unbiased) variance of `vs`.
(yeo-johnson-infer-lambda xs)
(yeo-johnson-infer-lambda xs lambda-range)
(yeo-johnson-infer-lambda xs lambda-range {:keys [alpha] :or {alpha 0.0}})
Find optimal lambda
parameter for Yeo-Johnson tranformation using maximum log likelihood method.
Find optimal `lambda` parameter for Yeo-Johnson tranformation using maximum log likelihood method.
(yeo-johnson-transformation xs)
(yeo-johnson-transformation xs lambda)
(yeo-johnson-transformation xs lambda {:keys [alpha inverse?] :or {alpha 0.0}})
Applies the Yeo-Johnson transformation to a dataset.
This transformation is used to stabilize variance and make data more normally distributed. It extends the Box-Cox transformation to allow for zero and negative values.
Parameters:
xs
: The input dataset.lambda
(default: 0.0): The power parameter controlling the transformation. If lambda
is nil
or a range [lambda-min, lambda-max]
it will be inferred using maximum log-likelihood method.:alpha
(optional): A shift parameter applied before transformation.:inverse?
(optional): Perform inverse operation, lambda
should be provided (can't be inferred).Returns:
Related: box-cox-tranformation
Applies the Yeo-Johnson transformation to a dataset. This transformation is used to stabilize variance and make data more normally distributed. It extends the Box-Cox transformation to allow for zero and negative values. Parameters: - `xs`: The input dataset. - `lambda` (default: 0.0): The power parameter controlling the transformation. If `lambda` is `nil` or a range `[lambda-min, lambda-max]` it will be inferred using maximum log-likelihood method. - Options map: - `:alpha` (optional): A shift parameter applied before transformation. - `:inverse?` (optional): Perform inverse operation, `lambda` should be provided (can't be inferred). Returns: - A transformed sequence of numbers. Related: `box-cox-tranformation`
(z-test-one-sample xs)
(z-test-one-sample xs m)
Performs a one-sample Z-test to compare the sample mean against a hypothesized population mean.
This test assesses the null hypothesis that the true population mean is equal to mu
.
It typically assumes either a known population standard deviation or relies on a
large sample size (e.g., n > 30) where the sample standard deviation provides a
reliable estimate. This implementation uses the sample standard deviation to calculate
the standard error.
Parameters:
xs
(seq of numbers): The sample data.params
(map, optional): Options map:
:alpha
(double, default 0.05
): Significance level for the confidence interval.:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): The true mean is not equal to mu
.:one-sided-greater
: The true mean is greater than mu
.:one-sided-less
: The true mean is less than mu
.:mu
(double, default 0.0
): The hypothesized population mean under the null hypothesis.Returns a map containing:
:z
: The calculated Z-statistic.:stat
: Alias for :z
.:p-value
: The p-value associated with the Z-statistic and the specified :sides
.:confidence-interval
: Confidence interval for the true population mean.:estimate
: The calculated sample mean.:n
: The sample size.:mu
: The hypothesized population mean used in the test.:stderr
: The standard error of the mean (calculated using sample standard deviation).:alpha
: Significance level used.:sides
: Alternative hypothesis side used.:test-type
: Alias for :sides
.See also t-test-one-sample
for smaller samples or when the population standard deviation is unknown.
Performs a one-sample Z-test to compare the sample mean against a hypothesized population mean. This test assesses the null hypothesis that the true population mean is equal to `mu`. It typically assumes either a known population standard deviation or relies on a large sample size (e.g., n > 30) where the sample standard deviation provides a reliable estimate. This implementation uses the sample standard deviation to calculate the standard error. Parameters: - `xs` (seq of numbers): The sample data. - `params` (map, optional): Options map: - `:alpha` (double, default `0.05`): Significance level for the confidence interval. - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): The true mean is not equal to `mu`. - `:one-sided-greater`: The true mean is greater than `mu`. - `:one-sided-less`: The true mean is less than `mu`. - `:mu` (double, default `0.0`): The hypothesized population mean under the null hypothesis. Returns a map containing: - `:z`: The calculated Z-statistic. - `:stat`: Alias for `:z`. - `:p-value`: The p-value associated with the Z-statistic and the specified `:sides`. - `:confidence-interval`: Confidence interval for the true population mean. - `:estimate`: The calculated sample mean. - `:n`: The sample size. - `:mu`: The hypothesized population mean used in the test. - `:stderr`: The standard error of the mean (calculated using sample standard deviation). - `:alpha`: Significance level used. - `:sides`: Alternative hypothesis side used. - `:test-type`: Alias for `:sides`. See also [[t-test-one-sample]] for smaller samples or when the population standard deviation is unknown.
(z-test-two-samples xs ys)
(z-test-two-samples xs
ys
{:keys [paired? equal-variances?]
:or {paired? false equal-variances? false}
:as params})
Performs a two-sample Z-test to compare the means of two independent or paired samples.
This test assesses the null hypothesis that the difference between the population
means is equal to mu
(default 0). It typically assumes known population variances
or relies on large sample sizes where sample variances provide good estimates.
This implementation calculates the standard error using the provided sample variances.
Parameters:
xs
(seq of numbers): The first sample.ys
(seq of numbers): The second sample.params
(map, optional): Options map:
:alpha
(double, default 0.05
): Significance level for the confidence interval.:sides
(keyword, default :two-sided
): Specifies the alternative hypothesis.
:two-sided
(default): The true difference in means is not equal to mu
.:one-sided-greater
: The true difference in means (mean(xs) - mean(ys)
) is greater than mu
.:one-sided-less
: The true difference in means (mean(xs) - mean(ys)
) is less than mu
.:mu
(double, default 0.0
): The hypothesized difference in means under the null hypothesis.:paired?
(boolean, default false
): If true
, performs a paired Z-test by applying z-test-one-sample
to the differences between paired observations in xs
and ys
(requires xs
and ys
to have the same length). If false
, performs a two-sample test assuming independence.:equal-variances?
(boolean, default false
): Used only when paired?
is false
. If true
, assumes population variances are equal and calculates a pooled standard error. If false
, calculates the standard error without assuming equal variances (Welch's approach adapted for Z-test). This affects the standard error calculation but the standard normal distribution is still used for inference.Returns a map containing:
:z
: The calculated Z-statistic.:stat
: Alias for :z
.:p-value
: The p-value associated with the Z-statistic and the specified :sides
.:confidence-interval
: Confidence interval for the true difference in means.:estimate
: The observed difference between sample means (mean(xs) - mean(ys)
).:n
: Sample sizes as [count xs, count ys]
.:nx
: Sample size of xs
.:ny
: Sample size of ys
.:estimated-mu
: The observed sample means as [mean xs, mean ys]
.:mu
: The hypothesized difference under the null hypothesis.:stderr
: The standard error of the difference between the means.:alpha
: Significance level used.:sides
: Alternative hypothesis side used.:test-type
: Alias for :sides
.:paired?
: Boolean indicating if a paired test was performed.:equal-variances?
: Boolean indicating the assumption used for standard error calculation (if unpaired).See also t-test-two-samples
for smaller samples or when population variances are unknown.
Performs a two-sample Z-test to compare the means of two independent or paired samples. This test assesses the null hypothesis that the difference between the population means is equal to `mu` (default 0). It typically assumes known population variances or relies on large sample sizes where sample variances provide good estimates. This implementation calculates the standard error using the provided sample variances. Parameters: - `xs` (seq of numbers): The first sample. - `ys` (seq of numbers): The second sample. - `params` (map, optional): Options map: - `:alpha` (double, default `0.05`): Significance level for the confidence interval. - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis. - `:two-sided` (default): The true difference in means is not equal to `mu`. - `:one-sided-greater`: The true difference in means (`mean(xs) - mean(ys)`) is greater than `mu`. - `:one-sided-less`: The true difference in means (`mean(xs) - mean(ys)`) is less than `mu`. - `:mu` (double, default `0.0`): The hypothesized difference in means under the null hypothesis. - `:paired?` (boolean, default `false`): If `true`, performs a paired Z-test by applying [[z-test-one-sample]] to the differences between paired observations in `xs` and `ys` (requires `xs` and `ys` to have the same length). If `false`, performs a two-sample test assuming independence. - `:equal-variances?` (boolean, default `false`): Used only when `paired?` is `false`. If `true`, assumes population variances are equal and calculates a pooled standard error. If `false`, calculates the standard error without assuming equal variances (Welch's approach adapted for Z-test). This affects the standard error calculation but the standard normal distribution is still used for inference. Returns a map containing: - `:z`: The calculated Z-statistic. - `:stat`: Alias for `:z`. - `:p-value`: The p-value associated with the Z-statistic and the specified `:sides`. - `:confidence-interval`: Confidence interval for the true difference in means. - `:estimate`: The observed difference between sample means (`mean(xs) - mean(ys)`). - `:n`: Sample sizes as `[count xs, count ys]`. - `:nx`: Sample size of `xs`. - `:ny`: Sample size of `ys`. - `:estimated-mu`: The observed sample means as `[mean xs, mean ys]`. - `:mu`: The hypothesized difference under the null hypothesis. - `:stderr`: The standard error of the difference between the means. - `:alpha`: Significance level used. - `:sides`: Alternative hypothesis side used. - `:test-type`: Alias for `:sides`. - `:paired?`: Boolean indicating if a paired test was performed. - `:equal-variances?`: Boolean indicating the assumption used for standard error calculation (if unpaired). See also [[t-test-two-samples]] for smaller samples or when population variances are unknown.
cljdoc builds & hosts documentation for Clojure/Script libraries
Ctrl+k | Jump to recent docs |
← | Move to previous article |
→ | Move to next article |
Ctrl+/ | Jump to the search field |