Liking cljdoc? Tell your friends :D
Clojure only.

fastmath.stats

Namespace provides a comprehensive collection of functions for performing statistical analysis in Clojure. It focuses on providing efficient implementations for common statistical tasks, leveraging fastmath's underlying numerical capabilities.

This namespace covers a wide range of statistical methods, including:

  • Descriptive Statistics: Measures of central tendency (mean, median, mode, expectile), dispersion (variance, standard deviation, MAD, SEM), and shape (skewness, kurtosis, L-moments).
  • Quantiles and Percentiles: Functions for calculating percentiles, quantiles, and the median, including weighted versions and various estimation strategies.
  • Intervals and Extents: Methods for defining ranges within data, such as span, IQR, standard deviation/MAD/SEM extents, percentile/quantile intervals, prediction intervals (PI, HPDI), and fence boundaries for outlier detection.
  • Outlier Detection: Functions for identifying data points outside conventional fence boundaries.
  • Data Transformation: Utilities for scaling, centering, trimming, winsorizing, and applying power transformations (Box-Cox, Yeo-Johnson) to data.
  • Correlation and Covariance: Measures of the linear and monotonic relationship between two or more variables (Pearson, Spearman, Kendall), and functions for generating covariance and correlation matrices.
  • Distance and Similarity Metrics: Functions for quantifying differences or likeness between data sequences or distributions, including error metrics (MAE, MSE, RMSE), L-p norms, and various distribution dissimilarity/similarity measures.
  • Contingency Tables: Functions for creating, analyzing, and deriving measures of association and agreement (Cramer's V, Cohen's Kappa) from contingency tables, including specialized functions for 2x2 tables.
  • Binary Classification Metrics: Functions for generating confusion matrices and calculating a wide array of performance metrics (Accuracy, Precision, Recall, F1, MCC, etc.).
  • Effect Size: Measures quantifying the magnitude of statistical effects, including difference-based (Cohen's d, Hedges' g, Glass's delta), ratio-based, ordinal/non-parametric (Cliff's Delta, Vargha-Delaney A), and overlap-based (Cohen's U, p-overlap), as well as measures related to explained variance (Eta-squared, Omega-squared, Cohen's f²).
  • Statistical Tests: Functions for performing hypothesis tests, including:
    • Normality and Shape tests (Skewness, Kurtosis, D'Agostino-Pearson K², Jarque-Bera, Bonett-Seier).
    • Binomial tests and confidence intervals.
    • Location tests (one-sample and two-sample T/Z tests, paired/unpaired).
    • Variance tests (F-test, Levene's, Brown-Forsythe, Fligner-Killeen).
    • Goodness-of-Fit and Independence tests (Power Divergence family including Chi-squared, G-test; AD/KS tests).
    • ANOVA and Rank Sum tests (One-way ANOVA, Kruskal-Wallis).
    • Autocorrelation tests (Durbin-Watson).
  • Time Series Analysis: Functions for analyzing the dependence structure of time series data, such as Autocorrelation (ACF) and Partial Autocorrelation (PACF).
  • Histograms: Functions for computing histograms and estimating optimal binning strategies.

This namespace aims to provide a robust set of statistical tools for data analysis and modeling within the Clojure ecosystem.

Namespace provides a comprehensive collection of functions for
performing statistical analysis in Clojure. It focuses on providing efficient
implementations for common statistical tasks, leveraging fastmath's underlying
numerical capabilities.

This namespace covers a wide range of statistical methods, including:

*   **Descriptive Statistics**: Measures of central tendency (mean, median, mode, expectile),
    dispersion (variance, standard deviation, MAD, SEM), and shape (skewness, kurtosis, L-moments).
*   **Quantiles and Percentiles**: Functions for calculating percentiles, quantiles, and the median,
    including weighted versions and various estimation strategies.
*   **Intervals and Extents**: Methods for defining ranges within data, such as span, IQR,
    standard deviation/MAD/SEM extents, percentile/quantile intervals, prediction intervals (PI, HPDI),
    and fence boundaries for outlier detection.
*   **Outlier Detection**: Functions for identifying data points outside conventional fence boundaries.
*   **Data Transformation**: Utilities for scaling, centering, trimming, winsorizing,
    and applying power transformations (Box-Cox, Yeo-Johnson) to data.
*   **Correlation and Covariance**: Measures of the linear and monotonic relationship
    between two or more variables (Pearson, Spearman, Kendall), and functions for
    generating covariance and correlation matrices.
*   **Distance and Similarity Metrics**: Functions for quantifying differences or
    likeness between data sequences or distributions, including error metrics (MAE, MSE, RMSE),
    L-p norms, and various distribution dissimilarity/similarity measures.
*   **Contingency Tables**: Functions for creating, analyzing, and deriving measures
    of association and agreement (Cramer's V, Cohen's Kappa) from contingency tables,
    including specialized functions for 2x2 tables.
*   **Binary Classification Metrics**: Functions for generating confusion matrices
    and calculating a wide array of performance metrics (Accuracy, Precision, Recall, F1, MCC, etc.).
*   **Effect Size**: Measures quantifying the magnitude of statistical effects,
    including difference-based (Cohen's d, Hedges' g, Glass's delta), ratio-based,
    ordinal/non-parametric (Cliff's Delta, Vargha-Delaney A), and overlap-based (Cohen's U, p-overlap),
    as well as measures related to explained variance (Eta-squared, Omega-squared, Cohen's f²).
*   **Statistical Tests**: Functions for performing hypothesis tests, including:
    -   Normality and Shape tests (Skewness, Kurtosis, D'Agostino-Pearson K², Jarque-Bera, Bonett-Seier).
    -   Binomial tests and confidence intervals.
    -   Location tests (one-sample and two-sample T/Z tests, paired/unpaired).
    -   Variance tests (F-test, Levene's, Brown-Forsythe, Fligner-Killeen).
    -   Goodness-of-Fit and Independence tests (Power Divergence family including Chi-squared, G-test; AD/KS tests).
    -   ANOVA and Rank Sum tests (One-way ANOVA, Kruskal-Wallis).
    -   Autocorrelation tests (Durbin-Watson).
*   **Time Series Analysis**: Functions for analyzing the dependence structure of
    time series data, such as Autocorrelation (ACF) and Partial Autocorrelation (PACF).
*   **Histograms**: Functions for computing histograms and estimating optimal binning strategies.

This namespace aims to provide a robust set of statistical tools for data analysis
and modeling within the Clojure ecosystem.
raw docstring

->confusion-matrixcljdeprecated

source

acfclj

(acf data)
(acf data lags)

Calculates the Autocorrelation Function (ACF) for a given time series data.

The ACF measures the linear dependence between a time series and its lagged values. It helps identify patterns (like seasonality or trend) and inform the selection of models for time series analysis (e.g., in ARIMA modeling).

Parameters:

  • data (seq of numbers): The time series data.
  • lags (long or seq of longs, optional):
    • If a number, calculates ACF for lags from 0 up to this maximum lag.
    • If a sequence of numbers, calculates ACF for each lag specified in the sequence.
    • If omitted (1-arity call), calculates ACF for lags from 0 up to (dec (count data)).

Returns a sequence of doubles: the autocorrelation coefficients for the specified lags. The value at lag 0 is always 1.0.

See also acf-ci (Calculates ACF with confidence intervals), pacf, pacf-ci.

Calculates the Autocorrelation Function (ACF) for a given time series `data`.

The ACF measures the linear dependence between a time series and its lagged values.
It helps identify patterns (like seasonality or trend) and inform the selection of
models for time series analysis (e.g., in ARIMA modeling).

Parameters:

* `data` (seq of numbers): The time series data.
* `lags` (long or seq of longs, optional):
  * If a number, calculates ACF for lags from 0 up to this maximum lag.
  * If a sequence of numbers, calculates ACF for each lag specified in the sequence.
  * If omitted (1-arity call), calculates ACF for lags from 0 up to `(dec (count data))`.

Returns a sequence of doubles: the autocorrelation coefficients for the specified lags.
The value at lag 0 is always 1.0.

See also [[acf-ci]] (Calculates ACF with confidence intervals), [[pacf]], [[pacf-ci]].
sourceraw docstring

acf-ciclj

(acf-ci data)
(acf-ci data lags)
(acf-ci data lags alpha)

Calculates the Autocorrelation Function (ACF) for a time series and provides approximate confidence intervals.

This function computes the ACF of the input time series data for specified lags (see acf) and includes approximate confidence intervals around the ACF estimates. These intervals help determine whether the autocorrelation at a specific lag is statistically significant (i.e., likely non-zero in the population).

Parameters:

  • data (seq of numbers): The time series data.
  • lags (long or seq of longs, optional):
    • If a number, calculates ACF for lags from 0 up to this maximum lag.
    • If a sequence of numbers, calculates ACF for each lag specified in the sequence.
    • If omitted (1-arity call), calculates ACF for lags from 0 up to (dec (count data)).
  • alpha (double, optional): The significance level for the confidence intervals. Defaults to 0.05 (for a 95% CI).

Returns a map containing:

  • :ci (double): The value of the approximate standard confidence interval bound for lags > 0. If the absolute value of an ACF coefficient at lag k > 0 exceeds this value, it is considered statistically significant.
  • :acf (seq of doubles): The sequence of autocorrelation coefficients at lags from 0 up to lags (or specified lags if lags is a sequence), calculated using acf.
  • :cis (seq of doubles): Cumulative confidence intervals for ACF. These are based on the variance of the sum of squared sample autocorrelations up to each lag.

See also acf, pacf, pacf-ci.

Calculates the Autocorrelation Function (ACF) for a time series and provides approximate confidence intervals.

This function computes the ACF of the input time series `data` for specified lags
(see [[acf]]) and includes approximate confidence intervals around the ACF
estimates. These intervals help determine whether the autocorrelation at a
specific lag is statistically significant (i.e., likely non-zero in the population).

Parameters:

* `data` (seq of numbers): The time series data.
* `lags` (long or seq of longs, optional):
  * If a number, calculates ACF for lags from 0 up to this maximum lag.
  * If a sequence of numbers, calculates ACF for each lag specified in the sequence.
  * If omitted (1-arity call), calculates ACF for lags from 0 up to `(dec (count data))`.
* `alpha` (double, optional): The significance level for the confidence intervals.
  Defaults to `0.05` (for a 95% CI).

Returns a map containing:

* `:ci` (double): The value of the approximate standard confidence interval bound
  for lags > 0. If the absolute value of an ACF
  coefficient at lag `k > 0` exceeds this value, it is considered statistically significant.
* `:acf` (seq of doubles): The sequence of autocorrelation coefficients
  at lags from 0 up to `lags` (or specified lags if `lags` is a sequence), calculated
  using [[acf]].
* `:cis` (seq of doubles): Cumulative confidence intervals for ACF. These are based on the
  variance of the sum of squared sample autocorrelations up to each lag.

See also [[acf]], [[pacf]], [[pacf-ci]].
sourceraw docstring

ad-test-one-sampleclj

(ad-test-one-sample xs)
(ad-test-one-sample xs distribution-or-ys)
(ad-test-one-sample xs
                    distribution-or-ys
                    {:keys [sides kernel bandwidth]
                     :or {sides :right kernel :gaussian}})

Performs the Anderson-Darling (AD) test for goodness-of-fit.

This test assesses the null hypothesis that a sample xs comes from a specified theoretical distribution or another empirical distribution. It is sensitive to differences in the tails of the distributions.

Parameters:

  • xs (seq of numbers): The sample data to be tested.
  • distribution-or-ys (optional):
    • A fastmath.random distribution object to test against. If omitted, defaults to the standard normal distribution (fastmath.random/default-normal).
    • A sequence of numbers (ys). In this case, an empirical distribution is estimated from ys using Kernel Density Estimation (KDE) or an enumerated distribution (see :kernel option).
  • opts (map, optional): Options map:
    • :sides (keyword, default :right): Specifies the side(s) of the A^2 statistic's distribution used for p-value calculation.
      • :right (default): Tests if the observed A^2 statistic is significantly large (standard approach for AD test, indicating poor fit).
      • :left: Tests if the observed A^2 statistic is significantly small.
      • :two-sided: Tests if the observed A^2 statistic is extreme in either tail.
    • :kernel (keyword, default :gaussian): Used only when distribution-or-ys is a sequence. Specifies the method to estimate the empirical distribution:
      • :gaussian (or other KDE kernels): Uses Kernel Density Estimation.
      • :enumerated: Creates a discrete empirical distribution from ys.
    • :bandwidth (double, optional): Bandwidth for KDE (if applicable).

Returns a map containing:

  • :A2: The Anderson-Darling test statistic (A^2).
  • :stat: Alias for :A2.
  • :p-value: The p-value associated with the test statistic and the specified :sides.
  • :n: Sample size of xs.
  • :mean: Mean of the sample xs (for context).
  • :stddev: Standard deviation of the sample xs (for context).
  • :sides: The alternative hypothesis side used for p-value calculation.
Performs the Anderson-Darling (AD) test for goodness-of-fit.

This test assesses the null hypothesis that a sample `xs` comes from a
specified theoretical distribution or another empirical distribution. It is
sensitive to differences in the tails of the distributions.

Parameters:

- `xs` (seq of numbers): The sample data to be tested.
- `distribution-or-ys` (optional):
  - A `fastmath.random` distribution object to test against. If omitted, defaults
    to the standard normal distribution (`fastmath.random/default-normal`).
  - A sequence of numbers (`ys`). In this case, an empirical distribution is
    estimated from `ys` using Kernel Density Estimation (KDE) or an enumerated
    distribution (see `:kernel` option).
- `opts` (map, optional): Options map:
  - `:sides` (keyword, default `:right`): Specifies the side(s) of the
    A^2 statistic's distribution used for p-value calculation.
    - `:right` (default): Tests if the observed A^2 statistic is significantly
      large (standard approach for AD test, indicating poor fit).
    - `:left`: Tests if the observed A^2 statistic is significantly small.
    - `:two-sided`: Tests if the observed A^2 statistic is extreme in either tail.
  - `:kernel` (keyword, default `:gaussian`): Used only when `distribution-or-ys`
    is a sequence. Specifies the method to estimate the empirical distribution:
      - `:gaussian` (or other KDE kernels): Uses Kernel Density Estimation.
      - `:enumerated`: Creates a discrete empirical distribution from `ys`.
  - `:bandwidth` (double, optional): Bandwidth for KDE (if applicable).

Returns a map containing:

- `:A2`: The Anderson-Darling test statistic (A^2).
- `:stat`: Alias for `:A2`.
- `:p-value`: The p-value associated with the test statistic and the specified `:sides`.
- `:n`: Sample size of `xs`.
- `:mean`: Mean of the sample `xs` (for context).
- `:stddev`: Standard deviation of the sample `xs` (for context).
- `:sides`: The alternative hypothesis side used for p-value calculation.
sourceraw docstring

adjacent-valuesclj

(adjacent-values vs)
(adjacent-values vs estimation-strategy)
(adjacent-values vs q1 q3 m)

Lower and upper adjacent values (LAV and UAV).

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1).

  • LAV is smallest value which is greater or equal to the LIF = (- Q1 (* 1.5 IQR)).
  • UAV is largest value which is lower or equal to the UIF = (+ Q3 (* 1.5 IQR)).
  • third value is a median of samples

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].

Lower and upper adjacent values (LAV and UAV).

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is `(- Q3 Q1)`.

* LAV is smallest value which is greater or equal to the LIF = `(- Q1 (* 1.5 IQR))`.
* UAV is largest value which is lower or equal to the UIF = `(+ Q3 (* 1.5 IQR))`.
* third value is a median of samples


Optional `estimation-strategy` argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
sourceraw docstring

ameasureclj

(ameasure [group1 group2])
(ameasure group1 group2)

Calculates the Vargha-Delaney A measure for two independent samples.

A non-parametric effect size measure quantifying the probability that a randomly chosen value from the first sample (group1) is greater than a randomly chosen value from the second sample (group2).

Parameters:

  • group1: The first independent sample.
  • group2: The second independent sample.

Returns the calculated A measure (a double) in the range [0, 1]. A value of 0.5 indicates stochastic equality (distributions are overlapping). Values > 0.5 mean group1 tends to be larger; values < 0.5 mean group2 tends to be larger.

Related to cliffs-delta and the Wilcoxon-Mann-Whitney U test statistic.

See also cliffs-delta, wmw-odds.

Calculates the Vargha-Delaney A measure for two independent samples.

A non-parametric effect size measure quantifying the probability that a randomly chosen value from the first sample (`group1`) is greater than a randomly chosen value from the second sample (`group2`).

Parameters:

- `group1`: The first independent sample.
- `group2`: The second independent sample.

Returns the calculated A measure (a double) in the range [0, 1].
A value of 0.5 indicates stochastic equality (distributions are overlapping). Values > 0.5 mean `group1` tends to be larger; values < 0.5 mean `group2` tends to be larger.

Related to [[cliffs-delta]] and the Wilcoxon-Mann-Whitney U test statistic.

See also [[cliffs-delta]], [[wmw-odds]].
sourceraw docstring

binary-measuresclj

(binary-measures confusion-matrix)
(binary-measures actual prediction)
(binary-measures actual prediction true-value)
(binary-measures tp fn fp tn)

Calculates a selected subset of common evaluation metrics for binary classification results.

This function is a convenience wrapper around binary-measures-all, providing a map containing the most frequently used metrics derived from a 2x2 confusion matrix.

The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN):

Predicted TruePredicted False
Actual TrueTPFN
Actual FalseFPTN

The function accepts the same input formats as binary-measures-all:

  1. (binary-measures tp fn fp tn): Direct input of the four counts.
  2. (binary-measures confusion-matrix): Input as a structured representation (map with keys like :tp, :fn, :fp, :tn; sequence of sequences [ [TP FP] [FN TN] ]; or flat sequence [TP FN FP TN]).
  3. (binary-measures actual prediction): Input as two sequences of outcomes.
  4. (binary-measures actual prediction true-value): Input as two sequences with a specified encoding for true (success).

Parameters:

  • tp, fn, fp, tn (long): Counts from the confusion matrix.
  • confusion-matrix (map or sequence): Representation of the confusion matrix.
  • actual, prediction (sequences): Sequences of true and predicted outcomes.
  • true-value (optional): Specifies how outcomes are converted to boolean true/false.

Returns a map containing the following selected metrics:

  • :tp (True Positives)
  • :tn (True Negatives)
  • :fp (False Positives)
  • :fn (False Negatives)
  • :accuracy
  • :fdr (False Discovery Rate, 1 - Precision)
  • :f-measure (F1 Score, harmonic mean of Precision and Recall)
  • :fall-out (False Positive Rate)
  • :precision (Positive Predictive Value)
  • :recall (True Positive Rate / Sensitivity)
  • :sensitivity (Alias for Recall/TPR)
  • :specificity (True Negative Rate)
  • :prevalence (Proportion of positive cases)

See also confusion-matrix, binary-measures-all, mcc, contingency-2x2-measures-all.

Calculates a selected subset of common evaluation metrics for binary classification results.

This function is a convenience wrapper around [[binary-measures-all]], providing
a map containing the most frequently used metrics derived from a 2x2 confusion matrix.

The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP),
False Negatives (FN), and True Negatives (TN):

|                | Predicted True | Predicted False |
|:---------------|:---------------|:----------------|
| **Actual True**  | TP             | FN              |
| **Actual False** | FP             | TN              |

The function accepts the same input formats as [[binary-measures-all]]:

1.  `(binary-measures tp fn fp tn)`: Direct input of the four counts.
2.  `(binary-measures confusion-matrix)`: Input as a structured representation
    (map with keys like `:tp`, `:fn`, `:fp`, `:tn`; sequence of sequences
    `[ [TP FP] [FN TN] ]`; or flat sequence `[TP FN FP TN]`).
3.  `(binary-measures actual prediction)`: Input as two sequences of outcomes.
4.  `(binary-measures actual prediction true-value)`: Input as two sequences with
    a specified encoding for `true` (success).

Parameters:

- `tp, fn, fp, tn` (long): Counts from the confusion matrix.
- `confusion-matrix` (map or sequence): Representation of the confusion matrix.
- `actual`, `prediction` (sequences): Sequences of true and predicted outcomes.
- `true-value` (optional): Specifies how outcomes are converted to boolean `true`/`false`.

Returns a map containing the following selected metrics:

- `:tp` (True Positives)
- `:tn` (True Negatives)
- `:fp` (False Positives)
- `:fn` (False Negatives)
- `:accuracy`
- `:fdr` (False Discovery Rate, 1 - Precision)
- `:f-measure` (F1 Score, harmonic mean of Precision and Recall)
- `:fall-out` (False Positive Rate)
- `:precision` (Positive Predictive Value)
- `:recall` (True Positive Rate / Sensitivity)
- `:sensitivity` (Alias for Recall/TPR)
- `:specificity` (True Negative Rate)
- `:prevalence` (Proportion of positive cases)

See also [[confusion-matrix]], [[binary-measures-all]], [[mcc]], [[contingency-2x2-measures-all]].
sourceraw docstring

binary-measures-allclj

(binary-measures-all confusion-matrix)
(binary-measures-all actual prediction)
(binary-measures-all actual prediction true-value)
(binary-measures-all tp fn fp tn)

Calculates a comprehensive set of evaluation metrics for binary classification results.

This function computes various statistics derived from a 2x2 confusion matrix, summarizing the performance of a binary classifier.

The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN):

Predicted TruePredicted False
Actual TrueTPFN
Actual FalseFPTN

The function supports several input formats:

  1. (binary-measures-all tp fn fp tn): Direct input of the four counts as arguments.

    • tp (long): True Positive count.
    • fn (long): False Negative count.
    • fp (long): False Positive count.
    • tn (long): True Negative count.
  2. (binary-measures-all confusion-matrix): Input as a structured representation of the confusion matrix.

    • confusion-matrix: Can be:
      • A map with keys like :tp, :fn, :fp, :tn (e.g., {:tp 10 :fn 2 :fp 5 :tn 80}).
      • A sequence of sequences representing rows [[TP FP] [FN TN]] (e.g., [[10 5] [2 80]]).
      • A flat sequence [TP FN FP TN] (e.g., [10 2 5 80]).
  3. (binary-measures-all actual prediction): Input as two sequences of outcomes.

    • actual (sequence): Sequence of true outcomes.
    • prediction (sequence): Sequence of predicted outcomes. Must have the same length as actual. Values in actual and prediction are converted to boolean true/false. By default, any non-nil or non-zero numeric value is treated as true, and nil or 0.0 is treated as false.
  4. (binary-measures-all actual prediction true-value): Input as two sequences with a specified encoding for true.

    • actual, prediction: Sequences as in the previous arity.
    • true-value (optional): Specifies how values in actual and prediction are converted to boolean true (success) or false (failure).
      • nil (default): Non-nil/non-zero (for numbers) is true.
      • Any sequence/set: Values found in this collection are true.
      • A map: Values are mapped according to the map; if a key is not found or maps to false, the value is false.
      • A predicate function: Returns true if the value satisfies the predicate.

Returns a map containing a wide array of calculated metrics. This includes, but is not limited to:

  • Basic Counts: :tp, :fn, :fp, :tn
  • Totals: :cp (Actual Positives), :cn (Actual Negatives), :pcp (Predicted Positives), :pcn (Predicted Negatives), :total (Grand Total)
  • Rates (often ratios of counts):
    • :tpr (True Positive Rate, Recall, Sensitivity, Hit Rate)
    • :fnr (False Negative Rate, Miss Rate)
    • :fpr (False Positive Rate, Fall-out)
    • :tnr (True Negative Rate, Specificity, Selectivity)
    • :ppv (Positive Predictive Value, Precision)
    • :fdr (False Discovery Rate, 1 - ppv)
    • :npv (Negative Predictive Value)
    • :for (False Omission Rate, 1 - npv)
  • Ratios/Odds:
    • :lr+ (Positive Likelihood Ratio)
    • :lr- (Negative Likelihood Ratio)
    • :dor (Diagnostic Odds Ratio)
  • Combined Scores:
    • :accuracy
    • :ba (Balanced Accuracy)
    • :fm (Fowlkes–Mallows index)
    • :pt (Prevalence Threshold)
    • :ts (Threat Score, Jaccard index)
    • :f-measure / :f1-score (F1 Score, special case of F-beta score)
    • :f-beta (Function to calculate F-beta for any beta)
    • :mcc / :phi (Matthews Correlation Coefficient, Phi coefficient)
    • :bm (Bookmaker Informedness)
    • :kappa (Cohen's Kappa, for 2x2 table)
    • :mk (Markedness)

Metrics are generally calculated using standard formulas based on the TP, FN, FP, TN counts. For more details on specific metrics, refer to standard classification literature or the Wikipedia page on Precision and recall, which covers many of these concepts.

See also confusion-matrix, binary-measures (for a selected subset of metrics), mcc, contingency-2x2-measures-all (for a broader set of 2x2 table measures).

Calculates a comprehensive set of evaluation metrics for binary classification results.

This function computes various statistics derived from a 2x2 confusion matrix,
summarizing the performance of a binary classifier.

The 2x2 confusion matrix is based on True Positives (TP), False Positives (FP),
False Negatives (FN), and True Negatives (TN):

|                | Predicted True | Predicted False |
|:---------------|:---------------|:----------------|
| **Actual True**  | TP             | FN              |
| **Actual False** | FP             | TN              |

The function supports several input formats:

1.  `(binary-measures-all tp fn fp tn)`: Direct input of the four counts as arguments.
    - `tp` (long): True Positive count.
    - `fn` (long): False Negative count.
    - `fp` (long): False Positive count.
    - `tn` (long): True Negative count.

2.  `(binary-measures-all confusion-matrix)`: Input as a structured representation of the confusion matrix.
    - `confusion-matrix`: Can be:
      - A map with keys like `:tp`, `:fn`, `:fp`, `:tn` (e.g., `{:tp 10 :fn 2 :fp 5 :tn 80}`).
      - A sequence of sequences representing rows `[[TP FP] [FN TN]]` (e.g., `[[10 5] [2 80]]`).
      - A flat sequence `[TP FN FP TN]` (e.g., `[10 2 5 80]`).

3.  `(binary-measures-all actual prediction)`: Input as two sequences of outcomes.
    - `actual` (sequence): Sequence of true outcomes.
    - `prediction` (sequence): Sequence of predicted outcomes. Must have the same length as `actual`.
    Values in `actual` and `prediction` are converted to boolean `true`/`false`. By default,
    any non-`nil` or non-zero numeric value is treated as `true`, and `nil` or `0.0` is
    treated as `false`.

4.  `(binary-measures-all actual prediction true-value)`: Input as two sequences with a specified encoding for `true`.
    - `actual`, `prediction`: Sequences as in the previous arity.
    - `true-value` (optional): Specifies how values in `actual` and `prediction` are converted to boolean `true` (success) or `false` (failure).
      - `nil` (default): Non-`nil`/non-zero (for numbers) is true.
      - Any sequence/set: Values found in this collection are true.
      - A map: Values are mapped according to the map; if a key is not found or maps to `false`, the value is false.
      - A predicate function: Returns `true` if the value satisfies the predicate.

Returns a map containing a wide array of calculated metrics. This includes, but is not limited to:

- Basic Counts: `:tp`, `:fn`, `:fp`, `:tn`
- Totals: `:cp` (Actual Positives), `:cn` (Actual Negatives), `:pcp` (Predicted Positives), `:pcn` (Predicted Negatives), `:total` (Grand Total)
- Rates (often ratios of counts):
  - `:tpr` (True Positive Rate, Recall, Sensitivity, Hit Rate)
  - `:fnr` (False Negative Rate, Miss Rate)
  - `:fpr` (False Positive Rate, Fall-out)
  - `:tnr` (True Negative Rate, Specificity, Selectivity)
  - `:ppv` (Positive Predictive Value, Precision)
  - `:fdr` (False Discovery Rate, `1 - ppv`)
  - `:npv` (Negative Predictive Value)
  - `:for` (False Omission Rate, `1 - npv`)
- Ratios/Odds:
  - `:lr+` (Positive Likelihood Ratio)
  - `:lr-` (Negative Likelihood Ratio)
  - `:dor` (Diagnostic Odds Ratio)
- Combined Scores:
  - `:accuracy`
  - `:ba` (Balanced Accuracy)
  - `:fm` (Fowlkes–Mallows index)
  - `:pt` (Prevalence Threshold)
  - `:ts` (Threat Score, Jaccard index)
  - `:f-measure` / `:f1-score` (F1 Score, special case of F-beta score)
  - `:f-beta` (Function to calculate F-beta for any beta)
  - `:mcc` / `:phi` (Matthews Correlation Coefficient, Phi coefficient)
  - `:bm` (Bookmaker Informedness)
  - `:kappa` (Cohen's Kappa, for 2x2 table)
  - `:mk` (Markedness)

Metrics are generally calculated using standard formulas based on the TP, FN, FP, TN counts.
For more details on specific metrics, refer to standard classification literature or
the Wikipedia page on [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall),
which covers many of these concepts.

See also [[confusion-matrix]], [[binary-measures]] (for a selected subset of metrics),
[[mcc]], [[contingency-2x2-measures-all]] (for a broader set of 2x2 table measures).
sourceraw docstring

binomial-ciclj

(binomial-ci number-of-successes number-of-trials)
(binomial-ci number-of-successes number-of-trials method)
(binomial-ci number-of-successes number-of-trials method alpha)

Calculates a confidence interval for a binomial proportion.

Given the number of observed successes in a fixed number of trials, this function estimates a confidence interval for the true underlying probability of success (p).

Different statistical methods are available for calculating the interval, as the accuracy and behavior of the interval can vary, especially for small sample sizes or probabilities close to 0 or 1.

Parameters:

  • number-of-successes (long): The count of successful outcomes.
  • number-of-trials (long): The total number of independent trials.
  • method (keyword, optional): The method used to calculate the confidence interval. Defaults to :asymptotic.
  • alpha (double, optional): The significance level (alpha) for the interval. The confidence level is 1 - alpha. Defaults to 0.05 (yielding a 95% CI).

Available method values:

  • :asymptotic: Normal approximation interval (Wald interval), based on the Central Limit Theorem. Simple but can be inaccurate for small samples or probabilities near 0 or 1.
  • :agresti-coull: An adjustment to the asymptotic interval, adding 'pseudo-counts' to improve performance for small samples.
  • :clopper-pearson: An exact method based on inverting binomial tests. Provides guaranteed coverage but can be overly conservative (wider than necessary).
  • :wilson: Score interval, derived from the score test. Generally recommended as a good balance of accuracy and coverage for various sample sizes.
  • :prop.test: Interval typically used with prop.test in R, applies a continuity correction.
  • :cloglog: Confidence interval based on the complementary log-log transformation.
  • :logit: Confidence interval based on the logit transformation.
  • :probit: Confidence interval based on the probit transformation (inverse of standard normal CDF).
  • :arcsine: Confidence interval based on the arcsine transformation.
  • :all: Applies all available methods and returns a map where keys are method keywords and values are their respective confidence intervals (as triplets).

Returns:

  • A vector [lower-bound, upper-bound, estimated-p].
    • lower-bound (double): The lower limit of the confidence interval.
    • upper-bound (double): The upper limit of the confidence interval.
    • estimated-p (double): The observed proportion of successes (number-of-successes / number-of-trials).

If method is :all, returns a map of results from each method.

See also binomial-test for performing a hypothesis test on a binomial proportion.

Calculates a confidence interval for a binomial proportion.

Given the number of observed `successes` in a fixed number of `trials`, this function
estimates a confidence interval for the true underlying probability of success (`p`).

Different statistical methods are available for calculating the interval, as the
accuracy and behavior of the interval can vary, especially for small sample sizes
or probabilities close to 0 or 1.

Parameters:

- `number-of-successes` (long): The count of successful outcomes.
- `number-of-trials` (long): The total number of independent trials.
- `method` (keyword, optional): The method used to calculate the confidence interval.
  Defaults to `:asymptotic`.
- `alpha` (double, optional): The significance level (alpha) for the interval.
  The confidence level is `1 - alpha`. Defaults to `0.05` (yielding a 95% CI).

Available `method` values:

- `:asymptotic`: Normal approximation interval (Wald interval), based on the Central Limit Theorem. Simple but can be inaccurate for small samples or probabilities near 0 or 1.
- `:agresti-coull`: An adjustment to the asymptotic interval, adding 'pseudo-counts' to improve performance for small samples.
- `:clopper-pearson`: An exact method based on inverting binomial tests. Provides guaranteed coverage but can be overly conservative (wider than necessary).
- `:wilson`: Score interval, derived from the score test. Generally recommended as a good balance of accuracy and coverage for various sample sizes.
- `:prop.test`: Interval typically used with `prop.test` in R, applies a continuity correction.
- `:cloglog`: Confidence interval based on the complementary log-log transformation.
- `:logit`: Confidence interval based on the logit transformation.
- `:probit`: Confidence interval based on the probit transformation (inverse of standard normal CDF).
- `:arcsine`: Confidence interval based on the arcsine transformation.
- `:all`: Applies all available methods and returns a map where keys are method keywords and values are their respective confidence intervals (as triplets).

Returns:

- A vector `[lower-bound, upper-bound, estimated-p]`.
  - `lower-bound` (double): The lower limit of the confidence interval.
  - `upper-bound` (double): The upper limit of the confidence interval.
  - `estimated-p` (double): The observed proportion of successes (`number-of-successes / number-of-trials`).

If `method` is `:all`, returns a map of results from each method.

See also [[binomial-test]] for performing a hypothesis test on a binomial proportion.
sourceraw docstring

binomial-ci-methodsclj

source

binomial-testclj

(binomial-test xs)
(binomial-test xs maybe-params)
(binomial-test number-of-successes
               number-of-trials
               {:keys [alpha p ci-method sides]
                :or {alpha 0.05 p 0.5 ci-method :asymptotic sides :two-sided}})

Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment, based on the binomial distribution.

This test assesses the null hypothesis that the true probability of success (p) in the underlying population is equal to a specified value (default 0.5).

The function can be called in two ways:

  1. With counts: (binomial-test number-of-successes number-of-trials params)
  2. With data: (binomial-test xs params), where xs is a sequence of outcomes. In this case, the outcomes in xs are converted to true/false based on the :true-false-conv parameter (if provided, otherwise numeric 1s are true), and the number of successes and total trials are derived from xs.

Parameters:

  • number-of-successes (long): Observed number of successful outcomes.
  • number-of-trials (long): Total number of trials.
  • xs (sequence): Sample data (used in the alternative call signature).
  • params (map, optional): Options map:
    • :p (double, default 0.5): The hypothesized probability of success under the null hypothesis.
    • :alpha (double, default 0.05): Significance level for confidence interval calculation.
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): True probability p is not equal to the hypothesized p.
      • :one-sided-greater: True probability p is greater than the hypothesized p.
      • :one-sided-less: True probability p is less than the hypothesized p.
    • :ci-method (keyword, default :asymptotic): Method used to calculate the confidence interval for the probability of success. See binomial-ci and binomial-ci-methods for available options (e.g., :wilson, :clopper-pearson).
    • :true-false-conv (optional, used only with xs): A function, set, or map to convert elements of xs into boolean true (success) or false (failure). See binary-measures-all documentation for details. If nil and xs contains numbers, 1.0 is treated as success.

Returns a map containing:

  • :p-value: The probability of observing a result as extreme as, or more extreme than, the observed number of successes, assuming the null hypothesis is true. Calculated using the binomial distribution.
  • :p: The hypothesized probability of success used in the test.
  • :successes: The observed number of successes.
  • :trials: The total number of trials.
  • :alpha: Significance level used for the confidence interval.
  • :level: Confidence level (1 - alpha).
  • :sides / :test-type: Alternative hypothesis side used.
  • :stat: The test statistic (the observed number of successes).
  • :estimate: The observed proportion of successes (successes / trials).
  • :ci-method: Confidence interval method used.
  • :confidence-interval: A confidence interval for the true probability of success, calculated using the specified :ci-method and adjusted for the :sides parameter.
Performs an exact test of a simple null hypothesis about the probability of success
in a Bernoulli experiment, based on the binomial distribution.

This test assesses the null hypothesis that the true probability of success (`p`)
in the underlying population is equal to a specified value (default 0.5).

The function can be called in two ways:

1. With counts: `(binomial-test number-of-successes number-of-trials params)`
2. With data: `(binomial-test xs params)`, where `xs` is a sequence of outcomes.
   In this case, the outcomes in `xs` are converted to true/false based on the
   `:true-false-conv` parameter (if provided, otherwise numeric 1s are true),
   and the number of successes and total trials are derived from `xs`.

Parameters:

- `number-of-successes` (long): Observed number of successful outcomes.
- `number-of-trials` (long): Total number of trials.
- `xs` (sequence): Sample data (used in the alternative call signature).
- `params` (map, optional): Options map:
  - `:p` (double, default `0.5`): The hypothesized probability of success under the null hypothesis.
  - `:alpha` (double, default `0.05`): Significance level for confidence interval calculation.
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): True probability `p` is not equal to the hypothesized `p`.
    - `:one-sided-greater`: True probability `p` is greater than the hypothesized `p`.
    - `:one-sided-less`: True probability `p` is less than the hypothesized `p`.
  - `:ci-method` (keyword, default `:asymptotic`): Method used to calculate the confidence interval for the probability of success. See [[binomial-ci]] and [[binomial-ci-methods]] for available options (e.g., `:wilson`, `:clopper-pearson`).
  - `:true-false-conv` (optional, used only with `xs`): A function, set, or map to convert elements of `xs` into boolean `true` (success) or `false` (failure). See [[binary-measures-all]] documentation for details. If `nil` and `xs` contains numbers, `1.0` is treated as success.

Returns a map containing:

- `:p-value`: The probability of observing a result as extreme as, or more extreme than, the observed number of successes, assuming the null hypothesis is true. Calculated using the binomial distribution.
- `:p`: The hypothesized probability of success used in the test.
- `:successes`: The observed number of successes.
- `:trials`: The total number of trials.
- `:alpha`: Significance level used for the confidence interval.
- `:level`: Confidence level (`1 - alpha`).
- `:sides` / `:test-type`: Alternative hypothesis side used.
- `:stat`: The test statistic (the observed number of successes).
- `:estimate`: The observed proportion of successes (`successes / trials`).
- `:ci-method`: Confidence interval method used.
- `:confidence-interval`: A confidence interval for the true probability of success, calculated using the specified `:ci-method` and adjusted for the `:sides` parameter.
sourceraw docstring

bonett-seier-testclj

(bonett-seier-test xs)
(bonett-seier-test xs params)
(bonett-seier-test xs geary-kurtosis {:keys [sides] :or {sides :two-sided}})

Performs the Bonett-Seier test for normality based on Geary's 'g' kurtosis measure.

This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample Geary's 'g' statistic significantly deviates from the value expected under normality (sqrt(2/pi)).

Parameters:

  • xs (seq of numbers): The sample data. Requires (count xs) > 3 for variance calculation.
  • geary-kurtosis (double, optional): A pre-calculated Geary's 'g' kurtosis value. If omitted, it's calculated from xs.
  • params (map, optional): Options map:
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis regarding the deviation from normal kurtosis.
      • :two-sided (default): The population kurtosis (measured by 'g') is different from normal.
      • :one-sided-greater: Population is leptokurtic ('g' < sqrt(2/pi)). Note Geary's 'g' decreases with peakedness.
      • :one-sided-less: Population is platykurtic ('g' > sqrt(2/pi)). Note Geary's 'g' increases with flatness.

Returns a map containing:

  • :Z: The final test statistic (approximately standard normal under H0).
  • :stat: Alias for :Z.
  • :p-value: The p-value associated with Z and the specified :sides.
  • :kurtosis: The Geary's 'g' kurtosis value used in the test.
  • :n: The sample size.
  • :sides: The alternative hypothesis side used.

References:

  • Bonett, D. G., & Seier, E. (2002). A test of normality with high uniform power. Computational Statistics & Data Analysis, 40(3), 435-445. (Provides theoretical basis)

See also kurtosis, kurtosis-test, normality-test, jarque-bera-test.

Performs the Bonett-Seier test for normality based on Geary's 'g' kurtosis measure.

This test assesses the null hypothesis that the data comes from a normally
distributed population by checking if the sample Geary's 'g' statistic
significantly deviates from the value expected under normality (`sqrt(2/pi)`).

Parameters:

- `xs` (seq of numbers): The sample data. Requires `(count xs) > 3` for variance calculation.
- `geary-kurtosis` (double, optional): A pre-calculated Geary's 'g' kurtosis value.
  If omitted, it's calculated from `xs`.
- `params` (map, optional): Options map:
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis
    regarding the deviation from normal kurtosis.
    - `:two-sided` (default): The population kurtosis (measured by 'g') is different from normal.
    - `:one-sided-greater`: Population is leptokurtic ('g' < sqrt(2/pi)). Note Geary's 'g' decreases with peakedness.
    - `:one-sided-less`: Population is platykurtic ('g' > sqrt(2/pi)). Note Geary's 'g' increases with flatness.

Returns a map containing:

- `:Z`: The final test statistic (approximately standard normal under H0).
- `:stat`: Alias for `:Z`.
- `:p-value`: The p-value associated with `Z` and the specified `:sides`.
- `:kurtosis`: The Geary's 'g' kurtosis value used in the test.
- `:n`: The sample size.
- `:sides`: The alternative hypothesis side used.

References:
- Bonett, D. G., & Seier, E. (2002). A test of normality with high uniform power.
  Computational Statistics & Data Analysis, 40(3), 435-445. (Provides theoretical basis)

See also [[kurtosis]], [[kurtosis-test]], [[normality-test]], [[jarque-bera-test]].
sourceraw docstring

bootstrapcljdeprecated

(bootstrap vs)
(bootstrap vs samples)
(bootstrap vs samples size)

Generate set of samples of given size from provided data.

Default samples is 200, number of size defaults to sample size.

Generate set of samples of given size from provided data.

Default `samples` is 200, number of `size` defaults to sample size.
sourceraw docstring

bootstrap-cicljdeprecated

(bootstrap-ci vs)
(bootstrap-ci vs alpha)
(bootstrap-ci vs alpha samples)
(bootstrap-ci vs alpha samples stat-fn)

Bootstrap method to calculate confidence interval.

Alpha defaults to 0.98, samples to 1000. Last parameter is statistical function used to measure, default: mean.

Returns ci and statistical function value.

Bootstrap method to calculate confidence interval.

Alpha defaults to 0.98, samples to 1000.
Last parameter is statistical function used to measure, default: [[mean]].

Returns ci and statistical function value.
sourceraw docstring

box-cox-infer-lambdaclj

(box-cox-infer-lambda xs)
(box-cox-infer-lambda xs lambda-range)
(box-cox-infer-lambda xs lambda-range opts)

Finds the optimal lambda (λ) parameter for the Box-Cox transformation of a dataset using the Maximum Likelihood Estimation (MLE) method.

The Box-Cox transformation is a family of power transformations often applied to positive data to make it more closely resemble a normal distribution and stabilize variance. This function estimates the lambda value that maximizes the log-likelihood function of the transformed data, assuming the transformed data is normally distributed.

Parameters:

  • xs (sequence of numbers): The input numerical data sequence.
  • lambda-range (vector of two numbers, optional): A sequence [min-lambda, max-lambda] defining the closed interval within which the optimal lambda is searched. Defaults to [-3.0, 3.0].
  • opts (map, optional): Additional options affecting the data used for the likelihood calculation. These options are passed to the internal data preparation step. Key options include:
    • :alpha (double, default 0.0): A constant value added to xs before estimating lambda. This is often used when xs contains zero or negative values and the standard Box-Cox (which requires positive input) is desired, or to explore transformations around a shifted location.
    • :negative? (boolean, default false): If true, indicates that the likelihood is estimated based on the modified Box-Cox transformation (Bickel and Doksum approach) suitable for negative values. The estimation process will work with the absolute values of the data shifted by :alpha.

Returns the estimated optimal lambda value as a double.

The inferred lambda value can then be used as the lambda parameter for the box-cox-transformation function to apply the actual transformation to the dataset.

See also box-cox-transformation, yeo-johnson-infer-lambda, yeo-johnson-transformation.

Finds the optimal lambda (λ) parameter for the Box-Cox transformation of a dataset using the Maximum Likelihood Estimation (MLE) method.

The Box-Cox transformation is a family of power transformations often applied to positive data to make it more closely resemble a normal distribution and stabilize variance. This function estimates the lambda value that maximizes the log-likelihood function of the transformed data, assuming the transformed data is normally distributed.

Parameters:

- `xs` (sequence of numbers): The input numerical data sequence.
- `lambda-range` (vector of two numbers, optional): A sequence `[min-lambda, max-lambda]` defining the closed interval within which the optimal lambda is searched. Defaults to `[-3.0, 3.0]`.
- `opts` (map, optional): Additional options affecting the data used for the likelihood calculation. These options are passed to the internal data preparation step. Key options include:
  - `:alpha` (double, default 0.0): A constant value added to `xs` before estimating lambda. This is often used when `xs` contains zero or negative values and the standard Box-Cox (which requires positive input) is desired, or to explore transformations around a shifted location.
  - `:negative?` (boolean, default `false`): If `true`, indicates that the likelihood is estimated based on the modified Box-Cox transformation (Bickel and Doksum approach) suitable for negative values. The estimation process will work with the absolute values of the data shifted by `:alpha`.

Returns the estimated optimal lambda value as a double.

The inferred lambda value can then be used as the `lambda` parameter for the [[box-cox-transformation]] function to apply the actual transformation to the dataset.

See also [[box-cox-transformation]], [[yeo-johnson-infer-lambda]], [[yeo-johnson-transformation]].
sourceraw docstring

box-cox-transformationclj

(box-cox-transformation xs)
(box-cox-transformation xs lambda)
(box-cox-transformation xs lambda {:keys [scaled? inverse?] :as opts})

Applies Box-Cox transformation to a data.

The Box-Cox transformation is a family of power transformations used to stabilize variance and make data more normally distributed.

Parameters:

  • xs (seq of numbers): The input data.
  • lambda (default 0.0): The power parameter. If nil or [lambda-min, lambda-max], lambda is inferred using maximum log likelihood.
  • Options map:
    • alpha (optional): A shift parameter applied before transformation.
    • scaled? (default false): Scale by geometric mean or any other number
    • negative? (default false): Allow negative values
    • inverse? (default: false): Perform inverse operation, lambda can't be inferred.

Returns transformed data.

Related: yeo-johnson-transformation

Applies Box-Cox transformation to a data.

 The Box-Cox transformation is a family of power transformations used to stabilize variance and make data more normally distributed.

Parameters:

- `xs` (seq of numbers): The input data.
- `lambda` (default `0.0`): The power parameter. If `nil` or `[lambda-min, lambda-max]`, `lambda` is inferred using maximum log likelihood.
- Options map:
  - `alpha` (optional): A shift parameter applied before transformation.
  - `scaled?` (default `false`): Scale by geometric mean or any other number
  - `negative?` (default `false`): Allow negative values
  - `inverse?` (default: `false`): Perform inverse operation, `lambda` can't be inferred.

Returns transformed data.

Related: `yeo-johnson-transformation`
sourceraw docstring

brown-forsythe-testclj

(brown-forsythe-test xss)
(brown-forsythe-test xss params)

Brown-Forsythe test for homogeneity of variances.

This test is a modification of Levene's test, using the median instead of the mean for calculating the spread within each group. This makes the test more robust against non-normally distributed data.

Calls levene-test with :statistic set to median. Accepts the same parameters as levene-test, except for :statistic.

Parameters:

  • xss (sequence of sequences): A collection of data groups.
  • params (map, optional): Options map (see levene-test).
Brown-Forsythe test for homogeneity of variances.

This test is a modification of Levene's test, using the median instead of the mean
for calculating the spread within each group. This makes the test more robust
against non-normally distributed data.

Calls [[levene-test]] with `:statistic` set to [[median]]. Accepts the same parameters
as [[levene-test]], except for `:statistic`.

Parameters:
- `xss` (sequence of sequences): A collection of data groups.
- `params` (map, optional): Options map (see [[levene-test]]).
sourceraw docstring

chisq-testclj

(chisq-test contingency-table-or-xs)
(chisq-test contingency-table-or-xs params)

Chi square test, a power divergence test for lambda 1.0

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Chi square test, a power divergence test for `lambda` 1.0

Performs a power divergence test, which encompasses several common statistical tests
  like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
  This function can perform either a goodness-of-fit test or a test for independence
  in a contingency table.

  Usage:

  1.  **Goodness-of-Fit (GOF):**
      - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
      - Input: `data` (sequence of numbers) and `:p` (a distribution object).
        In this case, a histogram of `data` is created (controlled by `:bins`) and
        compared against the probability mass/density of the distribution in those bins.

  2.  **Test for Independence:**
      - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

  Options map:

  * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
      * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
      * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
      * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
      * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
      * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
      * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
  * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
    or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
  * `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
  * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
    (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
  * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
    against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
  * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
  * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
    Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

  Returns a map containing:

  - `:stat`: The calculated power divergence test statistic.
  - `:chi2`: Alias for `:stat`.
  - `:df`: Degrees of freedom for the test.
  - `:p-value`: The p-value associated with the test statistic.
  - `:n`: Total number of observations.
  - `:estimate`: Observed proportions.
  - `:expected`: Expected counts or proportions under the null hypothesis.
  - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
  - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

ciclj

(ci vs)
(ci vs alpha)

T-student based confidence interval for given data. Alpha value defaults to 0.05.

Last value is mean.

T-student based confidence interval for given data. Alpha value defaults to 0.05.

Last value is mean.
sourceraw docstring

cliffs-deltaclj

(cliffs-delta [group1 group2])
(cliffs-delta group1 group2)

Calculates Cliff's Delta (δ), a non-parametric effect size measure for assessing the difference between two groups of ordinal or continuous data.

Cliff's Delta quantifies the degree of overlap between two distributions. It represents the probability that a randomly chosen value from the first group is greater than a randomly chosen value from the second group, minus the reverse probability.

Parameters:

  • group1 (seq of numbers): The first sample.
  • group2 (seq of numbers): The second sample.

Returns the calculated Cliff's Delta value as a double.

Interpretation:

  • A value of +1 indicates complete separation where every value in group1 is greater than every value in group2.
  • A value of -1 indicates complete separation where every value in group2 is greater than every value in group1.
  • A value of 0 indicates complete overlap between the distributions.
  • Values between -1 and 1 indicate varying degrees of overlap. Cohen (1988) suggested guidelines for effect size: |δ| < 0.147 (negligible), 0.147 ≤ |δ| < 0.33 (small), 0.33 ≤ |δ| < 0.474 (medium), |δ| ≥ 0.474 (large).

Cliff's Delta is a robust measure, suitable for ordinal data or when assumptions of parametric tests (like normality or equal variances) are violated. It is closely related to the wmw-odds (Wilcoxon-Mann-Whitney odds) and the ameasure (Vargha-Delaney A).

See also wmw-odds, ameasure, cohens-d, glass-delta.

Calculates Cliff's Delta (δ), a non-parametric effect size measure for assessing the difference between two groups of ordinal or continuous data.

Cliff's Delta quantifies the degree of overlap between two distributions. It represents the probability that a randomly chosen value from the first group is greater than a randomly chosen value from the second group, minus the reverse probability.

Parameters:

- `group1` (seq of numbers): The first sample.
- `group2` (seq of numbers): The second sample.

Returns the calculated Cliff's Delta value as a double.

Interpretation:

- A value of +1 indicates complete separation where every value in `group1` is greater than every value in `group2`.
- A value of -1 indicates complete separation where every value in `group2` is greater than every value in `group1`.
- A value of 0 indicates complete overlap between the distributions.
- Values between -1 and 1 indicate varying degrees of overlap. Cohen (1988) suggested guidelines for effect size: |δ| < 0.147 (negligible), 0.147 ≤ |δ| < 0.33 (small), 0.33 ≤ |δ| < 0.474 (medium), |δ| ≥ 0.474 (large).

Cliff's Delta is a robust measure, suitable for ordinal data or when assumptions of parametric tests (like normality or equal variances) are violated. It is closely related to the [[wmw-odds]] (Wilcoxon-Mann-Whitney odds) and the [[ameasure]] (Vargha-Delaney A).

See also [[wmw-odds]], [[ameasure]], [[cohens-d]], [[glass-delta]].
sourceraw docstring

coefficient-matrixclj

(coefficient-matrix vss)
(coefficient-matrix vss measure-fn)
(coefficient-matrix vss measure-fn symmetric?)

Generates a matrix of pairwise coefficients from a sequence of sequences.

This function calculates a matrix where the element at row i and column j is the result of applying the provided measure-fn to the i-th sequence and the j-th sequence from the input vss.

Parameters:

  • vss (sequence of sequences of numbers): The collection of data sequences. Each inner sequence is treated as a variable or set of observations. All inner sequences should ideally have the same length if the measure-fn expects it.
  • measure-fn (function, optional): A function of two arguments (sequences) that returns a double representing the coefficient or measure between them. Defaults to pearson-correlation.
  • symmetric? (boolean, optional): If true, the function assumes that measure-fn(a, b) is equal to measure-fn(b, a). It calculates the upper (or lower) triangle of the matrix and mirrors the values to the other side. This is an optimization for symmetric measures like correlation and covariance. If false (default), all pairwise combinations (i, j) are calculated independently.

Returns a sequence of sequences (a matrix) of doubles.

Note: While this function's symmetric? parameter defaults to false, convenience functions like correlation-matrix and covariance-matrix wrap this function and explicitly set symmetric? to true as their respective measures are symmetric.

See also correlation-matrix, covariance-matrix.

Generates a matrix of pairwise coefficients from a sequence of sequences.

This function calculates a matrix where the element at row `i` and column `j`
is the result of applying the provided `measure-fn` to the `i`-th sequence
and the `j`-th sequence from the input `vss`.

Parameters:

- `vss` (sequence of sequences of numbers): The collection of data sequences. Each
  inner sequence is treated as a variable or set of observations. All inner
  sequences should ideally have the same length if the `measure-fn` expects it.
- `measure-fn` (function, optional): A function of two arguments (sequences)
  that returns a double representing the coefficient or measure between them.
  Defaults to [[pearson-correlation]].
- `symmetric?` (boolean, optional): If `true`, the function assumes that
  `measure-fn(a, b)` is equal to `measure-fn(b, a)`. It calculates the upper
  (or lower) triangle of the matrix and mirrors the values to the other side.
  This is an optimization for symmetric measures like correlation and covariance.
  If `false` (default), all pairwise combinations `(i, j)` are calculated independently.

Returns a sequence of sequences (a matrix) of doubles.

Note: While this function's `symmetric?` parameter defaults to `false`,
convenience functions like [[correlation-matrix]] and [[covariance-matrix]]
wrap this function and explicitly set `symmetric?` to `true` as their
respective measures are symmetric.

See also [[correlation-matrix]], [[covariance-matrix]].
sourceraw docstring

cohens-dclj

(cohens-d [group1 group2])
(cohens-d group1 group2)
(cohens-d group1 group2 method)

Calculate Cohen's d effect size between two groups.

Cohen's d is a standardized measure used to quantify the magnitude of the difference between the means of two independent groups. It expresses the mean difference in terms of standard deviation units.

The most common formula for Cohen's d is:

d = (mean(group1) - mean(group2)) / pooled_stddev

where pooled_stddev is the pooled standard deviation of the two groups, calculated under the assumption of equal variances.

Parameters:

  • group1 (seq of numbers): The first independent sample.
  • group2 (seq of numbers): The second independent sample.
  • method (optional keyword): Specifies the method for calculating the pooled standard deviation, affecting the denominator of the formula. Possible values are :unbiased (default), :biased, or :avg. See pooled-stddev for details on these methods.

Returns the calculated Cohen's d effect size as a double.

Interpretation guidelines (approximate for normal distributions):

  • |d| = 0.2: small effect
  • |d| = 0.5: medium effect
  • |d| = 0.8: large effect

Assumptions:

  • The two samples are independent.
  • Data within each group are approximately normally distributed.
  • The choice of :method implies assumptions about equal variances (default :unbiased and :biased assume equal variances, while :avg does not but might be less standard).

See also hedges-g (a version bias-corrected for small sample sizes), glass-delta (an alternative effect size measure using the control group standard deviation), pooled-stddev.

Calculate Cohen's d effect size between two groups.

Cohen's d is a standardized measure used to quantify the magnitude of the
difference between the means of two independent groups. It expresses the mean
difference in terms of standard deviation units.

The most common formula for Cohen's d is:

    d = (mean(group1) - mean(group2)) / pooled_stddev

where `pooled_stddev` is the pooled standard deviation of the two groups,
calculated under the assumption of equal variances.

Parameters:

- `group1` (seq of numbers): The first independent sample.
- `group2` (seq of numbers): The second independent sample.
- `method` (optional keyword): Specifies the method for calculating the pooled standard deviation,
  affecting the denominator of the formula. Possible values are `:unbiased` (default),
  `:biased`, or `:avg`. See [[pooled-stddev]] for details on these methods.

Returns the calculated Cohen's d effect size as a double.

Interpretation guidelines (approximate for normal distributions):
- |d| = 0.2: small effect
- |d| = 0.5: medium effect
- |d| = 0.8: large effect

Assumptions:
- The two samples are independent.
- Data within each group are approximately normally distributed.
- The choice of `:method` implies assumptions about equal variances (default `:unbiased` and `:biased` assume equal variances, while `:avg` does not but might be less standard).

See also [[hedges-g]] (a version bias-corrected for small sample sizes),
[[glass-delta]] (an alternative effect size measure using the control group standard deviation),
[[pooled-stddev]].
sourceraw docstring

cohens-d-correctedclj

(cohens-d-corrected [group1 group2])
(cohens-d-corrected group1 group2)
(cohens-d-corrected group1 group2 method)

Calculates Cohen's d effect size corrected for bias in small sample sizes.

This function applies a correction factor (derived from the gamma function) to Cohen's d (cohens-d) to provide a less biased estimate of the population effect size when sample sizes are small. This corrected measure is sometimes referred to as Hedges' g, though this function specifically implements the correction applied to Cohen's d.

The correction factor is (1 - 3 / (4 * df - 1)) where df is the degrees of freedom used in the standard Cohen's d calculation.

Parameters:

  • group1 (seq of numbers): The first independent sample.
  • group2 (seq of numbers): The second independent sample.
  • method (optional keyword): Specifies the method for calculating the pooled standard deviation, affecting the denominator of the formula (passed to cohens-d). Possible values are :unbiased (default), :biased, or :avg. See pooled-stddev for details on these methods.

Returns the calculated bias-corrected Cohen's d effect size as a double.

Note: While this function is named cohens-d-corrected, Hedges' g (calculated by hedges-g-corrected) also applies a similar small-sample bias correction. Differences might exist based on the specific correction formula or degree of freedom definition used. This function uses (count group1) + (count group2) - 2 as the degrees of freedom for the correction by default (when :unbiased method is used for cohens-d).

See also cohens-d, hedges-g, hedges-g-corrected.

Calculates Cohen's d effect size corrected for bias in small sample sizes.

This function applies a correction factor (derived from the gamma function) to
Cohen's d ([[cohens-d]]) to provide a less biased estimate of the population
effect size when sample sizes are small. This corrected measure is sometimes
referred to as Hedges' g, though this function specifically implements the
correction applied to Cohen's d.

The correction factor is `(1 - 3 / (4 * df - 1))` where `df` is the degrees of
freedom used in the standard Cohen's d calculation.

Parameters:

- `group1` (seq of numbers): The first independent sample.
- `group2` (seq of numbers): The second independent sample.
- `method` (optional keyword): Specifies the method for calculating the pooled
  standard deviation, affecting the denominator of the formula (passed to
  [[cohens-d]]). Possible values are `:unbiased` (default), `:biased`, or `:avg`.
  See [[pooled-stddev]] for details on these methods.

Returns the calculated bias-corrected Cohen's d effect size as a double.

Note: While this function is named `cohens-d-corrected`, Hedges' g (calculated
by [[hedges-g-corrected]]) also applies a similar small-sample bias correction.
Differences might exist based on the specific correction formula or degree of
freedom definition used. This function uses `(count group1) + (count group2) - 2`
as the degrees of freedom for the correction by default (when `:unbiased` method
is used for `cohens-d`).

See also [[cohens-d]], [[hedges-g]], [[hedges-g-corrected]].
sourceraw docstring

cohens-fclj

(cohens-f [group1 group2])
(cohens-f group1 group2)
(cohens-f group1 group2 type)

Calculates Cohen's f, a measure of effect size derived as the square root of Cohen's f² (cohens-f2).

Cohen's f is a standardized measure quantifying the magnitude of an effect, often used in the context of ANOVA or regression. It is the square root of the ratio of the variance explained by the effect to the unexplained variance.

Parameters:

  • group1 (seq of numbers): The dependent variable.
  • group2 (seq of numbers): The independent variable (or predictor). Must have the same length as group1.
  • type (keyword, optional): Specifies the measure of 'Proportion of Variance Explained' used in the underlying cohens-f2 calculation. Defaults to :eta.
    • :eta (default): Uses Eta-squared (sample R²), a measure of variance explained in the sample.
    • :omega: Uses Omega-squared, a less biased estimate of variance explained in the population.
    • :epsilon: Uses Epsilon-squared, another less biased estimate of variance explained in the population.
    • Any function: A function accepting group1 and group2 and returning a double representing the proportion of variance explained.

Returns the calculated Cohen's f effect size as a double. Values range from 0 upwards.

Interpretation:

  • Values are positive. Larger values indicate a stronger effect (more variance in group1 explained by group2).
  • Cohen's guidelines for interpreting the magnitude of f² (and by extension, f) are:
    • $f = 0.10$ (approx. $f^2 = 0.01$): small effect
    • $f = 0.25$ (approx. $f^2 = 0.0625$): medium effect
    • $f = 0.40$ (approx. $f^2 = 0.16$): large effect (Note: Guidelines are often quoted for f², interpret f as $\sqrt{f^2}$)

See also cohens-f2, eta-sq, omega-sq, epsilon-sq.

Calculates Cohen's f, a measure of effect size derived as the square root of Cohen's f² ([[cohens-f2]]).

Cohen's f is a standardized measure quantifying the magnitude of an effect,
often used in the context of ANOVA or regression. It is the square root of
the ratio of the variance explained by the effect to the unexplained variance.

Parameters:

- `group1` (seq of numbers): The dependent variable.
- `group2` (seq of numbers): The independent variable (or predictor). Must have the same length as `group1`.
- `type` (keyword, optional): Specifies the measure of 'Proportion of Variance Explained'
  used in the underlying [[cohens-f2]] calculation. Defaults to `:eta`.
  - `:eta` (default): Uses Eta-squared (sample R²), a measure of variance explained in the sample.
  - `:omega`: Uses Omega-squared, a less biased estimate of variance explained in the population.
  - `:epsilon`: Uses Epsilon-squared, another less biased estimate of variance explained in the population.
  - Any function: A function accepting `group1` and `group2` and returning a double representing the proportion of variance explained.

Returns the calculated Cohen's f effect size as a double. Values range from 0 upwards.

Interpretation:

- Values are positive. Larger values indicate a stronger effect (more variance in `group1` explained by `group2`).
- Cohen's guidelines for interpreting the magnitude of f² (and by extension, f) are:
  - $f = 0.10$ (approx. $f^2 = 0.01$): small effect
  - $f = 0.25$ (approx. $f^2 = 0.0625$): medium effect
  - $f = 0.40$ (approx. $f^2 = 0.16$): large effect
  (Note: Guidelines are often quoted for f², interpret f as $\sqrt{f^2}$)

See also [[cohens-f2]], [[eta-sq]], [[omega-sq]], [[epsilon-sq]].
sourceraw docstring

cohens-f2clj

(cohens-f2 [group1 group2])
(cohens-f2 group1 group2)
(cohens-f2 group1 group2 type)

Calculates Cohen's f², a measure of effect size often used in ANOVA or regression.

Cohen's f² quantifies the magnitude of the effect of an independent variable or set of predictors on a dependent variable, expressed as the ratio of the variance explained by the effect to the unexplained variance.

This function allows calculating f² using different measures for the 'Proportion of Variance Explained', specified by the type parameter:

  • :eta (default): Uses eta-sq (Eta-squared), which in this implementation is equivalent to the sample $R^2$ from a linear regression of group1 on group2. This is a measure of the proportion of variance explained in the sample.
  • :omega: Uses omega-sq (Omega-squared), a less biased estimate of the proportion of variance explained in the population.
  • :epsilon: Uses epsilon-sq (Epsilon-squared), another less biased estimate of the proportion of variance explained in the population, similar to adjusted $R^2$.
  • Any function: A function accepting group1 and group2 and returning a double representing the proportion of variance explained.

Parameters:

  • group1 (seq of numbers): The dependent variable.
  • group2 (seq of numbers): The independent variable (or predictor). Must have the same length as group1.
  • type (keyword, optional): Specifies the measure of 'Proportion of Variance Explained' to use (:eta, :omega, :epsilon or any function). Defaults to :eta.

Returns the calculated Cohen's f² effect size as a double. Values range from 0 upwards.

Interpretation Guidelines (approximate, often used for F-tests in ANOVA/regression):

  • $f^2 = 0.02$: small effect
  • $f^2 = 0.15$: medium effect
  • $f^2 = 0.35$: large effect

See also cohens-f, eta-sq, omega-sq, epsilon-sq.

Calculates Cohen's f², a measure of effect size often used in ANOVA or regression.

Cohen's f² quantifies the magnitude of the effect of an independent variable or set
of predictors on a dependent variable, expressed as the ratio of the variance
explained by the effect to the unexplained variance.

This function allows calculating f² using different measures for the 'Proportion of Variance Explained',
specified by the `type` parameter:

- `:eta` (default): Uses [[eta-sq]] (Eta-squared), which in this implementation is
  equivalent to the sample $R^2$ from a linear regression of `group1` on `group2`.
  This is a measure of the proportion of variance explained in the sample.
- `:omega`: Uses [[omega-sq]] (Omega-squared), a less biased estimate of the
  proportion of variance explained in the population.
- `:epsilon`: Uses [[epsilon-sq]] (Epsilon-squared), another less biased estimate
  of the proportion of variance explained in the population, similar to adjusted $R^2$.
- Any function: A function accepting `group1` and `group2` and returning a double representing the proportion of variance explained.

Parameters:

- `group1` (seq of numbers): The dependent variable.
- `group2` (seq of numbers): The independent variable (or predictor). Must have the same length as `group1`.
- `type` (keyword, optional): Specifies the measure of 'Proportion of Variance Explained' to use (`:eta`, `:omega`, `:epsilon` or any function). Defaults to `:eta`.

Returns the calculated Cohen's f² effect size as a double. Values range from 0 upwards.

Interpretation Guidelines (approximate, often used for F-tests in ANOVA/regression):
- $f^2 = 0.02$: small effect
- $f^2 = 0.15$: medium effect
- $f^2 = 0.35$: large effect

See also [[cohens-f]], [[eta-sq]], [[omega-sq]], [[epsilon-sq]].
sourceraw docstring

cohens-kappaclj

(cohens-kappa contingency-table)
(cohens-kappa group1 group2)

Calculates Cohen's Kappa coefficient (κ), a statistic that measures inter-rater agreement for categorical items, while correcting for chance agreement.

It is often used to assess the consistency of agreement between two raters or methods. Its value typically ranges from -1 to +1:

  • κ = 1: Perfect agreement.
  • κ = 0: Agreement is no better than chance.
  • κ < 0: Agreement is worse than chance.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a 2x2 contingency table from the unique values in the sequences (assuming they represent two binary variables). The mapping of values to table cells (e.g., what corresponds to TP, TN, FP, FN) depends on how contingency-table orders the unique values. For direct control over which cell is which, use the contingency table input.

  2. With a contingency table: The contingency table can be provided as:

    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}). This is the output format of contingency-table with two inputs. The mapping of indices to TP/TN/FP/FN depends on the order of unique values in the original data if generated by contingency-table, or the explicit structure if created manually or via rows->contingency-table. Standard convention maps [0 0] to TP, [0 1] to FP, [1 0] to FN, and [1 1] to TN for binary outcomes.
    • A sequence of sequences representing the rows of the table (e.g., [[TP FP] [FN TN]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of binary outcomes/categories.
  • group2 (sequence): The second sequence of binary outcomes/categories. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed 2x2 contingency table. The cell values should represent counts (e.g., TP, FN, FP, TN).

Returns the calculated Cohen's Kappa coefficient as a double.

See also weighted-kappa (for ordinal data with partial agreement), contingency-table, contingency-2x2-measures, binary-measures-all.

Calculates Cohen's Kappa coefficient (κ), a statistic that measures inter-rater
agreement for categorical items, while correcting for chance agreement.

It is often used to assess the consistency of agreement between two raters or
methods. Its value typically ranges from -1 to +1:

- `κ = 1`: Perfect agreement.
- `κ = 0`: Agreement is no better than chance.
- `κ < 0`: Agreement is worse than chance.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a 2x2 contingency table from
    the unique values in the sequences (assuming they represent two binary
    variables). The mapping of values to table cells (e.g., what corresponds
    to TP, TN, FP, FN) depends on how `contingency-table` orders the unique values.
    For direct control over which cell is which, use the contingency table input.

2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}`). This is the output format
      of [[contingency-table]] with two inputs. The mapping of indices to TP/TN/FP/FN
      depends on the order of unique values in the original data if generated by
      [[contingency-table]], or the explicit structure if created manually or via
      [[rows->contingency-table]]. Standard convention maps `[0 0]` to TP, `[0 1]` to FP,
      `[1 0]` to FN, and `[1 1]` to TN for binary outcomes.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[TP FP] [FN TN]]`). This is equivalent to [[rows->contingency-table]].

Parameters:

- `group1` (sequence): The first sequence of binary outcomes/categories.
- `group2` (sequence): The second sequence of binary outcomes/categories.
  Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed 2x2 contingency table.
  The cell values should represent counts (e.g., TP, FN, FP, TN).

Returns the calculated Cohen's Kappa coefficient as a double.

See also [[weighted-kappa]] (for ordinal data with partial agreement), [[contingency-table]], [[contingency-2x2-measures]], [[binary-measures-all]].
sourceraw docstring

cohens-qclj

(cohens-q r1 r2)
(cohens-q group1 group2a group2b)
(cohens-q group1a group2a group1b group2b)

Compares two correlation coefficients by calculating the difference between their Fisher z-transformations.

The Fisher z-transformation (atanh) of a correlation coefficient r helps normalize the sampling distribution of correlation coefficients. The difference between two z'-transformed correlations is often used as a test statistic.

The function supports comparing correlations in different scenarios via its arities:

  • (cohens-q r1 r2): Calculates the difference between the Fisher z-transformations of two correlation values r1 and r2 provided directly. This is typically used when comparing two independent correlation coefficients (e.g., correlations from two separate studies). Returns atanh(r1) - atanh(r2).

    • r1, r2 (double): Correlation coefficient values (-1.0 to 1.0).
  • (cohens-q group1 group2a group2b): Calculates the difference between the correlation of group1 with group2a and the correlation of group1 with group2b. This is commonly used for comparing dependent correlations (where group1 is a common variable). Calculates atanh(pearson-correlation(group1, group2a)) - atanh(pearson-correlation(group1, group2b)).

    • group1, group2a, group2b (sequences): Data sequences from which Pearson correlations are computed.
  • (cohens-q group1a group2a group1b group2b): Calculates the difference between the correlation of group1a with group2a and the correlation of group1b with group2b. This is typically used for comparing two independent correlations obtained from two distinct pairs of variables (all four sequences are independent). Calculates atanh(pearson-correlation(group1a, group2a)) - atanh(pearson-correlation(group1b, group2b)).

    • group1a, group2a, group1b, group2b (sequences): Data sequences from which Pearson correlations are computed.

Returns the difference between the Fisher z-transformed correlation values as a double.

Note: For comparing dependent correlations (3-arity case), standard statistical tests (e.g., Steiger's test) are more complex than a simple difference of z-transforms and involve the correlation between group2a and group2b. This function provides the basic difference value.

Compares two correlation coefficients by calculating the difference between their Fisher z-transformations.

The Fisher z-transformation (`atanh`) of a correlation coefficient `r` helps normalize the sampling distribution of correlation coefficients. The difference between two z'-transformed correlations is often used as a test statistic.

The function supports comparing correlations in different scenarios via its arities:

- `(cohens-q r1 r2)`: Calculates the difference between the Fisher z-transformations of two correlation values `r1` and `r2` provided directly. This is typically used when comparing two *independent* correlation coefficients (e.g., correlations from two separate studies). Returns `atanh(r1) - atanh(r2)`.
  - `r1`, `r2` (double): Correlation coefficient values (-1.0 to 1.0).

- `(cohens-q group1 group2a group2b)`: Calculates the difference between the correlation of `group1` with `group2a` and the correlation of `group1` with `group2b`. This is commonly used for comparing *dependent* correlations (where `group1` is a common variable). Calculates `atanh(pearson-correlation(group1, group2a)) - atanh(pearson-correlation(group1, group2b))`.
  - `group1`, `group2a`, `group2b` (sequences): Data sequences from which Pearson correlations are computed.

- `(cohens-q group1a group2a group1b group2b)`: Calculates the difference between the correlation of `group1a` with `group2a` and the correlation of `group1b` with `group2b`. This is typically used for comparing two *independent* correlations obtained from two distinct pairs of variables (all four sequences are independent). Calculates `atanh(pearson-correlation(group1a, group2a)) - atanh(pearson-correlation(group1b, group2b))`.
  - `group1a`, `group2a`, `group1b`, `group2b` (sequences): Data sequences from which Pearson correlations are computed.

Returns the difference between the Fisher z-transformed correlation values as a double.

Note: For comparing dependent correlations (3-arity case), standard statistical tests (e.g., Steiger's test) are more complex than a simple difference of z-transforms and involve the correlation between `group2a` and `group2b`. This function provides the basic difference value.
sourceraw docstring

cohens-u1clj

(cohens-u1 [group1 group2])
(cohens-u1 group1 group2)

Calculates a non-parametric measure of difference or separation between two samples.

This function computes a value derived from cohens-u2, which internally quantifies a minimal difference between corresponding quantiles of the two empirical distributions.

Parameters:

  • group1 (seq of numbers): The first sample.
  • group2 (seq of numbers): The second sample.

Returns the calculated measure as a double.

Interpretation:

  • Values close to -1 indicate high similarity or maximum overlap between the distributions (as the minimal difference between quantiles approaches zero).
  • Increasing values indicate greater difference or separation between the distributions (as the minimal difference between quantiles is larger).

This measure is symmetric, meaning the order of group1 and group2 does not affect the result. It is a non-parametric measure applicable to any data samples.

See also cohens-u2 (the measure this calculation is based on), cohens-u3 (related non-parametric measure), cohens-u1-normal (the version applicable to normal data).

Calculates a non-parametric measure of difference or separation between two samples.

This function computes a value derived from [[cohens-u2]], which internally
quantifies a minimal difference between corresponding quantiles of the two
empirical distributions.

Parameters:

- `group1` (seq of numbers): The first sample.
- `group2` (seq of numbers): The second sample.

Returns the calculated measure as a double.

Interpretation:

- Values close to -1 indicate high similarity or maximum overlap between the
  distributions (as the minimal difference between quantiles approaches zero).
- Increasing values indicate greater difference or separation between the
  distributions (as the minimal difference between quantiles is larger).

This measure is symmetric, meaning the order of `group1` and `group2` does not
affect the result. It is a non-parametric measure applicable to any data samples.

See also [[cohens-u2]] (the measure this calculation is based on),
[[cohens-u3]] (related non-parametric measure), [[cohens-u1-normal]]
(the version applicable to normal data).
sourceraw docstring

cohens-u1-normalclj

(cohens-u1-normal d)
(cohens-u1-normal group1 group2)
(cohens-u1-normal group1 group2 method)

Calculates Cohen's U1, a measure of non-overlap between two distributions assumed to be normal with equal variances.

Cohen's U1 quantifies the proportion of scores in the lower-scoring group that overlap with the scores in the higher-scoring group. A U1 of 0 means no overlap, while a U1 of 1 means complete overlap (distributions are identical).

This measure is calculated directly from Cohen's d statistic (cohens-d) assuming normal distributions and equal variances.

Parameters:

  • group1 (seq of numbers): The first sample.
  • group2 (seq of numbers): The second sample.
  • method (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying cohens-d calculation. Possible values are :unbiased (default), :biased, or :avg. See pooled-stddev for details.
  • d (double): A pre-calculated Cohen's d value. If provided, group1, group2, and method are ignored.

Returns the calculated Cohen's U1 as a double [0, 1].

Assumptions:

  • Both samples are drawn from normally distributed populations.
  • The populations have equal variances (homoscedasticity).

See also cohens-d, cohens-u2-normal, cohens-u3-normal, p-overlap (a non-parametric overlap measure).

Calculates Cohen's U1, a measure of non-overlap between two distributions assumed to be normal with equal variances.

Cohen's U1 quantifies the proportion of scores in the lower-scoring group that overlap
with the scores in the higher-scoring group. A U1 of 0 means no overlap, while
a U1 of 1 means complete overlap (distributions are identical).

This measure is calculated directly from Cohen's d statistic ([[cohens-d]]) assuming
normal distributions and equal variances.

Parameters:

- `group1` (seq of numbers): The first sample.
- `group2` (seq of numbers): The second sample.
- `method` (optional keyword): Specifies the method for calculating the pooled standard deviation
  used in the underlying [[cohens-d]] calculation. Possible values are `:unbiased` (default),
  `:biased`, or `:avg`. See [[pooled-stddev]] for details.
- `d` (double): A pre-calculated Cohen's d value. If provided, `group1`, `group2`, and `method` are ignored.

Returns the calculated Cohen's U1 as a double [0, 1].

Assumptions:
- Both samples are drawn from normally distributed populations.
- The populations have equal variances (homoscedasticity).

See also [[cohens-d]], [[cohens-u2-normal]], [[cohens-u3-normal]], [[p-overlap]] (a non-parametric overlap measure).
sourceraw docstring

cohens-u2clj

(cohens-u2 [group1 group2])
(cohens-u2 group1 group2)

Calculates a measure of overlap between two samples, referred to as Cohen's U2.

This function quantifies the degree to which the distributions of group1 and group2 overlap. It is related to comparing values at corresponding percentile levels across the two groups or the proportion of values in one group that are below the median of the other. A value of 0 indicates no overlap, while a value of 1 indicates complete overlap (distributions are identical).

The measure is symmetric, meaning (cohens-u2 group1 group2) is equal to (cohens-u2 group2 group1).

This is a non-parametric measure, suitable for any data samples, and does not assume normality, unlike cohens-u2-normal.

Parameters:

  • group1, group2 (sequences): The two samples directly as arguments.

Returns the calculated Cohen's U2 value as a double. The value typically ranges from 0 to 1. A value closer to 0.5 indicates substantial overlap between the distributions (e.g., the median of one group is near the median of the other); values closer to 0 or 1 indicate less overlap (greater separation between the distributions).

Calculates a measure of overlap between two samples, referred to as Cohen's U2.

This function quantifies the degree to which the distributions of `group1` and `group2` overlap. It is related to comparing values at corresponding percentile levels across the two groups or the proportion of values in one group that are below the median of the other. A value of 0 indicates no overlap, while a value of 1 indicates complete overlap (distributions are identical).

The measure is symmetric, meaning `(cohens-u2 group1 group2)` is equal to `(cohens-u2 group2 group1)`.

This is a non-parametric measure, suitable for any data samples, and does not assume normality, unlike [[cohens-u2-normal]].

Parameters:

- `group1`, `group2` (sequences): The two samples directly as arguments.

Returns the calculated Cohen's U2 value as a double. The value typically ranges from 0 to 1. A value closer to 0.5 indicates substantial overlap between the distributions (e.g., the median of one group is near the median of the other); values closer to 0 or 1 indicate less overlap (greater separation between the distributions).
sourceraw docstring

cohens-u2-normalclj

(cohens-u2-normal d)
(cohens-u2-normal group1 group2)
(cohens-u2-normal group1 group2 method)

Calculates Cohen's U2, a measure of overlap between two distributions assumed to be normal with equal variances.

Cohen's U2 quantifies the proportion of scores in the lower-scoring group that are below the point located halfway between the means of the two groups (or equivalently, the proportion of scores in the higher-scoring group that are above this halfway point). This measure is calculated from Cohen's d statistic (cohens-d) using the standard normal cumulative distribution function ($\Phi$): $\Phi(0.5 |d|)$.

Parameters:

  • group1 (seq of numbers): The first sample.
  • group2 (seq of numbers): The second sample.
  • method (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying cohens-d calculation. Possible values are :unbiased (default), :biased, or :avg. See pooled-stddev for details.
  • d (double): A pre-calculated Cohen's d value. If provided, group1, group2, and method are ignored.

Returns the calculated Cohen's U2 as a double [0.0, 1.0]. A value closer to 0.5 indicates greater overlap between the distributions; values closer to 0 or 1 indicate less overlap.

Assumptions:

  • Both samples are drawn from normally distributed populations.
  • The populations have equal variances (homoscedasticity).

See also cohens-d, cohens-u1-normal, cohens-u3-normal, p-overlap (a non-parametric overlap measure).

Calculates Cohen's U2, a measure of overlap between two distributions assumed to be normal with equal variances.

Cohen's U2 quantifies the proportion of scores in the lower-scoring group that are below the point located halfway between the means of the two groups (or equivalently, the proportion of scores in the higher-scoring group that are above this halfway point). This measure is calculated from Cohen's d statistic ([[cohens-d]]) using the standard normal cumulative distribution function ($\Phi$): $\Phi(0.5 |d|)$.

Parameters:

- `group1` (seq of numbers): The first sample.
- `group2` (seq of numbers): The second sample.
- `method` (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying [[cohens-d]] calculation. Possible values are `:unbiased` (default), `:biased`, or `:avg`. See [[pooled-stddev]] for details.
- `d` (double): A pre-calculated Cohen's d value. If provided, `group1`, `group2`, and `method` are ignored.

Returns the calculated Cohen's U2 as a double [0.0, 1.0]. A value closer to 0.5 indicates greater overlap between the distributions; values closer to 0 or 1 indicate less overlap.

Assumptions:
- Both samples are drawn from normally distributed populations.
- The populations have equal variances (homoscedasticity).

See also [[cohens-d]], [[cohens-u1-normal]], [[cohens-u3-normal]], [[p-overlap]] (a non-parametric overlap measure).
sourceraw docstring

cohens-u3clj

(cohens-u3 [group1 group2])
(cohens-u3 group1 group2)
(cohens-u3 group1 group2 estimation-strategy)

Calculates Cohen's U3 for two samples.

In this implementation, Cohen's U3 is defined as the proportion of values in the second sample (group2) that are less than the median of the first sample (group1).

Parameters:

  • group1 (seq of numbers): The first sample. The median of this sample is used as the threshold.
  • group2 (seq of numbers): The second sample. Values from this sample are counted if they fall below the median of group1.
  • estimation-strategy (optional keyword): The strategy used to estimate the median of group1. Defaults to :legacy. See median or quantile for available strategies (e.g., :r1 through :r9).

Returns the calculated proportion as a double between 0.0 and 1.0.

Interpretation:

  • A value close to 0 means most values in group2 are greater than or equal to the median of group1.
  • A value close to 0.5 means approximately half the values in group2 are below the median of group1.
  • A value close to 1 means most values in group2 are less than the median of group1.

Note: This measure is not symmetric. (cohens-u3 group1 group2) is generally not equal to (cohens-u3 group2 group1).

This is a non-parametric measure, suitable for any data samples, and does not assume normality, unlike cohens-u3-normal.

See also cohens-u3-normal (the version applicable to normal data), cohens-u2 (a related symmetric non-parametric measure), median, quantile.

Calculates Cohen's U3 for two samples.

In this implementation, Cohen's U3 is defined as the proportion of values
in the second sample (`group2`) that are less than the median of the first
sample (`group1`).

Parameters:

- `group1` (seq of numbers): The first sample. The median of this sample is used as the threshold.
- `group2` (seq of numbers): The second sample. Values from this sample are counted if they fall below the median of `group1`.
- `estimation-strategy` (optional keyword): The strategy used to estimate the median of `group1`.
  Defaults to `:legacy`. See [[median]] or [[quantile]] for available strategies
  (e.g., `:r1` through `:r9`).

Returns the calculated proportion as a double between 0.0 and 1.0.

Interpretation:

- A value close to 0 means most values in `group2` are greater than or equal to the median of `group1`.
- A value close to 0.5 means approximately half the values in `group2` are below the median of `group1`.
- A value close to 1 means most values in `group2` are less than the median of `group1`.

Note: This measure is **not symmetric**. `(cohens-u3 group1 group2)` is generally
not equal to `(cohens-u3 group2 group1)`.

This is a non-parametric measure, suitable for any data samples, and does not
assume normality, unlike [[cohens-u3-normal]].

See also [[cohens-u3-normal]] (the version applicable to normal data), [[cohens-u2]]
(a related symmetric non-parametric measure), [[median]], [[quantile]].
sourceraw docstring

cohens-u3-normalclj

(cohens-u3-normal d)
(cohens-u3-normal group1 group2)
(cohens-u3-normal group1 group2 method)

Calculates Cohen's U3, a measure of overlap between two distributions assumed to be normal with equal variances.

Cohen's U3 quantifies the proportion of scores in the lower-scoring group that fall below the mean of the higher-scoring group. It is calculated from Cohen's d statistic (cohens-d) using the standard normal cumulative distribution function ($\Phi$): U3 = Φ(d).

The measure is asymmetric: U3(group1, group2) is not necessarily equal to U3(group2, group1). The interpretation depends on which group is considered the 'higher-scoring' one based on the sign of d. By convention, the result often represents the proportion of the first group (group1) that is below the mean of the second group (group2) if d is negative, or the proportion of the second group (group2) that is below the mean of the first group (group1) if d is positive.

Parameters:

  • group1 (seq of numbers): The first sample.
  • group2 (seq of numbers): The second sample.
  • method (optional keyword): Specifies the method for calculating the pooled standard deviation used in the underlying cohens-d calculation. Possible values are :unbiased (default), :biased, or :avg. See pooled-stddev for details.
  • d (double): A pre-calculated Cohen's d value. If provided, group1, group2, and method are ignored.

Returns the calculated Cohen's U3 as a double [0.0, 1.0]. A value close to 0.5 suggests significant overlap. Values closer to 0 or 1 suggest less overlap (greater separation between the means).

Assumptions:

  • Both samples are drawn from normally distributed populations.
  • The populations have equal variances (homoscedasticity).

See also cohens-d, cohens-u1-normal, cohens-u2-normal, p-overlap (a non-parametric overlap measure).

Calculates Cohen's U3, a measure of overlap between two distributions assumed to be normal with equal variances.

Cohen's U3 quantifies the proportion of scores in the lower-scoring group that fall
below the mean of the higher-scoring group. It is calculated from Cohen's d statistic
([[cohens-d]]) using the standard normal cumulative distribution function ($\Phi$):
`U3 = Φ(d)`.

The measure is asymmetric: `U3(group1, group2)` is not necessarily equal to
`U3(group2, group1)`. The interpretation depends on which group is considered
the 'higher-scoring' one based on the sign of d. By convention, the result
often represents the proportion of the *first* group (`group1`) that is below the
mean of the *second* group (`group2`) if d is negative, or the proportion of the
*second* group (`group2`) that is below the mean of the *first* group (`group1`) if d is positive.

Parameters:

- `group1` (seq of numbers): The first sample.
- `group2` (seq of numbers): The second sample.
- `method` (optional keyword): Specifies the method for calculating the pooled standard deviation
  used in the underlying [[cohens-d]] calculation. Possible values are `:unbiased` (default),
  `:biased`, or `:avg`. See [[pooled-stddev]] for details.
- `d` (double): A pre-calculated Cohen's d value. If provided, `group1`, `group2`, and `method` are ignored.

Returns the calculated Cohen's U3 as a double [0.0, 1.0].
A value close to 0.5 suggests significant overlap. Values closer to 0 or 1 suggest
less overlap (greater separation between the means).

Assumptions:
- Both samples are drawn from normally distributed populations.
- The populations have equal variances (homoscedasticity).

See also [[cohens-d]], [[cohens-u1-normal]], [[cohens-u2-normal]], [[p-overlap]] (a non-parametric overlap measure).
sourceraw docstring

cohens-wclj

(cohens-w contingency-table)
(cohens-w group1 group2)

Calculates Cohen's W effect size for the association between two nominal variables represented in a contingency table.

Cohen's W is a measure of association derived from the Pearson's Chi-squared statistic. It quantifies the magnitude of the difference between the observed frequencies and the frequencies expected under the assumption of independence between the variables.

Its value ranges from 0 upwards:

  • A value of 0 indicates no association between the variables.
  • Larger values indicate a stronger association.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a contingency table from the unique values in the sequences.
  2. With a contingency table: The contingency table can be provided as:
    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This is the output format of contingency-table with two inputs.
    • A sequence of sequences representing the rows of the table (e.g., [[10 5] [3 12]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of categorical data.
  • group2 (sequence): The second sequence of categorical data. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Cohen's W coefficient as a double.

See also chisq-test, cramers-v, cramers-c, tschuprows-t, contingency-table.

Calculates Cohen's W effect size for the association between two nominal
variables represented in a contingency table.

Cohen's W is a measure of association derived from the Pearson's Chi-squared
statistic. It quantifies the magnitude of the difference between the observed
frequencies and the frequencies expected under the assumption of independence
between the variables.

Its value ranges from 0 upwards:
- A value of 0 indicates no association between the variables.
- Larger values indicate a stronger association.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a contingency table from
    the unique values in the sequences.
2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format
      of [[contingency-table]] with two inputs.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]].

Parameters:

- `group1` (sequence): The first sequence of categorical data.
- `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Cohen's W coefficient as a double.

See also [[chisq-test]], [[cramers-v]], [[cramers-c]], [[tschuprows-t]], [[contingency-table]].
sourceraw docstring

confusion-matrixclj

(confusion-matrix confusion-mat)
(confusion-matrix actual prediction)
(confusion-matrix actual prediction encode-true)
(confusion-matrix tp fn fp tn)

Creates a 2x2 confusion matrix for binary classification.

A confusion matrix summarizes the results of a binary classification problem, showing the counts of True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).

TP: Actual is True, Predicted is True FP: Actual is False, Predicted is True (Type I error) FN: Actual is True, Predicted is False (Type II error) TN: Actual is False, Predicted is False

The function supports several input formats:

  1. (confusion-matrix tp fn fp tn): Direct input of the four counts.

    • tp (long): True Positive count.
    • fn (long): False Negative count.
    • fp (long): False Positive count.
    • tn (long): True Negative count.
  2. (confusion-matrix confusion-matrix-representation): Input as a structured representation.

    • confusion-matrix-representation: Can be:
      • A map with keys like :tp, :fn, :fp, :tn (e.g., {:tp 10 :fn 2 :fp 5 :tn 80}).
      • A sequence of sequences representing rows [[TP FP] [FN TN]] (e.g., [[10 5] [2 80]]).
      • A flat sequence [TP FN FP TN] (e.g., [10 2 5 80]).
  3. (confusion-matrix actual prediction): Input as two sequences of outcomes.

    • actual (sequence): Sequence of true outcomes.
    • prediction (sequence): Sequence of predicted outcomes. Must have the same length as actual. Values in actual and prediction are compared element-wise. By default, any non-nil or non-zero value is treated as true, and nil or 0.0 is treated as false.
  4. (confusion-matrix actual prediction encode-true): Input as two sequences with a specified encoding for true.

    • actual, prediction: Sequences as in the previous arity.
    • encode-true: Specifies how values in actual and prediction are converted to boolean true or false.
      • nil (default): Non-nil/non-zero is true.
      • Any sequence/set: Values found in this collection are true.
      • A map: Values are mapped according to the map; if a key is not found or maps to false, the value is false.
      • A predicate function: Returns true if the value satisfies the predicate.

Returns a map with keys :tp, :fn, :fp, and :tn representing the counts.

This function is commonly used to prepare input for binary classification metrics like those provided by binary-measures-all and binary-measures.

Creates a 2x2 confusion matrix for binary classification.

A confusion matrix summarizes the results of a binary classification problem, showing
the counts of True Positives (TP), False Positives (FP), False Negatives (FN),
and True Negatives (TN).

TP: Actual is True, Predicted is True
FP: Actual is False, Predicted is True (Type I error)
FN: Actual is True, Predicted is False (Type II error)
TN: Actual is False, Predicted is False

The function supports several input formats:

1.  `(confusion-matrix tp fn fp tn)`: Direct input of the four counts.
    - `tp` (long): True Positive count.
    - `fn` (long): False Negative count.
    - `fp` (long): False Positive count.
    - `tn` (long): True Negative count.

2.  `(confusion-matrix confusion-matrix-representation)`: Input as a structured representation.
    - `confusion-matrix-representation`: Can be:
      - A map with keys like `:tp`, `:fn`, `:fp`, `:tn` (e.g., `{:tp 10 :fn 2 :fp 5 :tn 80}`).
      - A sequence of sequences representing rows `[[TP FP] [FN TN]]` (e.g., `[[10 5] [2 80]]`).
      - A flat sequence `[TP FN FP TN]` (e.g., `[10 2 5 80]`).

3.  `(confusion-matrix actual prediction)`: Input as two sequences of outcomes.
    - `actual` (sequence): Sequence of true outcomes.
    - `prediction` (sequence): Sequence of predicted outcomes. Must have the same length as `actual`.
    Values in `actual` and `prediction` are compared element-wise. By default,
    any non-`nil` or non-zero value is treated as `true`, and `nil` or `0.0` is
    treated as `false`.

4.  `(confusion-matrix actual prediction encode-true)`: Input as two sequences with a specified encoding for `true`.
    - `actual`, `prediction`: Sequences as in the previous arity.
    - `encode-true`: Specifies how values in `actual` and `prediction` are converted to boolean `true` or `false`.
      - `nil` (default): Non-`nil`/non-zero is true.
      - Any sequence/set: Values found in this collection are true.
      - A map: Values are mapped according to the map; if a key is not found or maps to `false`, the value is false.
      - A predicate function: Returns `true` if the value satisfies the predicate.

Returns a map with keys `:tp`, `:fn`, `:fp`, and `:tn` representing the counts.

This function is commonly used to prepare input for binary classification
metrics like those provided by [[binary-measures-all]] and [[binary-measures]].
sourceraw docstring

contingency-2x2-measuresclj

(contingency-2x2-measures & args)

Calculates a subset of common statistics and measures for a 2x2 contingency table.

This function provides a selection of the most frequently used measures from the more comprehensive contingency-2x2-measures-all.

The function accepts the same input formats as contingency-2x2-measures-all:

  1. (contingency-2x2-measures a b c d): Takes the four counts as arguments.
  2. (contingency-2x2-measures [a b c d]): Takes a sequence of the four counts.
  3. (contingency-2x2-measures [[a b] [c d]]): Takes a sequence of sequences representing the rows.
  4. (contingency-2x2-measures {:a a :b b :c c :d d}): Takes a map of counts (accepts :a/:b/:c/:d keys).

Parameters:

  • a, b, c, d (long): Counts in the 2x2 table cells.
  • map-or-seq (map or sequence): A representation of the 2x2 table.

Returns a map containing a selection of measures:

  • :OR: Odds Ratio (Odds Ratio)
  • :chi2: Pearson's Chi-squared statistic
  • :yates: Yates' continuity corrected Chi-squared statistic
  • :cochran-mantel-haenszel: Cochran-Mantel-Haenszel statistic
  • :cohens-kappa: Cohen's Kappa coefficient
  • :yules-q: Yule's Q measure of association
  • :holley-guilfords-g: Holley-Guilford's G measure
  • :huberts-gamma: Hubert's Gamma measure
  • :yules-y: Yule's Y measure of association
  • :cramers-v: Cramer's V measure of association
  • :phi: Phi coefficient (Matthews Correlation Coefficient)
  • :scotts-pi: Scott's Pi measure of agreement
  • :cohens-h: Cohen's H measure
  • :PCC: Pearson's Contingency Coefficient
  • :PCC-adjusted: Adjusted Pearson's Contingency Coefficient
  • :TCC: Tschuprow's Contingency Coefficient
  • :F1: F1 Score
  • :bangdiwalas-b: Bangdiwala's B statistic
  • :mcnemars-chi2: McNemar's Chi-squared test statistic
  • :gwets-ac1: Gwet's AC1 measure

For a more comprehensive set of 2x2 measures and their detailed descriptions, see contingency-2x2-measures-all.

Calculates a subset of common statistics and measures for a 2x2 contingency table.

This function provides a selection of the most frequently used measures from the
more comprehensive [[contingency-2x2-measures-all]].

The function accepts the same input formats as [[contingency-2x2-measures-all]]:

1.  `(contingency-2x2-measures a b c d)`: Takes the four counts as arguments.
2.  `(contingency-2x2-measures [a b c d])`: Takes a sequence of the four counts.
3.  `(contingency-2x2-measures [[a b] [c d]])`: Takes a sequence of sequences representing the rows.
4.  `(contingency-2x2-measures {:a a :b b :c c :d d})`: Takes a map of counts (accepts `:a/:b/:c/:d` keys).

Parameters:

- `a, b, c, d` (long): Counts in the 2x2 table cells.
- `map-or-seq` (map or sequence): A representation of the 2x2 table.

Returns a map containing a selection of measures:

- `:OR`: Odds Ratio (Odds Ratio)
- `:chi2`: Pearson's Chi-squared statistic
- `:yates`: Yates' continuity corrected Chi-squared statistic
- `:cochran-mantel-haenszel`: Cochran-Mantel-Haenszel statistic
- `:cohens-kappa`: Cohen's Kappa coefficient
- `:yules-q`: Yule's Q measure of association
- `:holley-guilfords-g`: Holley-Guilford's G measure
- `:huberts-gamma`: Hubert's Gamma measure
- `:yules-y`: Yule's Y measure of association
- `:cramers-v`: Cramer's V measure of association
- `:phi`: Phi coefficient (Matthews Correlation Coefficient)
- `:scotts-pi`: Scott's Pi measure of agreement
- `:cohens-h`: Cohen's H measure
- `:PCC`: Pearson's Contingency Coefficient
- `:PCC-adjusted`: Adjusted Pearson's Contingency Coefficient
- `:TCC`: Tschuprow's Contingency Coefficient
- `:F1`: F1 Score
- `:bangdiwalas-b`: Bangdiwala's B statistic
- `:mcnemars-chi2`: McNemar's Chi-squared test statistic
- `:gwets-ac1`: Gwet's AC1 measure

For a more comprehensive set of 2x2 measures and their detailed descriptions, see [[contingency-2x2-measures-all]].
sourceraw docstring

contingency-2x2-measures-allclj

(contingency-2x2-measures-all map-or-seq)
(contingency-2x2-measures-all [a b] [c d])
(contingency-2x2-measures-all a b c d)

Calculates a comprehensive set of statistics and measures for a 2x2 contingency table.

A 2x2 contingency table cross-tabulates two categorical variables, each with two levels. The table counts are typically represented as:

+---+---+ | a | b | +---+---+ | c | d | +---+---+

Where a, b, c, d are the counts in the respective cells.

This function calculates numerous measures, including:

  • Chi-squared statistics (Pearson, Yates' corrected, CMH) and their p-values.
  • Measures of association (Phi, Yule's Q, Holley-Guilford's G, Hubert's Gamma, Yule's Y, Cramer's V, Scott's Pi, Cohen's H, Pearson/Tschuprow's CC).
  • Measures of agreement (Cohen's Kappa).
  • Risk and effect size measures (Odds Ratio (OR), Relative Risk (RR), Risk Difference (RD), NNT, etc.).
  • Table marginals and proportions.

The function can be called with the four counts directly or with a representation of the contingency table:

  1. (contingency-2x2-measures-all a b c d): Takes the four counts as arguments.
  2. (contingency-2x2-measures-all [a b c d]): Takes a sequence of the four counts.
  3. (contingency-2x2-measures-all [[a b] [c d]]): Takes a sequence of sequences representing the rows.
  4. (contingency-2x2-measures-all {:a a :b b :c c :d d}): Takes a map of counts (accepts :a/:b/:c/:d keys).

Parameters:

  • a (long): Count in the top-left cell.
  • b (long): Count in the top-right cell.
  • c (long): Count in the bottom-left cell.
  • d (long): Count in the bottom-right cell.
  • map-or-seq (map or sequence): A representation of the 2x2 table as described above.

Returns a map containing a wide range of calculated statistics. Keys include: :n, :table, :expected, :marginals, :proportions, :p-values (map), :OR, :lOR, :RR, :risk (map), :SE, :measures (map).

See also contingency-2x2-measures for a selected subset of these measures, mcc for the Matthews Correlation Coefficient (Phi), and binary-measures-all for metrics derived from a confusion matrix (often a 2x2 table in binary classification).

Calculates a comprehensive set of statistics and measures for a 2x2 contingency table.

A 2x2 contingency table cross-tabulates two categorical variables, each with two levels.
The table counts are typically represented as:

+---+---+
| a | b |
+---+---+
| c | d |
+---+---+

Where `a, b, c, d` are the counts in the respective cells. 

This function calculates numerous measures, including:

*   Chi-squared statistics (Pearson, Yates' corrected, CMH) and their p-values.
*   Measures of association (Phi, Yule's Q, Holley-Guilford's G, Hubert's Gamma, Yule's Y, Cramer's V, Scott's Pi, Cohen's H, Pearson/Tschuprow's CC).
*   Measures of agreement (Cohen's Kappa).
*   Risk and effect size measures (Odds Ratio (OR), Relative Risk (RR), Risk Difference (RD), NNT, etc.).
*   Table marginals and proportions.

The function can be called with the four counts directly or with a representation
of the contingency table:

1.  `(contingency-2x2-measures-all a b c d)`: Takes the four counts as arguments.
2.  `(contingency-2x2-measures-all [a b c d])`: Takes a sequence of the four counts.
3.  `(contingency-2x2-measures-all [[a b] [c d]])`: Takes a sequence of sequences representing the rows.
4.  `(contingency-2x2-measures-all {:a a :b b :c c :d d})`: Takes a map of counts (accepts `:a/:b/:c/:d` keys).

Parameters:

- `a` (long): Count in the top-left cell.
- `b` (long): Count in the top-right cell.
- `c` (long): Count in the bottom-left cell.
- `d` (long): Count in the bottom-right cell.
- `map-or-seq` (map or sequence): A representation of the 2x2 table as described above.

Returns a map containing a wide range of calculated statistics. Keys include:
`:n`, `:table`, `:expected`, `:marginals`, `:proportions`, `:p-values` (map), `:OR`, `:lOR`, `:RR`, `:risk` (map), `:SE`, `:measures` (map).

See also [[contingency-2x2-measures]] for a selected subset of these measures,
[[mcc]] for the Matthews Correlation Coefficient (Phi), and [[binary-measures-all]]
for metrics derived from a confusion matrix (often a 2x2 table in binary classification).
sourceraw docstring

contingency-tableclj

(contingency-table & seqs)

Creates a frequency map (contingency table) from one or more sequences.

If one sequence xs is provided, it returns a simple frequency map of the values in xs.

If multiple sequences s1, s2, ..., sn are provided, it creates a contingency table of the tuples formed by the corresponding elements [s1_i, s2_i, ..., sn_i] at each index i. The returned map keys are these tuples, and values are their frequencies.

Parameters:

  • seqs (one or more sequences): The input sequences. All sequences should ideally have the same length, as elements are paired by index.

Returns a map where keys represent unique combinations of values (or single values if only one sequence is input) and values are the counts of these combinations.

See also rows->contingency-table, contingency-table->marginals.

Creates a frequency map (contingency table) from one or more sequences.

If one sequence `xs` is provided, it returns a simple frequency map of the values
in `xs`.

If multiple sequences `s1, s2, ..., sn` are provided, it creates a contingency table
of the tuples formed by the corresponding elements `[s1_i, s2_i, ..., sn_i]` at
each index `i`. The returned map keys are these tuples, and values are their
frequencies.

Parameters:

- `seqs` (one or more sequences): The input sequences. All sequences should ideally
  have the same length, as elements are paired by index.

Returns a map where keys represent unique combinations of values (or single values
if only one sequence is input) and values are the counts of these combinations.

See also [[rows->contingency-table]], [[contingency-table->marginals]].
sourceraw docstring

contingency-table->marginalsclj

(contingency-table->marginals ct)

Calculates marginal sums (row and column totals) and the grand total from a contingency table.

A contingency table represents the frequency distribution of observations for two or more categorical variables. This function summarizes these frequencies along the rows and columns.

The function accepts two main input formats for the contingency table:

  1. A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This format is produced by contingency-table when given multiple sequences or by rows->contingency-table.
  2. A sequence of sequences representing the rows of the table, where each inner sequence contains counts for the columns in that row (e.g., [[10 5] [3 12]]). The function internally converts this format to the map format.

Parameters:

  • ct (map or sequence of sequences): The contingency table input.

Returns a map containing:

  • :rows: A sequence of [row-index, row-total] pairs.
  • :cols: A sequence of [column-index, column-total] pairs.
  • :n: The grand total of all counts in the table.
  • :diag: A sequence of [[index, index], count] pairs for cells on the diagonal (where row index equals column index). This is useful for square tables like confusion matrices.

See also contingency-table, rows->contingency-table.

Calculates marginal sums (row and column totals) and the grand total from a contingency table.

A contingency table represents the frequency distribution of observations for two or
more categorical variables. This function summarizes these frequencies along the
rows and columns.

The function accepts two main input formats for the contingency table:

1.  A map where keys are `[row-index, column-index]` tuples and values are counts (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This format is produced by [[contingency-table]] when given multiple sequences or by [[rows->contingency-table]].
2.  A sequence of sequences representing the rows of the table, where each inner sequence contains counts for the columns in that row (e.g., `[[10 5] [3 12]]`). The function internally converts this format to the map format.

Parameters:

- `ct` (map or sequence of sequences): The contingency table input.

Returns a map containing:

- `:rows`: A sequence of `[row-index, row-total]` pairs.
- `:cols`: A sequence of `[column-index, column-total]` pairs.
- `:n`: The grand total of all counts in the table.
- `:diag`: A sequence of `[[index, index], count]` pairs for cells on the diagonal
  (where row index equals column index). This is useful for square tables like
  confusion matrices.

See also [[contingency-table]], [[rows->contingency-table]].
sourceraw docstring

correlationclj

(correlation [vs1 vs2])
(correlation vs1 vs2)

Calculates the correlation coefficient between two sequences.

By default, this function calculates the Pearson product-moment correlation coefficient, which measures the linear relationship between two datasets.

This function handles the standard deviation normalization based on whether the inputs vs1 and vs2 are treated as samples or populations (it uses sample standard deviation derived from variance).

Parameters:

  • [vs1 vs2] (sequence of two sequences): A sequence containing the two sequences of numbers.
  • vs1, vs2 (sequences): The two sequences of numbers directly as arguments.

Both sequences must have the same length.

Returns the calculated correlation coefficient (a value between -1.0 and 1.0) as a double. Returns NaN if one or both sequences have zero variance (are constant).

See also covariance, pearson-correlation, spearman-correlation, kendall-correlation, correlation-matrix.

Calculates the correlation coefficient between two sequences.

By default, this function calculates the Pearson product-moment correlation
coefficient, which measures the linear relationship between two datasets.

This function handles the standard deviation normalization based on whether
the inputs `vs1` and `vs2` are treated as samples or populations (it uses
sample standard deviation derived from [[variance]]).

Parameters:

- `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers.
- `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments.

Both sequences must have the same length.

Returns the calculated correlation coefficient (a value between -1.0 and 1.0) as a double.
Returns `NaN` if one or both sequences have zero variance (are constant).

See also [[covariance]], [[pearson-correlation]], [[spearman-correlation]], [[kendall-correlation]], [[correlation-matrix]].
sourceraw docstring

correlation-matrixclj

(correlation-matrix vss)
(correlation-matrix vss measure)

Generates a matrix of pairwise correlation coefficients from a sequence of sequences.

Given a collection of data sequences vss, where each inner sequence represents a variable, this function calculates a square matrix where the element at row i and column j is the correlation coefficient between the i-th and j-th sequences in vss.

Parameters:

  • vss (sequence of sequences of numbers): The collection of data sequences. Each inner sequence is treated as a variable. All inner sequences must have the same length.
  • measure (keyword, optional): Specifies the type of correlation coefficient to calculate. Defaults to :pearson.
    • :pearson (default): Calculates the Pearson product-moment correlation coefficient.
    • :kendall: Calculates Kendall's Tau rank correlation coefficient.
    • :spearman: Calculates Spearman's rank correlation coefficient.

Returns a sequence of sequences (a matrix) of doubles representing the correlation matrix. The matrix is symmetric, as correlation is a symmetric measure.

See also pearson-correlation, spearman-correlation, kendall-correlation, covariance-matrix, coefficient-matrix.

Generates a matrix of pairwise correlation coefficients from a sequence of sequences.

Given a collection of data sequences `vss`, where each inner sequence represents
a variable, this function calculates a square matrix where the element at row `i`
and column `j` is the correlation coefficient between the `i`-th and `j`-th
sequences in `vss`.

Parameters:

- `vss` (sequence of sequences of numbers): The collection of data sequences.
  Each inner sequence is treated as a variable. All inner sequences must have the same length.
- `measure` (keyword, optional): Specifies the type of correlation coefficient to calculate.
  Defaults to `:pearson`.
  - `:pearson` (default): Calculates the Pearson product-moment correlation coefficient.
  - `:kendall`: Calculates Kendall's Tau rank correlation coefficient.
  - `:spearman`: Calculates Spearman's rank correlation coefficient.

Returns a sequence of sequences (a matrix) of doubles representing the correlation matrix.
The matrix is symmetric, as correlation is a symmetric measure.

See also [[pearson-correlation]], [[spearman-correlation]], [[kendall-correlation]],
[[covariance-matrix]], [[coefficient-matrix]].
sourceraw docstring

count=clj

(count= [vs1 vs2-or-val])
(count= vs1 vs2-or-val)

Count equal values in both seqs. Same as L0

Calculates the number of pairs of corresponding elements that are equal between two sequences, or between a sequence and a single scalar value.

Parameters:

  • vs1 (sequence of numbers): The first sequence.
  • vs2-or-val (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the count of equal elements as a long integer.

Count equal values in both seqs. Same as [[L0]]

Calculates the number of pairs of corresponding elements that are equal between
two sequences, or between a sequence and a single scalar value.

Parameters:

- `vs1` (sequence of numbers): The first sequence.
- `vs2-or-val` (sequence of numbers or single number): The second sequence of
  numbers, or a single number to compare against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the count of equal elements as a long integer.
sourceraw docstring

covarianceclj

(covariance [vs1 vs2])
(covariance vs1 vs2)

Covariance of two sequences.

This function calculates the sample covariance.

Parameters:

  • [vs1 vs2] (sequence of two sequences): A sequence containing the two sequences of numbers.
  • vs1, vs2 (sequences): The two sequences of numbers directly as arguments.

Both sequences must have the same length.

Returns the calculated sample covariance as a double.

See also correlation, covariance-matrix.

Covariance of two sequences.

This function calculates the *sample* covariance.

Parameters:

- `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers.
- `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments.

Both sequences must have the same length.

Returns the calculated sample covariance as a double.

See also [[correlation]], [[covariance-matrix]].
sourceraw docstring

covariance-matrixclj

(covariance-matrix vss)

Generates a matrix of pairwise covariance coefficients from a sequence of sequences.

Given a collection of data sequences vss, where each inner sequence represents a variable, this function calculates a square matrix where the element at row i and column j is the sample covariance between the i-th and j-th sequences in vss.

Parameters:

  • vss (sequence of sequences of numbers): The collection of data sequences. Each inner sequence is treated as a variable. All inner sequences must have the same length.

Returns a sequence of sequences (a matrix) of doubles representing the covariance matrix. The matrix is symmetric, as covariance is a symmetric measure ($Cov(X,Y) = Cov(Y,X)$).

Internally uses coefficient-matrix with the covariance function and symmetric? set to true.

See also covariance, correlation-matrix, coefficient-matrix.

Generates a matrix of pairwise covariance coefficients from a sequence of sequences.

Given a collection of data sequences `vss`, where each inner sequence represents
a variable, this function calculates a square matrix where the element at row `i`
and column `j` is the sample covariance between the `i`-th and `j`-th sequences
in `vss`.

Parameters:

- `vss` (sequence of sequences of numbers): The collection of data sequences.
  Each inner sequence is treated as a variable. All inner sequences must have the same length.

Returns a sequence of sequences (a matrix) of doubles representing the covariance matrix.
The matrix is symmetric, as covariance is a symmetric measure ($Cov(X,Y) = Cov(Y,X)$).

Internally uses [[coefficient-matrix]] with the [[covariance]] function and `symmetric?` set to `true`.

See also [[covariance]], [[correlation-matrix]], [[coefficient-matrix]].
sourceraw docstring

cramers-cclj

(cramers-c contingency-table)
(cramers-c group1 group2)

Calculates Cramer's C, a measure of association (effect size) between two nominal variables represented in a contingency table.

Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. It is particularly useful for tables larger than 2x2.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a contingency table from the unique values in the sequences.
  2. With a contingency table: The contingency table can be provided as:
    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This is the output format of contingency-table with two inputs.
    • A sequence of sequences representing the rows of the table (e.g., [[10 5] [3 12]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of categorical data.
  • group2 (sequence): The second sequence of categorical data. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Cramer's C coefficient as a double.

See also chisq-test, cramers-v, cohens-w, tschuprows-t, contingency-table.

Calculates Cramer's C, a measure of association (effect size) between two
nominal variables represented in a contingency table.

Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates
a perfect association. It is particularly useful for tables larger than 2x2.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a contingency table from
    the unique values in the sequences.
2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format
      of [[contingency-table]] with two inputs.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[10 5] [3 12]]`). This is equivalent to `rows->contingency-table`.

Parameters:

- `group1` (sequence): The first sequence of categorical data.
- `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Cramer's C coefficient as a double.

See also [[chisq-test]], [[cramers-v]], [[cohens-w]], [[tschuprows-t]], [[contingency-table]].
sourceraw docstring

cramers-vclj

(cramers-v contingency-table)
(cramers-v group1 group2)

Calculates Cramer's V, a measure of association (effect size) between two nominal variables represented in a contingency table.

Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. It is related to the Pearson's Chi-squared statistic and is useful for tables of any size.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a contingency table from the unique values in the sequences.
  2. With a contingency table: The contingency table can be provided as:
    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This is the output format of contingency-table with two inputs.
    • A sequence of sequences representing the rows of the table (e.g., [[10 5] [3 12]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of categorical data.
  • group2 (sequence): The second sequence of categorical data. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Cramer's V coefficient as a double.

See also chisq-test, cramers-c, cohens-w, tschuprows-t, contingency-table.

Calculates Cramer's V, a measure of association (effect size) between two
nominal variables represented in a contingency table.

Its value ranges from 0 to 1, where 0 indicates no association and 1 indicates
a perfect association. It is related to the Pearson's Chi-squared statistic
and is useful for tables of any size.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a contingency table from
    the unique values in the sequences.
2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format
      of [[contingency-table]] with two inputs.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[10 5] [3 12]]`). This is equivalent to `rows->contingency-table`.

Parameters:

- `group1` (sequence): The first sequence of categorical data.
- `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Cramer's V coefficient as a double.

See also [[chisq-test]], [[cramers-c]], [[cohens-w]], [[tschuprows-t]], [[contingency-table]].
sourceraw docstring

cramers-v-correctedclj

(cramers-v-corrected contingency-table)
(cramers-v-corrected group1 group2)

Calculates the corrected Cramer's V, a measure of association (effect size) between two nominal variables represented in a contingency table, with a correction to reduce bias, particularly for small sample sizes or tables with many cells having small expected counts.

Like the uncorrected Cramer's V (cramers-v), its value ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association. The correction tends to yield a value closer to the true population value in biased situations.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a contingency table from the unique values in the sequences.
  2. With a contingency table: The contingency table can be provided as:
    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This is the output format of contingency-table with two inputs.
    • A sequence of sequences representing the rows of the table (e.g., [[10 5] [3 12]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of categorical data.
  • group2 (sequence): The second sequence of categorical data. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated corrected Cramer's V coefficient as a double.

See also chisq-test, cramers-v (uncorrected), cramers-c, cohens-w, tschuprows-t, contingency-table.

Calculates the **corrected Cramer's V**, a measure of association (effect size)
between two nominal variables represented in a contingency table, with a correction
to reduce bias, particularly for small sample sizes or tables with many cells
having small expected counts.

Like the uncorrected Cramer's V ([[cramers-v]]), its value ranges from 0 to 1,
where 0 indicates no association and 1 indicates a perfect association. The
correction tends to yield a value closer to the true population value in
biased situations.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a contingency table from
    the unique values in the sequences.
2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format
      of [[contingency-table]] with two inputs.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]].

Parameters:

- `group1` (sequence): The first sequence of categorical data.
- `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated corrected Cramer's V coefficient as a double.

See also [[chisq-test]], [[cramers-v]] (uncorrected), [[cramers-c]], [[cohens-w]],
[[tschuprows-t]], [[contingency-table]].
sourceraw docstring

cressie-read-testclj

(cressie-read-test contingency-table-or-xs)
(cressie-read-test contingency-table-or-xs params)

Cressie-Read test, a power divergence test for lambda 2/3

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Cressie-Read test, a power divergence test for `lambda` 2/3

Performs a power divergence test, which encompasses several common statistical tests
  like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
  This function can perform either a goodness-of-fit test or a test for independence
  in a contingency table.

  Usage:

  1.  **Goodness-of-Fit (GOF):**
      - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
      - Input: `data` (sequence of numbers) and `:p` (a distribution object).
        In this case, a histogram of `data` is created (controlled by `:bins`) and
        compared against the probability mass/density of the distribution in those bins.

  2.  **Test for Independence:**
      - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

  Options map:

  * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
      * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
      * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
      * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
      * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
      * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
      * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
  * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
    or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
  * `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
  * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
    (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
  * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
    against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
  * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
  * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
    Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

  Returns a map containing:

  - `:stat`: The calculated power divergence test statistic.
  - `:chi2`: Alias for `:stat`.
  - `:df`: Degrees of freedom for the test.
  - `:p-value`: The p-value associated with the test statistic.
  - `:n`: Total number of observations.
  - `:estimate`: Observed proportions.
  - `:expected`: Expected counts or proportions under the null hypothesis.
  - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
  - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

demeanclj

(demean vs)

Subtract mean from a sequence

Subtract mean from a sequence
sourceraw docstring

dissimilarityclj

(dissimilarity method P-observed Q-expected)
(dissimilarity method
               P-observed
               Q-expected
               {:keys [bins probabilities? epsilon log-base power remove-zeros?]
                :or
                  {probabilities? true epsilon 1.0E-6 log-base m/E power 2.0}})

Various PDF distance between two histograms (frequencies) or probabilities.

Q can be a distribution object. Then, histogram will be created out of P.

Arguments:

  • method - distance method
  • P-observed - frequencies, probabilities or actual data (when Q is a distribution of :bins is set)
  • Q-expected - frequencies, probabilities or distribution object (when P is a data or :bins is set)

Options:

  • :probabilities? - should P/Q be converted to a probabilities, default: true.
  • :epsilon - small number which replaces 0.0 when division or logarithm is used`
  • :log-base - base for logarithms, default: e
  • :power - exponent for :minkowski distance, default: 2.0
  • :bins - number of bins or bins estimation method, see histogram.

The list of methods: :euclidean, :city-block, :manhattan, :chebyshev, :minkowski, :sorensen, :gower, :soergel, :kulczynski, :canberra, :lorentzian, :non-intersection, :wave-hedges, :czekanowski, :motyka, :tanimoto, :jaccard, :dice, :bhattacharyya, :hellinger, :matusita, :squared-chord, :euclidean-sq, :squared-euclidean, :pearson-chisq, :chisq, :neyman-chisq, :squared-chisq, :symmetric-chisq, :divergence, :clark, :additive-symmetric-chisq, :kullback-leibler, :jeffreys, :k-divergence, :topsoe, :jensen-shannon, :jensen-difference, :taneja, :kumar-johnson, :avg

See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha

Various PDF distance between two histograms (frequencies) or probabilities.

Q can be a distribution object. Then, histogram will be created out of P.

Arguments:

* `method` - distance method
* `P-observed` - frequencies, probabilities or actual data (when Q is a distribution of `:bins` is set)
* `Q-expected` - frequencies, probabilities or distribution object (when P is a data or `:bins` is set)

Options:

* `:probabilities?` - should P/Q be converted to a probabilities, default: `true`.
* `:epsilon` - small number which replaces `0.0` when division or logarithm is used`
* `:log-base` - base for logarithms, default: `e`
* `:power` - exponent for `:minkowski` distance, default: `2.0`
* `:bins` - number of bins or bins estimation method, see [[histogram]].

The list of methods: `:euclidean`, `:city-block`, `:manhattan`, `:chebyshev`, `:minkowski`, `:sorensen`, `:gower`, `:soergel`, `:kulczynski`, `:canberra`, `:lorentzian`, `:non-intersection`, `:wave-hedges`, `:czekanowski`, `:motyka`, `:tanimoto`, `:jaccard`, `:dice`, `:bhattacharyya`, `:hellinger`, `:matusita`, `:squared-chord`, `:euclidean-sq`, `:squared-euclidean`, `:pearson-chisq`, `:chisq`, `:neyman-chisq`, `:squared-chisq`, `:symmetric-chisq`, `:divergence`, `:clark`, `:additive-symmetric-chisq`, `:kullback-leibler`, `:jeffreys`, `:k-divergence`, `:topsoe`, `:jensen-shannon`, `:jensen-difference`, `:taneja`, `:kumar-johnson`, `:avg`

See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha
sourceraw docstring

durbin-watsonclj

(durbin-watson rs)

Calculates the Durbin-Watson statistic (d) for a sequence of residuals.

This statistic is used to test for the presence of serial correlation, especially first-order (lag-1) autocorrelation, in the residuals from a regression analysis. Autocorrelation violates the assumption of independent errors.

Parameters:

  • rs (sequence of numbers): The sequence of residuals from a regression model. The sequence should represent observations ordered by time or sequence index.

Returns the calculated Durbin-Watson statistic as a double. The value ranges from 0 to 4.

Interpretation:

  • Values near 2 suggest no first-order autocorrelation.
  • Values less than 2 suggest positive autocorrelation (residuals tend to be followed by residuals of the same sign).
  • Values greater than 2 suggest negative autocorrelation (residuals tend to be followed by residuals of the opposite sign).
Calculates the Durbin-Watson statistic (d) for a sequence of residuals.

This statistic is used to test for the presence of serial correlation,
especially first-order (lag-1) autocorrelation, in the residuals from a
regression analysis. Autocorrelation violates the assumption of independent errors.

Parameters:

- `rs` (sequence of numbers): The sequence of residuals from a regression model.
  The sequence should represent observations ordered by time or sequence index.

Returns the calculated Durbin-Watson statistic as a double. The value ranges from 0 to 4.

Interpretation:

- Values near 2 suggest no first-order autocorrelation.
- Values less than 2 suggest positive autocorrelation (residuals tend to be followed by residuals of the same sign).
- Values greater than 2 suggest negative autocorrelation (residuals tend to be followed by residuals of the opposite sign).
sourceraw docstring

epsilon-sqclj

(epsilon-sq [group1 group2])
(epsilon-sq group1 group2)

Calculates Epsilon squared (ε²), an effect size measure for the simple linear regression of group1 on group2.

Epsilon squared estimates the proportion of variance in the dependent variable (group1) that is accounted for by the independent variable (group2) in the population. It is considered a less biased alternative to the sample R-squared (r2-determination).

The calculation is based on the sums of squares from the simple linear regression of group1 on group2.

Parameters:

  • group1 (seq of numbers): The dependent variable.
  • group2 (seq of numbers): The independent variable. Must have the same length as group1.

Returns the calculated Epsilon squared value as a double. The value typically ranges from 0.0 to 1.0.

Interpretation:

  • 0.0 indicates that group2 explains none of the variance in group1 in the population.
  • 1.0 indicates that group2 perfectly explains the variance in group1 in the population.

Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences.

See also eta-sq (Eta-squared, often based on $R^2$), omega-sq (another adjusted R²-like measure), r2-determination (R-squared).

Calculates Epsilon squared (ε²), an effect size measure for the simple linear regression of `group1` on `group2`.

Epsilon squared estimates the proportion of variance in the dependent variable (`group1`)
that is accounted for by the independent variable (`group2`) in the population. It is
considered a less biased alternative to the sample R-squared ([[r2-determination]]).

The calculation is based on the sums of squares from the simple linear regression of
`group1` on `group2`.

Parameters:

- `group1` (seq of numbers): The dependent variable.
- `group2` (seq of numbers): The independent variable. Must have the same length as `group1`.

Returns the calculated Epsilon squared value as a double. The value typically ranges
from 0.0 to 1.0.

Interpretation:

- 0.0 indicates that `group2` explains none of the variance in `group1` in the population.
- 1.0 indicates that `group2` perfectly explains the variance in `group1` in the population.

Note: While often presented in the context of ANOVA, this implementation applies the
formula to the sums of squares obtained from a simple linear regression between the
two sequences.

See also [[eta-sq]] (Eta-squared, often based on $R^2$), [[omega-sq]] (another adjusted
R²-like measure), [[r2-determination]] (R-squared).
sourceraw docstring

estimate-binsclj

(estimate-bins vs)
(estimate-bins vs bins-or-estimate-method)

Estimate number of bins for histogram.

Possible methods are: :sqrt :sturges :rice :doane :scott :freedman-diaconis (default).

The number returned is not higher than number of samples.

Estimate number of bins for histogram.

Possible methods are: `:sqrt` `:sturges` `:rice` `:doane` `:scott` `:freedman-diaconis` (default).

The number returned is not higher than number of samples.
sourceraw docstring

estimation-strategies-listclj

List of estimation strategies for percentile/quantile functions.

List of estimation strategies for [[percentile]]/[[quantile]] functions.
sourceraw docstring

eta-sqclj

(eta-sq [group1 group2])
(eta-sq group1 group2)

Calculates a measure of association between two sequences, named eta-sq (Eta-squared).

Note: The current implementation calculates the R-squared coefficient of determination from a simple linear regression where the first input sequence (group1) is treated as the dependent variable and the second (group2) as the independent variable. In this context, it quantifies the proportion of the variance in group1 that is linearly predictable from group2.

Parameters:

  • group1 (seq of numbers): The first sequence (treated as dependent variable).
  • group2 (seq of numbers): The second sequence (treated as independent variable).

Returns the calculated R-squared value as a double [0.0, 1.0].

Interpretation:

  • 0.0 indicates that group2 explains none of the variance in group1 linearly.
  • 1.0 indicates that group2 linearly explains all the variance in group1.

While Eta-squared ($\eta^2$) is commonly used in ANOVA to quantify the proportion of variance in a dependent variable explained by group membership, this function's calculation method differs from the standard ANOVA $\eta^2$ unless group2 explicitly represents numeric codes for two groups.

See also r2-determination (which is equivalent to this function), pearson-correlation, omega-sq, epsilon-sq, one-way-anova-test.

Calculates a measure of association between two sequences, named `eta-sq` (Eta-squared).

*Note*: The current implementation calculates the R-squared coefficient of determination from a simple linear regression where the first input sequence (`group1`) is treated as the dependent variable and the second (`group2`) as the independent variable. In this context, it quantifies the proportion of the variance in `group1` that is linearly predictable from `group2`.

Parameters:

- `group1` (seq of numbers): The first sequence (treated as dependent variable).
- `group2` (seq of numbers): The second sequence (treated as independent variable).

Returns the calculated R-squared value as a double [0.0, 1.0].

Interpretation:

- 0.0 indicates that `group2` explains none of the variance in `group1` linearly.
- 1.0 indicates that `group2` linearly explains all the variance in `group1`.

While Eta-squared ($\eta^2$) is commonly used in ANOVA to quantify the proportion of variance in a dependent variable explained by group membership, this function's calculation method differs from the standard ANOVA $\eta^2$ unless `group2` explicitly represents numeric codes for two groups.

See also [[r2-determination]] (which is equivalent to this function), [[pearson-correlation]], [[omega-sq]], [[epsilon-sq]], [[one-way-anova-test]].
sourceraw docstring

expectileclj

(expectile vs tau)
(expectile vs weights tau)

Calculate the tau-th expectile of a sequence vs.

Expectiles are related to quantiles but are determined by minimizing an asymmetrically weighted sum of squared differences, rather than absolute differences. The tau parameter controls the asymmetry.

A key property is that the expectile for tau = 0.5 is equal to the mean.

The calculation involves finding the value t such that the weighted sum of w_i * (v_i - t) is zero, where the effective weights depend on tau and whether v_i is above or below t.

Parameters:

  • vs: Sequence of data values.
  • weights (optional): Sequence of corresponding non-negative weights. Must have the same count as vs. If omitted, calculates the unweighted expectile.
  • tau: The expectile level, a value between 0.0 and 1.0 (inclusive).

Returns the calculated expectile as a double.

See also quantile, mean, median.

Calculate the tau-th expectile of a sequence `vs`.

Expectiles are related to quantiles but are determined by minimizing an
asymmetrically weighted sum of squared differences, rather than absolute
differences. The `tau` parameter controls the asymmetry.

A key property is that the expectile for `tau = 0.5` is equal to the [[mean]].

The calculation involves finding the value `t` such that the weighted sum
of `w_i * (v_i - t)` is zero, where the effective weights depend on `tau` and whether
`v_i` is above or below `t`.

Parameters:

- `vs`: Sequence of data values.
- `weights` (optional): Sequence of corresponding non-negative weights.
  Must have the same count as `vs`. If omitted, calculates the unweighted expectile.
- `tau`: The expectile level, a value between 0.0 and 1.0 (inclusive).

Returns the calculated expectile as a double.

See also [[quantile]], [[mean]], [[median]].
sourceraw docstring

extentclj

(extent vs)
(extent vs mean?)

Return extent (min, max, mean) values from sequence. Mean is optional (default: true)

Return extent (min, max, mean) values from sequence. Mean is optional (default: true)
sourceraw docstring

f-testclj

(f-test xs ys)
(f-test xs ys {:keys [sides alpha] :or {sides :two-sided alpha 0.05}})

Performs an F-test to compare the variances of two independent samples.

The test assesses the null hypothesis that the variances of the populations from which xs and ys are drawn are equal.

Assumes independence of samples. The test is sensitive to departures from the assumption that both populations are normally distributed.

Parameters:

  • xs (seq of numbers): The first sample.
  • ys (seq of numbers): The second sample.
  • params (map, optional): Options map:
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis regarding the ratio of variances (Var(xs) / Var(ys)).
      • :two-sided (default): Variances are not equal (ratio != 1).
      • :one-sided-greater: Variance of xs is greater than variance of ys (ratio > 1).
      • :one-sided-less: Variance of xs is less than variance of ys (ratio < 1).
    • :alpha (double, default 0.05): Significance level for the confidence interval.

Returns a map containing:

  • :F: The calculated F-statistic (ratio of sample variances: Var(xs) / Var(ys)).
  • :stat: Alias for :F.
  • :estimate: Alias for :F, representing the estimated ratio of variances.
  • :df: Degrees of freedom as [numerator-df, denominator-df], corresponding to [(count xs)-1, (count ys)-1].
  • :n: Sample sizes as [count xs, count ys].
  • :nx: Sample size of xs.
  • :ny: Sample size of ys.
  • :sides: The alternative hypothesis side used (:two-sided, :one-sided-greater, or :one-sided-less).
  • :test-type: Alias for :sides.
  • :p-value: The p-value associated with the F-statistic and the specified :sides.
  • :confidence-interval: A confidence interval for the true ratio of the population variances (Var(xs) / Var(ys)).
Performs an F-test to compare the variances of two independent samples.

The test assesses the null hypothesis that the variances of the populations
from which `xs` and `ys` are drawn are equal.

Assumes independence of samples. The test is sensitive to departures from
the assumption that both populations are normally distributed.

Parameters:

- `xs` (seq of numbers): The first sample.
- `ys` (seq of numbers): The second sample.
- `params` (map, optional): Options map:
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis
    regarding the ratio of variances (Var(xs) / Var(ys)).
    - `:two-sided` (default): Variances are not equal (ratio != 1).
    - `:one-sided-greater`: Variance of `xs` is greater than variance of `ys` (ratio > 1).
    - `:one-sided-less`: Variance of `xs` is less than variance of `ys` (ratio < 1).
  - `:alpha` (double, default `0.05`): Significance level for the confidence interval.

Returns a map containing:

- `:F`: The calculated F-statistic (ratio of sample variances: Var(xs) / Var(ys)).
- `:stat`: Alias for `:F`.
- `:estimate`: Alias for `:F`, representing the estimated ratio of variances.
- `:df`: Degrees of freedom as `[numerator-df, denominator-df]`, corresponding to `[(count xs)-1, (count ys)-1]`.
- `:n`: Sample sizes as `[count xs, count ys]`.
- `:nx`: Sample size of `xs`.
- `:ny`: Sample size of `ys`.
- `:sides`: The alternative hypothesis side used (`:two-sided`, `:one-sided-greater`, or `:one-sided-less`).
- `:test-type`: Alias for `:sides`.
- `:p-value`: The p-value associated with the F-statistic and the specified `:sides`.
- `:confidence-interval`: A confidence interval for the true ratio of the population variances (Var(xs) / Var(ys)).
sourceraw docstring

fligner-killeen-testclj

(fligner-killeen-test xss)
(fligner-killeen-test xss {:keys [sides] :or {sides :one-sided-greater}})

Performs the Fligner-Killeen test for homogeneity of variances across two or more groups.

The Fligner-Killeen test is a non-parametric test that assesses the null hypothesis that the variances of the groups are equal. It is robust against departures from normality. The test is based on ranks of the absolute deviations from the group medians.

Parameters:

  • xss (sequence of sequences): A collection where each element is a sequence representing a group of observations.
  • params (map, optional): Options map with the following key:
    • :sides (keyword, default :one-sided-greater): Alternative hypothesis side for the Chi-squared test. Possible values: :one-sided-greater, :one-sided-less, :two-sided.

Returns a map containing:

  • :chi2: The Fligner-Killeen test statistic (Chi-squared value).
  • :stat: Alias for :chi2.
  • :p-value: The p-value for the test.
  • :df: Degrees of freedom for the test (number of groups - 1).
  • :n: Sequence of sample sizes for each group.
  • :SSt: Sum of squares between groups (treatment) based on transformed ranks.
  • :SSe: Sum of squares within groups (error) based on transformed ranks.
  • :DFt: Degrees of freedom between groups.
  • :DFe: Degrees of freedom within groups.
  • :MSt: Mean square between groups.
  • :MSe: Mean square within groups.
  • :sides: Test side used.
Performs the Fligner-Killeen test for homogeneity of variances across two or more groups.

The Fligner-Killeen test is a non-parametric test that assesses the null hypothesis
that the variances of the groups are equal. It is robust against departures from normality.
The test is based on ranks of the absolute deviations from the group medians.

Parameters:

- `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations.
- `params` (map, optional): Options map with the following key:
  - `:sides` (keyword, default `:one-sided-greater`): Alternative hypothesis side for the Chi-squared test.
    Possible values: `:one-sided-greater`, `:one-sided-less`, `:two-sided`.

Returns a map containing:

- `:chi2`: The Fligner-Killeen test statistic (Chi-squared value).
- `:stat`: Alias for `:chi2`.
- `:p-value`: The p-value for the test.
- `:df`: Degrees of freedom for the test (number of groups - 1).
- `:n`: Sequence of sample sizes for each group.
- `:SSt`: Sum of squares between groups (treatment) based on transformed ranks.
- `:SSe`: Sum of squares within groups (error) based on transformed ranks.
- `:DFt`: Degrees of freedom between groups.
- `:DFe`: Degrees of freedom within groups.
- `:MSt`: Mean square between groups.
- `:MSe`: Mean square within groups.
- `:sides`: Test side used.
sourceraw docstring

freeman-tukey-testclj

(freeman-tukey-test contingency-table-or-xs)
(freeman-tukey-test contingency-table-or-xs params)

Freeman-Tukey test, a power divergence test for lambda -0.5

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Freeman-Tukey test, a power divergence test for `lambda` -0.5

Performs a power divergence test, which encompasses several common statistical tests
  like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
  This function can perform either a goodness-of-fit test or a test for independence
  in a contingency table.

  Usage:

  1.  **Goodness-of-Fit (GOF):**
      - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
      - Input: `data` (sequence of numbers) and `:p` (a distribution object).
        In this case, a histogram of `data` is created (controlled by `:bins`) and
        compared against the probability mass/density of the distribution in those bins.

  2.  **Test for Independence:**
      - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

  Options map:

  * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
      * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
      * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
      * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
      * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
      * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
      * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
  * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
    or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
  * `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
  * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
    (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
  * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
    against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
  * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
  * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
    Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

  Returns a map containing:

  - `:stat`: The calculated power divergence test statistic.
  - `:chi2`: Alias for `:stat`.
  - `:df`: Degrees of freedom for the test.
  - `:p-value`: The p-value associated with the test statistic.
  - `:n`: Total number of observations.
  - `:estimate`: Observed proportions.
  - `:expected`: Expected counts or proportions under the null hypothesis.
  - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
  - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

geomeanclj

(geomean vs)
(geomean vs weights)

Calculates the geometric mean of a sequence vs.

The geometric mean is suitable for averaging ratios or rates of change and requires all values in the sequence to be positive. It is calculated as the n-th root of the product of n numbers.

Parameters:

  • vs: Sequence of numbers. Non-positive values will result in NaN or 0.0 due to the internal use of log.
  • weights (optional): Sequence of non-negative weights corresponding to vs. Must have the same count as vs.

Returns the calculated geometric mean as a double.

See also mean, harmean, powmean.

Calculates the geometric mean of a sequence `vs`.

The geometric mean is suitable for averaging ratios or rates of change and requires
all values in the sequence to be positive. It is calculated as the n-th root
of the product of n numbers.

Parameters:

- `vs`: Sequence of numbers. Non-positive values will result in `NaN` or `0.0` due
        to the internal use of `log`.
- `weights` (optional): Sequence of non-negative weights corresponding to `vs`.
  Must have the same count as `vs`.

Returns the calculated geometric mean as a double.

See also [[mean]], [[harmean]], [[powmean]].
sourceraw docstring

glass-deltaclj

(glass-delta [group1 group2])
(glass-delta group1 group2)

Calculates Glass's delta (Δ), an effect size measure for the difference between two group means, using the standard deviation of the control group.

Glass's delta is used to quantify the magnitude of the difference between an experimental group and a control group, specifically when the control group's standard deviation is considered a better estimate of the population standard deviation than a pooled variance.

Parameters:

  • group1 (seq of numbers): The experimental group.
  • group2 (seq of numbers): The control group.

Returns the calculated Glass's delta as a double.

This measure is less common than cohens-d or hedges-g but is preferred when the intervention is expected to affect the variance or when group2 (the control) is clearly the baseline against which variability should be assessed.

See also cohens-d, hedges-g.

Calculates Glass's delta (Δ), an effect size measure for the difference
between two group means, using the standard deviation of the control group.

Glass's delta is used to quantify the magnitude of the difference between an
experimental group and a control group, specifically when the control group's
standard deviation is considered a better estimate of the population
standard deviation than a pooled variance.

Parameters:

- `group1` (seq of numbers): The experimental group.
- `group2` (seq of numbers): The control group.

Returns the calculated Glass's delta as a double.

This measure is less common than [[cohens-d]] or [[hedges-g]] but is preferred
when the intervention is expected to affect the variance or when group2 (the control)
is clearly the baseline against which variability should be assessed.

See also [[cohens-d]], [[hedges-g]].
sourceraw docstring

harmeanclj

(harmean vs)
(harmean vs weights)

Calculates the harmonic mean of a sequence vs.

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the observations.

Parameters:

  • vs: Sequence of numbers. Values must be non-zero.
  • weights (optional): Sequence of non-negative weights corresponding to vs. Must have the same count as vs.

Returns the calculated harmonic mean as a double.

See also mean, geomean, powmean.

Calculates the harmonic mean of a sequence `vs`.

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals
of the observations.

Parameters:

- `vs`: Sequence of numbers. Values must be non-zero.
- `weights` (optional): Sequence of non-negative weights corresponding to `vs`.
  Must have the same count as `vs`.

Returns the calculated harmonic mean as a double.

See also [[mean]], [[geomean]], [[powmean]].
sourceraw docstring

hedges-gclj

(hedges-g [group1 group2])
(hedges-g group1 group2)

Calculates Hedges's g effect size for comparing the means of two independent groups.

Hedges's g is a standardized measure quantifying the magnitude of the difference between the means of two independent groups. It is similar to Cohen's d but uses the unbiased pooled standard deviation in the denominator.

This implementation calculates g using the unbiased pooled standard deviation as the denominator.

Parameters:

  • group1, group2 (sequences): The two independent samples directly as arguments.

Returns the calculated Hedges's g effect size as a double.

Note: This specific function uses the unbiased pooled standard deviation but does not apply the small-sample bias correction factor (often denoted as J) sometimes associated with Hedges's g. For a bias-corrected version, see hedges-g-corrected. This function is equivalent to calling (cohens-d group1 group2 :unbiased).

See also cohens-d, hedges-g-corrected, glass-delta, pooled-stddev.

Calculates Hedges's g effect size for comparing the means of two independent groups.

Hedges's g is a standardized measure quantifying the magnitude of the difference
between the means of two independent groups. It is similar to Cohen's d but
uses the *unbiased* pooled standard deviation in the denominator.

This implementation calculates g using the unbiased pooled standard deviation as the denominator.

Parameters:

- `group1`, `group2` (sequences): The two independent samples directly as arguments.

Returns the calculated Hedges's g effect size as a double.

Note: This specific function uses the unbiased pooled standard deviation but does
*not* apply the small-sample bias correction factor (often denoted as J)
sometimes associated with Hedges's g. For a bias-corrected version, see [[hedges-g-corrected]].
This function is equivalent to calling `(cohens-d group1 group2 :unbiased)`.

See also [[cohens-d]], [[hedges-g-corrected]], [[glass-delta]], [[pooled-stddev]].
sourceraw docstring

hedges-g*clj

(hedges-g* [group1 group2])
(hedges-g* group1 group2)

Calculates a less biased estimate of Hedges's g effect size for comparing the means of two independent groups, using the exact J bias correction.

Hedges's g is a standardized measure of the difference between two means. For small sample sizes, the standard Hedges's g (and Cohen's d) can overestimate the true population effect size. This function applies a specific correction factor, often denoted as J, to mitigate this bias.

The calculation involves:

  1. Calculating the standard Hedges's g (equivalent to hedges-g, which uses the unbiased pooled standard deviation).
  2. Calculating the J correction factor based on the degrees of freedom (n1 + n2 - 2) using the gamma function.
  3. Multiplying the standard Hedges's g by the J factor.

The J factor is calculated as (Gamma(df/2) / (sqrt(df/2) * Gamma((df-1)/2))).

Parameters:

  • group1 (seq of numbers): The first independent sample.
  • group2 (seq of numbers): The second independent sample.

Returns the calculated bias-corrected Hedges's g effect size as a double.

This version of Hedges's g is generally preferred over the standard version or Cohen's d when working with small sample sizes, as it provides a more accurate estimate of the population effect size.

Assumptions:

  • The two samples are independent.
  • Data within each group are approximately normally distributed.
  • Equal variances are assumed for calculating the pooled standard deviation.

See also cohens-d, hedges-g (uncorrected), hedges-g-corrected (another correction method).

Calculates a less biased estimate of Hedges's g effect size for comparing the means of two independent groups, using the exact J bias correction.

Hedges's g is a standardized measure of the difference between two means. For small sample sizes, the standard Hedges's g (and Cohen's d) can overestimate the true population effect size. This function applies a specific correction factor, often denoted as J, to mitigate this bias.

The calculation involves:
1. Calculating the standard Hedges's g (equivalent to [[hedges-g]], which uses the unbiased pooled standard deviation).
2. Calculating the J correction factor based on the degrees of freedom (`n1 + n2 - 2`) using the gamma function.
3. Multiplying the standard Hedges's g by the J factor.

The J factor is calculated as `(Gamma(df/2) / (sqrt(df/2) * Gamma((df-1)/2)))`.

Parameters:

- `group1` (seq of numbers): The first independent sample.
- `group2` (seq of numbers): The second independent sample.

Returns the calculated bias-corrected Hedges's g effect size as a double.

This version of Hedges's g is generally preferred over the standard version or Cohen's d when working with small sample sizes, as it provides a more accurate estimate of the population effect size.

Assumptions:
- The two samples are independent.
- Data within each group are approximately normally distributed.
- Equal variances are assumed for calculating the pooled standard deviation.

See also [[cohens-d]], [[hedges-g]] (uncorrected), [[hedges-g-corrected]] (another correction method).
sourceraw docstring

hedges-g-correctedclj

(hedges-g-corrected [group1 group2])
(hedges-g-corrected group1 group2)

Calculates a small-sample bias-corrected effect size for comparing the means of two independent groups, often referred to as a form of Hedges's g.

This function calculates Cohen's d (cohens-d) using the unbiased pooled standard deviation (equivalent to hedges-g), and then applies a specific correction factor designed to reduce the bias in the effect size estimate for small sample sizes.

The correction factor applied is (1 - 3 / (4 * df - 1)), where df is the degrees of freedom for the unbiased pooled variance calculation (n1 + n2 - 2). This corresponds to calling cohens-d-corrected with the :unbiased method for pooled standard deviation.

Parameters:

  • group1 (seq of numbers): The first independent sample.
  • group2 (seq of numbers): The second independent sample.

Returns the calculated bias-corrected effect size as a double.

Note: This function applies a correction factor. For the more standard Hedges's g bias correction using the exact gamma function based correction factor, see hedges-g*.

See also cohens-d, cohens-d-corrected, hedges-g, hedges-g*, pooled-stddev.

Calculates a small-sample bias-corrected effect size for comparing the means
of two independent groups, often referred to as a form of Hedges's g.

This function calculates Cohen's d ([[cohens-d]]) using the *unbiased*
pooled standard deviation (equivalent to [[hedges-g]]), and then applies
a specific correction factor designed to reduce the bias in the effect size
estimate for small sample sizes.

The correction factor applied is `(1 - 3 / (4 * df - 1))`, where `df` is the
degrees of freedom for the unbiased pooled variance calculation (`n1 + n2 - 2`).
This corresponds to calling [[cohens-d-corrected]] with the `:unbiased` method
for pooled standard deviation.

Parameters:

- `group1` (seq of numbers): The first independent sample.
- `group2` (seq of numbers): The second independent sample.

Returns the calculated bias-corrected effect size as a double.

Note: This function applies *a* correction factor. For the more
standard Hedges's g bias correction using the exact gamma function
based correction factor, see [[hedges-g*]].

See also [[cohens-d]], [[cohens-d-corrected]], [[hedges-g]], [[hedges-g*]],
[[pooled-stddev]].
sourceraw docstring

histogramclj

(histogram vs)
(histogram vs bins-or-estimate-method)
(histogram vs bins-or-estimate-method [mn mx])
(histogram vs bins-or-estimate-method mn mx)

Calculate histogram.

Estimation method can be a number, named method: :sqrt :sturges :rice :doane :scott :freedman-diaconis (default) or a sequence of points used as intervals. In the latter case or when mn and mx values are provided - data will be filtered to fit in desired interval(s).

Returns map with keys:

  • :size - number of bins
  • :step - average distance between bins
  • :bins - seq of pairs of range lower value and number of elements
  • :min - min value
  • :max - max value
  • :samples - number of used samples
  • :frequencies - a map containing counts for bin's average
  • :intervals - intervals used to create bins
  • :bins-maps - seq of maps containing:
    • :min - lower bound
    • :mid - middle value
    • :max - upper bound
    • :step - actual distance between bins
    • :count - number of elements
    • :avg - average value
    • :probability - probability for bin

If difference between min and max values is 0, number of bins is set to 1.

Calculate histogram.

Estimation method can be a number, named method: `:sqrt` `:sturges` `:rice` `:doane` `:scott` `:freedman-diaconis` (default) or a sequence of points used as intervals.
In the latter case or when `mn` and `mx` values are provided - data will be filtered to fit in desired interval(s).

Returns map with keys:

* `:size` - number of bins
* `:step` - average distance between bins
* `:bins` - seq of pairs of range lower value and number of elements
* `:min` - min value
* `:max` - max value
* `:samples` - number of used samples
* `:frequencies` - a map containing counts for bin's average
* `:intervals` - intervals used to create bins
* `:bins-maps` - seq of maps containing:
  * `:min` - lower bound
  * `:mid` - middle value
  * `:max` - upper bound
  * `:step` - actual distance between bins 
  * `:count` - number of elements
  * `:avg` - average value
  * `:probability` - probability for bin

If difference between min and max values is `0`, number of bins is set to 1.
sourceraw docstring

hpdi-extentclj

(hpdi-extent vs)
(hpdi-extent vs size)

Higher Posterior Density interval + median.

size parameter is the target probability content of the interval.

Higher Posterior Density interval + median.

`size` parameter is the target probability content of the interval.
sourceraw docstring

inner-fence-extentclj

(inner-fence-extent vs)
(inner-fence-extent vs estimation-strategy)

Returns LIF, UIF and median

Returns LIF, UIF and median
sourceraw docstring

iqrclj

(iqr vs)
(iqr vs estimation-strategy)

Interquartile range.

Interquartile range.
sourceraw docstring

jarque-bera-testclj

(jarque-bera-test xs)
(jarque-bera-test xs params)
(jarque-bera-test xs skew kurt {:keys [sides] :or {sides :one-sided-greater}})

Performs the Jarque-Bera goodness-of-fit test to determine if sample data exhibits skewness and kurtosis consistent with a normal distribution.

The test assesses the null hypothesis that the data comes from a normally distributed population (i.e., population skewness is 0 and population excess kurtosis is 0).

The test statistic is calculated as: JB = (n/6) * (S^2 + (1/4)*K^2) where n is the sample size, S is the sample skewness (using :g1 type), and K is the excess kurtosis :g2. Under the null hypothesis, the JB statistic asymptotically follows a Chi-squared distribution with 2 degrees of freedom.

Parameters:

  • xs (seq of numbers): The sample data.
  • skew (double, optional): A pre-calculated sample skewness value (type :g1). If omitted, it's calculated from xs.
  • kurt (double, optional): A pre-calculated sample excess kurtosis value (type :g2). If omitted, it's calculated from xs.
  • params (map, optional): Options map:
    • :sides (keyword, default :one-sided-greater): Specifies the side(s) of the Chi-squared(2) distribution used for p-value calculation.
      • :one-sided-greater (default and standard for JB): Tests if the JB statistic is significantly large, indicating departure from normality.
      • :one-sided-less: Tests if the statistic is significantly small.
      • :two-sided: Tests if the statistic is extreme in either tail.

Returns a map containing:

  • :Z: The calculated Jarque-Bera test statistic (labeled :Z for consistency, though it follows Chi-squared(2)).
  • :stat: Alias for :Z.
  • :p-value: The p-value associated with the test statistic and :sides, derived from the Chi-squared(2) distribution.
  • :skewness: The sample skewness (type :g1) used in the calculation.
  • :kurtosis: The sample kurtosis (type :g2) used in the calculation.

See also skewness-test, kurtosis-test, normality-test, bonett-seier-test.

Performs the Jarque-Bera goodness-of-fit test to determine if sample data
exhibits skewness and kurtosis consistent with a normal distribution.

The test assesses the null hypothesis that the data comes from a normally
distributed population (i.e., population skewness is 0 and population excess
kurtosis is 0).

The test statistic is calculated as:
`JB = (n/6) * (S^2 + (1/4)*K^2)`
where `n` is the sample size, `S` is the sample skewness (using `:g1` type),
and `K` is the excess kurtosis `:g2`.
Under the null hypothesis, the JB statistic asymptotically follows a Chi-squared
distribution with 2 degrees of freedom.

Parameters:

- `xs` (seq of numbers): The sample data.
- `skew` (double, optional): A pre-calculated sample skewness value (type `:g1`).
  If omitted, it's calculated from `xs`.
- `kurt` (double, optional): A pre-calculated sample *excess* kurtosis value (type `:g2`).
  If omitted, it's calculated from `xs`.
- `params` (map, optional): Options map:
  - `:sides` (keyword, default `:one-sided-greater`): Specifies the side(s) of the
    Chi-squared(2) distribution used for p-value calculation.
    - `:one-sided-greater` (default and standard for JB): Tests if the JB statistic is
      significantly large, indicating departure from normality.
    - `:one-sided-less`: Tests if the statistic is significantly small.
    - `:two-sided`: Tests if the statistic is extreme in either tail.

Returns a map containing:

- `:Z`: The calculated Jarque-Bera test statistic (labeled `:Z` for consistency,
         though it follows Chi-squared(2)).
- `:stat`: Alias for `:Z`.
- `:p-value`: The p-value associated with the test statistic and `:sides`, derived
               from the Chi-squared(2) distribution.
- `:skewness`: The sample skewness (type `:g1`) used in the calculation.
- `:kurtosis`: The sample kurtosis (type `:g2`) used in the calculation.

See also [[skewness-test]], [[kurtosis-test]], [[normality-test]], [[bonett-seier-test]].
sourceraw docstring

jensen-shannon-divergencecljdeprecated

(jensen-shannon-divergence [vs1 vs2])
(jensen-shannon-divergence vs1 vs2)

Jensen-Shannon divergence of two sequences.

Jensen-Shannon divergence of two sequences.
sourceraw docstring

kendall-correlationclj

(kendall-correlation [vs1 vs2])
(kendall-correlation vs1 vs2)

Calculates Kendall's rank correlation coefficient (Kendall's Tau) between two sequences.

Kendall's Tau is a non-parametric statistic used to measure the ordinal association between two measured quantities. It assesses the degree of similarity between the orderings of data when ranked by each of the quantities.

The coefficient value ranges from -1.0 (perfect disagreement in ranking) to 1.0 (perfect agreement in ranking), with 0.0 indicating no monotonic relationship. Unlike Pearson correlation, it does not require the relationship to be linear.

Parameters:

  • [vs1 vs2] (sequence of two sequences): A sequence containing the two sequences of numbers.
  • vs1, vs2 (sequences): The two sequences of numbers directly as arguments.

Both input sequences must contain only numbers and must have the same length.

Returns the calculated Kendall's Tau coefficient as a double.

See also pearson-correlation, spearman-correlation, correlation.

Calculates Kendall's rank correlation coefficient (Kendall's Tau) between two sequences.

Kendall's Tau is a non-parametric statistic used to measure the ordinal association
between two measured quantities. It assesses the degree of similarity between the
orderings of data when ranked by each of the quantities.

The coefficient value ranges from -1.0 (perfect disagreement in ranking) to 1.0
(perfect agreement in ranking), with 0.0 indicating no monotonic relationship.
Unlike Pearson correlation, it does not require the relationship to be linear.

Parameters:

- `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers.
- `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments.

Both input sequences must contain only numbers and must have the same length.

Returns the calculated Kendall's Tau coefficient as a double.

See also [[pearson-correlation]], [[spearman-correlation]], [[correlation]].
sourceraw docstring

kruskal-testclj

(kruskal-test xss)
(kruskal-test xss {:keys [sides] :or {sides :right}})

Performs the Kruskal-Wallis H-test (rank sum test) for independent samples.

The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA. It determines whether there is a statistically significant difference between the distributions of two or more independent groups. It does not assume normality but requires that distributions have a similar shape for the test to be valid.

Parameters:

  • data-groups (vector of sequences): A collection where each element is a sequence representing a group of observations.
  • a map containing :sides key with values of: :right (default), :left or :both

Returns a map containing:

  • :stat: The Kruskal-Wallis H statistic.
  • :n: Total number of observations across all groups.
  • :df: Degrees of freedom (number of groups - 1).
  • :k: Number of groups.
  • :sides: Test side
  • :p-value: The p-value for the test (null hypothesis: all groups have the same distribution).
Performs the Kruskal-Wallis H-test (rank sum test) for independent samples.

The Kruskal-Wallis test is a non-parametric alternative to one-way ANOVA.
It determines whether there is a statistically significant difference between the distributions of two or more independent groups. It does not assume normality but requires that distributions have a similar shape for the test to be valid.

Parameters:

- `data-groups` (vector of sequences): A collection where each element is a sequence 
  representing a group of observations.
- a map containing `:sides` key with values of: `:right` (default), `:left` or `:both`

Returns a map containing:

- `:stat`: The Kruskal-Wallis H statistic.
- `:n`: Total number of observations across all groups.
- `:df`: Degrees of freedom (number of groups - 1).
- `:k`: Number of groups.
- `:sides`: Test side
- `:p-value`: The p-value for the test (null hypothesis: all groups have the same distribution).
sourceraw docstring

ks-test-one-sampleclj

(ks-test-one-sample xs)
(ks-test-one-sample xs distribution-or-ys)
(ks-test-one-sample xs
                    distribution-or-ys
                    {:keys [sides kernel bandwidth distinct?]
                     :or {sides :two-sided kernel :gaussian distinct? true}})

Performs the one-sample Kolmogorov-Smirnov (KS) test.

This test compares the empirical cumulative distribution function (ECDF) of a sample xs against a specified theoretical distribution or the ECDF of another empirical sample. It assesses the null hypothesis that xs is drawn from the reference distribution.

Parameters:

  • xs (seq of numbers): The sample data to be tested.
  • distribution-or-ys (optional):
    • A fastmath.random distribution object to test against. If omitted, defaults to the standard normal distribution (fastmath.random/default-normal).
    • A sequence of numbers (ys). In this case, an empirical distribution is estimated from ys using Kernel Density Estimation (KDE) or an enumerated distribution (see :kernel option).
  • opts (map, optional): Options map:
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis regarding the difference between the ECDF of xs and the reference CDF.
      • :two-sided (default): Tests if the ECDF of xs is different from the reference CDF.
      • :right: Tests if the ECDF of xs is significantly below the reference CDF (i.e., xs tends to have larger values, stochastically greater).
      • :left: Tests if the ECDF of xs is significantly above the reference CDF (i.e., xs tends to have smaller values, stochastically smaller).
    • :kernel (keyword, default :gaussian): Used only when distribution-or-ys is a sequence. Specifies the method to estimate the empirical distribution:
      • :gaussian (or other KDE kernels): Uses Kernel Density Estimation.
      • :enumerated: Creates a discrete empirical distribution from ys.
    • :bandwidth (double, optional): Bandwidth for KDE (if applicable).
    • :distinct? (boolean or keyword, default true): How to handle duplicate values in xs.
      • true (default): Removes duplicate values from xs before computation.
      • false: Uses all values in xs, including duplicates.
      • :jitter: Adds a small amount of random noise to each value in xs to break ties.

Returns a map containing:

  • :n: Sample size of xs (after applying :distinct?).
  • :dp: Maximum positive difference (ECDF(xs) - CDF(ref)).
  • :dn: Maximum positive difference (CDF(ref) - ECDF(xs)).
  • :d: The KS test statistic (max absolute difference: max(dp, dn)).
  • :stat: The specific statistic used for p-value calculation, depending on :sides (d, dp, or dn).
  • :p-value: The p-value associated with the test statistic and the specified :sides.
  • :sides: The alternative hypothesis side used.
Performs the one-sample Kolmogorov-Smirnov (KS) test.

This test compares the empirical cumulative distribution function (ECDF) of a
sample `xs` against a specified theoretical distribution or the ECDF of
another empirical sample. It assesses the null hypothesis that `xs` is drawn
from the reference distribution.

Parameters:

- `xs` (seq of numbers): The sample data to be tested.
- `distribution-or-ys` (optional):
  - A `fastmath.random` distribution object to test against. If omitted, defaults
    to the standard normal distribution (`fastmath.random/default-normal`).
  - A sequence of numbers (`ys`). In this case, an empirical distribution is
    estimated from `ys` using Kernel Density Estimation (KDE) or an enumerated
    distribution (see `:kernel` option).
- `opts` (map, optional): Options map:
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis
    regarding the difference between the ECDF of `xs` and the reference CDF.
    - `:two-sided` (default): Tests if the ECDF of `xs` is different from the reference CDF.
    - `:right`: Tests if the ECDF of `xs` is significantly *below* the reference CDF (i.e., `xs` tends to have larger values, stochastically greater).
    - `:left`: Tests if the ECDF of `xs` is significantly *above* the reference CDF (i.e., `xs` tends to have smaller values, stochastically smaller).
  - `:kernel` (keyword, default `:gaussian`): Used only when `distribution-or-ys`
    is a sequence. Specifies the method to estimate the empirical distribution:
    - `:gaussian` (or other KDE kernels): Uses Kernel Density Estimation.
    - `:enumerated`: Creates a discrete empirical distribution from `ys`.
  - `:bandwidth` (double, optional): Bandwidth for KDE (if applicable).
  - `:distinct?` (boolean or keyword, default `true`): How to handle duplicate values in `xs`.
    - `true` (default): Removes duplicate values from `xs` before computation.
    - `false`: Uses all values in `xs`, including duplicates.
    - `:jitter`: Adds a small amount of random noise to each value in `xs` to break ties.

Returns a map containing:

- `:n`: Sample size of `xs` (after applying `:distinct?`).
- `:dp`: Maximum positive difference (ECDF(xs) - CDF(ref)).
- `:dn`: Maximum positive difference (CDF(ref) - ECDF(xs)).
- `:d`: The KS test statistic (max absolute difference: `max(dp, dn)`).
- `:stat`: The specific statistic used for p-value calculation, depending on `:sides` (`d`, `dp`, or `dn`).
- `:p-value`: The p-value associated with the test statistic and the specified `:sides`.
- `:sides`: The alternative hypothesis side used.
sourceraw docstring

ks-test-two-samplesclj

(ks-test-two-samples xs ys)
(ks-test-two-samples xs
                     ys
                     {:keys [method sides distinct? correct?]
                      :or {sides :two-sided distinct? :ties correct? true}})

Performs the two-sample Kolmogorov-Smirnov (KS) test.

This test compares the empirical cumulative distribution functions (ECDFs) of two independent samples, xs and ys, to assess the null hypothesis that they are drawn from the same continuous distribution.

Parameters:

  • xs (seq of numbers): The first sample.
  • ys (seq of numbers): The second sample.
  • opts (map, optional): Options map:
    • :method (keyword, optional): Specifies the calculation method for the p-value.
      • :exact: Attempts an exact calculation (suitable for small samples, sensitive to ties). Default if nx * ny < 10000.
      • :approximate: Uses the asymptotic Kolmogorov distribution (suitable for larger samples). Default otherwise.
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): Tests if the distributions differ (ECDFs are different).
      • :right: Tests if xs is stochastically greater than ys (ECDF(xs) is below ECDF(ys)).
      • :left: Tests if xs is stochastically smaller than ys (ECDF(xs) is above ECDF(ys)).
    • :distinct? (keyword or boolean, default :ties): How to handle duplicate values (ties).
      • :ties (default): Includes all points. Passes information about ties to the :exact calculation method. Accuracy depends on the exact method's tie handling.
      • :jitter: Adds a small amount of random noise to break ties before comparison. A practical approach if exact tie handling is complex or not required.
      • true: Applies distinct to xs and ys separately before combining. May not resolve all ties between the combined samples.
      • false: Uses the data as-is, without attempting to handle ties explicitly (may lead to less accurate p-values, especially with the exact method).
    • :correct? (boolean, default true): Apply continuity correction when using the :exact calculation method for a more accurate p-value especially for smaller sample sizes.

Returns a map containing:

  • :nx: Number of observations in xs (after :distinct? processing if applicable).
  • :ny: Number of observations in ys (after :distinct? processing if applicable).
  • :n: Effective sample size used for asymptotic calculation (nx*ny / (nx+ny)).
  • :dp: Maximum positive difference (ECDF(xs) - ECDF(ys)).
  • :dn: Maximum positive difference (ECDF(ys) - ECDF(xs)).
  • :d: The KS test statistic (max absolute difference: max(dp, dn)).
  • :stat: The specific statistic used for p-value calculation (d, dp, or dn for exact; scaled version for approximate).
  • :KS: Alias for :stat.
  • :p-value: The p-value associated with the test statistic and :sides.
  • :sides: The alternative hypothesis side used.
  • :method: The calculation method used (:exact or :approximate).

Note on Ties: The KS test is strictly defined for continuous distributions where ties have zero probability. The presence of ties in sample data affects the p-value calculation. The :distinct? option provides ways to manage this, with :jitter being a common pragmatic choice.

Performs the two-sample Kolmogorov-Smirnov (KS) test.

This test compares the empirical cumulative distribution functions (ECDFs) of two
independent samples, `xs` and `ys`, to assess the null hypothesis that they
are drawn from the same continuous distribution.

Parameters:

- `xs` (seq of numbers): The first sample.
- `ys` (seq of numbers): The second sample.
- `opts` (map, optional): Options map:
  - `:method` (keyword, optional): Specifies the calculation method for the p-value.
      - `:exact`: Attempts an exact calculation (suitable for small samples, sensitive to ties). Default if `nx * ny < 10000`.
      - `:approximate`: Uses the asymptotic Kolmogorov distribution (suitable for larger samples). Default otherwise.
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
      - `:two-sided` (default): Tests if the distributions differ (ECDFs are different).
      - `:right`: Tests if `xs` is stochastically greater than `ys` (ECDF(xs) is below ECDF(ys)).
      - `:left`: Tests if `xs` is stochastically smaller than `ys` (ECDF(xs) is above ECDF(ys)).
  - `:distinct?` (keyword or boolean, default `:ties`): How to handle duplicate values (ties).
      - `:ties` (default): Includes all points. Passes information about ties to the `:exact` calculation method. Accuracy depends on the exact method's tie handling.
      - `:jitter`: Adds a small amount of random noise to break ties before comparison. A practical approach if exact tie handling is complex or not required.
      - `true`: Applies `distinct` to `xs` and `ys` separately before combining. May not resolve all ties between the combined samples.
      - `false`: Uses the data as-is, without attempting to handle ties explicitly (may lead to less accurate p-values, especially with the exact method).
  - `:correct?` (boolean, default `true`): Apply continuity correction when using the `:exact` calculation method for a more accurate p-value especially for smaller sample sizes.

Returns a map containing:

- `:nx`: Number of observations in `xs` (after `:distinct?` processing if applicable).
- `:ny`: Number of observations in `ys` (after `:distinct?` processing if applicable).
- `:n`: Effective sample size used for asymptotic calculation (`nx*ny / (nx+ny)`).
- `:dp`: Maximum positive difference (ECDF(xs) - ECDF(ys)).
- `:dn`: Maximum positive difference (ECDF(ys) - ECDF(xs)).
- `:d`: The KS test statistic (max absolute difference: `max(dp, dn)`).
- `:stat`: The specific statistic used for p-value calculation (`d`, `dp`, or `dn` for exact; scaled version for approximate).
- `:KS`: Alias for `:stat`.
- `:p-value`: The p-value associated with the test statistic and `:sides`.
- `:sides`: The alternative hypothesis side used.
- `:method`: The calculation method used (`:exact` or `:approximate`).

Note on Ties: The KS test is strictly defined for continuous distributions where ties have zero probability.
The presence of ties in sample data affects the p-value calculation. The `:distinct?` option provides ways to manage this, with `:jitter` being a common pragmatic choice.
sourceraw docstring

kullback-leibler-divergencecljdeprecated

(kullback-leibler-divergence [vs1 vs2])
(kullback-leibler-divergence vs1 vs2)

Kullback-Leibler divergence of two sequences.

Kullback-Leibler divergence of two sequences.
sourceraw docstring

kurtosisclj

(kurtosis vs)
(kurtosis vs typ)

Calculates the kurtosis of a sequence, a measure of the 'tailedness' or 'peakedness' of the distribution compared to a normal distribution.

Parameters:

  • vs (seq of numbers): The input sequence.
  • typ (keyword or sequence, optional): Specifies the type of kurtosis measure to calculate. Different types use different algorithms and may have different expected values under normality (e.g., 0 or 3). Defaults to :G2.

Available typ values:

  • :G2 (Default): Sample kurtosis based on the fourth standardized moment, as implemented by Apache Commons Math Kurtosis. Its value approaches 3 for a large normal sample, but the exact expected value depends on sample size.
  • :g2 or :excess: Sample excess kurtosis. This is calculated from :G2 and adjusted for sample bias, such that the expected value for a normal distribution is approximately 0.
  • :kurt: Kurtosis definition where normal = 3. Calculated as :g2 + 3.
  • :b2: Kurtosis defined as fourth moment divided by standard deviation to the power of 4
  • :geary: Geary's 'g', a robust measure calculated as mean_abs_deviation / population_stddev. Expected value for normal is sqrt(2/pi) ≈ 0.798. Lower values indicate leptokurtosis.
  • :moors: Moors' robust kurtosis measure based on octiles. The implementation returns a centered version where the expected value for normal is 0.
  • :crow: Crow-Siddiqui robust kurtosis measure based on quantiles. The implementation returns a centered version where the expected value for normal is 0. Can accept parameters alpha and beta via sequential type [:crow alpha beta].
  • :hogg: Hogg's robust kurtosis measure based on trimmed means. The implementation returns a centered version where the expected value for normal is 0. Can accept parameters alpha and beta via sequential type [:hogg alpha beta].
  • :l-kurtosis: L-kurtosis (τ₄), the ratio of the 4th L-moment (λ₄) to the 2nd L-moment (λ₂, L-scale). Calculated directly using l-moment with the :ratio? option set to true. It's a robust measure. Expected value for normal distribution is ≈ 0.1226.

Interpretation (for excess kurtosis :g2):

  • Positive values indicate a leptokurtic distribution (heavier tails, more peaked than normal).
  • Negative values indicate a platykurtic distribution (lighter tails, flatter than normal).
  • Values near 0 suggest kurtosis similar to a normal distribution.

Returns the calculated kurtosis value as a double.

See also kurtosis-test, bonett-seier-test, normality-test, jarque-bera-test, l-moment.

Calculates the kurtosis of a sequence, a measure of the 'tailedness' or 'peakedness'
of the distribution compared to a normal distribution.

Parameters:

- `vs` (seq of numbers): The input sequence.
- `typ` (keyword or sequence, optional): Specifies the type of kurtosis measure to calculate.
  Different types use different algorithms and may have different expected values
  under normality (e.g., 0 or 3). Defaults to `:G2`.

Available `typ` values:

- `:G2` (Default): Sample kurtosis based on the fourth standardized moment, as
  implemented by Apache Commons Math `Kurtosis`. Its value approaches 3 for
  a large normal sample, but the exact expected value depends on sample size.
- `:g2` or `:excess`: Sample excess kurtosis. This is calculated from `:G2`
  and adjusted for sample bias, such that the expected value for a normal
  distribution is approximately 0.
- `:kurt`: Kurtosis definition where normal = 3. Calculated as `:g2` + 3.
- `:b2`: Kurtosis defined as fourth moment divided by standard deviation to the power of 4
- `:geary`: Geary's 'g', a robust measure calculated as `mean_abs_deviation / population_stddev`.
  Expected value for normal is `sqrt(2/pi) ≈ 0.798`. Lower values indicate leptokurtosis.
- `:moors`: Moors' robust kurtosis measure based on octiles. The implementation
  returns a centered version where the expected value for normal is 0.
- `:crow`: Crow-Siddiqui robust kurtosis measure based on quantiles. The implementation
  returns a centered version where the expected value for normal is 0.
  Can accept parameters `alpha` and `beta` via sequential type `[:crow alpha beta]`.
- `:hogg`: Hogg's robust kurtosis measure based on trimmed means. The implementation
  returns a centered version where the expected value for normal is 0.
  Can accept parameters `alpha` and `beta` via sequential type `[:hogg alpha beta]`.
- `:l-kurtosis`: L-kurtosis (τ₄), the ratio of the 4th L-moment (λ₄) to the
  2nd L-moment (λ₂, L-scale). Calculated directly using [[l-moment]] with the
  `:ratio?` option set to true. It's a robust measure.
  Expected value for normal distribution is ≈ 0.1226.

Interpretation (for excess kurtosis `:g2`):

- Positive values indicate a leptokurtic distribution (heavier tails, more peaked than normal).
- Negative values indicate a platykurtic distribution (lighter tails, flatter than normal).
- Values near 0 suggest kurtosis similar to a normal distribution.

Returns the calculated kurtosis value as a double.

See also [[kurtosis-test]], [[bonett-seier-test]], [[normality-test]], [[jarque-bera-test]], [[l-moment]].
sourceraw docstring

kurtosis-testclj

(kurtosis-test xs)
(kurtosis-test xs params)
(kurtosis-test xs kurt {:keys [sides type] :or {sides :two-sided type :kurt}})

Performs a test for normality based on sample kurtosis.

This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample kurtosis significantly deviates from the kurtosis expected under normality (approximately 3).

The test works by:

  1. Calculating the sample kurtosis (type configurable via :type, default :kurt).
  2. Standardizing the difference between the sample kurtosis and the expected kurtosis under normality using the theoretical standard error.
  3. Applying a further transformation (e.g., Anscombe-Glynn/D'Agostino) to this standardized score to yield a final test statistic Z that more closely follows a standard normal distribution under the null hypothesis, especially for smaller sample sizes.

Parameters:

  • xs (seq of numbers): The sample data.
  • kurt (double, optional): A pre-calculated kurtosis value. If omitted, it's calculated from xs.
  • params (map, optional): Options map:
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): The population kurtosis is different from normal.
      • :one-sided-greater: The population kurtosis is greater than normal (leptokurtic).
      • :one-sided-less: The population kurtosis is less than normal (platykurtic).
    • :type (keyword, default :kurt): The type of kurtosis to calculate if kurt is not provided. See kurtosis for options (e.g., :kurt, :G2, :g2).

Returns a map containing:

  • :Z: The final test statistic, approximately standard normal under H0.
  • :stat: Alias for :Z.
  • :p-value: The p-value associated with Z and the specified :sides.
  • :kurtosis: The sample kurtosis value used in the test (either provided or calculated).

See also skewness-test, normality-test, jarque-bera-test, bonett-seier-test.

Performs a test for normality based on sample kurtosis.

This test assesses the null hypothesis that the data comes from a normally
distributed population by checking if the sample kurtosis significantly deviates
from the kurtosis expected under normality (approximately 3).

The test works by:

1. Calculating the sample kurtosis (type configurable via `:type`, default `:kurt`).
2. Standardizing the difference between the sample kurtosis and the expected
   kurtosis under normality using the theoretical standard error.
3. Applying a further transformation (e.g., Anscombe-Glynn/D'Agostino) to this standardized
   score to yield a final test statistic `Z` that more closely follows a
   standard normal distribution under the null hypothesis, especially for
   smaller sample sizes.

Parameters:

- `xs` (seq of numbers): The sample data.
- `kurt` (double, optional): A pre-calculated kurtosis value. If omitted, it's calculated from `xs`.
- `params` (map, optional): Options map:
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): The population kurtosis is different from normal.
    - `:one-sided-greater`: The population kurtosis is greater than normal (leptokurtic).
    - `:one-sided-less`: The population kurtosis is less than normal (platykurtic).
  - `:type` (keyword, default `:kurt`): The type of kurtosis to calculate if `kurt` is not provided. See [[kurtosis]] for options (e.g., `:kurt`, `:G2`, `:g2`).

Returns a map containing:

- `:Z`: The final test statistic, approximately standard normal under H0.
- `:stat`: Alias for `:Z`.
- `:p-value`: The p-value associated with `Z` and the specified `:sides`.
- `:kurtosis`: The sample kurtosis value used in the test (either provided or calculated).

See also [[skewness-test]], [[normality-test]], [[jarque-bera-test]], [[bonett-seier-test]].
sourceraw docstring

l-momentclj

(l-moment vs order)
(l-moment vs order {:keys [s t sorted? ratio?] :or {s 0 t 0} :as opts})

Calculates L-moment, TL-moment (trimmed) or (T)L-moment ratios.

Options:

  • :s (default: 0) - number of left trimmed values
  • :t (default: 0) - number of right tirmmed values
  • :sorted? (default: false) - if input is already sorted
  • :ratio? (default: false) - normalized l-moment, l-moment ratio
Calculates L-moment, TL-moment (trimmed) or (T)L-moment ratios.

Options:

- `:s` (default: 0) - number of left trimmed values
- `:t` (default: 0) - number of right tirmmed values
- `:sorted?` (default: false) - if input is already sorted
- `:ratio?` (default: false) - normalized l-moment, l-moment ratio
sourceraw docstring

l-variationclj

(l-variation vs)

Coefficient of L-variation, L-CV

Coefficient of L-variation, L-CV
sourceraw docstring

L0clj

Count equal values in both seqs. Alias for [[count==]]

Count equal values in both seqs. Alias for [[count==]]
sourceraw docstring

L1clj

(L1 [vs1 vs2-or-val])
(L1 vs1 vs2-or-val)

Calculates the L1 distance (Manhattan or City Block distance) between two sequences or a sequence and a constant value.

The L1 distance is the sum of the absolute differences between corresponding elements.

Parameters:

  • vs1 (sequence of numbers): The first sequence.
  • vs2-or-val (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated L1 distance as a double.

See also L2, L2sq, LInf, mae (Mean Absolute Error).

Calculates the L1 distance (Manhattan or City Block distance) between two sequences or a sequence and a constant value.

The L1 distance is the sum of the absolute differences between corresponding elements.

Parameters:

- `vs1` (sequence of numbers): The first sequence.
- `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated L1 distance as a double.

See also [[L2]], [[L2sq]], [[LInf]], [[mae]] (Mean Absolute Error).
sourceraw docstring

L2clj

(L2 [vs1 vs2-or-val])
(L2 vs1 vs2-or-val)

Calculates the L2 distance (Euclidean distance) between two sequences or a sequence and a constant value.

This is the standard straight-line distance between two points (vectors) in Euclidean space. It is the square root of the L2sq distance.

Parameters:

  • vs1 (sequence of numbers): The first sequence.
  • vs2-or-val (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated L2 distance as a double.

See also L1, L2sq, LInf, rmse (Root Mean Squared Error).

Calculates the L2 distance (Euclidean distance) between two sequences or a sequence and a constant value.

This is the standard straight-line distance between two points (vectors) in Euclidean space.
It is the square root of the [[L2sq]] distance.

Parameters:

- `vs1` (sequence of numbers): The first sequence.
- `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated L2 distance as a double.

See also [[L1]], [[L2sq]], [[LInf]], [[rmse]] (Root Mean Squared Error).
sourceraw docstring

L2sqclj

(L2sq [vs1 vs2-or-val])
(L2sq vs1 vs2-or-val)

Calculates the Squared Euclidean distance between two sequences or a sequence and a constant value.

This is the sum of the squared differences between corresponding elements. It is equivalent to the rss (Residual Sum of Squares).

Parameters:

  • vs1 (sequence of numbers): The first sequence.
  • vs2-or-val (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated Squared Euclidean distance as a double.

See also L1, L2, LInf, rss (Residual Sum of Squares), mse (Mean Squared Error).

Calculates the Squared Euclidean distance between two sequences or a sequence and a constant value.

This is the sum of the squared differences between corresponding elements.
It is equivalent to the [[rss]] (Residual Sum of Squares).

Parameters:

- `vs1` (sequence of numbers): The first sequence.
- `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated Squared Euclidean distance as a double.

See also [[L1]], [[L2]], [[LInf]], [[rss]] (Residual Sum of Squares), [[mse]] (Mean Squared Error).
sourceraw docstring

levene-testclj

(levene-test xss)
(levene-test xss
             {:keys [sides statistic scorediff]
              :or {sides :one-sided-greater statistic mean scorediff abs}})

Performs Levene's test for homogeneity of variances across two or more groups.

Levene's test assesses the null hypothesis that the variances of the groups are equal. It calculates an ANOVA on the absolute deviations of the data points from their group center (mean by default).

Parameters:

  • xss (sequence of sequences): A collection where each element is a sequence representing a group of observations.
  • params (map, optional): Options map with the following keys:
    • :sides (keyword, default :one-sided-greater): Alternative hypothesis side for the F-test. Possible values: :one-sided-greater, :one-sided-less, :two-sided.
    • :statistic (fn, default mean): Function to calculate the center of each group (e.g., mean, median). Using median results in the Brown-Forsythe test.
    • :scorediff (fn, default [[abs]]): Function applied to the difference between each data point and its group center (e.g., [[abs]], [[sq]]).

Returns a map containing:

  • :W: The Levene test statistic (which is an F-statistic).
  • :stat: Alias for :W.
  • :p-value: The p-value for the test.
  • :df: Degrees of freedom for the F-statistic ([DFt, DFe]).
  • :n: Sequence of sample sizes for each group.
  • :SSt: Sum of squares between groups (treatment).
  • :SSe: Sum of squares within groups (error).
  • :DFt: Degrees of freedom between groups.
  • :DFe: Degrees of freedom within groups.
  • :MSt: Mean square between groups.
  • :MSe: Mean square within groups.
  • :sides: Test side used.

See also brown-forsythe-test.

Performs Levene's test for homogeneity of variances across two or more groups.

Levene's test assesses the null hypothesis that the variances of the groups are equal.
It calculates an ANOVA on the absolute deviations of the data points from their group
center (mean by default).

Parameters:

- `xss` (sequence of sequences): A collection where each element is a sequence representing a group of observations.
- `params` (map, optional): Options map with the following keys:
  - `:sides` (keyword, default `:one-sided-greater`): Alternative hypothesis side for the F-test.
    Possible values: `:one-sided-greater`, `:one-sided-less`, `:two-sided`.
  - `:statistic` (fn, default [[mean]]): Function to calculate the center of each group (e.g., [[mean]], [[median]]). Using [[median]] results in the Brown-Forsythe test.
  - `:scorediff` (fn, default [[abs]]): Function applied to the difference between each data point and its group center (e.g., [[abs]], [[sq]]).

Returns a map containing:

- `:W`: The Levene test statistic (which is an F-statistic).
- `:stat`: Alias for `:W`.
- `:p-value`: The p-value for the test.
- `:df`: Degrees of freedom for the F-statistic ([DFt, DFe]).
- `:n`: Sequence of sample sizes for each group.
- `:SSt`: Sum of squares between groups (treatment).
- `:SSe`: Sum of squares within groups (error).
- `:DFt`: Degrees of freedom between groups.
- `:DFe`: Degrees of freedom within groups.
- `:MSt`: Mean square between groups.
- `:MSe`: Mean square within groups.
- `:sides`: Test side used.

See also [[brown-forsythe-test]].
sourceraw docstring

LInfclj

(LInf [vs1 vs2-or-val])
(LInf vs1 vs2-or-val)

Calculates the L-infinity distance (Chebyshev distance) between two sequences or a sequence and a constant value.

The Chebyshev distance is the maximum absolute difference between corresponding elements.

Parameters:

  • vs1 (sequence of numbers): The first sequence.
  • vs2-or-val (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated L-infinity distance as a double.

See also L1, L2, L2sq.

Calculates the L-infinity distance (Chebyshev distance) between two sequences or a sequence and a constant value.

The Chebyshev distance is the maximum absolute difference between corresponding elements.

Parameters:

- `vs1` (sequence of numbers): The first sequence.
- `vs2-or-val` (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated L-infinity distance as a double.

See also [[L1]], [[L2]], [[L2sq]].
sourceraw docstring

madclj

Alias for [[median-absolute-deviation]]
sourceraw docstring

mad-extentclj

(mad-extent vs)

-/+ median-absolute-deviation and median

 -/+ median-absolute-deviation and median
sourceraw docstring

maeclj

(mae [vs1 vs2-or-val])
(mae vs1 vs2-or-val)

Calculates the Mean Absolute Error (MAE) between two sequences or a sequence and constant value.

MAE is a measure of the difference between two sequences of values. It quantifies the average magnitude of the errors, without considering their direction.

Parameters:

  • vs1 (sequence of numbers): The first sequence (often the observed or true values).
  • vs2-or-val (sequence of numbers or single number): The second sequence (often the predicted or reference values), or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated Mean Absolute Error as a double.

Note: MAE is less sensitive to large outliers than metrics like Mean Squared Error (MSE) because it uses the absolute value of differences rather than the squared difference.

See also me (Mean Error), mse (Mean Squared Error), rmse (Root Mean Squared Error).

Calculates the Mean Absolute Error (MAE) between two sequences or a sequence and constant value.

MAE is a measure of the difference between two sequences of values. It quantifies
the average magnitude of the errors, without considering their direction.

Parameters:

- `vs1` (sequence of numbers): The first sequence (often the observed or true values).
- `vs2-or-val` (sequence of numbers or single number): The second sequence
  (often the predicted or reference values), or a single number to compare
  against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated Mean Absolute Error as a double.

Note: MAE is less sensitive to large outliers than metrics like Mean Squared Error (MSE)
because it uses the absolute value of differences rather than the squared difference.

See also [[me]] (Mean Error), [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error).
sourceraw docstring

mapeclj

(mape [vs1 vs2-or-val])
(mape vs1 vs2-or-val)

Calculates the Mean Absolute Percentage Error (MAPE) between two sequences or a sequence and a constant value.

MAPE is a measure of prediction accuracy of a forecasting method, for example in time series analysis. It is calculated as the average of the absolute percentage errors.

Parameters:

  • vs1 (sequence of numbers): The first sequence (conventionally, the actual or true values).
  • vs2-or-val (sequence of numbers or single number): The second sequence (conventionally, the predicted or reference values), or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated Mean Absolute Percentage Error as a double.

Note: MAPE is scale-independent and useful for comparing performance across different datasets. However, it is undefined if any of the actual values (x_i) are zero, and can be skewed by small actual values.

See also me (Mean Error), mae (Mean Absolute Error), mse (Mean Squared Error), rmse (Root Mean Squared Error).

Calculates the Mean Absolute Percentage Error (MAPE) between two sequences
or a sequence and a constant value.

MAPE is a measure of prediction accuracy of a forecasting method, for example
in time series analysis. It is calculated as the average of the absolute
percentage errors.

Parameters:

- `vs1` (sequence of numbers): The first sequence (conventionally, the actual or true values).
- `vs2-or-val` (sequence of numbers or single number): The second sequence
  (conventionally, the predicted or reference values), or a single number to
  compare against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated Mean Absolute Percentage Error as a double.

Note: MAPE is scale-independent and useful for comparing performance across
different datasets. However, it is undefined if any of the actual values (`x_i`)
are zero, and can be skewed by small actual values.

See also [[me]] (Mean Error), [[mae]] (Mean Absolute Error), [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error).
sourceraw docstring

maximumclj

(maximum vs)

Finds the maximum value in a sequence of numbers.

Finds the maximum value in a sequence of numbers.
sourceraw docstring

mccclj

(mcc ct)
(mcc group1 group2)

Calculates the Matthews Correlation Coefficient (MCC), also known as the Phi coefficient, for a 2x2 contingency table or binary classification outcomes.

MCC is a measure of the quality of binary classifications. It is a balanced measure which can be used even if the classes are of very different sizes. Its value ranges from -1 to +1.

  • A coefficient of +1 represents a perfect prediction.
  • 0 represents a prediction no better than random.
  • -1 represents a perfect inverse prediction.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a 2x2 contingency table from the unique values in the sequences (assuming they represent two binary variables). The mapping of values to table cells (e.g., what corresponds to TP, TN, FP, FN) depends on how contingency-table orders the unique values. For direct control over which cell is which, use the contingency table input.

  2. With a contingency table: The contingency table can be provided as:

    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}). This is the output format of contingency-table with two inputs.
    • A sequence of sequences representing the rows of the table (e.g., [[TP FP] [FN TN]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of binary outcomes/categories.
  • group2 (sequence): The second sequence of binary outcomes/categories. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed 2x2 contingency table.

Returns the calculated Matthews Correlation Coefficient as a double.

Note: The implementation uses marginal sums from the contingency table, which is mathematically equivalent to the standard formula but avoids potential division by zero in the denominator product if any marginal sum is zero.

See also contingency-table, contingency-2x2-measures, binary-measures-all.

Calculates the Matthews Correlation Coefficient (MCC), also known as the Phi coefficient,
for a 2x2 contingency table or binary classification outcomes.

MCC is a measure of the quality of binary classifications. It is a balanced
measure which can be used even if the classes are of very different sizes.
Its value ranges from -1 to +1.

- A coefficient of +1 represents a perfect prediction.
- 0 represents a prediction no better than random.
- -1 represents a perfect inverse prediction.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a 2x2 contingency table from
    the unique values in the sequences (assuming they represent two binary
    variables). The mapping of values to table cells (e.g., what corresponds
    to TP, TN, FP, FN) depends on how `contingency-table` orders the unique values.
    For direct control over which cell is which, use the contingency table input.

2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] TP, [0 1] FP, [1 0] FN, [1 1] TN}`). This is the output format
      of [[contingency-table]] with two inputs.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[TP FP] [FN TN]]`). This is equivalent to `rows->contingency-table`.

Parameters:

- `group1` (sequence): The first sequence of binary outcomes/categories.
- `group2` (sequence): The second sequence of binary outcomes/categories.
  Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed 2x2 contingency table.

Returns the calculated Matthews Correlation Coefficient as a double.

Note: The implementation uses marginal sums from the contingency table, which
is mathematically equivalent to the standard formula but avoids potential
division by zero in the denominator product if any marginal sum is zero.

See also [[contingency-table]], [[contingency-2x2-measures]], [[binary-measures-all]].
sourceraw docstring

meclj

(me [vs1 vs2-or-val])
(me vs1 vs2-or-val)

Calculates the Mean Error (ME) between two sequences or a sequence and constant value.

Parameters:

  • vs1 (sequence of numbers): The first sequence.
  • vs2-or-val (sequence of numbers or single number): The second sequence of numbers, or a single number to compare against each element of vs1.

Both sequences (vs1 and vs2) must have the same length if both are sequences. If vs2-or-val is a single number, it is compared element-wise to vs1.

Returns the calculated Mean Error as a double.

Note: Positive ME indicates that vs1 values tend to be greater than vs2 values on average, while negative ME indicates vs1 values tend to be smaller. ME can be influenced by the magnitude of errors and their signs. It does not directly measure the magnitude of the typical error due to potential cancellation of positive and negative differences.

See also mae (Mean Absolute Error), mse (Mean Squared Error), rmse (Root Mean Squared Error).

Calculates the Mean Error (ME) between two sequences or a sequence and constant value.

Parameters:

- `vs1` (sequence of numbers): The first sequence.
- `vs2-or-val` (sequence of numbers or single number): The second sequence of
  numbers, or a single number to compare against each element of `vs1`.

Both sequences (`vs1` and `vs2`) must have the same length if both are sequences.
If `vs2-or-val` is a single number, it is compared element-wise to `vs1`.

Returns the calculated Mean Error as a double.

Note: Positive ME indicates that `vs1` values tend to be greater than `vs2` values
on average, while negative ME indicates `vs1` values tend to be smaller. ME can be
influenced by the magnitude of errors and their signs. It does not directly measure
the magnitude of the typical error due to potential cancellation of positive and
negative differences.

See also [[mae]] (Mean Absolute Error), [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error).
sourceraw docstring

meanclj

(mean vs)
(mean vs weights)

Calculates the arithmetic mean (average) of a sequence vs.

If weights are provided, calculates the weighted arithmetic mean.

Parameters:

  • vs: Sequence of numbers.
  • weights (optional): Sequence of non-negative weights corresponding to vs. Must have the same count as vs.

Returns the calculated mean as a double.

See also geomean, harmean, powmean, median.

Calculates the arithmetic mean (average) of a sequence `vs`.

If `weights` are provided, calculates the weighted arithmetic mean.

Parameters:

- `vs`: Sequence of numbers.
- `weights` (optional): Sequence of non-negative weights corresponding to `vs`.
  Must have the same count as `vs`.

Returns the calculated mean as a double.

See also [[geomean]], [[harmean]], [[powmean]], [[median]].
sourceraw docstring

mean-absolute-deviationclj

(mean-absolute-deviation vs)
(mean-absolute-deviation vs center)

Calculates the Mean Absolute Deviation of a sequence vs.

MeanAD is a measure of the variability of a univariate sample of quantitative data. It is defined as the mean of the absolute deviations from a central point, typically the data's mean.

MeanAD = mean(|X_i - center|)

Parameters:

  • vs: Sequence of numbers.
  • center (optional, double): The central point from which to calculate deviations. If nil or not provided, the arithmetic mean of vs is used as the center.

Returns the calculated Mean Absolute Deviation as a double.

Unlike median-absolute-deviation, which uses the median of absolute deviations from the median, the Mean Absolute Deviation uses the mean of absolute deviations from the mean (or specified center). This makes it more sensitive to outliers than median-absolute-deviation but less sensitive than the standard deviation.

See also median-absolute-deviation, stddev, mean.

Calculates the Mean Absolute Deviation of a sequence `vs`.

MeanAD is a measure of the variability of a univariate sample of quantitative data.
It is defined as the mean of the absolute deviations from a central point,
typically the data's mean.

`MeanAD = mean(|X_i - center|)`

Parameters:

- `vs`: Sequence of numbers.
- `center` (optional, double): The central point from which to calculate deviations.
  If `nil` or not provided, the arithmetic [[mean]] of `vs` is used as the center.

Returns the calculated Mean Absolute Deviation as a double.

Unlike [[median-absolute-deviation]], which uses the median of absolute deviations
from the median, the Mean Absolute Deviation uses the mean of absolute deviations
from the mean (or specified center). This makes it more sensitive to outliers
than [[median-absolute-deviation]] but less sensitive than the standard deviation.

See also [[median-absolute-deviation]], [[stddev]], [[mean]].
sourceraw docstring

means-ratioclj

(means-ratio [group1 group2])
(means-ratio group1 group2)
(means-ratio group1 group2 adjusted?)

Calculates the ratio of the mean of group1 to the mean of group2.

This is a measure of effect size in the 'Ratio Family', comparing the central tendency of two groups multiplicatively.

Parameters:

  • group1 (seq of numbers): The first independent sample. The mean of this group is the numerator.
  • group2 (seq of numbers): The second independent sample. The mean of this group is the denominator.
  • adjusted? (boolean, optional): If true, applies a small-sample bias correction to the ratio. Defaults to false.

Returns the calculated ratio of means as a double.

A value greater than 1 indicates that group1 has a larger mean than group2. A value less than 1 indicates group1 has a smaller mean. A value close to 1 indicates similar means.

The adjusted? version attempts to provide a less biased estimate of the population mean ratio, particularly for small sample sizes, by incorporating variances into the calculation (based on Bickel and Doksum, see also means-ratio-corrected).

See also means-ratio-corrected (which is equivalent to calling this with adjusted? set to true).

Calculates the ratio of the mean of `group1` to the mean of `group2`.

This is a measure of effect size in the 'Ratio Family', comparing the central tendency
of two groups multiplicatively.

Parameters:

- `group1` (seq of numbers): The first independent sample. The mean of this group is the numerator.
- `group2` (seq of numbers): The second independent sample. The mean of this group is the denominator.
- `adjusted?` (boolean, optional): If `true`, applies a small-sample bias correction to the ratio.
  Defaults to `false`.

Returns the calculated ratio of means as a double.

A value greater than 1 indicates that `group1` has a larger mean than `group2`.
A value less than 1 indicates `group1` has a smaller mean.
A value close to 1 indicates similar means.

The `adjusted?` version attempts to provide a less biased estimate of the population
mean ratio, particularly for small sample sizes, by incorporating variances into the calculation
(based on Bickel and Doksum, see also [[means-ratio-corrected]]).

See also [[means-ratio-corrected]] (which is equivalent to calling this with `adjusted?` set to `true`).
sourceraw docstring

means-ratio-correctedclj

(means-ratio-corrected [group1 group2])
(means-ratio-corrected group1 group2)

Calculates a bias-corrected ratio of the mean of group1 to the mean of group2.

This function applies a correction (based on Bickel and Doksum) to the simple ratio mean(group1) / mean(group2) to reduce bias, particularly for small sample sizes.

It is equivalent to calling (means-ratio group1 group2 true).

Parameters:

  • group1 (seq of numbers): The first independent sample. The mean of this group is the numerator.
  • group2 (seq of numbers): The second independent sample. The mean of this group is the denominator.

Returns the calculated bias-corrected ratio of means as a double.

See also means-ratio (for the simple, uncorrected ratio).

Calculates a bias-corrected ratio of the mean of `group1` to the mean of `group2`.

This function applies a correction (based on Bickel and Doksum) to the simple
ratio `mean(group1) / mean(group2)` to reduce bias, particularly for small
sample sizes.

It is equivalent to calling `(means-ratio group1 group2 true)`.

Parameters:

- `group1` (seq of numbers): The first independent sample. The mean of this group
  is the numerator.
- `group2` (seq of numbers): The second independent sample. The mean of this group
  is the denominator.

Returns the calculated bias-corrected ratio of means as a double.

See also [[means-ratio]] (for the simple, uncorrected ratio).
sourceraw docstring

medianclj

(median vs)
(median vs estimation-strategy)

Calculates median of a sequence vs.

An optional estimation-strategy keyword can be provided to specify the method used for estimating the quantile, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence.

Available estimation-strategy values:

  • :legacy (Default): The original method used in Apache Commons Math.
  • :r1 through :r9: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np or (n+1)p) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.

See also quantile, median-3

Calculates median of a sequence `vs`.

An optional `estimation-strategy` keyword can be provided to specify the
method used for estimating the quantile, particularly how interpolation is
handled when the desired quantile falls between data points in the sorted
sequence.

Available `estimation-strategy` values:

- `:legacy` (Default): The original method used in Apache Commons Math.
- `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms
    recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to
the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html).

See also [[quantile]], [[median-3]]
sourceraw docstring

median-3clj

(median-3 a b c)

Median of three values. See median.

Median of three values. See [[median]].
sourceraw docstring

median-absolute-deviationclj

(median-absolute-deviation vs)
(median-absolute-deviation vs center-or-estimation-strategy)
(median-absolute-deviation vs center estimation-strategy)

Calculates the Median Absolute Deviation (MAD) of a sequence vs.

MAD is a robust measure of the variability of a univariate sample of quantitative data. It is defined as the median of the absolute deviations from the data's median (or a specified center).

MAD = median(|X_i - median(X)|)

Parameters:

  • vs: Sequence of numbers.
  • center-or-estimation-strategy (optional): The central point from which to calculate deviations or estimation strategy. If nil or not provided, the median of vs is used as the center. If keyword, it's treated as estimation strategy for median.
  • estimation-strategy (optional, keyword): The estimation strategy to use for calculating the median(s). This applies to the calculation of the central value (if center is not provided) and to the final median of the absolute deviations. See median or quantile for available strategies (e.g., :legacy, :r1 through :r9).

Returns the calculated MAD as a double.

MAD is less sensitive to outliers than the standard deviation.

See also mean-absolute-deviation, stddev, median, quantile.

Calculates the Median Absolute Deviation (MAD) of a sequence `vs`.

MAD is a robust measure of the variability of a univariate sample of quantitative
data. It is defined as the median of the absolute deviations from the data's median
(or a specified center).

`MAD = median(|X_i - median(X)|)`

Parameters:

- `vs`: Sequence of numbers.
- `center-or-estimation-strategy` (optional): The central point from which to calculate deviations or estimation strategy.
  If `nil` or not provided, the [[median]] of `vs` is used as the center. If keyword, it's treated as estimation strategy for median.
- `estimation-strategy` (optional, keyword): The estimation strategy to use for
  calculating the median(s). This applies to the calculation of the central
  value (if `center` is not provided) and to the final median of the absolute
  deviations. See [[median]] or [[quantile]] for available strategies (e.g.,
  `:legacy`, `:r1` through `:r9`).

Returns the calculated MAD as a double.

MAD is less sensitive to outliers than the standard deviation.

See also [[mean-absolute-deviation]], [[stddev]], [[median]], [[quantile]].
sourceraw docstring

minimumclj

(minimum vs)

Finds the minimum value in a sequence of numbers.

Finds the minimum value in a sequence of numbers.
sourceraw docstring

minimum-discrimination-information-testclj

(minimum-discrimination-information-test contingency-table-or-xs)
(minimum-discrimination-information-test contingency-table-or-xs params)

Minimum discrimination information test, a power divergence test for lambda -1.0

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Minimum discrimination information test, a power divergence test for `lambda` -1.0

Performs a power divergence test, which encompasses several common statistical tests
  like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
  This function can perform either a goodness-of-fit test or a test for independence
  in a contingency table.

  Usage:

  1.  **Goodness-of-Fit (GOF):**
      - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
      - Input: `data` (sequence of numbers) and `:p` (a distribution object).
        In this case, a histogram of `data` is created (controlled by `:bins`) and
        compared against the probability mass/density of the distribution in those bins.

  2.  **Test for Independence:**
      - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

  Options map:

  * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
      * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
      * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
      * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
      * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
      * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
      * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
  * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
    or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
  * `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
  * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
    (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
  * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
    against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
  * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
  * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
    Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

  Returns a map containing:

  - `:stat`: The calculated power divergence test statistic.
  - `:chi2`: Alias for `:stat`.
  - `:df`: Degrees of freedom for the test.
  - `:p-value`: The p-value associated with the test statistic.
  - `:n`: Total number of observations.
  - `:estimate`: Observed proportions.
  - `:expected`: Expected counts or proportions under the null hypothesis.
  - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
  - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

modeclj

(mode vs)
(mode vs method)
(mode vs method opts)

Find the value that appears most often in a dataset vs.

If multiple values share the same highest frequency (or estimated density/histogram peak), this function returns only the first one encountered during processing. The specific mode returned in case of a tie is not guaranteed to be stable. Use modes if you need all tied modes.

For samples potentially drawn from a continuous distribution, several estimation methods are provided via the method argument:

  • :histogram: Calculates the mode based on the peak of a histogram constructed from vs. Uses interpolation within the bin with the highest frequency. Accepts options via opts, primarily :bins to control histogram construction (see histogram).
  • :pearson: Estimates the mode using Pearson's second skewness coefficient formula: mode ≈ 3 * median - 2 * mean. Accepts :estimation-strategy in opts for median calculation (see median).
  • :kde: Estimates the mode by finding the original data point in vs with the highest estimated probability density, based on Kernel Density Estimation (KDE). Accepts KDE options in opts like :kernel, :bandwidth, etc. (passed to fastmath.kernel.density/kernel-density).
  • :default (or when method is omitted): Finds the exact value that occurs most frequently in vs. Suitable for discrete data.

The optional opts map provides method-specific configuration.

See also modes (returns all modes) and wmode (for weighted data).

Find the value that appears most often in a dataset `vs`.

If multiple values share the same highest frequency (or estimated density/histogram peak),
this function returns only the *first* one encountered during processing. The specific
mode returned in case of a tie is not guaranteed to be stable. Use [[modes]] if
you need all tied modes.

For samples potentially drawn from a continuous distribution, several estimation
methods are provided via the `method` argument:

* `:histogram`: Calculates the mode based on the peak of a histogram constructed from `vs`.
  Uses interpolation within the bin with the highest frequency.
  Accepts options via `opts`, primarily `:bins` to control histogram
  construction (see [[histogram]]).
* `:pearson`: Estimates the mode using Pearson's second skewness coefficient
  formula: `mode ≈ 3 * median - 2 * mean`. Accepts `:estimation-strategy`
  in `opts` for median calculation (see [[median]]).
* `:kde`: Estimates the mode by finding the original data point in `vs`
  with the highest estimated probability density, based on Kernel Density
  Estimation (KDE). Accepts KDE options in `opts` like `:kernel`, `:bandwidth`,
  etc. (passed to `fastmath.kernel.density/kernel-density`).
* `:default` (or when `method` is omitted): Finds the exact value that occurs
  most frequently in `vs`. Suitable for discrete data.

The optional `opts` map provides method-specific configuration.

See also [[modes]] (returns all modes) and [[wmode]] (for weighted data).
sourceraw docstring

modesclj

(modes vs)
(modes vs method)
(modes vs method opts)

Find the values that appear most often in a dataset vs.

Returns sequence with all most appearing values. For the default method (discrete data), modes are sorted in increasing order.

For samples potentially drawn from a continuous distribution, simply finding the most frequent exact value might not be meaningful. Several estimation methods are provided via the method argument:

  • :histogram: Calculates the mode(s) based on the peak(s) of a histogram constructed from vs. Uses interpolation within the bin(s) with the highest frequency. Accepts options via opts, primarily :bins to control histogram construction (see histogram).
  • :pearson: Estimates the mode using Pearson's second skewness coefficient formula: mode ≈ 3 * median - 2 * mean. Accepts :estimation-strategy in opts for median calculation (see median). Returns a single estimated mode.
  • :kde: Estimates the mode(s) by finding the original data points in vs with the highest estimated probability density, based on Kernel Density Estimation (KDE). Accepts KDE options in opts like :kernel, :bandwidth, etc. (passed to fastmath.kernel.density/kernel-density).
  • :default (or when method is omitted): Finds the exact value(s) that occur most frequently in vs. Suitable for discrete data.

The optional opts map provides method-specific configuration.

See also mode (returns only the first mode) and wmodes (for weighted data).

Find the values that appear most often in a dataset `vs`.

Returns sequence with all most appearing values. For the default method
(discrete data), modes are sorted in increasing order.

For samples potentially drawn from a continuous distribution, simply finding the
most frequent exact value might not be meaningful. Several estimation methods
are provided via the `method` argument:

* `:histogram`: Calculates the mode(s) based on the peak(s) of a histogram constructed
  from `vs`. Uses interpolation within the bin(s) with the highest frequency.
  Accepts options via `opts`, primarily `:bins` to control histogram
  construction (see [[histogram]]).
* `:pearson`: Estimates the mode using Pearson's second skewness coefficient
  formula: `mode ≈ 3 * median - 2 * mean`. Accepts `:estimation-strategy`
  in `opts` for median calculation (see [[median]]). Returns a single estimated mode.
* `:kde`: Estimates the mode(s) by finding the original data points in `vs`
  with the highest estimated probability density, based on Kernel Density
  Estimation (KDE). Accepts KDE options in `opts` like `:kernel`, `:bandwidth`,
  etc. (passed to `fastmath.kernel.density/kernel-density`).
* `:default` (or when `method` is omitted): Finds the exact value(s) that occur
  most frequently in `vs`. Suitable for discrete data.

The optional `opts` map provides method-specific configuration.

See also [[mode]] (returns only the first mode) and [[wmodes]] (for weighted data).
sourceraw docstring

modified-power-transformationcljdeprecated

(modified-power-transformation xs)
(modified-power-transformation xs lambda)
(modified-power-transformation xs lambda alpha)

Applies a modified power transformation (Bickel and Doksum) to a data.

Applies a modified power transformation (Bickel and Doksum) to a data.
sourceraw docstring

momentclj

(moment vs)
(moment vs order)
(moment vs order {:keys [absolute? center mean? normalize?] :or {mean? true}})

Calculate moment (central or/and absolute) of given order (default: 2).

Additional parameters as a map:

  • :absolute? - calculate sum as absolute values (default: false)
  • :mean? - returns mean (proper moment) or just sum of differences (default: true)
  • :center - value of center (default: nil = mean)
  • :normalize? - apply normalization by standard deviation to the order power
Calculate moment (central or/and absolute) of given order (default: 2).

Additional parameters as a map:

* `:absolute?` - calculate sum as absolute values (default: `false`)
* `:mean?` - returns mean (proper moment) or just sum of differences (default: `true`)
* `:center` - value of center (default: `nil` = mean)
* `:normalize?` - apply normalization by standard deviation to the order power
sourceraw docstring

mseclj

(mse [vs1 vs2-or-val])
(mse vs1 vs2-or-val)

Calculates the Mean Squared Error (MSE) between two sequences or a sequence and a constant value.

MSE is a measure of the quality of an estimator or predictor. It quantifies the average of the squared differences between corresponding elements of the input sequences.

Parameters:

  • vs1 (sequence of numbers): The first sequence (often the observed or true values).
  • vs2-or-val (sequence of numbers or single number): The second sequence (often the predicted or reference values), or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated Mean Squared Error as a double.

Note: MSE penalizes larger errors more heavily than smaller errors because the errors are squared. This makes it sensitive to outliers. It is the average of the rss (Residual Sum of Squares). Its square root is the rmse.

See also rss (Residual Sum of Squares), rmse (Root Mean Squared Error), me (Mean Error), mae (Mean Absolute Error), r2 (Coefficient of Determination).

Calculates the Mean Squared Error (MSE) between two sequences or a sequence and a constant value.

MSE is a measure of the quality of an estimator or predictor. It quantifies
the average of the squared differences between corresponding elements of the
input sequences.

Parameters:

- `vs1` (sequence of numbers): The first sequence (often the observed or true values).
- `vs2-or-val` (sequence of numbers or single number): The second sequence
  (often the predicted or reference values), or a single number to compare
  against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated Mean Squared Error as a double.

Note: MSE penalizes larger errors more heavily than smaller errors because the
errors are squared. This makes it sensitive to outliers. It is the average
of the [[rss]] (Residual Sum of Squares). Its square root is the [[rmse]].

See also [[rss]] (Residual Sum of Squares), [[rmse]] (Root Mean Squared Error),
[[me]] (Mean Error), [[mae]] (Mean Absolute Error), [[r2]] (Coefficient of Determination).
sourceraw docstring

multinomial-likelihood-ratio-testclj

(multinomial-likelihood-ratio-test contingency-table-or-xs)
(multinomial-likelihood-ratio-test contingency-table-or-xs params)

Multinomial likelihood ratio test, a power divergence test for lambda 0.0

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Multinomial likelihood ratio test, a power divergence test for `lambda` 0.0

Performs a power divergence test, which encompasses several common statistical tests
  like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
  This function can perform either a goodness-of-fit test or a test for independence
  in a contingency table.

  Usage:

  1.  **Goodness-of-Fit (GOF):**
      - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
      - Input: `data` (sequence of numbers) and `:p` (a distribution object).
        In this case, a histogram of `data` is created (controlled by `:bins`) and
        compared against the probability mass/density of the distribution in those bins.

  2.  **Test for Independence:**
      - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

  Options map:

  * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
      * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
      * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
      * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
      * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
      * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
      * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
  * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
    or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
  * `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
  * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
    (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
  * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
    against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
  * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
  * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
    Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

  Returns a map containing:

  - `:stat`: The calculated power divergence test statistic.
  - `:chi2`: Alias for `:stat`.
  - `:df`: Degrees of freedom for the test.
  - `:p-value`: The p-value associated with the test statistic.
  - `:n`: Total number of observations.
  - `:estimate`: Observed proportions.
  - `:expected`: Expected counts or proportions under the null hypothesis.
  - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
  - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

neyman-modified-chisq-testclj

(neyman-modified-chisq-test contingency-table-or-xs)
(neyman-modified-chisq-test contingency-table-or-xs params)

Neyman modifield chi square test, a power divergence test for lambda -2.0

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Neyman modifield chi square test, a power divergence test for `lambda` -2.0

Performs a power divergence test, which encompasses several common statistical tests
  like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
  This function can perform either a goodness-of-fit test or a test for independence
  in a contingency table.

  Usage:

  1.  **Goodness-of-Fit (GOF):**
      - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
      - Input: `data` (sequence of numbers) and `:p` (a distribution object).
        In this case, a histogram of `data` is created (controlled by `:bins`) and
        compared against the probability mass/density of the distribution in those bins.

  2.  **Test for Independence:**
      - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

  Options map:

  * `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
      * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
      * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
      * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
      * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
      * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
      * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
  * `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
    or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
  * `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
  * `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
    (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
  * `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
    against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
  * `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
  * `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  * `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
    Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

  Returns a map containing:

  - `:stat`: The calculated power divergence test statistic.
  - `:chi2`: Alias for `:stat`.
  - `:df`: Degrees of freedom for the test.
  - `:p-value`: The p-value associated with the test statistic.
  - `:n`: Total number of observations.
  - `:estimate`: Observed proportions.
  - `:expected`: Expected counts or proportions under the null hypothesis.
  - `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
  - `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

normality-testclj

(normality-test xs)
(normality-test xs params)
(normality-test xs skew kurt {:keys [sides] :or {sides :one-sided-greater}})

Performs the D'Agostino-Pearson K² omnibus test for normality.

This test combines the results of the skewness and kurtosis tests to provide an overall assessment of whether the sample data deviates from a normal distribution in terms of either asymmetry or peakedness/tailedness.

The test works by:

  1. Calculating a normalized test statistic (Z₁) for skewness using skewness-test.
  2. Calculating a normalized test statistic (Z₂) for kurtosis using kurtosis-test.
  3. Combining these into an omnibus statistic: K² = Z₁² + Z₂².
  4. Under the null hypothesis that the data comes from a normal distribution, K² approximately follows a Chi-squared distribution with 2 degrees of freedom.

Parameters:

  • xs (seq of numbers): The sample data.
  • skew (double, optional): A pre-calculated skewness value (type :g1 used by default in underlying test).
  • kurt (double, optional): A pre-calculated kurtosis value (type :kurt used by default in underlying test).
  • params (map, optional): Options map:
    • :sides (keyword, default :one-sided-greater): Specifies the side(s) of the Chi-squared(2) distribution used for p-value calculation.
      • :one-sided-greater (default and standard): Tests if K² is significantly large, indicating departure from normality in skewness, kurtosis, or both.
      • :one-sided-less: Tests if the K² statistic is significantly small.
      • :two-sided: Tests if the K² statistic is extreme in either tail.

Returns a map containing:

  • :Z: The calculated K² omnibus test statistic (labeled :Z for consistency, though it follows Chi-squared(2)).
  • :stat: Alias for :Z.
  • :p-value: The p-value associated with the K² statistic and :sides.
  • :skewness: The sample skewness value used (either provided or calculated).
  • :kurtosis: The sample kurtosis value used (either provided or calculated).

See also skewness-test, kurtosis-test, jarque-bera-test.

Performs the D'Agostino-Pearson K² omnibus test for normality.

This test combines the results of the skewness and kurtosis tests to provide
an overall assessment of whether the sample data deviates from a normal distribution
in terms of either asymmetry or peakedness/tailedness.

The test works by:
1. Calculating a normalized test statistic (Z₁) for skewness using [[skewness-test]].
2. Calculating a normalized test statistic (Z₂) for kurtosis using [[kurtosis-test]].
3. Combining these into an omnibus statistic: K² = Z₁² + Z₂².
4. Under the null hypothesis that the data comes from a normal distribution,
   K² approximately follows a Chi-squared distribution with 2 degrees of freedom.

Parameters:

- `xs` (seq of numbers): The sample data.
- `skew` (double, optional): A pre-calculated skewness value (type `:g1` used by default in underlying test).
- `kurt` (double, optional): A pre-calculated kurtosis value (type `:kurt` used by default in underlying test).
- `params` (map, optional): Options map:
  - `:sides` (keyword, default `:one-sided-greater`): Specifies the side(s) of the
    Chi-squared(2) distribution used for p-value calculation.
    - `:one-sided-greater` (default and standard): Tests if K² is significantly large,
      indicating departure from normality in skewness, kurtosis, or both.
    - `:one-sided-less`: Tests if the K² statistic is significantly small.
    - `:two-sided`: Tests if the K² statistic is extreme in either tail.

Returns a map containing:

- `:Z`: The calculated K² omnibus test statistic (labeled `:Z` for consistency,
         though it follows Chi-squared(2)).
- `:stat`: Alias for `:Z`.
- `:p-value`: The p-value associated with the K² statistic and `:sides`.
- `:skewness`: The sample skewness value used (either provided or calculated).
- `:kurtosis`: The sample kurtosis value used (either provided or calculated).

See also [[skewness-test]], [[kurtosis-test]], [[jarque-bera-test]].
sourceraw docstring

omega-sqclj

(omega-sq [group1 group2])
(omega-sq group1 group2)
(omega-sq group1 group2 degrees-of-freedom)

Calculates Omega squared (ω²), an effect size measure for the simple linear regression of group1 on group2.

Omega squared estimates the proportion of variance in the dependent variable (group1) that is accounted for by the independent variable (group2) in the population. It is considered a less biased alternative to r2-determination.

Parameters:

  • group1 (seq of numbers): The dependent variable.
  • group2 (seq of numbers): The independent variable. Must have the same length as group1.
  • degrees-of-freedom (double, optional): The degrees of freedom for the regression model. Defaults to 1.0, which is standard for simple linear regression and used in the 2-arity version. Providing a different value allows calculating ω² for cases with multiple predictors if the sums of squares are computed for the overall model.

Returns the calculated Omega squared value as a double. The value typically ranges from 0.0 to 1.0.

Interpretation:

  • 0.0 indicates that group2 explains none of the variance in group1 in the population.
  • 1.0 indicates that group2 perfectly explains the variance in group1 in the population.

Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences. The 3-arity version allows specifying a custom degrees of freedom for regression, which might be relevant for calculating overall $\omega^2$ in multiple regression contexts (where degrees-of-freedom would be the number of predictors).

See also eta-sq (Eta-squared, often based on $R^2$), epsilon-sq (another adjusted R²-like measure), r2-determination (R-squared).

Calculates Omega squared (ω²), an effect size measure for the simple linear regression of `group1` on `group2`.

Omega squared estimates the proportion of variance in the dependent variable (`group1`) that is accounted for by the independent variable (`group2`) in the population. It is considered a less biased alternative to [[r2-determination]].

Parameters:

- `group1` (seq of numbers): The dependent variable.
- `group2` (seq of numbers): The independent variable. Must have the same length as `group1`.
- `degrees-of-freedom` (double, optional): The degrees of freedom for the regression model. Defaults to 1.0, which is standard for simple linear regression and used in the 2-arity version. Providing a different value allows calculating ω² for cases with multiple predictors if the sums of squares are computed for the overall model.

Returns the calculated Omega squared value as a double. The value typically ranges from 0.0 to 1.0.

Interpretation:

- 0.0 indicates that `group2` explains none of the variance in `group1` in the population.
- 1.0 indicates that `group2` perfectly explains the variance in `group1` in the population.

Note: While often presented in the context of ANOVA, this implementation applies the formula to the sums of squares obtained from a simple linear regression between the two sequences. The 3-arity version allows specifying a custom degrees of freedom for regression, which might be relevant for calculating overall $\omega^2$ in multiple regression contexts (where `degrees-of-freedom` would be the number of predictors).

See also [[eta-sq]] (Eta-squared, often based on $R^2$), [[epsilon-sq]] (another adjusted R²-like measure), [[r2-determination]] (R-squared).
sourceraw docstring

one-way-anova-testclj

(one-way-anova-test xss)
(one-way-anova-test xss {:keys [sides] :or {sides :one-sided-greater}})

Performs a one-way analysis of variance (ANOVA) test.

ANOVA tests the null hypothesis that the means of two or more independent groups are equal. It assumes that the data within each group are normally distributed and have equal variances.

Parameters:

  • xss (sequence of sequences): A collection where each element is a sequence representing a group of observations.
  • params (map, optional): Options map with the following key:
    • :sides (keyword, default :one-sided-greater): Alternative hypothesis side for the F-test. Possible values: :one-sided-greater, :one-sided-less, :two-sided.

Returns a map containing:

  • :F: The F-statistic for the test.
  • :stat: Alias for :F.
  • :p-value: The p-value for the test.
  • :df: Degrees of freedom for the F-statistic ([DFt, DFe]).
  • :n: Sequence of sample sizes for each group.
  • :SSt: Sum of squares between groups (treatment).
  • :SSe: Sum of squares within groups (error).
  • :DFt: Degrees of freedom between groups.
  • :DFe: Degrees of freedom within groups.
  • :MSt: Mean square between groups.
  • :MSe: Mean square within groups.
  • :sides: Test side used.
Performs a one-way analysis of variance (ANOVA) test.

ANOVA tests the null hypothesis that the means of two or more independent groups
are equal. It assumes that the data within each group are normally distributed
and have equal variances.

Parameters:

- `xss` (sequence of sequences): A collection where each element is a sequence
  representing a group of observations.
- `params` (map, optional): Options map with the following key:
  - `:sides` (keyword, default `:one-sided-greater`): Alternative hypothesis side for the F-test.
    Possible values: `:one-sided-greater`, `:one-sided-less`, `:two-sided`.

Returns a map containing:

- `:F`: The F-statistic for the test.
- `:stat`: Alias for `:F`.
- `:p-value`: The p-value for the test.
- `:df`: Degrees of freedom for the F-statistic ([DFt, DFe]).
- `:n`: Sequence of sample sizes for each group.
- `:SSt`: Sum of squares between groups (treatment).
- `:SSe`: Sum of squares within groups (error).
- `:DFt`: Degrees of freedom between groups.
- `:DFe`: Degrees of freedom within groups.
- `:MSt`: Mean square between groups.
- `:MSe`: Mean square within groups.
- `:sides`: Test side used.
sourceraw docstring

outer-fence-extentclj

(outer-fence-extent vs)
(outer-fence-extent vs estimation-strategy)

Returns LOF, UOF and median

Returns LOF, UOF and median
sourceraw docstring

outliersclj

(outliers vs)
(outliers vs estimation-strategy)
(outliers vs q1 q3)

Find outliers defined as values outside inner fences.

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1).

  • LIF (Lower Inner Fence) equals (- Q1 (* 1.5 IQR)).
  • UIF (Upper Inner Fence) equals (+ Q3 (* 1.5 IQR)).

Returns a sequence of outliers.

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].

Find outliers defined as values outside inner fences.

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is `(- Q3 Q1)`.

* LIF (Lower Inner Fence) equals `(- Q1 (* 1.5 IQR))`.
* UIF (Upper Inner Fence) equals `(+ Q3 (* 1.5 IQR))`.

Returns a sequence of outliers.

Optional `estimation-strategy` argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
sourceraw docstring

p-overlapclj

(p-overlap [group1 group2])
(p-overlap group1 group2)
(p-overlap group1
           group2
           {:keys [kde bandwidth min-iterations steps]
            :or {kde :gaussian min-iterations 3 steps 500}})

Calculates the overlapping index between the estimated distributions of two samples using Kernel Density Estimation (KDE).

This function estimates the probability density function (PDF) for group1 and group2 using KDE and then calculates the area of overlap between the two estimated PDFs. The area of overlap is the integral of the minimum of the two density functions.

Parameters:

  • group1 (seq of numbers): The first sample.
  • group2 (seq of numbers): The second sample.
  • opts (map, optional): Options map for KDE and integration:
    • :kde (keyword, default :gaussian): The kernel function to use for KDE. See fastmath.kernel.density/kernel-density+ for options.
    • :bandwidth (double, optional): The bandwidth for KDE. If omitted, it is automatically estimated.
    • :min-iterations (long, default 3): Minimum number of iterations for Romberg integration.
    • :steps (long, default 500): Number of steps (subintervals) for numerical integration over the relevant range.

Returns the calculated overlapping index as a double, representing the area of overlap between the two estimated distributions. A value closer to 1 indicates greater overlap, while a value closer to 0 indicates less overlap.

This measure quantifies the degree to which two distributions share common values and can be seen as a measure of similarity.

Calculates the overlapping index between the estimated distributions of two samples using Kernel Density Estimation (KDE).

This function estimates the probability density function (PDF) for `group1` and `group2` using KDE and then calculates the area of overlap between the two estimated PDFs. The area of overlap is the integral of the minimum of the two density functions.

Parameters:

- `group1` (seq of numbers): The first sample.
- `group2` (seq of numbers): The second sample.
- `opts` (map, optional): Options map for KDE and integration:
  - `:kde` (keyword, default `:gaussian`): The kernel function to use for KDE. See `fastmath.kernel.density/kernel-density+` for options.
  - `:bandwidth` (double, optional): The bandwidth for KDE. If omitted, it is automatically estimated.
  - `:min-iterations` (long, default 3): Minimum number of iterations for Romberg integration.
  - `:steps` (long, default 500): Number of steps (subintervals) for numerical integration over the relevant range.

Returns the calculated overlapping index as a double, representing the area of overlap between the two estimated distributions. A value closer to 1 indicates greater overlap, while a value closer to 0 indicates less overlap.

This measure quantifies the degree to which two distributions share common values and can be seen as a measure of similarity.
sourceraw docstring

p-valueclj

(p-value stat)
(p-value distribution stat)
(p-value distribution stat sides)

Calculates the p-value for a given test statistic based on a reference probability distribution.

The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the provided stat, assuming the null hypothesis is true (where the null hypothesis implies stat follows the given distribution).

Parameters:

  • distribution (distribution object, optional): The probability distribution object (from fastmath.random) that the test statistic follows under the null hypothesis. Defaults to the standard normal distribution (fastmath.random/default-normal) if omitted.
  • stat (double): The observed value of the test statistic.
  • sides (keyword, optional): Specifies the type of alternative hypothesis and how 'extremeness' is defined. Defaults to :two-sided.
    • :two-sided or :both: Alternative hypothesis is that the true parameter is different from the null value (tests for extremeness in either tail). Calculates 2 * min(CDF(stat), CCDF(stat)) (adjusted for discrete).
    • :one-sided-greater or :right: Alternative hypothesis is that the true parameter is greater than the null value (tests for extremeness in the right tail). Calculates CCDF(stat) (adjusted for discrete).
    • :one-sided-less, :left, or :one-sided: Alternative hypothesis is that the true parameter is less than the null value (tests for extremeness in the left tail). Calculates CDF(stat).

Note: For discrete distributions, a continuity correction (stat - 1 for CCDF calculations) is applied when calculating right-tail or two-tail probabilities involving the upper tail. This ensures the probability mass at the statistic value is correctly accounted for.

Returns the calculated p-value (a double between 0.0 and 1.0).

Calculates the p-value for a given test statistic based on a reference probability distribution.

The p-value represents the probability of observing a test statistic as extreme as,
or more extreme than, the provided `stat`, assuming the null hypothesis is true
(where the null hypothesis implies `stat` follows the given `distribution`).

Parameters:

- `distribution` (distribution object, optional): The probability distribution object
  (from `fastmath.random`) that the test statistic follows under the null
  hypothesis. Defaults to the standard normal distribution (`fastmath.random/default-normal`)
  if omitted.
- `stat` (double): The observed value of the test statistic.
- `sides` (keyword, optional): Specifies the type of alternative hypothesis and
  how 'extremeness' is defined. Defaults to `:two-sided`.
  - `:two-sided` or `:both`: Alternative hypothesis is that the true parameter is
    different from the null value (tests for extremeness in either tail).
    Calculates `2 * min(CDF(stat), CCDF(stat))` (adjusted for discrete).
  - `:one-sided-greater` or `:right`: Alternative hypothesis is that the true
    parameter is greater than the null value (tests for extremeness in the right tail).
    Calculates `CCDF(stat)` (adjusted for discrete).
  - `:one-sided-less`, `:left`, or `:one-sided`: Alternative hypothesis is that the true
    parameter is less than the null value (tests for extremeness in the left tail).
    Calculates `CDF(stat)`.

Note: For discrete distributions, a continuity correction (`stat - 1` for CCDF calculations)
is applied when calculating right-tail or two-tail probabilities involving the
upper tail. This ensures the probability mass *at* the statistic value is correctly
accounted for.

Returns the calculated p-value (a double between 0.0 and 1.0).
sourceraw docstring

pacfclj

(pacf data)
(pacf data lags)

Calculates the Partial Autocorrelation Function (PACF) for a given time series data.

The PACF measures the linear dependence between a time series and its lagged values after removing the effects of the intermediate lags. It helps identify the direct relationship at each lag and is used to determine the order of autoregressive (AR) components in time series models (e.g., ARIMA).

Parameters:

  • data (seq of numbers): The time series data.
  • lags (long, optional): The maximum lag for which to calculate the PACF. If omitted, calculates PACF for lags from 0 up to (dec (count data)).

Returns a sequence of doubles representing the partial autocorrelation coefficients for the specified lags. The value at lag 0 is always 0.0.

See also acf, acf-ci, pacf-ci.

Calculates the Partial Autocorrelation Function (PACF) for a given time series `data`.

The PACF measures the linear dependence between a time series and its lagged values *after removing* the effects of the intermediate lags. It helps identify the direct relationship at each lag and is used to determine the order of autoregressive (AR) components in time series models (e.g., ARIMA).

Parameters:

* `data` (seq of numbers): The time series data.
* `lags` (long, optional): The maximum lag for which to calculate the PACF. If omitted, calculates PACF for lags from 0 up to `(dec (count data))`.

Returns a sequence of doubles representing the partial autocorrelation coefficients for the specified lags. The value at lag 0 is always 0.0.

See also [[acf]], [[acf-ci]], [[pacf-ci]].
sourceraw docstring

pacf-ciclj

(pacf-ci data)
(pacf-ci data lags)
(pacf-ci data lags alpha)

Calculates the Partial Autocorrelation Function (PACF) for a time series and provides approximate confidence intervals.

This function computes the PACF of the input time series data for specified lags (see pacf) and includes approximate confidence intervals around the PACF estimates. These intervals help determine whether the partial autocorrelation at a specific lag is statistically significant (i.e., likely non-zero in the population).

Parameters:

  • data (seq of numbers): The time series data.
  • lags (long, optional): The maximum lag for which to calculate the PACF and CI. If omitted, calculates for lags up to (dec (count data)).
  • alpha (double, optional): The significance level for the confidence intervals. Defaults to 0.05 (for a 95% CI).

Returns a map containing:

  • :ci (double): The value of the approximate standard confidence interval bound for lags > 0. If the absolute value of a PACF coefficient at lag k > 0 exceeds this value, it is considered statistically significant.
  • :pacf (seq of doubles): The sequence of partial autocorrelation coefficients at lags from 0 up to lags (calculated using pacf).

See also pacf, acf, acf-ci.

Calculates the Partial Autocorrelation Function (PACF) for a time series and provides approximate confidence intervals.

This function computes the PACF of the input time series `data` for specified lags
(see [[pacf]]) and includes approximate confidence intervals around the PACF
estimates. These intervals help determine whether the partial autocorrelation at
a specific lag is statistically significant (i.e., likely non-zero in the population).

Parameters:

* `data` (seq of numbers): The time series data.
* `lags` (long, optional): The maximum lag for which to calculate the PACF and CI.
  If omitted, calculates for lags up to `(dec (count data))`.
* `alpha` (double, optional): The significance level for the confidence intervals.
  Defaults to `0.05` (for a 95% CI).

Returns a map containing:

* `:ci` (double): The value of the approximate standard confidence interval bound
  for lags > 0. If the absolute value of a PACF
  coefficient at lag `k > 0` exceeds this value, it is considered statistically significant.
* `:pacf` (seq of doubles): The sequence of partial autocorrelation coefficients
  at lags from 0 up to `lags` (calculated using [[pacf]]).

See also [[pacf]], [[acf]], [[acf-ci]].
sourceraw docstring

pearson-correlationclj

(pearson-correlation [vs1 vs2])
(pearson-correlation vs1 vs2)

Calculates the Pearson product-moment correlation coefficient between two sequences.

This function measures the linear relationship between two datasets. The coefficient value ranges from -1.0 (perfect negative linear correlation) to 1.0 (perfect positive linear correlation), with 0.0 indicating no linear correlation.

Parameters:

  • [vs1 vs2] (sequence of two sequences): A sequence containing the two sequences of numbers.
  • vs1, vs2 (sequences): The two sequences of numbers directly as arguments.

Both input sequences must contain only numbers and must have the same length.

Returns the calculated Pearson correlation coefficient as a double. Returns NaN if either sequence has zero variance (i.e., all elements are the same).

See also correlation (general correlation, defaults to Pearson), spearman-correlation, kendall-correlation, correlation-matrix.

Calculates the Pearson product-moment correlation coefficient between two sequences.

This function measures the linear relationship between two datasets. The coefficient
value ranges from -1.0 (perfect negative linear correlation) to 1.0 (perfect
positive linear correlation), with 0.0 indicating no linear correlation.

Parameters:

- `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers.
- `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments.

Both input sequences must contain only numbers and must have the same length.

Returns the calculated Pearson correlation coefficient as a double. Returns `NaN` if
either sequence has zero variance (i.e., all elements are the same).

See also [[correlation]] (general correlation, defaults to Pearson), [[spearman-correlation]],
[[kendall-correlation]], [[correlation-matrix]].
sourceraw docstring

pearson-rclj

(pearson-r [group1 group2])
(pearson-r group1 group2)

Calculates the Pearson r correlation coefficient between two sequences.

This function is an alias for pearson-correlation.

See pearson-correlation for detailed documentation, parameters, and usage examples.

Calculates the Pearson `r` correlation coefficient between two sequences.

This function is an alias for [[pearson-correlation]].

See [[pearson-correlation]] for detailed documentation, parameters, and usage examples.
sourceraw docstring

percentileclj

(percentile vs p)
(percentile vs p estimation-strategy)

Calculates the p-th percentile of a sequence vs.

The percentile p is a value between 0 and 100, inclusive.

An optional estimation-strategy keyword can be provided to specify the method used for estimating the percentile, particularly how interpolation is handled when the desired percentile falls between data points in the sorted sequence.

Available estimation-strategy values:

  • :legacy (Default): The original method used in Apache Commons Math.
  • :r1 through :r9: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np or (n+1)p) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.

See also quantile (which uses a 0.0-1.0 range) and percentiles.

Calculates the p-th percentile of a sequence `vs`.

The percentile `p` is a value between 0 and 100, inclusive.

An optional `estimation-strategy` keyword can be provided to specify the
method used for estimating the percentile, particularly how interpolation is
handled when the desired percentile falls between data points in the sorted
sequence.

Available `estimation-strategy` values:

- `:legacy` (Default): The original method used in Apache Commons Math.
- `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms
    recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index
    (e.g., using `np` or `(n+1)p`) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to
the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html).

See also [[quantile]] (which uses a 0.0-1.0 range) and [[percentiles]].
sourceraw docstring

percentile-bc-extentclj

(percentile-bc-extent vs)
(percentile-bc-extent vs p)
(percentile-bc-extent vs p1 p2)
(percentile-bc-extent vs p1 p2 estimation-strategy)

Return bias corrected percentile range and mean for bootstrap samples. See https://projecteuclid.org/euclid.ss/1032280214

p - calculates extent of bias corrected p and 100-p (default: p=2.5)

Set estimation-strategy to :r7 to get the same result as in R coxed::bca.

Return bias corrected percentile range and mean for bootstrap samples.
See https://projecteuclid.org/euclid.ss/1032280214

`p` - calculates extent of bias corrected `p` and `100-p` (default: `p=2.5`)

Set `estimation-strategy` to `:r7` to get the same result as in R `coxed::bca`.
sourceraw docstring

percentile-bca-extentclj

(percentile-bca-extent vs)
(percentile-bca-extent vs p)
(percentile-bca-extent vs p1 p2)
(percentile-bca-extent vs p1 p2 estimation-strategy)
(percentile-bca-extent vs p1 p2 accel estimation-strategy)

Return bias corrected percentile range and mean for bootstrap samples. Also accounts for variance variations throught the accelaration parameter. See https://projecteuclid.org/euclid.ss/1032280214

p - calculates extent of bias corrected p and 100-p (default: p=2.5)

Set estimation-strategy to :r7 to get the same result as in R coxed::bca.

Return bias corrected percentile range and mean for bootstrap samples. Also accounts for variance
 variations throught the accelaration parameter.
See https://projecteuclid.org/euclid.ss/1032280214

`p` - calculates extent of bias corrected `p` and `100-p` (default: `p=2.5`)

Set `estimation-strategy` to `:r7` to get the same result as in R `coxed::bca`.
sourceraw docstring

percentile-extentclj

(percentile-extent vs)
(percentile-extent vs p)
(percentile-extent vs p1 p2)
(percentile-extent vs p1 p2 estimation-strategy)

Return percentile range and median.

p - calculates extent of p and 100-p (default: p=25)

Return percentile range and median.

`p` - calculates extent of `p` and `100-p` (default: `p=25`)
sourceraw docstring

percentilesclj

(percentiles vs)
(percentiles vs ps)
(percentiles vs ps estimation-strategy)

Calculates the sequence of p-th percentiles of a sequence vs.

Percentiles ps is sequence of values between 0 and 100, inclusive.

An optional estimation-strategy keyword can be provided to specify the method used for estimating the percentile, particularly how interpolation is handled when the desired percentile falls between data points in the sorted sequence.

Available estimation-strategy values:

  • :legacy (Default): The original method used in Apache Commons Math.
  • :r1 through :r9: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np or (n+1)p) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.

See also quantiles (which uses a 0.0-1.0 range) and percentile.

Calculates the sequence of p-th percentiles of a sequence `vs`.

Percentiles `ps` is sequence of values between 0 and 100, inclusive.

An optional `estimation-strategy` keyword can be provided to specify the
method used for estimating the percentile, particularly how interpolation is
handled when the desired percentile falls between data points in the sorted
sequence.

Available `estimation-strategy` values:

- `:legacy` (Default): The original method used in Apache Commons Math.
- `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms
    recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to
the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html).

See also [[quantiles]] (which uses a 0.0-1.0 range) and [[percentile]].
sourceraw docstring

piclj

(pi vs)
(pi vs size)
(pi vs size estimation-strategy)

Returns PI as a map, quantile intervals based on interval size.

Quantiles are (1-size)/2 and 1-(1-size)/2

Returns PI as a map, quantile intervals based on interval size.

Quantiles are `(1-size)/2` and `1-(1-size)/2`
sourceraw docstring

pi-extentclj

(pi-extent vs)
(pi-extent vs size)
(pi-extent vs size estimation-strategy)

Returns PI extent, quantile intervals based on interval size + median.

Quantiles are (1-size)/2 and 1-(1-size)/2

Returns PI extent, quantile intervals based on interval size + median.

Quantiles are `(1-size)/2` and `1-(1-size)/2`
sourceraw docstring

pooled-madclj

(pooled-mad groups)
(pooled-mad groups const)

Calculate pooled median absolute deviation for samples.

k is a scaling constant which equals around 1.4826 by default.

Calculate pooled median absolute deviation for samples.

k is a scaling constant which equals around 1.4826 by default.
sourceraw docstring

pooled-stddevclj

(pooled-stddev groups)
(pooled-stddev groups method)

Calculate pooled standard deviation for samples and method

Methods:

  • :unbiased - sqrt of weighted average of variances (default)
  • :biased - biased version of :unbiased, no count correction.
  • :avg - sqrt of average of variances
Calculate pooled standard deviation for samples and method

Methods:

* `:unbiased` - sqrt of weighted average of variances (default)
* `:biased` - biased version of `:unbiased`, no count correction.
* `:avg` - sqrt of average of variances
sourceraw docstring

pooled-varianceclj

(pooled-variance groups)
(pooled-variance groups method)

Calculate pooled variance for samples and method.

Methods:

  • :unbiased - weighted average of variances (default)
  • :biased - biased version of :unbiased, no count correction.
  • :avg - average of variances
Calculate pooled variance for samples and method.

Methods:
* `:unbiased` - weighted average of variances (default)
* `:biased` - biased version of `:unbiased`, no count correction.
* `:avg` - average of variances
sourceraw docstring

population-stddevclj

(population-stddev vs)
(population-stddev vs mu)

Calculate population standard deviation of vs.

See stddev.

Calculate population standard deviation of `vs`.

See [[stddev]].
sourceraw docstring

population-varianceclj

(population-variance vs)
(population-variance vs mu)

Calculate population variance of vs.

See variance.

Calculate population variance of `vs`.

See [[variance]].
sourceraw docstring

population-wstddevclj

(population-wstddev vs weights)

Calculate population weighted standard deviation of vs

Calculate population weighted standard deviation of `vs`
sourceraw docstring

population-wvarianceclj

(population-wvariance vs freqs)

Calculate weighted population variance of vs.

Calculate weighted population variance of `vs`.
sourceraw docstring

power-divergence-testclj

(power-divergence-test contingency-table-or-xs)
(power-divergence-test contingency-table-or-xs
                       {:keys [lambda ci-sides sides p alpha bootstrap-samples
                               ddof bins]
                        :or {lambda m/TWO_THIRD
                             sides :one-sided-greater
                             ci-sides :two-sided
                             alpha 0.05
                             bootstrap-samples 1000
                             ddof 0}})

Performs a power divergence test, which encompasses several common statistical tests like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter. This function can perform either a goodness-of-fit test or a test for independence in a contingency table.

Usage:

  1. Goodness-of-Fit (GOF):

    • Input: observed-counts (sequence of numbers) and :p (expected probabilities/weights).
    • Input: data (sequence of numbers) and :p (a distribution object). In this case, a histogram of data is created (controlled by :bins) and compared against the probability mass/density of the distribution in those bins.
  2. Test for Independence:

    • Input: contingency-table (2D sequence or map format). The :p option is ignored.

Options map:

  • :lambda (double, default: 2/3): Determines the specific test statistic. Common values:
  • :p (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts) or a fastmath.random distribution object (for GOF with data). Ignored for independence tests.
  • :alpha (double, default: 0.05): Significance level for confidence intervals.
  • :ci-sides (keyword, default: :two-sided): Sides for bootstrap confidence intervals (:two-sided, :one-sided-greater, :one-sided-less).
  • :sides (keyword, default: :one-sided-greater): Alternative hypothesis side for the p-value calculation against the Chi-squared distribution (:one-sided-greater, :one-sided-less, :two-sided).
  • :bootstrap-samples (long, default: 1000): Number of bootstrap samples for confidence interval estimation.
  • :ddof (long, default: 0): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
  • :bins (number, keyword, or seq): Used only for GOF test against a distribution. Specifies the number of bins, an estimation method (see histogram), or explicit bin edges for histogram creation.

Returns a map containing:

  • :stat: The calculated power divergence test statistic.
  • :chi2: Alias for :stat.
  • :df: Degrees of freedom for the test.
  • :p-value: The p-value associated with the test statistic.
  • :n: Total number of observations.
  • :estimate: Observed proportions.
  • :expected: Expected counts or proportions under the null hypothesis.
  • :confidence-interval: Bootstrap confidence intervals for the observed proportions.
  • :lambda, :alpha, :sides, :ci-sides: Input options used.
Performs a power divergence test, which encompasses several common statistical tests
like Chi-squared, G-test (likelihood ratio), etc., based on the lambda parameter.
This function can perform either a goodness-of-fit test or a test for independence
in a contingency table.

Usage:

1.  **Goodness-of-Fit (GOF):**
    - Input: `observed-counts` (sequence of numbers) and `:p` (expected probabilities/weights).
    - Input: `data` (sequence of numbers) and `:p` (a distribution object).
      In this case, a histogram of `data` is created (controlled by `:bins`) and
      compared against the probability mass/density of the distribution in those bins.

2.  **Test for Independence:**
    - Input: `contingency-table` (2D sequence or map format). The `:p` option is ignored.

Options map:

* `:lambda` (double, default: `2/3`): Determines the specific test statistic. Common values:
    * `1.0`: Pearson Chi-squared test ([[chisq-test]]).
    * `0.0`: G-test / Multinomial Likelihood Ratio test ([[multinomial-likelihood-ratio-test]]).
    * `-0.5`: Freeman-Tukey test ([[freeman-tukey-test]]).
    * `-1.0`: Minimum Discrimination Information test ([[minimum-discrimination-information-test]]).
    * `-2.0`: Neyman Modified Chi-squared test ([[neyman-modified-chisq-test]]).
    * `2/3`: Cressie-Read test (default, [[cressie-read-test]]).
* `:p` (seq of numbers or distribution): Expected probabilities/weights (for GOF with counts)
  or a `fastmath.random` distribution object (for GOF with data). Ignored for independence tests.
* `:alpha` (double, default: `0.05`): Significance level for confidence intervals.
* `:ci-sides` (keyword, default: `:two-sided`): Sides for bootstrap confidence intervals
  (`:two-sided`, `:one-sided-greater`, `:one-sided-less`).
* `:sides` (keyword, default: `:one-sided-greater`): Alternative hypothesis side for the p-value calculation
  against the Chi-squared distribution (`:one-sided-greater`, `:one-sided-less`, `:two-sided`).
* `:bootstrap-samples` (long, default: `1000`): Number of bootstrap samples for confidence interval estimation.
* `:ddof` (long, default: `0`): Delta degrees of freedom. Adjustment subtracted from the calculated degrees of freedom.
* `:bins` (number, keyword, or seq): Used only for GOF test against a distribution.
  Specifies the number of bins, an estimation method (see [[histogram]]), or explicit bin edges for histogram creation.

Returns a map containing:

- `:stat`: The calculated power divergence test statistic.
- `:chi2`: Alias for `:stat`.
- `:df`: Degrees of freedom for the test.
- `:p-value`: The p-value associated with the test statistic.
- `:n`: Total number of observations.
- `:estimate`: Observed proportions.
- `:expected`: Expected counts or proportions under the null hypothesis.
- `:confidence-interval`: Bootstrap confidence intervals for the observed proportions.
- `:lambda`, `:alpha`, `:sides`, `:ci-sides`: Input options used.
sourceraw docstring

power-transformationcljdeprecated

(power-transformation xs)
(power-transformation xs lambda)
(power-transformation xs lambda alpha)

Applies a power transformation to a data.

Applies a power transformation to a data.
sourceraw docstring

powmeanclj

(powmean vs power)
(powmean vs weights power)

Calculates the generalized power mean (also known as the Hölder mean) of a sequence vs.

The power mean is a generalization of the Pythagorean means (arithmetic, geometric, harmonic) and other means like the quadratic mean (RMS). It is defined for a non-zero real number power.

Parameters:

  • vs: Sequence of numbers. Constraints depend on the power:
    • For power > 0, values should be non-negative.
    • For power = 0, values must be positive (reduces to geometric mean).
    • For power < 0, values must be positive and non-zero.
  • weights (optional): Sequence of non-negative weights corresponding to vs. Must have the same count as vs.
  • power (double): The exponent defining the mean.

Special Cases:

  • power = 0: Returns the geomean.
  • power = 1: Returns the arithmetic mean.
  • power = -1: Equivalent to the harmean. (Handled by the general formula)
  • power = 2: Returns the Root Mean Square (RMS) or quadratic mean.
  • power = inf: Returns maximum.
  • power = -inf: Returms minimum.
  • The implementation includes optimized paths for power values 1/3, 0.5, 2, and 3.

Returns the calculated power mean as a double.

See also mean, geomean, harmean.

Calculates the generalized power mean (also known as the Hölder mean) of a sequence `vs`.

The power mean is a generalization of the Pythagorean means (arithmetic, geometric, harmonic)
and other means like the quadratic mean (RMS). It is defined for a non-zero real number `power`.

Parameters:

- `vs`: Sequence of numbers. Constraints depend on the `power`:
  - For `power > 0`, values should be non-negative.
  - For `power = 0`, values must be positive (reduces to geometric mean).
  - For `power < 0`, values must be positive and non-zero.
- `weights` (optional): Sequence of non-negative weights corresponding to `vs`.
  Must have the same count as `vs`.
- `power` (double): The exponent defining the mean.

Special Cases:

- `power = 0`: Returns the [[geomean]].
- `power = 1`: Returns the arithmetic [[mean]].
- `power = -1`: Equivalent to the [[harmean]]. (Handled by the general formula)
- `power = 2`: Returns the Root Mean Square (RMS) or quadratic mean.
- `power = inf`: Returns maximum.
- `power = -inf`: Returms minimum.
- The implementation includes optimized paths for `power` values 1/3, 0.5, 2, and 3.

Returns the calculated power mean as a double.

See also [[mean]], [[geomean]], [[harmean]].
sourceraw docstring

psnrclj

(psnr [vs1 vs2-or-val])
(psnr vs1 vs2-or-val)
(psnr vs1 vs2-or-val max-value)

Peak Signal-to-Noise Ratio (PSNR).

PSNR is a measure used to quantify the quality of reconstruction of lossy compression codecs (e.g., for images or video). It is calculated using the Mean Squared Error (MSE) between the original and compressed images/signals. A higher PSNR generally indicates a higher quality signal reconstruction (i.e., less distortion).

Parameters:

  • vs1 (sequence of numbers): The first sequence (conventionally, the original or reference signal/data).
  • vs2-or-val (sequence of numbers or single number): The second sequence (conventionally, the reconstructed or noisy signal/data), or a single number to compare against each element of vs1.
  • max-value (optional, double): The maximum possible value of a sample in the data. If not provided, the function automatically determines the maximum value present across both input sequences (vs1 and vs2 if a sequence, or vs1 and the scalar value if vs2-or-val is a number). Providing an explicit max-value is often more appropriate based on the data type's theoretical maximum range (e.g., 255 for 8-bit).

If vs2-or-val is a sequence, both vs1 and vs2 must have the same length.

Returns the calculated Peak Signal-to-Noise Ratio as a double. Returns -Double/Infinity if the MSE is zero (perfect match). Returns NaN if MSE is non-positive.

See also mse, rmse.

Peak Signal-to-Noise Ratio (PSNR).

PSNR is a measure used to quantify the quality of reconstruction of lossy
compression codecs (e.g., for images or video). It is calculated using the
Mean Squared Error (MSE) between the original and compressed images/signals.
A higher PSNR generally indicates a higher quality signal reconstruction
(i.e., less distortion).

Parameters:

- `vs1` (sequence of numbers): The first sequence (conventionally, the original or reference signal/data).
- `vs2-or-val` (sequence of numbers or single number): The second sequence
  (conventionally, the reconstructed or noisy signal/data), or a single number
  to compare against each element of `vs1`.
- `max-value` (optional, double): The maximum possible value of a sample in the data.
  If not provided, the function automatically determines the maximum value present
  across both input sequences (`vs1` and `vs2` if a sequence, or `vs1` and the scalar value
  if `vs2-or-val` is a number). Providing an explicit `max-value` is often more
  appropriate based on the data type's theoretical maximum range (e.g., 255 for 8-bit).

If `vs2-or-val` is a sequence, both `vs1` and `vs2` must have the same length.

Returns the calculated Peak Signal-to-Noise Ratio as a double. Returns `-Double/Infinity`
if the MSE is zero (perfect match). Returns `NaN` if MSE is non-positive.

See also [[mse]], [[rmse]].
sourceraw docstring

quantileclj

(quantile vs q)
(quantile vs q estimation-strategy)

Calculates the q-th quantile of a sequence vs.

The quantile q is a value between 0.0 and 1.0, inclusive.

An optional estimation-strategy keyword can be provided to specify the method used for estimating the quantile, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence.

Available estimation-strategy values:

  • :legacy (Default): The original method used in Apache Commons Math.
  • :r1 through :r9: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np or (n+1)p) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.

See also percentile (which uses a 0-100 range) and quantiles.

Calculates the q-th quantile of a sequence `vs`.

The quantile `q` is a value between 0.0 and 1.0, inclusive.

An optional `estimation-strategy` keyword can be provided to specify the
method used for estimating the quantile, particularly how interpolation is
handled when the desired quantile falls between data points in the sorted
sequence.

Available `estimation-strategy` values:

- `:legacy` (Default): The original method used in Apache Commons Math.
- `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms
    recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using `np` or `(n+1)p`) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to
the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html).

See also [[percentile]] (which uses a 0-100 range) and [[quantiles]].
sourceraw docstring

quantile-extentclj

(quantile-extent vs)
(quantile-extent vs q)
(quantile-extent vs q1 q2)
(quantile-extent vs q1 q2 estimation-strategy)

Return quantile range and median.

q - calculates extent of q and 1.0-q (default: q=0.25)

Return quantile range and median.

`q` - calculates extent of `q` and `1.0-q` (default: `q=0.25`)
sourceraw docstring

quantilesclj

(quantiles vs)
(quantiles vs qs)
(quantiles vs qs estimation-strategy)

Calculates the sequence of q-th quantiles of a sequence vs.

Quantiles q is a sequence of values between 0.0 and 1.0, inclusive.

An optional estimation-strategy keyword can be provided to specify the method used for estimating the quantile, particularly how interpolation is handled when the desired quantile falls between data points in the sorted sequence.

Available estimation-strategy values:

  • :legacy (Default): The original method used in Apache Commons Math.
  • :r1 through :r9: Correspond to the nine quantile estimation algorithms recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index (e.g., using np or (n+1)p) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to the Apache Commons Math Percentile documentation.

See also percentiles (which uses a 0-100 range) and quantile.

Calculates the sequence of q-th quantiles of a sequence `vs`.

Quantiles `q` is a sequence of values between 0.0 and 1.0, inclusive.

An optional `estimation-strategy` keyword can be provided to specify the
method used for estimating the quantile, particularly how interpolation is
handled when the desired quantile falls between data points in the sorted
sequence.

Available `estimation-strategy` values:

- `:legacy` (Default): The original method used in Apache Commons Math.
- `:r1` through `:r9`: Correspond to the nine quantile estimation algorithms
    recommended by Hyndman and Fan (1996). Each strategy differs slightly in how it calculates the index
    (e.g., using `np` or `(n+1)p`) and how it interpolates between points.

For detailed mathematical descriptions of each estimation strategy, refer to
the [Apache Commons Math Percentile documentation](http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/stat/descriptive/rank/Percentile.EstimationType.html).

See also [[percentiles]] (which uses a 0-100 range) and [[quantile]].
sourceraw docstring

r2clj

(r2 [vs1 vs2-or-val])
(r2 vs1 vs2-or-val)
(r2 vs1 vs2-or-val no-of-variables)

Calculates the Coefficient of Determination ($R^2$) or adjusted version between two sequences or a sequence and a constant value.

$R^2$ is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a statistical model. It indicates how well the model fits the observed data.

The standard $R^2$ is calculated as $1 - (RSS / TSS)$, where:

  • $RSS$ (Residual Sum of Squares) is the sum of the squared differences between the observed values (vs1) and the predicted/reference values (vs2 or vs2-or-val). See rss.
  • $TSS$ (Total Sum of Squares) is the sum of the squared differences between the observed values (vs1) and their mean. This is calculated using moment of order 2 with :mean? set to false.

This function has two arities:

  1. (r2 vs1 vs2-or-val): Calculates the standard $R^2$.

    • vs1 (seq of numbers): The sequence of observed or actual values.
    • vs2-or-val (seq of numbers or single number): The sequence of predicted or reference values, or a single constant value to compare against.

    Returns the calculated standard $R^2$ as a double. For simple linear regression, this is equal to the square of the Pearson correlation coefficient (r2-determination). $R^2$ typically ranges from 0 to 1 in this context, but can be negative if the chosen model fits the data worse than a horizontal line through the mean of the observed data.

  2. (r2 vs1 vs2-or-val no-of-variables): Calculates the Adjusted $R^2$. The adjusted $R^2$ is a modified version of $R^2$ that has been adjusted for the number of predictors in the model. It increases only if the new term improves the model more than would be expected by chance. The formula for adjusted $R^2$ is: $$ R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1} $$ where $n$ is the number of observations (length of vs1) and $p$ is the number of independent variables (no-of-variables).

    • vs1 (seq of numbers): The sequence of observed or actual values.
    • vs2-or-val (seq of numbers or single number): The sequence of predicted or reference values, or a single constant value to compare against.
    • no-of-variables (double): The number of independent variables ($p$) used in the model that produced the vs2-or-val predictions.

    Returns the calculated adjusted $R^2$ as a double.

Both vs1 and vs2 (if vs2-or-val is a sequence) must have the same length.

See also rss, mse, rmse, pearson-correlation, r2-determination.

Calculates the Coefficient of Determination ($R^2$) or adjusted version between two sequences or a sequence and a constant value.

$R^2$ is a statistical measure that represents the proportion of the variance in the
dependent variable that is predictable from the independent variable(s) in a
statistical model. It indicates how well the model fits the observed data.

The standard $R^2$ is calculated as $1 - (RSS / TSS)$, where:
- $RSS$ (Residual Sum of Squares) is the sum of the squared differences
  between the observed values (`vs1`) and the predicted/reference values (`vs2` or `vs2-or-val`).
  See [[rss]].
- $TSS$ (Total Sum of Squares) is the sum of the squared differences
  between the observed values (`vs1`) and their mean. This is calculated
  using [[moment]] of order 2 with `:mean?` set to `false`.

This function has two arities:

1.  `(r2 vs1 vs2-or-val)`: Calculates the standard $R^2$.

    - `vs1` (seq of numbers): The sequence of observed or actual values.
    - `vs2-or-val` (seq of numbers or single number): The sequence of predicted or
      reference values, or a single constant value to compare against.

    Returns the calculated standard $R^2$ as a double. For simple linear regression,
    this is equal to the square of the Pearson correlation coefficient ([[r2-determination]]).
    $R^2$ typically ranges from 0 to 1 in this context, but can be negative
    if the chosen model fits the data worse than a horizontal line through the mean
    of the observed data.

2.  `(r2 vs1 vs2-or-val no-of-variables)`: Calculates the **Adjusted $R^2$**.
    The adjusted $R^2$ is a modified version of $R^2$ that has been adjusted
    for the number of predictors in the model. It increases only if the new term
    improves the model more than would be expected by chance.
    The formula for adjusted $R^2$ is:
    $$ R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1} $$
    where $n$ is the number of observations (length of `vs1`) and $p$ is the
    number of independent variables (`no-of-variables`).

    - `vs1` (seq of numbers): The sequence of observed or actual values.
    - `vs2-or-val` (seq of numbers or single number): The sequence of predicted or
      reference values, or a single constant value to compare against.
    - `no-of-variables` (double): The number of independent variables ($p$) used in the model
      that produced the `vs2-or-val` predictions.

    Returns the calculated adjusted $R^2$ as a double.

Both `vs1` and `vs2` (if `vs2-or-val` is a sequence) must have the same length.

See also [[rss]], [[mse]], [[rmse]], [[pearson-correlation]], [[r2-determination]].
sourceraw docstring

r2-determinationclj

(r2-determination [group1 group2])
(r2-determination group1 group2)

Calculates the Coefficient of Determination ($R^2$) between two sequences.

This function computes the square of the Pearson product-moment correlation coefficient (pearson-correlation) between group1 and group2.

$R^2$ measures the proportion of the variance in one variable that is predictable from the other variable in a linear relationship. For a simple linear regression with one independent variable, this value is equivalent to the $R^2$ calculated from the Residual Sum of Squares (RSS) and Total Sum of Squares (TSS).

Parameters:

  • group1 (seq of numbers): The first sequence.
  • group2 (seq of numbers): The second sequence.

Both sequences must have the same length.

Returns the calculated $R^2$ value (a double between 0.0 and 1.0) as a double. Returns NaN if the Pearson correlation cannot be calculated (e.g., one sequence is constant).

See also r2 (for general $R^2$ and adjusted $R^2$), pearson-correlation.

Calculates the Coefficient of Determination ($R^2$) between two sequences.

This function computes the square of the Pearson product-moment correlation
coefficient ([[pearson-correlation]]) between `group1` and `group2`.

$R^2$ measures the proportion of the variance in one variable that is predictable
from the other variable in a linear relationship. For a simple linear regression
with one independent variable, this value is equivalent to the $R^2$ calculated
from the Residual Sum of Squares (RSS) and Total Sum of Squares (TSS).

Parameters:

- `group1` (seq of numbers): The first sequence.
- `group2` (seq of numbers): The second sequence.

Both sequences must have the same length.

Returns the calculated $R^2$ value (a double between 0.0 and 1.0) as a double.
Returns `NaN` if the Pearson correlation cannot be calculated (e.g., one sequence is constant).

See also [[r2]] (for general $R^2$ and adjusted $R^2$), [[pearson-correlation]].
sourceraw docstring

rank-epsilon-sqclj

(rank-epsilon-sq xss)

Calculates Rank Epsilon-squared (ε²), a measure of effect size for the Kruskal-Wallis H-test.

Rank Epsilon-squared is a non-parametric measure quantifying the proportion of the total variability (based on ranks) in the dependent variable that is associated with group membership (the independent variable). It is analogous to Eta-squared or Epsilon-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test.

This function calculates Epsilon-squared based on the Kruskal-Wallis H statistic (H) and the total number of observations (n) across all groups.

Parameters:

  • xss (sequence of sequences): A collection where each element is a sequence representing a group of observations, as used in kruskal-test.

Returns the calculated Rank Epsilon-squared value as a double, ranging from 0 to 1.

Interpretation:

  • A value of 0 indicates no difference in the distributions across groups.
  • A value closer to 1 indicates that a large proportion of the variability is due to differences between group ranks.

Rank Epsilon-squared is a useful supplement to the Kruskal-Wallis test, providing a measure of the magnitude of the group effect that is not sensitive to assumptions about the data distribution shape (beyond having similar shapes for valid interpretation of the Kruskal-Wallis test itself).

See also kruskal-test, rank-eta-sq (another rank-based effect size).

Calculates Rank Epsilon-squared (ε²), a measure of effect size for the Kruskal-Wallis H-test.

Rank Epsilon-squared is a non-parametric measure quantifying the proportion of the
total variability (based on ranks) in the dependent variable that is associated
with group membership (the independent variable). It is analogous to Eta-squared
or Epsilon-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test.

This function calculates Epsilon-squared based on the Kruskal-Wallis H statistic (`H`)
and the total number of observations (`n`) across all groups.

Parameters:

- `xss` (sequence of sequences): A collection where each element is a sequence
  representing a group of observations, as used in [[kruskal-test]].

Returns the calculated Rank Epsilon-squared value as a double, ranging from 0 to 1.

Interpretation:

- A value of 0 indicates no difference in the distributions across groups.
- A value closer to 1 indicates that a large proportion of the variability is
  due to differences between group ranks.

Rank Epsilon-squared is a useful supplement to the Kruskal-Wallis test, providing a
measure of the magnitude of the group effect that is not sensitive to assumptions
about the data distribution shape (beyond having similar shapes for valid
interpretation of the Kruskal-Wallis test itself).

See also [[kruskal-test]], [[rank-eta-sq]] (another rank-based effect size).
sourceraw docstring

rank-eta-sqclj

(rank-eta-sq xss)

Calculates the Rank Eta-squared (η²), an effect size measure for the Kruskal-Wallis H-test.

Rank Eta-squared is a non-parametric measure quantifying the proportion of the total variability (based on ranks) in the dependent variable that is associated with group membership (the independent variable). It is analogous to Eta-squared in one-way ANOVA but used for the rank-based Kruskal-Wallis test.

The statistic is calculated based on the Kruskal-Wallis H statistic, the number of groups (k), and the total number of observations (n).

Parameters:

  • xss (sequence of sequences): A collection where each element is a sequence representing a group of observations, as used in kruskal-test.

Returns the calculated Rank Eta-squared value as a double, ranging from 0 to 1.

Interpretation:

  • A value of 0 indicates no difference in the distributions across groups (all variability is within groups).
  • A value closer to 1 indicates that a large proportion of the variability is due to differences between group ranks.

Rank Eta-squared is a useful supplement to the Kruskal-Wallis test, providing a measure of the magnitude of the group effect that is not sensitive to assumptions about the data distribution shape (beyond having similar shapes for valid interpretation of the Kruskal-Wallis test itself).

See also kruskal-test, rank-epsilon-sq (another rank-based effect size).

Calculates the Rank Eta-squared (η²), an effect size measure for the Kruskal-Wallis H-test.

Rank Eta-squared is a non-parametric measure quantifying the proportion of the
total variability (based on ranks) in the dependent variable that is associated
with group membership (the independent variable). It is analogous to Eta-squared
in one-way ANOVA but used for the rank-based Kruskal-Wallis test.

The statistic is calculated based on the Kruskal-Wallis H statistic, the number
of groups (`k`), and the total number of observations (`n`).

Parameters:

- `xss` (sequence of sequences): A collection where each element is a sequence
  representing a group of observations, as used in [[kruskal-test]].

Returns the calculated Rank Eta-squared value as a double, ranging from 0 to 1.

Interpretation:
- A value of 0 indicates no difference in the distributions across groups (all variability is within groups).
- A value closer to 1 indicates that a large proportion of the variability is
  due to differences between group ranks.

Rank Eta-squared is a useful supplement to the Kruskal-Wallis test, providing a
measure of the magnitude of the group effect that is not sensitive to assumptions
about the data distribution shape (beyond having similar shapes for valid
interpretation of the Kruskal-Wallis test itself).

See also [[kruskal-test]], [[rank-epsilon-sq]] (another rank-based effect size).
sourceraw docstring

remove-outliersclj

(remove-outliers vs)
(remove-outliers vs estimation-strategy)
(remove-outliers vs q1 q3)

Remove outliers defined as values outside inner fences.

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is (- Q3 Q1).

  • LIF (Lower Inner Fence) equals (- Q1 (* 1.5 IQR)).
  • UIF (Upper Inner Fence) equals (+ Q3 (* 1.5 IQR)).

Returns a sequence without outliers.

Optional estimation-strategy argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].

Remove outliers defined as values outside inner fences.

Let Q1 is 25-percentile and Q3 is 75-percentile. IQR is `(- Q3 Q1)`.

* LIF (Lower Inner Fence) equals `(- Q1 (* 1.5 IQR))`.
* UIF (Upper Inner Fence) equals `(+ Q3 (* 1.5 IQR))`.

Returns a sequence without outliers.

Optional `estimation-strategy` argument can be set to change quantile calculations estimation type. See [[estimation-strategies]].
sourceraw docstring

rescaleclj

(rescale vs)
(rescale vs low high)

Lineary rascale data to desired range, [0,1] by default

Lineary rascale data to desired range, [0,1] by default
sourceraw docstring

rmseclj

(rmse [vs1 vs2-or-val])
(rmse vs1 vs2-or-val)

Calculates the Root Mean Squared Error (RMSE) between two sequences or a sequence and a constant value.

RMSE is the square root of the mse (Mean Squared Error). It represents the standard deviation of the residuals (prediction errors) and has the same units as the original data, making it more interpretable than MSE. It measures the average magnitude of the errors, penalizing larger errors more than smaller ones due to the squaring involved.

Parameters:

  • vs1 (sequence of numbers): The first sequence (often the observed or true values).
  • vs2-or-val (sequence of numbers or single number): The second sequence (often the predicted or reference values), or a single number to compare against each element of vs1.

If both inputs are sequences, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated Root Mean Squared Error as a double.

See also mse (Mean Squared Error), rss (Residual Sum of Squares), me (Mean Error), mae (Mean Absolute Error), r2 (Coefficient of Determination).

Calculates the Root Mean Squared Error (RMSE) between two sequences or a sequence and a constant value.

RMSE is the square root of the [[mse]] (Mean Squared Error). It represents the
standard deviation of the residuals (prediction errors) and has the same units
as the original data, making it more interpretable than MSE. It measures the
average magnitude of the errors, penalizing larger errors more than smaller ones
due to the squaring involved.

Parameters:

- `vs1` (sequence of numbers): The first sequence (often the observed or true values).
- `vs2-or-val` (sequence of numbers or single number): The second sequence
  (often the predicted or reference values), or a single number to compare
  against each element of `vs1`.

If both inputs are sequences, they must have the same length. If `vs2-or-val`
is a single number, it is effectively treated as a sequence of that number
repeated `count(vs1)` times.

Returns the calculated Root Mean Squared Error as a double.

See also [[mse]] (Mean Squared Error), [[rss]] (Residual Sum of Squares),
[[me]] (Mean Error), [[mae]] (Mean Absolute Error), [[r2]] (Coefficient of Determination).
sourceraw docstring

robust-standardizeclj

(robust-standardize vs)
(robust-standardize vs q)

Normalize samples to have median = 0 and MAD = 1.

If q argument is used, scaling is done by quantile difference (Q_q, Q_(1-q)). Set 0.25 for IQR.

Normalize samples to have median = 0 and MAD = 1.

If `q` argument is used, scaling is done by quantile difference (Q_q, Q_(1-q)). Set 0.25 for IQR.
sourceraw docstring

rows->contingency-tableclj

(rows->contingency-table xss)

Converts a sequence of sequences (representing rows of counts) into a map-based contingency table.

This function takes a collection where each inner sequence is treated as a row of counts in a grid or matrix. It transforms this matrix representation into a map where keys are [row-index, column-index] tuples and values are the non-zero counts at that intersection.

This is particularly useful for converting structured count data, like the output of some grouping or tabulation processes, into a format suitable for functions expecting a contingency table map (like contingency-table->marginals or chi-squared tests).

Parameters:

  • xss (sequence of sequences of numbers): A collection where each inner sequence xs_i contains counts for row i. Values within xs_i are interpreted as counts for columns 0, 1, ....

Returns a map where keys are [row-index, column-index] vectors and values are the corresponding non-zero counts from the input matrix. Zero counts are omitted from the output map.

See also contingency-table (for building tables from raw data), contingency-table->marginals.

Converts a sequence of sequences (representing rows of counts) into a map-based contingency table.

This function takes a collection where each inner sequence is treated as a row
of counts in a grid or matrix. It transforms this matrix representation into a
map where keys are `[row-index, column-index]` tuples and values are the
non-zero counts at that intersection.

This is particularly useful for converting structured count data, like the
output of some grouping or tabulation processes, into a format suitable for
functions expecting a contingency table map (like `contingency-table->marginals`
or chi-squared tests).

Parameters:

- `xss` (sequence of sequences of numbers): A collection where each inner
  sequence `xs_i` contains counts for row `i`. Values within `xs_i` are
  interpreted as counts for columns `0, 1, ...`.

Returns a map where keys are `[row-index, column-index]` vectors and values
are the corresponding non-zero counts from the input matrix. Zero counts are
omitted from the output map.

See also [[contingency-table]] (for building tables from raw data), [[contingency-table->marginals]].
sourceraw docstring

rssclj

(rss [vs1 vs2-or-val])
(rss vs1 vs2-or-val)

Calculates the Residual Sum of Squares (RSS) between two sequences or a sequence and a constant value.

RSS is a measure of the discrepancy between data and a model, often used in regression analysis to quantify the total squared difference between observed values and predicted (or reference) values.

Parameters:

  • vs1 (sequence of numbers): The first sequence (often observed values).
  • vs2-or-val (sequence of numbers or single number): The second sequence (often predicted or reference values), or a single number to compare against each element of vs1.

If both sequences (vs1 and vs2) are provided, they must have the same length. If vs2-or-val is a single number, it is effectively treated as a sequence of that number repeated count(vs1) times.

Returns the calculated Residual Sum of Squares as a double.

See also mse (Mean Squared Error), rmse (Root Mean Squared Error), r2 (Coefficient of Determination).

Calculates the Residual Sum of Squares (RSS) between two sequences or a sequence and a constant value.

RSS is a measure of the discrepancy between data and a model, often used in
regression analysis to quantify the total squared difference between observed
values and predicted (or reference) values.

Parameters:

- `vs1` (sequence of numbers): The first sequence (often observed values).
- `vs2-or-val` (sequence of numbers or single number): The second sequence
  (often predicted or reference values), or a single number to compare
  against each element of `vs1`.

If both sequences (`vs1` and `vs2`) are provided, they must have the same length.
If `vs2-or-val` is a single number, it is effectively treated as a sequence
of that number repeated `count(vs1)` times.

Returns the calculated Residual Sum of Squares as a double.

See also [[mse]] (Mean Squared Error), [[rmse]] (Root Mean Squared Error), [[r2]] (Coefficient of Determination).
sourceraw docstring

second-momentcljdeprecated

source

semclj

(sem vs)

Calculates the Standard Error of the Mean (SEM) for a sequence vs.

The SEM estimates the standard deviation of the sample mean, providing an indication of how accurately the sample mean represents the population mean. It is calculated as:

SEM = stddev(vs) / sqrt(count(vs))

where stddev(vs) is the sample standard deviation and count(vs) is the sample size.

Parameters:

  • vs: Sequence of numbers.

Returns the calculated SEM as a double.

A smaller SEM indicates that the sample mean is likely to be a more precise estimate of the population mean.

See also stddev, mean.

Calculates the Standard Error of the Mean (SEM) for a sequence `vs`.

The SEM estimates the standard deviation of the sample mean, providing an
indication of how accurately the sample mean represents the population mean.
It is calculated as:

`SEM = stddev(vs) / sqrt(count(vs))`

where `stddev(vs)` is the sample standard deviation and `count(vs)` is the
sample size.

Parameters:

- `vs`: Sequence of numbers.

Returns the calculated SEM as a double.

A smaller SEM indicates that the sample mean is likely to be a more precise
estimate of the population mean.

See also [[stddev]], [[mean]].
sourceraw docstring

sem-extentclj

(sem-extent vs)

-/+ sem and mean

 -/+ sem and mean
sourceraw docstring

similarityclj

(similarity method P-observed Q-expected)
(similarity method
            P-observed
            Q-expected
            {:keys [bins probabilities? epsilon]
             :or {probabilities? true epsilon 1.0E-6}})

Various PDF similarities between two histograms (frequencies) or probabilities.

Q can be a distribution object. Then, histogram will be created out of P.

Arguments:

  • method - distance method
  • P-observed - frequencies, probabilities or actual data (when Q is a distribution)
  • Q-expected - frequencies, probabilities or distribution object (when P is a data)

Options:

  • :probabilities? - should P/Q be converted to a probabilities, default: true.
  • :epsilon - small number which replaces 0.0 when division or logarithm is used`
  • :bins - number of bins or bins estimation method, see histogram.

The list of methods: :intersection, :czekanowski, :motyka, :kulczynski, :ruzicka, :inner-product, :harmonic-mean, :cosine, :jaccard, :dice, :fidelity, :squared-chord

See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha

Various PDF similarities between two histograms (frequencies) or probabilities.

Q can be a distribution object. Then, histogram will be created out of P.

Arguments:

* `method` - distance method
* `P-observed` - frequencies, probabilities or actual data (when Q is a distribution)
* `Q-expected` - frequencies, probabilities or distribution object (when P is a data)

Options:

* `:probabilities?` - should P/Q be converted to a probabilities, default: `true`.
* `:epsilon` - small number which replaces `0.0` when division or logarithm is used`
* `:bins` - number of bins or bins estimation method, see [[histogram]].

The list of methods: `:intersection`, `:czekanowski`, `:motyka`, `:kulczynski`, `:ruzicka`, `:inner-product`, `:harmonic-mean`, `:cosine`, `:jaccard`, `:dice`, `:fidelity`, `:squared-chord`

See more: Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions by Sung-Hyuk Cha
sourceraw docstring

skewnessclj

(skewness vs)
(skewness vs typ)

Calculate skewness from sequence, a measure of the asymmetry of the probability distribution about its mean.

Parameters:

  • vs (seq of numbers): The input sequence.
  • typ (keyword or sequence, optional): Specifies the type of skewness measure to calculate. Defaults to :G1.

Available typ values:

  • :G1 (Default): Sample skewness based on the third standardized moment, as implemented by Apache Commons Math Skewness. Adjusted for sample size bias.
  • :g1 or :pearson: Pearson's moment coefficient of skewness (g1), a bias-adjusted version of the third standardized moment. Expected value 0 for symmetric distributions.
  • :b1: Sample skewness coefficient (b1), related to :g1.
  • :B1 or :yule: Yule's coefficient (robust), based on quantiles. Takes an optional quantile u (default 0.25) via sequence [:B1 u] or [:yule u].
  • :B3: Robust measure comparing the mean and median relative to the mean absolute deviation around the median.
  • :skew: An adjusted skewness definition sometimes used in bootstrap (BCa) calculations.
  • :mode: Pearson's second skewness coefficient: (mean - mode) / stddev. Requires calculating the mode. Mode calculation method can be specified via sequence [:mode method opts], see mode.
  • :median: Robust measure: 3 * (mean - median) / stddev.
  • :bowley: Bowley's coefficient (robust), based on quartiles (Q1, Q2, Q3). Also known as Yule-Bowley coefficient. Calculated as (Q3 + Q1 - 2*Q2) / (Q3 - Q1).
  • :hogg: Hogg's robust measure based on the ratio of differences between trimmed means.
  • :l-skewness: L-skewness (τ₃), the ratio of the 3rd L-moment (λ₃) to the 2nd L-moment (λ₂, L-scale). Calculated directly using l-moment with the :ratio? option set to true. It's a robust measure of asymmetry. Expected value 0 for symmetric distributions.

Interpretation:

  • Positive values generally indicate a distribution skewed to the right (tail is longer on the right).
  • Negative values generally indicate a distribution skewed to the left (tail is longer on the left).
  • Values near 0 suggest relative symmetry.

Returns the calculated skewness value as a double.

See also skewness-test, normality-test, jarque-bera-test, l-moment.

Calculate skewness from sequence, a measure of the asymmetry of the
probability distribution about its mean.

Parameters:

- `vs` (seq of numbers): The input sequence.
- `typ` (keyword or sequence, optional): Specifies the type of skewness measure to calculate.
  Defaults to `:G1`.

Available `typ` values:

- `:G1` (Default): Sample skewness based on the third standardized moment, as
  implemented by Apache Commons Math `Skewness`. Adjusted for sample size bias.
- `:g1` or `:pearson`: Pearson's moment coefficient of skewness (g1), a bias-adjusted
  version of the third standardized moment. Expected value 0 for symmetric distributions.
- `:b1`: Sample skewness coefficient (b1), related to :g1.
- `:B1` or `:yule`: Yule's coefficient (robust), based on quantiles. Takes an optional
  quantile `u` (default 0.25) via sequence `[:B1 u]` or `[:yule u]`.
- `:B3`: Robust measure comparing the mean and median relative to the mean absolute
  deviation around the median.
- `:skew`: An adjusted skewness definition sometimes used in bootstrap (BCa) calculations.
- `:mode`: Pearson's second skewness coefficient: `(mean - mode) / stddev`. Requires
  calculating the mode. Mode calculation method can be specified via sequence
  `[:mode method opts]`, see [[mode]].
- `:median`: Robust measure: `3 * (mean - median) / stddev`.
- `:bowley`: Bowley's coefficient (robust), based on quartiles (Q1, Q2, Q3). Also
  known as Yule-Bowley coefficient. Calculated as `(Q3 + Q1 - 2*Q2) / (Q3 - Q1)`.
- `:hogg`: Hogg's robust measure based on the ratio of differences between trimmed means.
- `:l-skewness`: L-skewness (τ₃), the ratio of the 3rd L-moment (λ₃) to the
  2nd L-moment (λ₂, L-scale). Calculated directly using [[l-moment]] with the
  `:ratio?` option set to true. It's a robust measure of asymmetry.
  Expected value 0 for symmetric distributions.

Interpretation:

- Positive values generally indicate a distribution skewed to the right (tail is longer on the right).
- Negative values generally indicate a distribution skewed to the left (tail is longer on the left).
- Values near 0 suggest relative symmetry.

Returns the calculated skewness value as a double.

See also [[skewness-test]], [[normality-test]], [[jarque-bera-test]], [[l-moment]].
sourceraw docstring

skewness-testclj

(skewness-test xs)
(skewness-test xs params)
(skewness-test xs skew {:keys [sides type] :or {sides :two-sided type :g1}})

Performs the D'Agostino test for normality based on sample skewness.

This test assesses the null hypothesis that the data comes from a normally distributed population by checking if the sample skewness significantly deviates from the zero skewness expected under normality.

The test works by:

  1. Calculating the sample skewness (type configurable via :type, default :g1).
  2. Standardizing the sample skewness relative to its expected value (0) and standard error under the null hypothesis.
  3. Applying a further transformation (inverse hyperbolic sine based) to this standardized score to yield a final test statistic Z that more closely follows a standard normal distribution under the null hypothesis.

Parameters:

  • xs (seq of numbers): The sample data.
  • skew (double, optional): A pre-calculated skewness value. If omitted, it's calculated from xs.
  • params (map, optional): Options map:
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): The population skewness is different from 0.
      • :one-sided-greater: The population skewness is greater than 0 (right-skewed).
      • :one-sided-less: The population skewness is less than 0 (left-skewed).
    • :type (keyword, default :g1): The type of skewness to calculate if skew is not provided. Note that the internal normalization constants are derived based on :g1. See skewness for options.

Returns a map containing:

  • :Z: The final test statistic, approximately standard normal under H0.
  • :stat: Alias for :Z.
  • :p-value: The p-value associated with Z and the specified :sides.
  • :skewness: The sample skewness value used in the test (either provided or calculated).

See also kurtosis-test, normality-test, jarque-bera-test.

Performs the D'Agostino test for normality based on sample skewness.

This test assesses the null hypothesis that the data comes from a normally
distributed population by checking if the sample skewness significantly deviates
from the zero skewness expected under normality.

The test works by:

1. Calculating the sample skewness (type configurable via `:type`, default `:g1`).
2. Standardizing the sample skewness relative to its expected value (0) and
   standard error under the null hypothesis.
3. Applying a further transformation (inverse hyperbolic sine based) to this
   standardized score to yield a final test statistic `Z` that more closely
   follows a standard normal distribution under the null hypothesis.

Parameters:

- `xs` (seq of numbers): The sample data.
- `skew` (double, optional): A pre-calculated skewness value. If omitted, it's calculated from `xs`.
- `params` (map, optional): Options map:
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): The population skewness is different from 0.
    - `:one-sided-greater`: The population skewness is greater than 0 (right-skewed).
    - `:one-sided-less`: The population skewness is less than 0 (left-skewed).
  - `:type` (keyword, default `:g1`): The type of skewness to calculate if `skew` is not provided. Note that the internal normalization constants are derived based on `:g1`. See [[skewness]] for options.

Returns a map containing:

- `:Z`: The final test statistic, approximately standard normal under H0.
- `:stat`: Alias for `:Z`.
- `:p-value`: The p-value associated with `Z` and the specified `:sides`.
- `:skewness`: The sample skewness value used in the test (either provided or calculated).

See also [[kurtosis-test]], [[normality-test]], [[jarque-bera-test]].
sourceraw docstring

spanclj

(span vs)

Width of the sample, maximum value minus minimum value

Width of the sample, maximum value minus minimum value
sourceraw docstring

spearman-correlationclj

(spearman-correlation [vs1 vs2])
(spearman-correlation vs1 vs2)

Calculates Spearman's rank correlation coefficient between two sequences.

Spearman's rank correlation is a non-parametric measure of the monotonic relationship between two datasets. It assesses how well the relationship between two variables can be described using a monotonic function. It does not require the data to be linearly related or follow a specific distribution. The coefficient is calculated on the ranks of the data rather than the raw values.

Parameters:

  • [vs1 vs2] (sequence of two sequences): A sequence containing the two sequences of numbers.
  • vs1, vs2 (sequences): The two sequences of numbers directly as arguments.

Both sequences must have the same length.

Returns the calculated Spearman rank correlation coefficient (a value between -1.0 and 1.0) as a double. A value of 1 indicates a perfect monotonic increasing relationship, -1 a perfect monotonic decreasing relationship, and 0 no monotonic relationship.

See also pearson-correlation, kendall-correlation, correlation.

Calculates Spearman's rank correlation coefficient between two sequences.

Spearman's rank correlation is a non-parametric measure of the monotonic
relationship between two datasets. It assesses how well the relationship between
two variables can be described using a monotonic function. It does not require
the data to be linearly related or follow a specific distribution. The
coefficient is calculated on the ranks of the data rather than the raw values.

Parameters:

- `[vs1 vs2]` (sequence of two sequences): A sequence containing the two sequences of numbers.
- `vs1`, `vs2` (sequences): The two sequences of numbers directly as arguments.

Both sequences must have the same length.

Returns the calculated Spearman rank correlation coefficient (a value between -1.0 and 1.0) as a double.
A value of 1 indicates a perfect monotonic increasing relationship, -1 a perfect
monotonic decreasing relationship, and 0 no monotonic relationship.

See also [[pearson-correlation]], [[kendall-correlation]], [[correlation]].
sourceraw docstring

standardizeclj

(standardize vs)

Normalize samples to have mean = 0 and stddev = 1.

Normalize samples to have mean = 0 and stddev = 1.
sourceraw docstring

stats-mapclj

(stats-map vs)
(stats-map vs estimation-strategy)

Calculates a comprehensive set of descriptive statistics for a numerical dataset.

This function computes various summary measures and returns them as a map, providing a quick overview of the data's central tendency, dispersion, shape, and potential outliers.

Parameters:

  • vs (seq of numbers): The input sequence of numerical data.
  • estimation-strategy (keyword, optional): Specifies the method for calculating quantiles (including median, quartiles, and values used for fences). Defaults to :legacy. See percentile or quantile for available strategies (e.g., :r1 through :r9).

Returns a map where keys are statistic names (as keywords) and values are their calculated measures:

  • :Size: The number of data points in the sequence (count).
  • :Min: The minimum value (see minimum).
  • :Max: The maximum value (see maximum).
  • :Range: The difference between the maximum and minimum values (Max - Min).
  • :Mean: The arithmetic average (see mean).
  • :Median: The middle value (see median with estimation-strategy).
  • :Mode: The most frequent value (see mode with default method).
  • :Q1: The first quartile (25th percentile) (see percentile with estimation-strategy).
  • :Q3: The third quartile (75th percentile) (see percentile with estimation-strategy).
  • :Total: The sum of all values (see sum).
  • :SD: The sample standard deviation (see stddev).
  • :Variance: The sample variance (SD^2, see variance).
  • :MAD: The Median Absolute Deviation (see median-absolute-deviation).
  • :SEM: The Standard Error of the Mean (see sem).
  • :LAV: The Lower Adjacent Value (smallest value within the inner fence, see adjacent-values).
  • :UAV: The Upper Adjacent Value (largest value within the inner fence, see adjacent-values).
  • :IQR: The Interquartile Range (Q3 - Q1).
  • :LOF: The Lower Outer Fence (Q1 - 3*IQR, see outer-fence-extent).
  • :UOF: The Upper Outer Fence (Q3 + 3*IQR, see outer-fence-extent).
  • :LIF: The Lower Inner Fence (Q1 - 1.5*IQR, see inner-fence-extent).
  • :UIF: The Upper Inner Fence (Q3 + 1.5*IQR, see inner-fence-extent).
  • :Outliers: A sequence of data points falling outside the inner fences (see outliers).
  • :Kurtosis: A measure of tailedness/peakedness (see kurtosis with default :G2 type).
  • :Skewness: A measure of asymmetry (see skewness with default :G1 type).

This function is a convenient way to get a standard set of summary statistics for a dataset in a single call.

Calculates a comprehensive set of descriptive statistics for a numerical dataset.

This function computes various summary measures and returns them as a map,
providing a quick overview of the data's central tendency, dispersion, shape,
and potential outliers.

Parameters:

- `vs` (seq of numbers): The input sequence of numerical data.
- `estimation-strategy` (keyword, optional): Specifies the method for calculating
  quantiles (including median, quartiles, and values used for fences).
  Defaults to `:legacy`. See [[percentile]] or [[quantile]] for available
  strategies (e.g., `:r1` through `:r9`).

Returns a map where keys are statistic names (as keywords) and values are
their calculated measures:

- `:Size`: The number of data points in the sequence (count).
- `:Min`: The minimum value (see [[minimum]]).
- `:Max`: The maximum value (see [[maximum]]).
- `:Range`: The difference between the maximum and minimum values (Max - Min).
- `:Mean`: The arithmetic average (see [[mean]]).
- `:Median`: The middle value (see [[median]] with `estimation-strategy`).
- `:Mode`: The most frequent value (see [[mode]] with default method).
- `:Q1`: The first quartile (25th percentile) (see [[percentile]] with `estimation-strategy`).
- `:Q3`: The third quartile (75th percentile) (see [[percentile]] with `estimation-strategy`).
- `:Total`: The sum of all values (see [[sum]]).
- `:SD`: The sample standard deviation (see [[stddev]]).
- `:Variance`: The sample variance (SD^2, see [[variance]]).
- `:MAD`: The Median Absolute Deviation (see [[median-absolute-deviation]]).
- `:SEM`: The Standard Error of the Mean (see [[sem]]).
- `:LAV`: The Lower Adjacent Value (smallest value within the inner fence, see [[adjacent-values]]).
- `:UAV`: The Upper Adjacent Value (largest value within the inner fence, see [[adjacent-values]]).
- `:IQR`: The Interquartile Range (Q3 - Q1).
- `:LOF`: The Lower Outer Fence (Q1 - 3*IQR, see [[outer-fence-extent]]).
- `:UOF`: The Upper Outer Fence (Q3 + 3*IQR, see [[outer-fence-extent]]).
- `:LIF`: The Lower Inner Fence (Q1 - 1.5*IQR, see [[inner-fence-extent]]).
- `:UIF`: The Upper Inner Fence (Q3 + 1.5*IQR, see [[inner-fence-extent]]).
- `:Outliers`: A sequence of data points falling outside the inner fences (see [[outliers]]).
- `:Kurtosis`: A measure of tailedness/peakedness (see [[kurtosis]] with default `:G2` type).
- `:Skewness`: A measure of asymmetry (see [[skewness]] with default `:G1` type).

This function is a convenient way to get a standard set of summary statistics
for a dataset in a single call.
sourceraw docstring

stddevclj

(stddev vs)
(stddev vs mu)

Calculate standard deviation of vs.

See population-stddev.

Calculate standard deviation of `vs`.

See [[population-stddev]].
sourceraw docstring

stddev-extentclj

(stddev-extent vs)

-/+ stddev and mean

 -/+ stddev and mean
sourceraw docstring

sumclj

(sum vs)
(sum vs compensation-method)

Sum of all vs values.

Possible compensated summation methods are: :kahan, :neumayer and :klein

Sum of all `vs` values.

Possible compensated summation methods are: `:kahan`, `:neumayer` and `:klein`
sourceraw docstring

t-test-one-sampleclj

(t-test-one-sample xs)
(t-test-one-sample xs m)

Performs a one-sample Student's t-test to compare the sample mean against a hypothesized population mean.

This test assesses the null hypothesis that the true population mean is equal to mu. It is suitable when the population standard deviation is unknown and is estimated from the sample.

Parameters:

  • xs (seq of numbers): The sample data.
  • params (map, optional): Options map:
    • :alpha (double, default 0.05): Significance level for the confidence interval.
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): The true mean is not equal to mu.
      • :one-sided-greater: The true mean is greater than mu.
      • :one-sided-less: The true mean is less than mu.
    • :mu (double, default 0.0): The hypothesized population mean under the null hypothesis.

Returns a map containing:

  • :t: The calculated t-statistic.
  • :stat: Alias for :t.
  • :df: Degrees of freedom (n-1).
  • :p-value: The p-value associated with the t-statistic and :sides.
  • :confidence-interval: Confidence interval for the true population mean.
  • :estimate: The calculated sample mean.
  • :n: The sample size.
  • :mu: The hypothesized population mean used in the test.
  • :stderr: The standard error of the mean (calculated from the sample).
  • :alpha: Significance level used.
  • :sides: Alternative hypothesis side used.
  • :test-type: Alias for :sides.

Assumptions:

  • The data are independent observations.
  • The data are drawn from a population that is approximately normally distributed. (The t-test is relatively robust to moderate violations, especially with larger sample sizes).

See also z-test-one-sample for large samples or known population standard deviation.

Performs a one-sample Student's t-test to compare the sample mean against a hypothesized population mean.

This test assesses the null hypothesis that the true population mean is equal to `mu`.
It is suitable when the population standard deviation is unknown and is estimated
from the sample.

Parameters:

- `xs` (seq of numbers): The sample data.
- `params` (map, optional): Options map:
  - `:alpha` (double, default `0.05`): Significance level for the confidence interval.
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): The true mean is not equal to `mu`.
    - `:one-sided-greater`: The true mean is greater than `mu`.
    - `:one-sided-less`: The true mean is less than `mu`.
  - `:mu` (double, default `0.0`): The hypothesized population mean under the null hypothesis.

Returns a map containing:

- `:t`: The calculated t-statistic.
- `:stat`: Alias for `:t`.
- `:df`: Degrees of freedom (`n-1`).
- `:p-value`: The p-value associated with the t-statistic and `:sides`.
- `:confidence-interval`: Confidence interval for the true population mean.
- `:estimate`: The calculated sample mean.
- `:n`: The sample size.
- `:mu`: The hypothesized population mean used in the test.
- `:stderr`: The standard error of the mean (calculated from the sample).
- `:alpha`: Significance level used.
- `:sides`: Alternative hypothesis side used.
- `:test-type`: Alias for `:sides`.

Assumptions:

- The data are independent observations.
- The data are drawn from a population that is approximately normally distributed.
  (The t-test is relatively robust to moderate violations, especially with larger sample sizes).

See also [[z-test-one-sample]] for large samples or known population standard deviation.
sourceraw docstring

t-test-two-samplesclj

(t-test-two-samples xs ys)
(t-test-two-samples xs
                    ys
                    {:keys [paired? equal-variances?]
                     :or {paired? false equal-variances? false}
                     :as params})

Performs a two-sample Student's t-test to compare the means of two samples.

This function can perform:

  • An unpaired t-test (assuming independent samples) using either:
    • Welch's t-test (default: :equal-variances? false): Does not assume equal population variances. Uses the Satterthwaite approximation for degrees of freedom. Recommended unless variances are known to be equal.
    • Student's t-test (:equal-variances? true): Assumes equal population variances and uses a pooled variance estimate.
  • A paired t-test (:paired? true): Assumes observations in xs and ys are paired (e.g., before/after measurements on the same subjects). This performs a one-sample t-test on the differences between paired observations.

The test assesses the null hypothesis that the true difference between the population means (or the mean of the differences for paired test) is equal to mu.

Parameters:

  • xs (seq of numbers): The first sample.
  • ys (seq of numbers): The second sample.
  • params (map, optional): Options map:
    • :alpha (double, default 0.05): Significance level for the confidence interval.
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): The true difference in means is not equal to mu.
      • :one-sided-greater: The true difference (mean(xs) - mean(ys) or mean(diff)) is greater than mu.
      • :one-sided-less: The true difference (mean(xs) - mean(ys) or mean(diff)) is less than mu.
    • :mu (double, default 0.0): The hypothesized difference in means under the null hypothesis.
    • :paired? (boolean, default false): If true, performs a paired t-test (requires xs and ys to have the same length). If false, performs an unpaired test.
    • :equal-variances? (boolean, default false): Used only when paired? is false. If true, assumes equal population variances (Student's). If false, does not assume equal variances (Welch's).

Returns a map containing:

  • :t: The calculated t-statistic.
  • :stat: Alias for :t.
  • :df: Degrees of freedom used for the t-distribution.
  • :p-value: The p-value associated with the t-statistic and :sides.
  • :confidence-interval: Confidence interval for the true difference in means.
  • :estimate: The observed difference between sample means (mean(xs) - mean(ys) or mean(differences)).
  • :n: Sample sizes as [count xs, count ys] (or count diffs if paired).
  • :nx: Sample size of xs (if unpaired).
  • :ny: Sample size of ys (if unpaired).
  • :estimated-mu: Observed sample means as [mean xs, mean ys] (if unpaired).
  • :mu: The hypothesized difference under the null hypothesis.
  • :stderr: The standard error of the difference between the means (or of the mean difference if paired).
  • :alpha: Significance level used.
  • :sides: Alternative hypothesis side used.
  • :test-type: Alias for :sides.
  • :paired?: Boolean indicating if a paired test was performed.
  • :equal-variances?: Boolean indicating the variance assumption used (if unpaired).

Assumptions:

  • Independence of observations (within and between groups for unpaired).
  • Normality of the underlying populations (or of the differences for paired). The t-test is relatively robust to violations of normality, especially with larger sample sizes.
  • Equal variances (only if :equal-variances? true).
Performs a two-sample Student's t-test to compare the means of two samples.

This function can perform:

- An **unpaired t-test** (assuming independent samples) using either:
  - **Welch's t-test** (default: `:equal-variances? false`): Does not assume equal population variances. Uses the Satterthwaite approximation for degrees of freedom. Recommended unless variances are known to be equal.
  - **Student's t-test** (`:equal-variances? true`): Assumes equal population variances and uses a pooled variance estimate.
- A **paired t-test** (`:paired? true`): Assumes observations in `xs` and `ys` are paired (e.g., before/after measurements on the same subjects). This performs a one-sample t-test on the differences between paired observations.

The test assesses the null hypothesis that the true difference between the population
means (or the mean of the differences for paired test) is equal to `mu`.

Parameters:

- `xs` (seq of numbers): The first sample.
- `ys` (seq of numbers): The second sample.
- `params` (map, optional): Options map:
  - `:alpha` (double, default `0.05`): Significance level for the confidence interval.
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): The true difference in means is not equal to `mu`.
    - `:one-sided-greater`: The true difference (`mean(xs) - mean(ys)` or `mean(diff)`) is greater than `mu`.
    - `:one-sided-less`: The true difference (`mean(xs) - mean(ys)` or `mean(diff)`) is less than `mu`.
  - `:mu` (double, default `0.0`): The hypothesized difference in means under the null hypothesis.
  - `:paired?` (boolean, default `false`): If `true`, performs a paired t-test (requires `xs` and `ys` to have the same length). If `false`, performs an unpaired test.
  - `:equal-variances?` (boolean, default `false`): Used only when `paired?` is `false`. If `true`, assumes equal population variances (Student's). If `false`, does not assume equal variances (Welch's).

Returns a map containing:

- `:t`: The calculated t-statistic.
- `:stat`: Alias for `:t`.
- `:df`: Degrees of freedom used for the t-distribution.
- `:p-value`: The p-value associated with the t-statistic and `:sides`.
- `:confidence-interval`: Confidence interval for the true difference in means.
- `:estimate`: The observed difference between sample means (`mean(xs) - mean(ys)` or `mean(differences)`).
- `:n`: Sample sizes as `[count xs, count ys]` (or `count diffs` if paired).
- `:nx`: Sample size of `xs` (if unpaired).
- `:ny`: Sample size of `ys` (if unpaired).
- `:estimated-mu`: Observed sample means as `[mean xs, mean ys]` (if unpaired).
- `:mu`: The hypothesized difference under the null hypothesis.
- `:stderr`: The standard error of the difference between the means (or of the mean difference if paired).
- `:alpha`: Significance level used.
- `:sides`: Alternative hypothesis side used.
- `:test-type`: Alias for `:sides`.
- `:paired?`: Boolean indicating if a paired test was performed.
- `:equal-variances?`: Boolean indicating the variance assumption used (if unpaired).

Assumptions:
- Independence of observations (within and between groups for unpaired).
- Normality of the underlying populations (or of the differences for paired). The t-test is relatively robust to violations of normality, especially with larger sample sizes.
- Equal variances (only if `:equal-variances? true`).
sourceraw docstring

trimclj

(trim vs)
(trim vs quantile)
(trim vs quantile estimation-strategy)
(trim vs low high nan)

Return trimmed data. Trim is done by using quantiles, by default is set to 0.2.

Return trimmed data. Trim is done by using quantiles, by default is set to 0.2.
sourceraw docstring

trim-lowerclj

(trim-lower vs)
(trim-lower vs quantile)
(trim-lower vs quantile estimation-strategy)

Trim data below given quanitle, default: 0.2.

Trim data below given quanitle, default: 0.2.
sourceraw docstring

trim-upperclj

(trim-upper vs)
(trim-upper vs quantile)
(trim-upper vs quantile estimation-strategy)

Trim data above given quanitle, default: 0.2.

Trim data above given quanitle, default: 0.2.
sourceraw docstring

tschuprows-tclj

(tschuprows-t contingency-table)
(tschuprows-t group1 group2)

Calculates Tschuprow's T, a measure of association between two nominal variables represented in a contingency table.

Tschuprow's T is derived from the Pearson's Chi-squared statistic and measures the strength of the association. Its value ranges from 0 to 1.

  • A value of 0 indicates no association between the variables.
  • A value of 1 indicates perfect association, but only when the number of rows (r) equals the number of columns (k) in the contingency table. If r != k, Tschuprow's T cannot reach 1, making Cramer's V (cramers-v) often preferred as it can reach 1 for any table size.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a contingency table from the unique values in the sequences.
  2. With a contingency table: The contingency table can be provided as:
    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This is the output format of contingency-table with two inputs.
    • A sequence of sequences representing the rows of the table (e.g., [[10 5] [3 12]]). This is equivalent to rows->contingency-table.

Parameters:

  • group1 (sequence): The first sequence of categorical data.
  • group2 (sequence): The second sequence of categorical data. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Tschuprow's T coefficient as a double.

See also chisq-test, cramers-c, cramers-v, cohens-w, contingency-table.

Calculates Tschuprow's T, a measure of association between two nominal variables
represented in a contingency table.

Tschuprow's T is derived from the Pearson's Chi-squared statistic and measures
the strength of the association. Its value ranges from 0 to 1.

- A value of 0 indicates no association between the variables.
- A value of 1 indicates perfect association, but only when the number of rows
  (`r`) equals the number of columns (`k`) in the contingency table. If `r != k`,
  Tschuprow's T cannot reach 1, making Cramer's V ([[cramers-v]]) often preferred
  as it can reach 1 for any table size.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a contingency table from
    the unique values in the sequences.
2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format
      of [[contingency-table]] with two inputs.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]].

Parameters:

- `group1` (sequence): The first sequence of categorical data.
- `group2` (sequence): The second sequence of categorical data. Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed contingency table.

Returns the calculated Tschuprow's T coefficient as a double.

See also [[chisq-test]], [[cramers-c]], [[cramers-v]], [[cohens-w]], [[contingency-table]].
sourceraw docstring

ttest-one-samplecljdeprecated

source

ttest-two-samplescljdeprecated

source

varianceclj

(variance vs)
(variance vs mu)

Calculate variance of vs.

See population-variance.

Calculate variance of `vs`.

See [[population-variance]].
sourceraw docstring

variationclj

(variation vs)

Calculates the coefficient of variation (CV) for a sequence vs.

The CV is a standardized measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation to the mean:

CV = stddev(vs) / mean(vs)

This measure is unitless and allows for comparison of variability between datasets with different means or different units.

Parameters:

  • vs: Sequence of numbers.

Returns the calculated coefficient of variation as a double.

Note: The CV is undefined if the mean is zero, and may be misleading if the mean is close to zero or if the data can take both positive and negative values. All values in vs should ideally be positive.

See also stddev, mean.

Calculates the coefficient of variation (CV) for a sequence `vs`.

The CV is a standardized measure of dispersion of a probability distribution or
frequency distribution. It is defined as the ratio of the standard deviation
to the mean:

`CV = stddev(vs) / mean(vs)`

This measure is unitless and allows for comparison of variability between
datasets with different means or different units.

Parameters:

- `vs`: Sequence of numbers.

Returns the calculated coefficient of variation as a double.

Note: The CV is undefined if the mean is zero, and may be misleading if the
mean is close to zero or if the data can take both positive and negative values.
All values in `vs` should ideally be positive.

See also [[stddev]], [[mean]].
sourceraw docstring

weighted-kappaclj

(weighted-kappa contingency-table)
(weighted-kappa contingency-table weights)

Calculates Cohen's weighted Kappa coefficient (κ) for a contingency table, allowing for partial agreement between categories, typically used for ordinal data.

Weighted Kappa measures inter-rater agreement, similar to cohens-kappa, but assigns different penalties to disagreements based on their magnitude. Disagreements between closely related categories are penalized less than disagreements between distantly related categories.

The function can be called in two ways:

  1. With two sequences group1 and group2: The function will automatically construct a contingency table from the unique values in the sequences. These values are assumed to be ordinal and their position in the sorted unique value list determines their index. The mapping of values to table indices might need verification.

  2. With a contingency table: The contingency table can be provided as:

    • A map where keys are [row-index, column-index] tuples and values are counts (e.g., {[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}). This is the output format of contingency-table with two inputs. Indices are assumed to represent the ordered categories.
    • A sequence of sequences representing the rows of the table (e.g., [[10 5] [3 12]]). This is equivalent to rows->contingency-table. The row and column indices are assumed to correspond to the ordered categories.

Parameters:

  • group1 (sequence): The first sequence of ordinal outcomes/categories.
  • group2 (sequence): The second sequence of ordinal outcomes/categories. Must have the same length as group1.
  • contingency-table (map or sequence of sequences): A pre-computed contingency table where row and column indices correspond to ordered categories.
  • weights (keyword, function, or map, optional): Specifies the weighting scheme to quantify the difference between categories. Defaults to :equal-spacing.
    • :equal-spacing (default, linear weights): Penalizes disagreements linearly with the distance between categories. Weight is 1 - |i-j|/R, where i is row index, j is column index, and R is the maximum dimension of the table (max(max_row_index, max_col_index)).
    • :fleiss-cohen (quadratic weights): Penalizes disagreements quadratically with the distance. Weight is 1 - (|i-j|/R)^2.
    • (function (fn [R id1 id2])): A custom function that takes the maximum dimension R, row index id1, and column index id2 and returns the weight (typically between 0 and 1, where 1 is perfect agreement).
    • (map {[id1 id2] weight}): A custom map providing weights for specific [row-index, column-index] pairs. Missing pairs default to a weight of 0.0.

Returns the calculated weighted Cohen's Kappa coefficient as a double.

Interpretation:

  • κ_w = 1: Perfect agreement.
  • κ_w = 0: Agreement is no better than chance.
  • κ_w < 0: Agreement is worse than chance.

See also cohens-kappa (unweighted Kappa).

Calculates Cohen's weighted Kappa coefficient (κ) for a contingency table,
allowing for partial agreement between categories, typically used for ordinal data.

Weighted Kappa measures inter-rater agreement, similar to [[cohens-kappa]],
but assigns different penalties to disagreements based on their magnitude.
Disagreements between closely related categories are penalized less than
disagreements between distantly related categories.

The function can be called in two ways:

1.  With two sequences `group1` and `group2`:
    The function will automatically construct a contingency table from
    the unique values in the sequences. These values are assumed to be ordinal
    and their position in the sorted unique value list determines their index.
    The mapping of values to table indices might need verification.

2.  With a contingency table:
    The contingency table can be provided as:
    - A map where keys are `[row-index, column-index]` tuples and values are counts
      (e.g., `{[0 0] 10, [0 1] 5, [1 0] 3, [1 1] 12}`). This is the output format
      of [[contingency-table]] with two inputs. Indices are assumed to represent
      the ordered categories.
    - A sequence of sequences representing the rows of the table
      (e.g., `[[10 5] [3 12]]`). This is equivalent to [[rows->contingency-table]].
    The row and column indices are assumed to correspond to the ordered categories.

Parameters:

- `group1` (sequence): The first sequence of ordinal outcomes/categories.
- `group2` (sequence): The second sequence of ordinal outcomes/categories.
  Must have the same length as `group1`.
- `contingency-table` (map or sequence of sequences): A pre-computed contingency table
  where row and column indices correspond to ordered categories.
- `weights` (keyword, function, or map, optional): Specifies the weighting scheme
  to quantify the difference between categories. Defaults to `:equal-spacing`.
  - `:equal-spacing` (default, linear weights): Penalizes disagreements linearly
    with the distance between categories. Weight is `1 - |i-j|/R`, where `i` is
    row index, `j` is column index, and `R` is the maximum dimension of the table (max(max_row_index, max_col_index)).
  - `:fleiss-cohen` (quadratic weights): Penalizes disagreements quadratically
    with the distance. Weight is `1 - (|i-j|/R)^2`.
  - (function `(fn [R id1 id2])`): A custom function that takes the maximum
    dimension `R`, row index `id1`, and column index `id2` and returns the weight
    (typically between 0 and 1, where 1 is perfect agreement).
  - (map `{[id1 id2] weight}`): A custom map providing weights for specific
    `[row-index, column-index]` pairs. Missing pairs default to a weight of 0.0.

Returns the calculated weighted Cohen's Kappa coefficient as a double.

Interpretation:

- `κ_w = 1`: Perfect agreement.
- `κ_w = 0`: Agreement is no better than chance.
- `κ_w < 0`: Agreement is worse than chance.

See also [[cohens-kappa]] (unweighted Kappa).
sourceraw docstring

winsorclj

(winsor vs)
(winsor vs quantile)
(winsor vs quantile estimation-strategy)
(winsor vs low high nan)

Return winsorized data. Trim is done by using quantiles, by default is set to 0.2.

Return winsorized data. Trim is done by using quantiles, by default is set to 0.2.
sourceraw docstring

wmeancljdeprecated

(wmean vs)
(wmean vs weights)

Weighted mean

Weighted mean
sourceraw docstring

wmedianclj

(wmedian vs ws)
(wmedian vs ws method)

Calculates median of a sequence vs with corresponding weights ws.

Parameters:

  • vs: Sequence of data values.
  • ws: Sequence of corresponding non-negative weights. Must have the same count as vs.
  • method (optional keyword): Specifies the interpolation method used when qs falls between points in the weighted ECDF. Defaults to :linear.
    • :linear: Performs linear interpolation between the data values corresponding to the cumulative weights surrounding q=0.5.
    • :step: Uses a step function (specifically, step-before) based on the weighted ECDF. The result is the data value whose cumulative weight range includes q=0.5.
    • :average: Computes the average of the step-before and step-after interpolation methods.

See also: wquantile, quantile.

Calculates median of a sequence `vs` with corresponding weights `ws`.

Parameters:

- `vs`: Sequence of data values.
- `ws`: Sequence of corresponding non-negative weights. Must have the same count as `vs`.
- `method` (optional keyword): Specifies the interpolation method used when `qs` falls
  between points in the weighted ECDF. Defaults to `:linear`.
    - `:linear`: Performs linear interpolation between the data values corresponding
      to the cumulative weights surrounding `q=0.5`.
    - `:step`: Uses a step function (specifically, step-before) based on the
      weighted ECDF. The result is the data value whose cumulative weight range
      includes `q=0.5`.
    - `:average`: Computes the average of the step-before and step-after
      interpolation methods.

See also: [[wquantile]], [[quantile]].
sourceraw docstring

wmodeclj

(wmode vs)
(wmode vs weights)

Returns the primary weighted mode of a sequence vs.

The mode is the value that appears most often in a dataset. This function generalizes the mode concept by considering weights associated with each value. A value's contribution to the mode calculation is proportional to its weight.

If multiple values share the same highest total weight (i.e., there are ties for the mode), this function returns only the first one encountered during processing. The specific mode returned in case of a tie is not guaranteed to be stable across different runs or environments. Use wmodes if you need all tied modes.

Parameters:

  • vs: Sequence of data values. Can contain any data type (numbers, keywords, etc.).
  • weights (optional): Sequence of non-negative weights corresponding to vs. Must have the same count as vs. Defaults to a sequence of 1.0s if omitted, effectively calculating the unweighted mode.

Returns a single value representing the mode (or one of the modes if ties exist).

See also wmodes (returns all modes) and mode (for unweighted numeric data).

Returns the primary weighted mode of a sequence `vs`.

The mode is the value that appears most often in a dataset. This function
generalizes the mode concept by considering weights associated with each value.
A value's contribution to the mode calculation is proportional to its weight.

If multiple values share the same highest total weight (i.e., there are ties
for the mode), this function returns only the first one encountered during
processing. The specific mode returned in case of a tie is not guaranteed
to be stable across different runs or environments. Use [[wmodes]] if you
need all tied modes.

Parameters:

- `vs`: Sequence of data values. Can contain any data type (numbers, keywords, etc.).
- `weights` (optional): Sequence of non-negative weights corresponding to `vs`.
  Must have the same count as `vs`. Defaults to a sequence of 1.0s if omitted,
  effectively calculating the unweighted mode.

Returns a single value representing the mode (or one of the modes if ties exist).

See also [[wmodes]] (returns all modes) and [[mode]] (for unweighted numeric data).
sourceraw docstring

wmodesclj

(wmodes vs)
(wmodes vs weights)

Returns the weighted mode(s) of a sequence vs.

The mode is the value that appears most often in a dataset. This function generalizes the mode concept by considering weights associated with each value. A value's contribution to the mode calculation is proportional to its weight.

Parameters:

  • vs: Sequence of data values. Can contain any data type (numbers, keywords, etc.).
  • weights (optional): Sequence of non-negative weights corresponding to vs. Must have the same count as vs. Defaults to a sequence of 1.0s if omitted, effectively calculating the unweighted modes.

Returns a sequence containing all values that have the highest total weight. If there are ties (multiple values share the same maximum total weight), all tied values are included in the returned sequence. The order of modes in the returned sequence is not guaranteed.

See also wmode (returns only one mode in case of ties) and modes (for unweighted numeric data).

Returns the weighted mode(s) of a sequence `vs`.

The mode is the value that appears most often in a dataset. This function
generalizes the mode concept by considering weights associated with each value.
A value's contribution to the mode calculation is proportional to its weight.

Parameters:

- `vs`: Sequence of data values. Can contain any data type (numbers, keywords, etc.).
- `weights` (optional): Sequence of non-negative weights corresponding to `vs`.
  Must have the same count as `vs`. Defaults to a sequence of 1.0s if omitted,
  effectively calculating the unweighted modes.

Returns a sequence containing all values that have the highest total weight.
If there are ties (multiple values share the same maximum total weight), all
tied values are included in the returned sequence. The order of modes in the
returned sequence is not guaranteed.

See also [[wmode]] (returns only one mode in case of ties) and [[modes]] (for unweighted numeric data).
sourceraw docstring

wmw-oddsclj

(wmw-odds [group1 group2])
(wmw-odds group1 group2)

Calculates the Wilcoxon-Mann-Whitney odds (often denoted as ψ) for two independent samples.

This non-parametric effect size measure quantifies the odds that a randomly chosen observation from the first group (group1) is greater than a randomly chosen observation from the second group (group2).

The statistic is directly related to cliffs-delta (δ): ψ = (1 + δ) / (1 - δ).

Parameters:

  • group1 (seq of numbers): The first independent sample.
  • group2 (seq of numbers): The second independent sample.

Returns the calculated WMW odds as a double.

Interpretation:

  • A value greater than 1 indicates that values from group1 tend to be larger than values from group2.
  • A value less than 1 indicates that values from group1 tend to be smaller than values from group2.
  • A value of 1 indicates stochastic equality between the distributions (50/50 odds).

This measure is robust to violations of normality and is suitable for ordinal data. It is closely related to Cliff's Delta (δ) and the Mann-Whitney U test statistic.

See also cliffs-delta, ameasure.

Calculates the Wilcoxon-Mann-Whitney odds (often denoted as ψ) for two independent samples.

This non-parametric effect size measure quantifies the odds that a randomly chosen
observation from the first group (`group1`) is greater than a randomly chosen
observation from the second group (`group2`).

The statistic is directly related to [[cliffs-delta]] (δ): ψ = (1 + δ) / (1 - δ).

Parameters:

- `group1` (seq of numbers): The first independent sample.
- `group2` (seq of numbers): The second independent sample.

Returns the calculated WMW odds as a double.

Interpretation:

- A value greater than 1 indicates that values from `group1` tend to be larger than values from `group2`.
- A value less than 1 indicates that values from `group1` tend to be smaller than values from `group2`.
- A value of 1 indicates stochastic equality between the distributions (50/50 odds).

This measure is robust to violations of normality and is suitable for ordinal data.
It is closely related to Cliff's Delta (δ) and the Mann-Whitney U test statistic.

See also [[cliffs-delta]], [[ameasure]].
sourceraw docstring

wquantileclj

(wquantile vs ws q)
(wquantile vs ws q method)

Calculates the q-th weighted quantile of a sequence vs with corresponding weights ws.

The quantile q is a value between 0.0 and 1.0, inclusive.

The calculation involves constructing a weighted empirical cumulative distribution function (ECDF) and interpolating to find the value at quantile q.

Parameters:

  • vs: Sequence of data values.
  • ws: Sequence of corresponding non-negative weights. Must have the same count as vs.
  • q: The quantile level (0.0 < q <= 1.0).
  • method (optional keyword): Specifies the interpolation method used when q falls between points in the weighted ECDF. Defaults to :linear.
    • :linear: Performs linear interpolation between the data values corresponding to the cumulative weights surrounding q.
    • :step: Uses a step function (specifically, step-before) based on the weighted ECDF. The result is the data value whose cumulative weight range includes q.
    • :average: Computes the average of the step-before and step-after interpolation methods. Useful when q corresponds exactly to a cumulative weight boundary.

See also: wmedian, wquantiles, quantile.

Calculates the q-th weighted quantile of a sequence `vs` with corresponding weights `ws`.

The quantile `q` is a value between 0.0 and 1.0, inclusive.

The calculation involves constructing a weighted empirical cumulative distribution
function (ECDF) and interpolating to find the value at quantile `q`.

Parameters:

- `vs`: Sequence of data values.
- `ws`: Sequence of corresponding non-negative weights. Must have the same count as `vs`.
- `q`: The quantile level (0.0 < q <= 1.0).
- `method` (optional keyword): Specifies the interpolation method used when `q` falls
  between points in the weighted ECDF. Defaults to `:linear`.
    - `:linear`: Performs linear interpolation between the data values corresponding
      to the cumulative weights surrounding `q`.
    - `:step`: Uses a step function (specifically, step-before) based on the
      weighted ECDF. The result is the data value whose cumulative weight range
      includes `q`.
    - `:average`: Computes the average of the step-before and step-after
      interpolation methods. Useful when `q` corresponds exactly to a cumulative
      weight boundary.

See also: [[wmedian]], [[wquantiles]], [[quantile]].
sourceraw docstring

wquantilesclj

(wquantiles vs ws)
(wquantiles vs ws qs)
(wquantiles vs ws qs method)

Calculates the sequence of q-th weighted quantiles of a sequence vs with corresponding weights ws.

Quantiles qs is a sequence of values between 0.0 and 1.0, inclusive.

The calculation involves constructing a weighted empirical cumulative distribution function (ECDF) and interpolating to find the value at quantiles qs.

Parameters:

  • vs: Sequence of data values.
  • ws: Sequence of corresponding non-negative weights. Must have the same count as vs.
  • qs: Sequence of quantiles level (0.0 < q <= 1.0).
  • method (optional keyword): Specifies the interpolation method used when qs falls between points in the weighted ECDF. Defaults to :linear.
    • :linear: Performs linear interpolation between the data values corresponding to the cumulative weights surrounding q.
    • :step: Uses a step function (specifically, step-before) based on the weighted ECDF. The result is the data value whose cumulative weight range includes q.
    • :average: Computes the average of the step-before and step-after interpolation methods. Useful when q corresponds exactly to a cumulative weight boundary.

See also: wquantile, quantiles.

Calculates the sequence of q-th weighted quantiles of a sequence `vs` with corresponding weights `ws`.

Quantiles `qs` is a sequence of values between 0.0 and 1.0, inclusive.

The calculation involves constructing a weighted empirical cumulative distribution
function (ECDF) and interpolating to find the value at quantiles `qs`.

Parameters:

- `vs`: Sequence of data values.
- `ws`: Sequence of corresponding non-negative weights. Must have the same count as `vs`.
- `qs`: Sequence of quantiles level (0.0 < q <= 1.0).
- `method` (optional keyword): Specifies the interpolation method used when `qs` falls
  between points in the weighted ECDF. Defaults to `:linear`.
    - `:linear`: Performs linear interpolation between the data values corresponding
      to the cumulative weights surrounding `q`.
    - `:step`: Uses a step function (specifically, step-before) based on the
      weighted ECDF. The result is the data value whose cumulative weight range
      includes `q`.
    - `:average`: Computes the average of the step-before and step-after
      interpolation methods. Useful when `q` corresponds exactly to a cumulative
      weight boundary.

See also: [[wquantile]], [[quantiles]].
sourceraw docstring

wstddevclj

(wstddev vs freqs)

Calculate weighted (unbiased) standard deviation of vs

Calculate weighted (unbiased) standard deviation of `vs`
sourceraw docstring

wvarianceclj

(wvariance vs freqs)

Calculate weighted (unbiased) variance of vs.

Calculate weighted (unbiased) variance of `vs`.
sourceraw docstring

yeo-johnson-infer-lambdaclj

(yeo-johnson-infer-lambda xs)
(yeo-johnson-infer-lambda xs lambda-range)
(yeo-johnson-infer-lambda xs lambda-range {:keys [alpha] :or {alpha 0.0}})

Find optimal lambda parameter for Yeo-Johnson tranformation using maximum log likelihood method.

Find optimal `lambda` parameter for Yeo-Johnson tranformation using maximum log likelihood method.
sourceraw docstring

yeo-johnson-transformationclj

(yeo-johnson-transformation xs)
(yeo-johnson-transformation xs lambda)
(yeo-johnson-transformation xs lambda {:keys [alpha inverse?] :or {alpha 0.0}})

Applies the Yeo-Johnson transformation to a dataset.

This transformation is used to stabilize variance and make data more normally distributed. It extends the Box-Cox transformation to allow for zero and negative values.

Parameters:

  • xs: The input dataset.
  • lambda (default: 0.0): The power parameter controlling the transformation. If lambda is nil or a range [lambda-min, lambda-max] it will be inferred using maximum log-likelihood method.
  • Options map:
    • :alpha (optional): A shift parameter applied before transformation.
    • :inverse? (optional): Perform inverse operation, lambda should be provided (can't be inferred).

Returns:

  • A transformed sequence of numbers.

Related: box-cox-tranformation

Applies the Yeo-Johnson transformation to a dataset.

This transformation is used to stabilize variance and make data more normally distributed. It extends the Box-Cox transformation to allow for zero and negative values.

Parameters:

- `xs`: The input dataset.
- `lambda` (default: 0.0): The power parameter controlling the transformation. If `lambda` is `nil` or a range `[lambda-min, lambda-max]` it will be inferred using maximum log-likelihood method.
- Options map:
  - `:alpha` (optional): A shift parameter applied before transformation.
  - `:inverse?` (optional): Perform inverse operation, `lambda` should be provided (can't be inferred). 

Returns:

- A transformed sequence of numbers.

Related: `box-cox-tranformation`
sourceraw docstring

z-test-one-sampleclj

(z-test-one-sample xs)
(z-test-one-sample xs m)

Performs a one-sample Z-test to compare the sample mean against a hypothesized population mean.

This test assesses the null hypothesis that the true population mean is equal to mu. It typically assumes either a known population standard deviation or relies on a large sample size (e.g., n > 30) where the sample standard deviation provides a reliable estimate. This implementation uses the sample standard deviation to calculate the standard error.

Parameters:

  • xs (seq of numbers): The sample data.
  • params (map, optional): Options map:
    • :alpha (double, default 0.05): Significance level for the confidence interval.
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): The true mean is not equal to mu.
      • :one-sided-greater: The true mean is greater than mu.
      • :one-sided-less: The true mean is less than mu.
    • :mu (double, default 0.0): The hypothesized population mean under the null hypothesis.

Returns a map containing:

  • :z: The calculated Z-statistic.
  • :stat: Alias for :z.
  • :p-value: The p-value associated with the Z-statistic and the specified :sides.
  • :confidence-interval: Confidence interval for the true population mean.
  • :estimate: The calculated sample mean.
  • :n: The sample size.
  • :mu: The hypothesized population mean used in the test.
  • :stderr: The standard error of the mean (calculated using sample standard deviation).
  • :alpha: Significance level used.
  • :sides: Alternative hypothesis side used.
  • :test-type: Alias for :sides.

See also t-test-one-sample for smaller samples or when the population standard deviation is unknown.

Performs a one-sample Z-test to compare the sample mean against a hypothesized population mean.

This test assesses the null hypothesis that the true population mean is equal to `mu`.
It typically assumes either a known population standard deviation or relies on a
large sample size (e.g., n > 30) where the sample standard deviation provides a
reliable estimate. This implementation uses the sample standard deviation to calculate
the standard error.

Parameters:

- `xs` (seq of numbers): The sample data.
- `params` (map, optional): Options map:
  - `:alpha` (double, default `0.05`): Significance level for the confidence interval.
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): The true mean is not equal to `mu`.
    - `:one-sided-greater`: The true mean is greater than `mu`.
    - `:one-sided-less`: The true mean is less than `mu`.
  - `:mu` (double, default `0.0`): The hypothesized population mean under the null hypothesis.

Returns a map containing:

- `:z`: The calculated Z-statistic.
- `:stat`: Alias for `:z`.
- `:p-value`: The p-value associated with the Z-statistic and the specified `:sides`.
- `:confidence-interval`: Confidence interval for the true population mean.
- `:estimate`: The calculated sample mean.
- `:n`: The sample size.
- `:mu`: The hypothesized population mean used in the test.
- `:stderr`: The standard error of the mean (calculated using sample standard deviation).
- `:alpha`: Significance level used.
- `:sides`: Alternative hypothesis side used.
- `:test-type`: Alias for `:sides`.

See also [[t-test-one-sample]] for smaller samples or when the population standard deviation is unknown.
sourceraw docstring

z-test-two-samplesclj

(z-test-two-samples xs ys)
(z-test-two-samples xs
                    ys
                    {:keys [paired? equal-variances?]
                     :or {paired? false equal-variances? false}
                     :as params})

Performs a two-sample Z-test to compare the means of two independent or paired samples.

This test assesses the null hypothesis that the difference between the population means is equal to mu (default 0). It typically assumes known population variances or relies on large sample sizes where sample variances provide good estimates. This implementation calculates the standard error using the provided sample variances.

Parameters:

  • xs (seq of numbers): The first sample.
  • ys (seq of numbers): The second sample.
  • params (map, optional): Options map:
    • :alpha (double, default 0.05): Significance level for the confidence interval.
    • :sides (keyword, default :two-sided): Specifies the alternative hypothesis.
      • :two-sided (default): The true difference in means is not equal to mu.
      • :one-sided-greater: The true difference in means (mean(xs) - mean(ys)) is greater than mu.
      • :one-sided-less: The true difference in means (mean(xs) - mean(ys)) is less than mu.
    • :mu (double, default 0.0): The hypothesized difference in means under the null hypothesis.
    • :paired? (boolean, default false): If true, performs a paired Z-test by applying z-test-one-sample to the differences between paired observations in xs and ys (requires xs and ys to have the same length). If false, performs a two-sample test assuming independence.
    • :equal-variances? (boolean, default false): Used only when paired? is false. If true, assumes population variances are equal and calculates a pooled standard error. If false, calculates the standard error without assuming equal variances (Welch's approach adapted for Z-test). This affects the standard error calculation but the standard normal distribution is still used for inference.

Returns a map containing:

  • :z: The calculated Z-statistic.
  • :stat: Alias for :z.
  • :p-value: The p-value associated with the Z-statistic and the specified :sides.
  • :confidence-interval: Confidence interval for the true difference in means.
  • :estimate: The observed difference between sample means (mean(xs) - mean(ys)).
  • :n: Sample sizes as [count xs, count ys].
  • :nx: Sample size of xs.
  • :ny: Sample size of ys.
  • :estimated-mu: The observed sample means as [mean xs, mean ys].
  • :mu: The hypothesized difference under the null hypothesis.
  • :stderr: The standard error of the difference between the means.
  • :alpha: Significance level used.
  • :sides: Alternative hypothesis side used.
  • :test-type: Alias for :sides.
  • :paired?: Boolean indicating if a paired test was performed.
  • :equal-variances?: Boolean indicating the assumption used for standard error calculation (if unpaired).

See also t-test-two-samples for smaller samples or when population variances are unknown.

Performs a two-sample Z-test to compare the means of two independent or paired samples.

This test assesses the null hypothesis that the difference between the population
means is equal to `mu` (default 0). It typically assumes known population variances
or relies on large sample sizes where sample variances provide good estimates.
This implementation calculates the standard error using the provided sample variances.

Parameters:

- `xs` (seq of numbers): The first sample.
- `ys` (seq of numbers): The second sample.
- `params` (map, optional): Options map:
  - `:alpha` (double, default `0.05`): Significance level for the confidence interval.
  - `:sides` (keyword, default `:two-sided`): Specifies the alternative hypothesis.
    - `:two-sided` (default): The true difference in means is not equal to `mu`.
    - `:one-sided-greater`: The true difference in means (`mean(xs) - mean(ys)`) is greater than `mu`.
    - `:one-sided-less`: The true difference in means (`mean(xs) - mean(ys)`) is less than `mu`.
  - `:mu` (double, default `0.0`): The hypothesized difference in means under the null hypothesis.
  - `:paired?` (boolean, default `false`): If `true`, performs a paired Z-test by applying [[z-test-one-sample]] to the differences between paired observations in `xs` and `ys` (requires `xs` and `ys` to have the same length). If `false`, performs a two-sample test assuming independence.
  - `:equal-variances?` (boolean, default `false`): Used only when `paired?` is `false`. If `true`, assumes population variances are equal and calculates a pooled standard error. If `false`, calculates the standard error without assuming equal variances (Welch's approach adapted for Z-test). This affects the standard error calculation but the standard normal distribution is still used for inference.

Returns a map containing:

- `:z`: The calculated Z-statistic.
- `:stat`: Alias for `:z`.
- `:p-value`: The p-value associated with the Z-statistic and the specified `:sides`.
- `:confidence-interval`: Confidence interval for the true difference in means.
- `:estimate`: The observed difference between sample means (`mean(xs) - mean(ys)`).
- `:n`: Sample sizes as `[count xs, count ys]`.
- `:nx`: Sample size of `xs`.
- `:ny`: Sample size of `ys`.
- `:estimated-mu`: The observed sample means as `[mean xs, mean ys]`.
- `:mu`: The hypothesized difference under the null hypothesis.
- `:stderr`: The standard error of the difference between the means.
- `:alpha`: Significance level used.
- `:sides`: Alternative hypothesis side used.
- `:test-type`: Alias for `:sides`.
- `:paired?`: Boolean indicating if a paired test was performed.
- `:equal-variances?`: Boolean indicating the assumption used for standard error calculation (if unpaired).

See also [[t-test-two-samples]] for smaller samples or when population variances are unknown.
sourceraw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close