Liking cljdoc? Tell your friends :D

scicloj.metamorph.ml.design-matrix

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).

Main Entry Point:

  • create-design-matrix: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use [column-name transformation] pairs where:

  • Transformations are Clojure expressions (quoted with ')
  • Expressions can reference column names directly as symbols
  • Expressions are evaluated in order and can chain
  • Non-listed columns are removed from the output

Shorthand Syntax:

  • :column-name Keeps column unchanged (identity function)
  • [nil '(+ a b)] Auto-generates column name for derived column
  • ['(+ a b)] Same as above

Available Aliases (no qualification needed):

  • ds - tech.v3.dataset
  • tc - tablecloth.api
  • tcc - tablecloth.column.api
  • All of clojure.core

Features:

  • Derives new columns from existing data
  • Expands array and map columns into separate columns
  • Automatically converts categorical columns to numbers
  • Sets inference target(s) for supervised learning
  • Chains transformations in dependency order

Example: (create-design-matrix iris-data [:species] ; target column [[:petal-length identity] ; keep as-is [:sepal-ratio '(/ :sepal-length ; derive new feature :sepal-width)]])

Limitations:

  • Does not automatically expand categorical variables (specify manually)
  • For linear regression, fastmath/ols offers a :transformer option using R formulas
  • Design matrix approach is more flexible but less compact than R formula syntax

See also: fastmath.ml/lm for linear regression with formula-based transformations

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design
matrices suitable for machine learning models. It supports deriving new features,
transforming existing columns, managing target variables, and expanding complex
column types (arrays, maps).

Main Entry Point:
- `create-design-matrix`: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use [column-name transformation] pairs where:
- Transformations are Clojure expressions (quoted with ')
- Expressions can reference column names directly as symbols
- Expressions are evaluated in order and can chain
- Non-listed columns are removed from the output

Shorthand Syntax:
- :column-name           Keeps column unchanged (identity function)
- [nil '(+ a b)]         Auto-generates column name for derived column
- ['(+ a b)]             Same as above

Available Aliases (no qualification needed):
- `ds`  - tech.v3.dataset
- `tc`  - tablecloth.api
- `tcc` - tablecloth.column.api
- All of clojure.core

Features:
- Derives new columns from existing data
- Expands array and map columns into separate columns
- Automatically converts categorical columns to numbers
- Sets inference target(s) for supervised learning
- Chains transformations in dependency order

Example:
(create-design-matrix
  iris-data
  [:species]                          ; target column
  [[:petal-length identity]           ; keep as-is
   [:sepal-ratio '(/ :sepal-length    ; derive new feature
                     :sepal-width)]]) 

Limitations:
- Does not automatically expand categorical variables (specify manually)
- For linear regression, fastmath/ols offers a :transformer option using R formulas
- Design matrix approach is more flexible but less compact than R formula syntax

See also: `fastmath.ml/lm` for linear regression with formula-based transformations
raw docstring

create-design-matrixclj

(create-design-matrix ds targets-specs features-specs)

Converts the given dataset into a full numeric dataset.

  • ds is the tech.v3.dataset to transform
  • target-specs are the specifications how to transform the target variables
  • features-specs are the specifications how to transform the features

The 'spec' can express several types of dataset transformations in a compact way:

  • add new derived columns
  • remove columns
  • rename columns
  • convert columns to categorical
  • set inference target

Columns specs are in general given as pairs of [colname function]

function need to be given as list (quoted by '), and can refer to column names.

They get evaluated from top->bottom, and can refer to each other.

Not listed columns get removed.

Special syntax:

  • :a-column keeps column as-is (calls identity fn)
  • [nil '(+ a b)] or ['(+ a b)] autogenerated column name

The following aliases can be used as part of the spec. (Other functions need to be full qualified).

clojure.core can be used without full qualifying the symbols

  • ds (tech.v3.dataset)
  • tc (tablecloth.api)
  • tcc (tablecloth.column.api)

Example:

(dm/create-design-matrix ds [:y] [
[:sum '(+ :a :b :c)] ])

This will:

  • set inference target to y:
  • create a new derived variables :sum, being the sum of a,b,c
  • remove all columns except :y and :sum

This covers a range of cases, but is not as complete as R formulae. Specialy it does not handle automatic expansion of categorical variables, but these can be manually specified.

See design_matrix_test.clj for more examples.

(for model type :fastmath/ols , linear regression, we support a different way of expressing arbitrary 'row transformations' using :transformer option see fastmath.ml/lm documentation)

Converts the given dataset into a full numeric dataset.

* `ds` is the tech.v3.dataset to transform
* `target-specs` are the specifications how to transform the target variables
* `features-specs` are the specifications how to transform the features 

The 'spec' can express several types of dataset transformations in a compact way:
- add new derived columns
- remove columns
- rename columns
- convert columns to categorical
- set inference target


Columns specs are in general given as pairs of [colname function]

function need to be given as list (quoted by '), and can refer to column names.

They get evaluated from top->bottom, and can refer to each other.

Not listed columns get removed.

Special syntax:

- :a-column                      keeps column as-is (calls `identity` fn)
- [nil '(+ a b)] or ['(+ a b)]   autogenerated column name 

The following aliases can be used as part of the spec.
(Other functions need to be full qualified).

clojure.core  can be used without full qualifying the symbols

- ds             (tech.v3.dataset)
- tc             (tablecloth.api)
- tcc            (tablecloth.column.api)


Example:

(dm/create-design-matrix
      ds
      [:y] 
      [         
       [:sum '(+ :a :b :c)]
      ])

This will:
- set inference target to y:
- create a new derived variables :sum, being the sum of a,b,c
- remove all columns except :y and :sum

This covers a range of cases, but is not as complete as `R formulae`.
Specialy it does not handle automatic expansion of categorical variables,
but these can be manually specified.


See  `design_matrix_test.clj` for more examples.

(for model type :fastmath/ols , linear regression, we support a different way
of expressing arbitrary 'row transformations' using :transformer option 
see `fastmath.ml/lm` documentation)

sourceraw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close