Design matrix construction for machine learning pipelines.
This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).
Main Entry Point:
create-design-matrix: Transform a dataset into a design matrix with custom specsDesign Matrix Specification Syntax:
Column specifications use [column-name transformation] pairs where:
Shorthand Syntax:
Available Aliases (no qualification needed):
ds - tech.v3.datasettc - tablecloth.apitcc - tablecloth.column.apiFeatures:
Example: (create-design-matrix iris-data [:species] ; target column [[:petal-length identity] ; keep as-is [:sepal-ratio '(/ :sepal-length ; derive new feature :sepal-width)]])
Limitations:
See also: fastmath.ml/lm for linear regression with formula-based transformations
Design matrix construction for machine learning pipelines.
This namespace provides utilities to transform datasets into numeric design
matrices suitable for machine learning models. It supports deriving new features,
transforming existing columns, managing target variables, and expanding complex
column types (arrays, maps).
Main Entry Point:
- `create-design-matrix`: Transform a dataset into a design matrix with custom specs
Design Matrix Specification Syntax:
Column specifications use [column-name transformation] pairs where:
- Transformations are Clojure expressions (quoted with ')
- Expressions can reference column names directly as symbols
- Expressions are evaluated in order and can chain
- Non-listed columns are removed from the output
Shorthand Syntax:
- :column-name Keeps column unchanged (identity function)
- [nil '(+ a b)] Auto-generates column name for derived column
- ['(+ a b)] Same as above
Available Aliases (no qualification needed):
- `ds` - tech.v3.dataset
- `tc` - tablecloth.api
- `tcc` - tablecloth.column.api
- All of clojure.core
Features:
- Derives new columns from existing data
- Expands array and map columns into separate columns
- Automatically converts categorical columns to numbers
- Sets inference target(s) for supervised learning
- Chains transformations in dependency order
Example:
(create-design-matrix
iris-data
[:species] ; target column
[[:petal-length identity] ; keep as-is
[:sepal-ratio '(/ :sepal-length ; derive new feature
:sepal-width)]])
Limitations:
- Does not automatically expand categorical variables (specify manually)
- For linear regression, fastmath/ols offers a :transformer option using R formulas
- Design matrix approach is more flexible but less compact than R formula syntax
See also: `fastmath.ml/lm` for linear regression with formula-based transformations(create-design-matrix ds targets-specs features-specs)Converts the given dataset into a full numeric dataset.
ds is the tech.v3.dataset to transformtarget-specs are the specifications how to transform the target variablesfeatures-specs are the specifications how to transform the featuresThe 'spec' can express several types of dataset transformations in a compact way:
Columns specs are in general given as pairs of [colname function]
function need to be given as list (quoted by '), and can refer to column names.
They get evaluated from top->bottom, and can refer to each other.
Not listed columns get removed.
Special syntax:
identity fn)The following aliases can be used as part of the spec. (Other functions need to be full qualified).
clojure.core can be used without full qualifying the symbols
Example:
(dm/create-design-matrix
ds
[:y]
[
[:sum '(+ :a :b :c)]
])
This will:
This covers a range of cases, but is not as complete as R formulae.
Specialy it does not handle automatic expansion of categorical variables,
but these can be manually specified.
See design_matrix_test.clj for more examples.
(for model type :fastmath/ols , linear regression, we support a different way
of expressing arbitrary 'row transformations' using :transformer option
see fastmath.ml/lm documentation)
Converts the given dataset into a full numeric dataset.
* `ds` is the tech.v3.dataset to transform
* `target-specs` are the specifications how to transform the target variables
* `features-specs` are the specifications how to transform the features
The 'spec' can express several types of dataset transformations in a compact way:
- add new derived columns
- remove columns
- rename columns
- convert columns to categorical
- set inference target
Columns specs are in general given as pairs of [colname function]
function need to be given as list (quoted by '), and can refer to column names.
They get evaluated from top->bottom, and can refer to each other.
Not listed columns get removed.
Special syntax:
- :a-column keeps column as-is (calls `identity` fn)
- [nil '(+ a b)] or ['(+ a b)] autogenerated column name
The following aliases can be used as part of the spec.
(Other functions need to be full qualified).
clojure.core can be used without full qualifying the symbols
- ds (tech.v3.dataset)
- tc (tablecloth.api)
- tcc (tablecloth.column.api)
Example:
(dm/create-design-matrix
ds
[:y]
[
[:sum '(+ :a :b :c)]
])
This will:
- set inference target to y:
- create a new derived variables :sum, being the sum of a,b,c
- remove all columns except :y and :sum
This covers a range of cases, but is not as complete as `R formulae`.
Specialy it does not handle automatic expansion of categorical variables,
but these can be manually specified.
See `design_matrix_test.clj` for more examples.
(for model type :fastmath/ols , linear regression, we support a different way
of expressing arbitrary 'row transformations' using :transformer option
see `fastmath.ml/lm` documentation)
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |