Design matrix construction for machine learning pipelines.
This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).
Main Entry Point:
create-design-matrix: Transform a dataset into a design matrix with custom specsDesign Matrix Specification Syntax:
Column specifications use [column-name transformation] pairs where:
Shorthand Syntax:
Available Aliases (no qualification needed):
ds - tech.v3.datasettc - tablecloth.apitcc - tablecloth.column.apiFeatures:
Limitations:
See also: fastmath.ml/lm for linear regression with formula-based transformations
Design matrix construction for machine learning pipelines. This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps). Main Entry Point: - `create-design-matrix`: Transform a dataset into a design matrix with custom specs Design Matrix Specification Syntax: Column specifications use [column-name transformation] pairs where: - Transformations are Clojure expressions (quoted with ') - Expressions can reference column names directly as symbols - Expressions are evaluated in order and can chain - Non-listed columns are removed from the output Shorthand Syntax: - :column-name Keeps column unchanged (identity function) - [nil '(+ a b)] Auto-generates column name for derived column - ['(+ a b)] Same as above Available Aliases (no qualification needed): - `ds` - tech.v3.dataset - `tc` - tablecloth.api - `tcc` - tablecloth.column.api - All of clojure.core Features: - Derives new columns from existing data - Expands array and map columns into separate columns - Automatically converts categorical columns to numbers - Sets inference target(s) for supervised learning - Chains transformations in dependency order Limitations: - Does not automatically expand categorical variables (specify manually) - Design matrix approach is more flexible but less compact than R formula syntax See also: `fastmath.ml/lm` for linear regression with formula-based transformations
(create-design-matrix ds targets-specs features-specs)Converts the given dataset into a full numeric dataset.
ds is the tech.v3.dataset to transformtarget-specs are the specifications how to transform the target variablesfeatures-specs are the specifications how to transform the featuresThe 'spec' can express several types of dataset transformations in a compact way:
Columns specs are in general given as pairs of [colname function]
function need to be given as list (quoted by '), and can refer to column names.
They get evaluated from top->bottom, and can refer to each other.
Not listed columns get removed.
Special syntax:
identity fn)The following aliases can be used as part of the spec. (Other functions need to be full qualified).
clojure.core can be used without full qualifying the symbols
Example:
(dm/create-design-matrix
ds
[:y]
[
[:sum '(+ :a :b :c)]
])
This will:
This covers a range of cases, but is not as complete as R formulae.
Specialy it does not handle automatic expansion of categorical variables,
but these can be manually specified.
See design_matrix_test.clj for more examples.
(for model type :fastmath/ols , linear regression, we support a different way
of expressing arbitrary 'row transformations' using :transformer option
see fastmath.ml/lm documentation)
Converts the given dataset into a full numeric dataset.
* `ds` is the tech.v3.dataset to transform
* `target-specs` are the specifications how to transform the target variables
* `features-specs` are the specifications how to transform the features
The 'spec' can express several types of dataset transformations in a compact way:
- add new derived columns
- remove columns
- rename columns
- convert columns to categorical
- set inference target
Columns specs are in general given as pairs of [colname function]
function need to be given as list (quoted by '), and can refer to column names.
They get evaluated from top->bottom, and can refer to each other.
Not listed columns get removed.
Special syntax:
- :a-column keeps column as-is (calls `identity` fn)
- [nil '(+ a b)] or ['(+ a b)] autogenerated column name
The following aliases can be used as part of the spec.
(Other functions need to be full qualified).
clojure.core can be used without full qualifying the symbols
- ds (tech.v3.dataset)
- tc (tablecloth.api)
- tcc (tablecloth.column.api)
Example:
```
(dm/create-design-matrix
ds
[:y]
[
[:sum '(+ :a :b :c)]
])
```
This will:
- set inference target to y:
- create a new derived variables :sum, being the sum of a,b,c
- remove all columns except :y and :sum
This covers a range of cases, but is not as complete as `R formulae`.
Specialy it does not handle automatic expansion of categorical variables,
but these can be manually specified.
See `design_matrix_test.clj` for more examples.
(for model type :fastmath/ols , linear regression, we support a different way
of expressing arbitrary 'row transformations' using :transformer option
see `fastmath.ml/lm` documentation)
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |