This namespace defines a set of functions that can be applied to data sets to modify the dataset in some way: transforming nominal attributes into binary attributes, removing attributes etc.
There are a number of ways to use the filtering API. The most straight forward and idomatic clojure way is to use the provided filter fns:
;; ds is the dataset (def ds (make-dataset :test [:a :b {:c [:g :m]}] [ [1 2 :g] [2 3 :m] [4 5 :g]])) (def filtered-ds (-> ds (add-attribute {:type :nominal, :column 1, :name "pet", :labels ["dog" "cat"]}) (remove-attributes {:attributes [:a :c]})))
The above functions rely on lower level fns that create and apply the filters which you may also use if you need more control over the actual filter objects:
(def filter (make-filter :remove-attributes {:dataset-format ds :attributes [:a :c]}))
;; We apply the filter to the original data set and obtain the new one (def filtered-ds (filter-apply filter ds))
The previous sample of code could be rewritten with the make-apply-filter function:
(def filtered-ds (make-apply-filter :remove-attributes {:attributes [:a :c]} ds))
This namespace defines a set of functions that can be applied to data sets to modify the
dataset in some way: transforming nominal attributes into binary attributes, removing
attributes etc.
There are a number of ways to use the filtering API. The most straight forward and
idomatic clojure way is to use the provided filter fns:
;; ds is the dataset
(def ds (make-dataset :test [:a :b {:c [:g :m]}]
[ [1 2 :g]
[2 3 :m]
[4 5 :g]]))
(def filtered-ds
(-> ds
(add-attribute {:type :nominal, :column 1, :name "pet", :labels ["dog" "cat"]})
(remove-attributes {:attributes [:a :c]})))
The above functions rely on lower level fns that create and apply the filters which you may
also use if you need more control over the actual filter objects:
(def filter (make-filter :remove-attributes {:dataset-format ds :attributes [:a :c]}))
;; We apply the filter to the original data set and obtain the new one
(def filtered-ds (filter-apply filter ds))
The previous sample of code could be rewritten with the make-apply-filter function:
(def filtered-ds (make-apply-filter :remove-attributes {:attributes [:a :c]} ds))(add-attribute ds__1214__auto__)(add-attribute ds__1214__auto__ attributes__1215__auto__)Mapping of Weka's attribute types from clj-ml keywords to the -T flag's representation.
Mapping of Weka's attribute types from clj-ml keywords to the -T flag's representation.
(clj-streamable ds__1214__auto__)(clj-streamable ds__1214__auto__ attributes__1215__auto__)(deffilter filter-name)Defines the filter's fn that creates a fn to make and apply the filter.
Defines the filter's fn that creates a fn to make and apply the filter.
Mapping of cjl-ml keywords to actual Weka classes
Mapping of cjl-ml keywords to actual Weka classes
(filter-apply filter dataset)Filters an input dataset using the provided filter and generates an output dataset. The first argument is a filter and the second parameter the data set where the filter should be applied.
Filters an input dataset using the provided filter and generates an output dataset. The first argument is a filter and the second parameter the data set where the filter should be applied.
(make-apply-filter kind options dataset)Creates a new filter with the provided options and apply it to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided.
The application of this filter is equivalent to the consecutive application of make-filter and apply-filter.
Creates a new filter with the provided options and apply it to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided. The application of this filter is equivalent to the consecutive application of make-filter and apply-filter.
(make-apply-filters filter-options dataset)Creates new filters with the provided options and applies them to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided.
Creates new filters with the provided options and applies them to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided.
(make-filter kind options)Creates a filter for the provided attributes format. The first argument must be a symbol identifying the kind of filter to generate. Currently the following filters are supported:
The second parameter is a map of attributes for the filter. All filters require a :dataset-format parameter:
- :dataset-format
The dataset where the filter is going to be applied or a
description of the format of its attributes. Sample value:
dataset, (dataset-format dataset)
An example of usage:
(make-filter :remove {:attributes [0 1] :dataset-format dataset})
Documentation for the different filters:
:supervised-discretize
An instance filter that discretizes a range of numeric attributes in the dataset into nominal attributes. Discretization is by Fayyad & Irani's MDL method (the default).
Parameters:
:unsupervised-discretize
Unsupervised version of the discretize filter. Discretization is by simple binning.
Parameters:
:pki-unsupervised-discretize
Discretizes numeric attributes using equal frequency binning, where the number of bins is equal to the square root of the number of non-missing values.
Parameters:
:supervised-nominal-to-binary
Converts nominal attributes into binary numeric attributes. An attribute with k values is transformed into k binary attributes if the class is nominal.
Parameters:
:unsupervised-nominal-to-binary
Unsupervised version of the :nominal-to-binary filter
Parameters:
:numeric-to-nominal
Transforms numeric attributes into nominal ones.
Parameters:
:attributes
Index of the attributes to be transformed. Sample value: [0 1 2] The attributes may also be specified by names as well: [:some-name, "another-name"]
:invert Invert the selection of the columns. Sample value: true
:string-to-word-vector
TODO
:add-attribute
Adds a new attribute to the dataset. The new attribute will contain all missing values.
Parameters:
:reorder-attributes
Reorder attributes.
Parameters:
:remove-attributes
Remove some columns from the data set after the provided attributes.
Parameters:
:remove-useless-attributes
Remove attributes that do not vary at all or that vary too much. All constant attributes are deleted automatically, along with any that exceed the maximum percentage of variance parameter. The maximum variance test is only applied to nominal attributes.
Parameters:
- :max-variance
Maximum variance percentage allowed (default 99).
Note: percentage, not decimal. e.g. 89 not 0.89
If you pass in a decimal Weka silently sets it to 0.0.
:resample-unsupervised
"Produces a random subsample of a dataset using either sampling with replacement or without replacement. The original dataset must fit entirely in memory. The number of instances in the generated dataset may be specified. When used in batch mode, subsequent batches are NOT resampled." -- from Weka JavaDoc.
Parameters:
:seed Random number seed (integer)
:size-percent "The size of the output dataset, as a percentage of the input dataset (default 100)" (integer)
:no-replacement Use replacement or not; default is false, i.e., with replacement (boolean)
:invert Inverts the selection; can only be true if :replacement is false (boolean)
:resample-supervised
"Produces a random subsample of a dataset using either sampling with replacement or without replacement. The original dataset must fit entirely in memory. The number of instances in the generated dataset may be specified. The dataset must have a nominal class attribute. If not, use the unsupervised version. The filter can be made to maintain the class distribution in the subsample, or to bias the class distribution toward a uniform distribution. When used in batch mode (i.e. in the FilteredClassifier), subsequent batches are NOT resampled." -- from Weka JavaDoc.
Parameters:
:seed Random number seed (integer)
:size-percent "The size of the output dataset, as a percentage of the input dataset (default 100)" (integer)
:bias "Bias factor towards uniform class distribution.0 = distribution in input data -- 1 = uniform distribution. (default 0)" (0 or 1)
:no-replacement Use replacement or not; default is false, i.e., with replacement (boolean)
:invert Inverts the selection; can only be true if :replacement is false (boolean)
:stratified-remove-folds-supervised
"This filter takes a dataset and outputs a specified fold for cross validation. If you do not want the folds to be stratified use the unsupervised version." -- from Weka JavaDoc
Parameters:
:num-folds Specifies number of folds dataset is split into. (default 10)
:fold Specifies which fold is selected. (default 1)
:seed Specifies random number seed. (default 0, no randomizing)
:invert Specifies if inverse of selection is to be output.
:select-append-attributes
Append a copy of the selected columns at the end of the dataset.
Parameters:
:replace-missing-values
Replaces all missing values for nominal and numeric attributes in a dataset with the modes and means from the training data.
Parameters:
:project-attributes
Project some columns from the provided dataset
Parameters:
Allows you to create a custom streamable filter with clojure functions. A streamable filter is appropriate when you don't need to iterate over the entire dataset before processing it.
Parameters:
Allows you to create a custom batch filter with clojure functions. A batch filter is appropriate when you need to iterate over the entire dataset before processing it.
Parameters:
For examples on how to use the filters, especially the clojure filters, you may refer to filters_test.clj of clj-ml.
Creates a filter for the provided attributes format. The first argument must be a symbol
identifying the kind of filter to generate.
Currently the following filters are supported:
- :supervised-discretize
- :unsupervised-discretize
- :pki-unsupervised-discretize
- :supervised-nominal-to-binary
- :unsupervised-nominal-to-binary
- :numeric-to-nominal
- :string-to-word-vector
- :add-attribute
- :reorder-attributes
- :remove-attributes
- :remove-percentage
- :remove-range
- :remove-useless-attributes
- :resample-unsupervised
- :resample-supervised
- :select-append-attributes
- :replace-missing-values
- :project-attributes
- :clj-streamable
- :clj-batch
The second parameter is a map of attributes for the filter.
All filters require a :dataset-format parameter:
- :dataset-format
The dataset where the filter is going to be applied or a
description of the format of its attributes. Sample value:
dataset, (dataset-format dataset)
An example of usage:
(make-filter :remove {:attributes [0 1] :dataset-format dataset})
Documentation for the different filters:
* :supervised-discretize
An instance filter that discretizes a range of numeric attributes
in the dataset into nominal attributes. Discretization is by Fayyad
& Irani's MDL method (the default).
Parameters:
- :attributes
Index of the attributes to be discretized, sample value: [0,4,6]
The attributes may also be specified by names as well: [:some-name, "another-name"]
- :invert
Invert mathcing sense of the columns, sample value: true
- :kononenko
Use Kononenko's MDL criterion, sample value: true
* :unsupervised-discretize
Unsupervised version of the discretize filter. Discretization is by simple
binning.
Parameters:
- :attributes
Index of the attributes to be discretized, sample value: [0,4,6]
The attributes may also be specified by names as well: [:some-name, "another-name"]
- :unset-class
Does not take class attribute into account for the application
of the filter, sample-value: true
- :binary
- :equal-frequency
Use equal frequency instead of equal width discretization, sample
value: true
- :optimize
Optmize the number of bins using leave-one-out estimate of
estimated entropy. Ingores the :binary attribute. sample value: true
- :number-bins
Defines the number of bins to divide the numeric attributes into
sample value: 3
* :pki-unsupervised-discretize
Discretizes numeric attributes using equal frequency binning, where the number of bins is
equal to the square root of the number of non-missing values.
Parameters:
- :attributes
Index of the attributes to be discretized, sample value: [0,4,6]
The attributes may also be specified by names as well: [:some-name, "another-name"]
- :unset-class
Does not take class attribute into account for the application
of the filter, sample-value: true
- :binary
* :supervised-nominal-to-binary
Converts nominal attributes into binary numeric attributes. An attribute with k values
is transformed into k binary attributes if the class is nominal.
Parameters:
- :also-binary
Sets if binary attributes are to be coded as nominal ones, sample value: true
- :for-each-nominal
For each nominal value one binary attribute is created, not only if the
values of the nominal attribute are greater than two.
* :unsupervised-nominal-to-binary
Unsupervised version of the :nominal-to-binary filter
Parameters:
- :attributes
Index of the attributes to be binarized. Sample value: [0 1 2]
The attributes may also be specified by names as well: [:some-name, "another-name"]
- :also-binary
Sets if binary attributes are to be coded as nominal ones, sample value: true
- :for-each-nominal
For each nominal value one binary attribute is created, not only if the
values of the nominal attribute are greater than two., sample value: true
* :numeric-to-nominal
Transforms numeric attributes into nominal ones.
Parameters:
- :attributes
Index of the attributes to be transformed. Sample value: [0 1 2]
The attributes may also be specified by names as well: [:some-name, "another-name"]
- :invert
Invert the selection of the columns. Sample value: true
* :string-to-word-vector
TODO
* :add-attribute
Adds a new attribute to the dataset. The new attribute will contain all missing values.
Parameters:
- :type
Type of the new attribute. Valid options: :numeric, :nominal, :string, :date. Defaults to :numeric.
- :name
Name of the new attribute.
- :column
Index of where to insert the attribute, indexed by 0. You may also pass in "first" and "last".
Sample values: "first", 0, 1, "last"
The default is: "last"
- :labels
Vector of valid nominal values. This only applies when the type is :nominal.
- :format
The format of the date values (see ISO-8601). This only applies when the type is :date.
The default is: "yyyy-MM-dd'T'HH:mm:ss"
* :reorder-attributes
Reorder attributes.
Parameters:
- :attributes
New ordering of the attributes. Sample value: ["2-last" "1"],
which moves the attribute currently at position 1 to the end.
Be sure to quote all attributes so that number indexes are not
automatically incremented by 1 (Weka indexes start at 1).
* :remove-attributes
Remove some columns from the data set after the provided attributes.
Parameters:
- :attributes
Index of the attributes to remove. Sample value: [0 1 2]
The attributes may also be specified by names as well: [:some-name, "another-name"]
* :remove-useless-attributes
Remove attributes that do not vary at all or that vary too much. All constant
attributes are deleted automatically, along with any that exceed the maximum percentage
of variance parameter. The maximum variance test is only applied to nominal attributes.
Parameters:
- :max-variance
Maximum variance percentage allowed (default 99).
Note: percentage, not decimal. e.g. 89 not 0.89
If you pass in a decimal Weka silently sets it to 0.0.
* :resample-unsupervised
"Produces a random subsample of a dataset using either sampling
with replacement or without replacement. The original dataset
must fit entirely in memory. The number of instances in the
generated dataset may be specified. When used in batch mode,
subsequent batches are NOT resampled." -- from Weka JavaDoc.
Parameters:
- :seed
Random number seed (integer)
- :size-percent
"The size of the output dataset, as a percentage of
the input dataset (default 100)" (integer)
- :no-replacement
Use replacement or not; default is false, i.e., with replacement (boolean)
- :invert
Inverts the selection; can only be true if :replacement is false (boolean)
* :resample-supervised
"Produces a random subsample of a dataset using either sampling
with replacement or without replacement. The original dataset
must fit entirely in memory. The number of instances in the
generated dataset may be specified. The dataset must have a
nominal class attribute. If not, use the unsupervised
version. The filter can be made to maintain the class
distribution in the subsample, or to bias the class distribution
toward a uniform distribution. When used in batch mode (i.e. in
the FilteredClassifier), subsequent batches are NOT resampled."
-- from Weka JavaDoc.
Parameters:
- :seed
Random number seed (integer)
- :size-percent
"The size of the output dataset, as a percentage of
the input dataset (default 100)" (integer)
- :bias "Bias factor towards uniform class distribution.0 =
distribution in input data -- 1 = uniform
distribution. (default 0)" (0 or 1)
- :no-replacement
Use replacement or not; default is false, i.e., with replacement (boolean)
- :invert
Inverts the selection; can only be true if :replacement is false (boolean)
* :stratified-remove-folds-supervised
"This filter takes a dataset and outputs a specified fold for cross validation.
If you do not want the folds to be stratified use the unsupervised version."
-- from Weka JavaDoc
Parameters:
- :num-folds
Specifies number of folds dataset is split into. (default 10)
- :fold
Specifies which fold is selected. (default 1)
- :seed
Specifies random number seed. (default 0, no randomizing)
- :invert
Specifies if inverse of selection is to be output.
* :select-append-attributes
Append a copy of the selected columns at the end of the dataset.
Parameters:
- :attributes
Index of the attributes. Sample value: [1 2 3]
The attributes may also be specified by names as well: [:some-name, "another-name"]
- :invert
Invert the selection of the columns. Sample value: true
* :replace-missing-values
Replaces all missing values for nominal and numeric attributes
in a dataset with the modes and means from the training data.
Parameters:
- :unset-class-temporarily
Unsets the class index temporarily before the filter is
applied to the data. Sample value: true; default: false
* :project-attributes
Project some columns from the provided dataset
Parameters:
- :invert
Invert the selection of columns. Sample value: true
* :clj-streamable
Allows you to create a custom streamable filter with clojure functions.
A streamable filter is appropriate when you don't need to iterate over
the entire dataset before processing it.
Parameters:
- :process
This function will receive individual weka.core.Instance objects (rows
of the dataset) and should return a newly processed Instance. The
actual Instance is passed in and you may change it directly. However, a better
approach is to copy the Instance with the copy method or Instance
constructor and return a modified version of the copy.
- :determine-dataset-format
This function will receive the dataset's weka.core.Instances object with
no actual Instance objects (i.e. just the format enocded in the attributes).
You must return a Instances object that contains the new format of the
filtered dataset. Passing this fn is optional. If you are not changing
the format of the dataset then by omitting a function will use the
current format.
* :clj-batch
Allows you to create a custom batch filter with clojure functions.
A batch filter is appropriate when you need to iterate over
the entire dataset before processing it.
Parameters:
- :process
This function will receive the entire dataset as a weka.core.Instances
objects. A processed Instances object should be returned with the
new Instance objects added to it. The format of the dataset (Instances)
that is returned from this will be returned from the filter (see below).
- :determine-dataset-format
This function will receive the dataset's weka.core.Instances object with
no actual Instance objects (i.e. just the format enocded in the attributes).
You must return a Instances object that contains the new format of the
filtered dataset. Passing this fn is optional.
For many batch filters you need to process the entire dataset to determine
the correct format (e.g. filters that operate on nominal attributes). For
this reason the clj-batch filter will *always* use format of the dataset
that the process fn outputs. In other words, if you need to operate on the
entire dataset before determining the format then this should be done in the
process-fn and nothing needs to be passed for this fn.
For examples on how to use the filters, especially the clojure filters, you may
refer to filters_test.clj of clj-ml.(numeric-to-nominal ds__1214__auto__)(numeric-to-nominal ds__1214__auto__ attributes__1215__auto__)(pki-unsupervised-discretize ds__1214__auto__)(pki-unsupervised-discretize ds__1214__auto__ attributes__1215__auto__)(project-attributes ds__1214__auto__)(project-attributes ds__1214__auto__ attributes__1215__auto__)(random-subset ds__1214__auto__)(random-subset ds__1214__auto__ attributes__1215__auto__)(remove-attributes ds__1214__auto__)(remove-attributes ds__1214__auto__ attributes__1215__auto__)(remove-percentage ds__1214__auto__)(remove-percentage ds__1214__auto__ attributes__1215__auto__)(remove-range ds__1214__auto__)(remove-range ds__1214__auto__ attributes__1215__auto__)(remove-useless-attributes ds__1214__auto__)(remove-useless-attributes ds__1214__auto__ attributes__1215__auto__)(reorder-attributes ds__1214__auto__)(reorder-attributes ds__1214__auto__ attributes__1215__auto__)(replace-missing-values ds__1214__auto__)(replace-missing-values ds__1214__auto__ attributes__1215__auto__)(resample-supervised ds__1214__auto__)(resample-supervised ds__1214__auto__ attributes__1215__auto__)(resample-unsupervised ds__1214__auto__)(resample-unsupervised ds__1214__auto__ attributes__1215__auto__)(select-append-attributes ds__1214__auto__)(select-append-attributes ds__1214__auto__ attributes__1215__auto__)(stratified-remove-folds-supervised ds__1214__auto__)(stratified-remove-folds-supervised ds__1214__auto__ attributes__1215__auto__)(string-to-word-vector ds__1214__auto__)(string-to-word-vector ds__1214__auto__ attributes__1215__auto__)(supervised-discretize ds__1214__auto__)(supervised-discretize ds__1214__auto__ attributes__1215__auto__)(supervised-nominal-to-binary ds__1214__auto__)(supervised-nominal-to-binary ds__1214__auto__ attributes__1215__auto__)(unsupervised-discretize ds__1214__auto__)(unsupervised-discretize ds__1214__auto__ attributes__1215__auto__)(unsupervised-nominal-to-binary ds__1214__auto__)(unsupervised-nominal-to-binary ds__1214__auto__ attributes__1215__auto__)cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |