This namespace defines a set of functions that can be applied to data sets to modify the dataset in some way: transforming nominal attributes into binary attributes, removing attributes etc.
There are a number of ways to use the filtering API. The most straight forward and idomatic clojure way is to use the provided filter fns:
;; ds is the dataset (def ds (make-dataset :test [:a :b {:c [:g :m]}] [ [1 2 :g] [2 3 :m] [4 5 :g]])) (def filtered-ds (-> ds (add-attribute {:type :nominal, :column 1, :name "pet", :labels ["dog" "cat"]}) (remove-attributes {:attributes [:a :c]})))
The above functions rely on lower level fns that create and apply the filters which you may also use if you need more control over the actual filter objects:
(def filter (make-filter :remove-attributes {:dataset-format ds :attributes [:a :c]}))
;; We apply the filter to the original data set and obtain the new one (def filtered-ds (filter-apply filter ds))
The previous sample of code could be rewritten with the make-apply-filter function:
(def filtered-ds (make-apply-filter :remove-attributes {:attributes [:a :c]} ds))
This namespace defines a set of functions that can be applied to data sets to modify the dataset in some way: transforming nominal attributes into binary attributes, removing attributes etc. There are a number of ways to use the filtering API. The most straight forward and idomatic clojure way is to use the provided filter fns: ;; ds is the dataset (def ds (make-dataset :test [:a :b {:c [:g :m]}] [ [1 2 :g] [2 3 :m] [4 5 :g]])) (def filtered-ds (-> ds (add-attribute {:type :nominal, :column 1, :name "pet", :labels ["dog" "cat"]}) (remove-attributes {:attributes [:a :c]}))) The above functions rely on lower level fns that create and apply the filters which you may also use if you need more control over the actual filter objects: (def filter (make-filter :remove-attributes {:dataset-format ds :attributes [:a :c]})) ;; We apply the filter to the original data set and obtain the new one (def filtered-ds (filter-apply filter ds)) The previous sample of code could be rewritten with the make-apply-filter function: (def filtered-ds (make-apply-filter :remove-attributes {:attributes [:a :c]} ds))
(add-attribute ds__1300__auto__)
(add-attribute ds__1300__auto__ attributes__1301__auto__)
Mapping of Weka's attribute types from clj-ml keywords to the -T flag's representation.
Mapping of Weka's attribute types from clj-ml keywords to the -T flag's representation.
(clj-streamable ds__1300__auto__)
(clj-streamable ds__1300__auto__ attributes__1301__auto__)
(deffilter filter-name)
Defines the filter's fn that creates a fn to make and apply the filter.
Defines the filter's fn that creates a fn to make and apply the filter.
Mapping of cjl-ml keywords to actual Weka classes
Mapping of cjl-ml keywords to actual Weka classes
(filter-apply filter dataset)
Filters an input dataset using the provided filter and generates an output dataset. The first argument is a filter and the second parameter the data set where the filter should be applied.
Filters an input dataset using the provided filter and generates an output dataset. The first argument is a filter and the second parameter the data set where the filter should be applied.
(make-apply-filter kind options dataset)
Creates a new filter with the provided options and apply it to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided.
The application of this filter is equivalent to the consecutive application of make-filter and apply-filter.
Creates a new filter with the provided options and apply it to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided. The application of this filter is equivalent to the consecutive application of make-filter and apply-filter.
(make-apply-filters filter-options dataset)
Creates new filters with the provided options and applies them to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided.
Creates new filters with the provided options and applies them to the provided dataset. The :dataset-format attribute for the making of the filter will be setup to the dataset passed as an argument if no other value is provided.
(make-filter kind options)
Creates a filter for the provided attributes format. The first argument must be a symbol identifying the kind of filter to generate. Currently the following filters are supported:
The second parameter is a map of attributes for the filter. All filters require a :dataset-format parameter:
- :dataset-format
The dataset where the filter is going to be applied or a
description of the format of its attributes. Sample value:
dataset, (dataset-format dataset)
An example of usage:
(make-filter :remove {:attributes [0 1] :dataset-format dataset})
Documentation for the different filters:
:supervised-discretize
An instance filter that discretizes a range of numeric attributes in the dataset into nominal attributes. Discretization is by Fayyad & Irani's MDL method (the default).
Parameters:
:unsupervised-discretize
Unsupervised version of the discretize filter. Discretization is by simple binning.
Parameters:
:pki-unsupervised-discretize
Discretizes numeric attributes using equal frequency binning, where the number of bins is equal to the square root of the number of non-missing values.
Parameters:
:supervised-nominal-to-binary
Converts nominal attributes into binary numeric attributes. An attribute with k values is transformed into k binary attributes if the class is nominal.
Parameters:
:unsupervised-nominal-to-binary
Unsupervised version of the :nominal-to-binary filter
Parameters:
:numeric-to-nominal
Transforms numeric attributes into nominal ones.
Parameters:
:attributes
Index of the attributes to be transformed. Sample value: [0 1 2] The attributes may also be specified by names as well: [:some-name, "another-name"]
:invert Invert the selection of the columns. Sample value: true
:string-to-word-vector
TODO
:add-attribute
Adds a new attribute to the dataset. The new attribute will contain all missing values.
Parameters:
:reorder-attributes
Reorder attributes.
Parameters:
:remove-attributes
Remove some columns from the data set after the provided attributes.
Parameters:
:remove-useless-attributes
Remove attributes that do not vary at all or that vary too much. All constant attributes are deleted automatically, along with any that exceed the maximum percentage of variance parameter. The maximum variance test is only applied to nominal attributes.
Parameters:
- :max-variance
Maximum variance percentage allowed (default 99).
Note: percentage, not decimal. e.g. 89 not 0.89
If you pass in a decimal Weka silently sets it to 0.0.
:resample-unsupervised
"Produces a random subsample of a dataset using either sampling with replacement or without replacement. The original dataset must fit entirely in memory. The number of instances in the generated dataset may be specified. When used in batch mode, subsequent batches are NOT resampled." -- from Weka JavaDoc.
Parameters:
:seed Random number seed (integer)
:size-percent "The size of the output dataset, as a percentage of the input dataset (default 100)" (integer)
:no-replacement Use replacement or not; default is false, i.e., with replacement (boolean)
:invert Inverts the selection; can only be true if :replacement is false (boolean)
:resample-supervised
"Produces a random subsample of a dataset using either sampling with replacement or without replacement. The original dataset must fit entirely in memory. The number of instances in the generated dataset may be specified. The dataset must have a nominal class attribute. If not, use the unsupervised version. The filter can be made to maintain the class distribution in the subsample, or to bias the class distribution toward a uniform distribution. When used in batch mode (i.e. in the FilteredClassifier), subsequent batches are NOT resampled." -- from Weka JavaDoc.
Parameters:
:seed Random number seed (integer)
:size-percent "The size of the output dataset, as a percentage of the input dataset (default 100)" (integer)
:bias "Bias factor towards uniform class distribution.0 = distribution in input data -- 1 = uniform distribution. (default 0)" (0 or 1)
:no-replacement Use replacement or not; default is false, i.e., with replacement (boolean)
:invert Inverts the selection; can only be true if :replacement is false (boolean)
:stratified-remove-folds-supervised
"This filter takes a dataset and outputs a specified fold for cross validation. If you do not want the folds to be stratified use the unsupervised version." -- from Weka JavaDoc
Parameters:
:num-folds Specifies number of folds dataset is split into. (default 10)
:fold Specifies which fold is selected. (default 1)
:seed Specifies random number seed. (default 0, no randomizing)
:invert Specifies if inverse of selection is to be output.
:select-append-attributes
Append a copy of the selected columns at the end of the dataset.
Parameters:
:replace-missing-values
Replaces all missing values for nominal and numeric attributes in a dataset with the modes and means from the training data.
Parameters:
:project-attributes
Project some columns from the provided dataset
Parameters:
Allows you to create a custom streamable filter with clojure functions. A streamable filter is appropriate when you don't need to iterate over the entire dataset before processing it.
Parameters:
Allows you to create a custom batch filter with clojure functions. A batch filter is appropriate when you need to iterate over the entire dataset before processing it.
Parameters:
For examples on how to use the filters, especially the clojure filters, you may refer to filters_test.clj of clj-ml.
Creates a filter for the provided attributes format. The first argument must be a symbol identifying the kind of filter to generate. Currently the following filters are supported: - :supervised-discretize - :unsupervised-discretize - :pki-unsupervised-discretize - :supervised-nominal-to-binary - :unsupervised-nominal-to-binary - :numeric-to-nominal - :string-to-word-vector - :add-attribute - :reorder-attributes - :remove-attributes - :remove-percentage - :remove-range - :remove-useless-attributes - :resample-unsupervised - :resample-supervised - :select-append-attributes - :replace-missing-values - :project-attributes - :clj-streamable - :clj-batch The second parameter is a map of attributes for the filter. All filters require a :dataset-format parameter: - :dataset-format The dataset where the filter is going to be applied or a description of the format of its attributes. Sample value: dataset, (dataset-format dataset) An example of usage: (make-filter :remove {:attributes [0 1] :dataset-format dataset}) Documentation for the different filters: * :supervised-discretize An instance filter that discretizes a range of numeric attributes in the dataset into nominal attributes. Discretization is by Fayyad & Irani's MDL method (the default). Parameters: - :attributes Index of the attributes to be discretized, sample value: [0,4,6] The attributes may also be specified by names as well: [:some-name, "another-name"] - :invert Invert mathcing sense of the columns, sample value: true - :kononenko Use Kononenko's MDL criterion, sample value: true * :unsupervised-discretize Unsupervised version of the discretize filter. Discretization is by simple binning. Parameters: - :attributes Index of the attributes to be discretized, sample value: [0,4,6] The attributes may also be specified by names as well: [:some-name, "another-name"] - :unset-class Does not take class attribute into account for the application of the filter, sample-value: true - :binary - :equal-frequency Use equal frequency instead of equal width discretization, sample value: true - :optimize Optmize the number of bins using leave-one-out estimate of estimated entropy. Ingores the :binary attribute. sample value: true - :number-bins Defines the number of bins to divide the numeric attributes into sample value: 3 * :pki-unsupervised-discretize Discretizes numeric attributes using equal frequency binning, where the number of bins is equal to the square root of the number of non-missing values. Parameters: - :attributes Index of the attributes to be discretized, sample value: [0,4,6] The attributes may also be specified by names as well: [:some-name, "another-name"] - :unset-class Does not take class attribute into account for the application of the filter, sample-value: true - :binary * :supervised-nominal-to-binary Converts nominal attributes into binary numeric attributes. An attribute with k values is transformed into k binary attributes if the class is nominal. Parameters: - :also-binary Sets if binary attributes are to be coded as nominal ones, sample value: true - :for-each-nominal For each nominal value one binary attribute is created, not only if the values of the nominal attribute are greater than two. * :unsupervised-nominal-to-binary Unsupervised version of the :nominal-to-binary filter Parameters: - :attributes Index of the attributes to be binarized. Sample value: [0 1 2] The attributes may also be specified by names as well: [:some-name, "another-name"] - :also-binary Sets if binary attributes are to be coded as nominal ones, sample value: true - :for-each-nominal For each nominal value one binary attribute is created, not only if the values of the nominal attribute are greater than two., sample value: true * :numeric-to-nominal Transforms numeric attributes into nominal ones. Parameters: - :attributes Index of the attributes to be transformed. Sample value: [0 1 2] The attributes may also be specified by names as well: [:some-name, "another-name"] - :invert Invert the selection of the columns. Sample value: true * :string-to-word-vector TODO * :add-attribute Adds a new attribute to the dataset. The new attribute will contain all missing values. Parameters: - :type Type of the new attribute. Valid options: :numeric, :nominal, :string, :date. Defaults to :numeric. - :name Name of the new attribute. - :column Index of where to insert the attribute, indexed by 0. You may also pass in "first" and "last". Sample values: "first", 0, 1, "last" The default is: "last" - :labels Vector of valid nominal values. This only applies when the type is :nominal. - :format The format of the date values (see ISO-8601). This only applies when the type is :date. The default is: "yyyy-MM-dd'T'HH:mm:ss" * :reorder-attributes Reorder attributes. Parameters: - :attributes New ordering of the attributes. Sample value: ["2-last" "1"], which moves the attribute currently at position 1 to the end. Be sure to quote all attributes so that number indexes are not automatically incremented by 1 (Weka indexes start at 1). * :remove-attributes Remove some columns from the data set after the provided attributes. Parameters: - :attributes Index of the attributes to remove. Sample value: [0 1 2] The attributes may also be specified by names as well: [:some-name, "another-name"] * :remove-useless-attributes Remove attributes that do not vary at all or that vary too much. All constant attributes are deleted automatically, along with any that exceed the maximum percentage of variance parameter. The maximum variance test is only applied to nominal attributes. Parameters: - :max-variance Maximum variance percentage allowed (default 99). Note: percentage, not decimal. e.g. 89 not 0.89 If you pass in a decimal Weka silently sets it to 0.0. * :resample-unsupervised "Produces a random subsample of a dataset using either sampling with replacement or without replacement. The original dataset must fit entirely in memory. The number of instances in the generated dataset may be specified. When used in batch mode, subsequent batches are NOT resampled." -- from Weka JavaDoc. Parameters: - :seed Random number seed (integer) - :size-percent "The size of the output dataset, as a percentage of the input dataset (default 100)" (integer) - :no-replacement Use replacement or not; default is false, i.e., with replacement (boolean) - :invert Inverts the selection; can only be true if :replacement is false (boolean) * :resample-supervised "Produces a random subsample of a dataset using either sampling with replacement or without replacement. The original dataset must fit entirely in memory. The number of instances in the generated dataset may be specified. The dataset must have a nominal class attribute. If not, use the unsupervised version. The filter can be made to maintain the class distribution in the subsample, or to bias the class distribution toward a uniform distribution. When used in batch mode (i.e. in the FilteredClassifier), subsequent batches are NOT resampled." -- from Weka JavaDoc. Parameters: - :seed Random number seed (integer) - :size-percent "The size of the output dataset, as a percentage of the input dataset (default 100)" (integer) - :bias "Bias factor towards uniform class distribution.0 = distribution in input data -- 1 = uniform distribution. (default 0)" (0 or 1) - :no-replacement Use replacement or not; default is false, i.e., with replacement (boolean) - :invert Inverts the selection; can only be true if :replacement is false (boolean) * :stratified-remove-folds-supervised "This filter takes a dataset and outputs a specified fold for cross validation. If you do not want the folds to be stratified use the unsupervised version." -- from Weka JavaDoc Parameters: - :num-folds Specifies number of folds dataset is split into. (default 10) - :fold Specifies which fold is selected. (default 1) - :seed Specifies random number seed. (default 0, no randomizing) - :invert Specifies if inverse of selection is to be output. * :select-append-attributes Append a copy of the selected columns at the end of the dataset. Parameters: - :attributes Index of the attributes. Sample value: [1 2 3] The attributes may also be specified by names as well: [:some-name, "another-name"] - :invert Invert the selection of the columns. Sample value: true * :replace-missing-values Replaces all missing values for nominal and numeric attributes in a dataset with the modes and means from the training data. Parameters: - :unset-class-temporarily Unsets the class index temporarily before the filter is applied to the data. Sample value: true; default: false * :project-attributes Project some columns from the provided dataset Parameters: - :invert Invert the selection of columns. Sample value: true * :clj-streamable Allows you to create a custom streamable filter with clojure functions. A streamable filter is appropriate when you don't need to iterate over the entire dataset before processing it. Parameters: - :process This function will receive individual weka.core.Instance objects (rows of the dataset) and should return a newly processed Instance. The actual Instance is passed in and you may change it directly. However, a better approach is to copy the Instance with the copy method or Instance constructor and return a modified version of the copy. - :determine-dataset-format This function will receive the dataset's weka.core.Instances object with no actual Instance objects (i.e. just the format enocded in the attributes). You must return a Instances object that contains the new format of the filtered dataset. Passing this fn is optional. If you are not changing the format of the dataset then by omitting a function will use the current format. * :clj-batch Allows you to create a custom batch filter with clojure functions. A batch filter is appropriate when you need to iterate over the entire dataset before processing it. Parameters: - :process This function will receive the entire dataset as a weka.core.Instances objects. A processed Instances object should be returned with the new Instance objects added to it. The format of the dataset (Instances) that is returned from this will be returned from the filter (see below). - :determine-dataset-format This function will receive the dataset's weka.core.Instances object with no actual Instance objects (i.e. just the format enocded in the attributes). You must return a Instances object that contains the new format of the filtered dataset. Passing this fn is optional. For many batch filters you need to process the entire dataset to determine the correct format (e.g. filters that operate on nominal attributes). For this reason the clj-batch filter will *always* use format of the dataset that the process fn outputs. In other words, if you need to operate on the entire dataset before determining the format then this should be done in the process-fn and nothing needs to be passed for this fn. For examples on how to use the filters, especially the clojure filters, you may refer to filters_test.clj of clj-ml.
(numeric-to-nominal ds__1300__auto__)
(numeric-to-nominal ds__1300__auto__ attributes__1301__auto__)
(pki-unsupervised-discretize ds__1300__auto__)
(pki-unsupervised-discretize ds__1300__auto__ attributes__1301__auto__)
(project-attributes ds__1300__auto__)
(project-attributes ds__1300__auto__ attributes__1301__auto__)
(random-subset ds__1300__auto__)
(random-subset ds__1300__auto__ attributes__1301__auto__)
(remove-attributes ds__1300__auto__)
(remove-attributes ds__1300__auto__ attributes__1301__auto__)
(remove-percentage ds__1300__auto__)
(remove-percentage ds__1300__auto__ attributes__1301__auto__)
(remove-range ds__1300__auto__)
(remove-range ds__1300__auto__ attributes__1301__auto__)
(remove-useless-attributes ds__1300__auto__)
(remove-useless-attributes ds__1300__auto__ attributes__1301__auto__)
(reorder-attributes ds__1300__auto__)
(reorder-attributes ds__1300__auto__ attributes__1301__auto__)
(replace-missing-values ds__1300__auto__)
(replace-missing-values ds__1300__auto__ attributes__1301__auto__)
(resample-supervised ds__1300__auto__)
(resample-supervised ds__1300__auto__ attributes__1301__auto__)
(resample-unsupervised ds__1300__auto__)
(resample-unsupervised ds__1300__auto__ attributes__1301__auto__)
(select-append-attributes ds__1300__auto__)
(select-append-attributes ds__1300__auto__ attributes__1301__auto__)
(stratified-remove-folds-supervised ds__1300__auto__)
(stratified-remove-folds-supervised ds__1300__auto__ attributes__1301__auto__)
(string-to-word-vector ds__1300__auto__)
(string-to-word-vector ds__1300__auto__ attributes__1301__auto__)
(supervised-discretize ds__1300__auto__)
(supervised-discretize ds__1300__auto__ attributes__1301__auto__)
(supervised-nominal-to-binary ds__1300__auto__)
(supervised-nominal-to-binary ds__1300__auto__ attributes__1301__auto__)
(unsupervised-discretize ds__1300__auto__)
(unsupervised-discretize ds__1300__auto__ attributes__1301__auto__)
(unsupervised-nominal-to-binary ds__1300__auto__)
(unsupervised-nominal-to-binary ds__1300__auto__ attributes__1301__auto__)
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close