(binariser params)Binarize a column of continuous features given a threshold.
Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html
Timestamp: 2020-10-19T01:56:05.331Z
Binarize a column of continuous features given a threshold. Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html Timestamp: 2020-10-19T01:56:05.331Z
(binarizer params)Binarize a column of continuous features given a threshold.
Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html
Timestamp: 2020-10-19T01:56:05.331Z
Binarize a column of continuous features given a threshold. Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html Timestamp: 2020-10-19T01:56:05.331Z
(bucketed-random-projection-lsh params)This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.
The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.
References:
Wikipedia on Stable Distributions
Timestamp: 2020-10-19T01:56:05.693Z
This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics. The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function. References: 1. Wikipedia on Stable Distributions 2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html Timestamp: 2020-10-19T01:56:05.693Z
(bucketiser params)Bucketizer maps a column of continuous features to a column of feature buckets.
Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html
Timestamp: 2020-10-19T01:56:06.060Z
Bucketizer maps a column of continuous features to a column of feature buckets. Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html Timestamp: 2020-10-19T01:56:06.060Z
(bucketizer params)Bucketizer maps a column of continuous features to a column of feature buckets.
Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html
Timestamp: 2020-10-19T01:56:06.060Z
Bucketizer maps a column of continuous features to a column of feature buckets. Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html Timestamp: 2020-10-19T01:56:06.060Z
(chi-sq-selector params)Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html
Timestamp: 2020-10-19T01:56:06.428Z
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html Timestamp: 2020-10-19T01:56:06.428Z
(count-vectoriser params)Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html
Timestamp: 2020-10-19T01:56:06.801Z
Extracts a vocabulary from document collections and generates a CountVectorizerModel. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html Timestamp: 2020-10-19T01:56:06.801Z
(count-vectorizer params)Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html
Timestamp: 2020-10-19T01:56:06.801Z
Extracts a vocabulary from document collections and generates a CountVectorizerModel. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html Timestamp: 2020-10-19T01:56:06.801Z
(dct params)A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
More information on DCT-II in Discrete cosine transform (Wikipedia).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html
Timestamp: 2020-10-19T01:56:07.160Z
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II). More information on DCT-II in Discrete cosine transform (Wikipedia). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html Timestamp: 2020-10-19T01:56:07.160Z
(discrete-cosine-transform params)A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
More information on DCT-II in Discrete cosine transform (Wikipedia).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html
Timestamp: 2020-10-19T01:56:07.160Z
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II). More information on DCT-II in Discrete cosine transform (Wikipedia). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html Timestamp: 2020-10-19T01:56:07.160Z
(elementwise-product params)Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html
Timestamp: 2020-10-19T01:56:07.551Z
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html Timestamp: 2020-10-19T01:56:07.551Z
(feature-hasher params)Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html
Timestamp: 2020-10-19T01:56:07.938Z
Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature
space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:
 -Numeric columns: For numeric features, the hash value of the column name is used to map the
                   feature value to its index in the feature vector. By default, numeric features
                   are not treated as categorical (even when they are integers). To treat them
                   as categorical, specify the relevant columns in categoricalCols.
 -String columns: For categorical features, the hash value of the string "column_name=value"
                  is used to map to the vector index, with an indicator value of 1.0.
                  Thus, categorical features are "one-hot" encoded
                  (similarly to using OneHotEncoder with dropLast=false).
 -Boolean columns: Boolean values are treated in the same way as string columns. That is,
                   boolean features are represented as "column_name=true" or "column_name=false",
                   with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo
on the hashed value is used to determine the vector index, it is advisable to use a power of two
as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector
indices.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html
Timestamp: 2020-10-19T01:56:07.938Z(hashing-tf params)Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html
Timestamp: 2020-10-19T01:56:08.308Z
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html Timestamp: 2020-10-19T01:56:08.308Z
(idf params)Compute the Inverse Document Frequency (IDF) given a collection of documents.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html
Timestamp: 2020-10-19T01:56:08.857Z
Compute the Inverse Document Frequency (IDF) given a collection of documents. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html Timestamp: 2020-10-19T01:56:08.857Z
(imputer params)Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature.
Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation.
Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html
Timestamp: 2020-10-19T01:56:09.241Z
Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature. Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation. Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html Timestamp: 2020-10-19T01:56:09.241Z
(index-to-string params)A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html
Timestamp: 2020-10-19T01:56:09.599Z
A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html Timestamp: 2020-10-19T01:56:09.599Z
(interaction params)Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.
For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html
Timestamp: 2020-10-19T01:56:09.965Z
Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced. For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html Timestamp: 2020-10-19T01:56:09.965Z
(max-abs-scaler params)Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html
Timestamp: 2020-10-19T01:56:10.658Z
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html Timestamp: 2020-10-19T01:56:10.658Z
(min-hash-lsh params)LSH class for Jaccard distance.
The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values.
References: Wikipedia on MinHash
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html
Timestamp: 2020-10-19T01:56:11.035Z
LSH class for Jaccard distance. The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values. References: Wikipedia on MinHash Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html Timestamp: 2020-10-19T01:56:11.035Z
(min-max-scaler params)Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as:
For the case (E_{max} == E_{min}), (Rescaled(e_i) = 0.5 * (max + min)).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html
Timestamp: 2020-10-19T01:56:11.407Z
Rescale each feature individually to a common range [min, max] linearly using column summary
statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
feature E is calculated as:
For the case \(E_{max} == E_{min}\), \(Rescaled(e_i) = 0.5 * (max + min)\).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html
Timestamp: 2020-10-19T01:56:11.407Z(n-gram params)A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html
Timestamp: 2020-10-19T01:56:11.769Z
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html Timestamp: 2020-10-19T01:56:11.769Z
(normaliser params)Normalize a vector to have unit norm using the given p-norm.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html
Timestamp: 2020-10-19T01:56:12.133Z
Normalize a vector to have unit norm using the given p-norm. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html Timestamp: 2020-10-19T01:56:12.133Z
(normalizer params)Normalize a vector to have unit norm using the given p-norm.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html
Timestamp: 2020-10-19T01:56:12.133Z
Normalize a vector to have unit norm using the given p-norm. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html Timestamp: 2020-10-19T01:56:12.133Z
(one-hot-encoder params)A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html
Timestamp: 2020-10-19T01:56:12.690Z
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html Timestamp: 2020-10-19T01:56:12.690Z
(pca params)PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html
Timestamp: 2020-10-19T01:56:13.048Z
PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html Timestamp: 2020-10-19T01:56:13.048Z
(polynomial-expansion params)Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html
Timestamp: 2020-10-19T01:56:13.405Z
Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html Timestamp: 2020-10-19T01:56:13.405Z
(quantile-discretiser params)QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.
NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html
Timestamp: 2020-10-19T01:56:13.770Z
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns. NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html Timestamp: 2020-10-19T01:56:13.770Z
(quantile-discretizer params)QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.
NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html
Timestamp: 2020-10-19T01:56:13.770Z
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns. NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html Timestamp: 2020-10-19T01:56:13.770Z
(regex-tokeniser params)A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html
Timestamp: 2020-10-19T01:56:14.327Z
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html Timestamp: 2020-10-19T01:56:14.327Z
(regex-tokenizer params)A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html
Timestamp: 2020-10-19T01:56:14.327Z
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html Timestamp: 2020-10-19T01:56:14.327Z
(robust-scaler params)Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html
Timestamp: 2020-10-19T01:56:15.260Z
Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html Timestamp: 2020-10-19T01:56:15.260Z
(sql-transformer params)Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html
Timestamp: 2020-10-19T01:56:15.611Z
Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html Timestamp: 2020-10-19T01:56:15.611Z
(standard-scaler params)Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
The "unit std" is computed using the
corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html
Timestamp: 2020-10-19T01:56:16.163Z
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. The "unit std" is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html Timestamp: 2020-10-19T01:56:16.163Z
(stop-words-remover params)A feature transformer that filters out stop words from input.
Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html
Timestamp: 2020-10-19T01:56:16.540Z
A feature transformer that filters out stop words from input. Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html Timestamp: 2020-10-19T01:56:16.540Z
(string-indexer params)A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html
Timestamp: 2020-10-19T01:56:16.905Z
A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html Timestamp: 2020-10-19T01:56:16.905Z
(tokeniser params)A tokenizer that converts the input string to lowercase and then splits it by white spaces.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html
Timestamp: 2020-10-19T01:56:17.265Z
A tokenizer that converts the input string to lowercase and then splits it by white spaces. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html Timestamp: 2020-10-19T01:56:17.265Z
(tokenizer params)A tokenizer that converts the input string to lowercase and then splits it by white spaces.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html
Timestamp: 2020-10-19T01:56:17.265Z
A tokenizer that converts the input string to lowercase and then splits it by white spaces. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html Timestamp: 2020-10-19T01:56:17.265Z
(vector-assembler params)A feature transformer that merges multiple columns into a vector column.
This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html
Timestamp: 2020-10-19T01:56:17.622Z
A feature transformer that merges multiple columns into a vector column. This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html Timestamp: 2020-10-19T01:56:17.622Z
(vector-indexer params)Class for indexing categorical feature columns in a dataset of Vector.
This has 2 usage modes:
This returns a model which can transform categorical features to use 0-based indices.
Index stability:
TODO: Future extensions: The following functionality is planned for the future:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html
Timestamp: 2020-10-19T01:56:18.174Z
Class for indexing categorical feature columns in a dataset of Vector. This has 2 usage modes: This returns a model which can transform categorical features to use 0-based indices. Index stability: TODO: Future extensions: The following functionality is planned for the future: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html Timestamp: 2020-10-19T01:56:18.174Z
(vector-size-hint params)A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata.
Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html
Timestamp: 2020-10-19T01:56:18.723Z
A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata. Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html Timestamp: 2020-10-19T01:56:18.723Z
(word-2-vec params)Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html
Timestamp: 2020-10-19T01:56:19.459Z
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html Timestamp: 2020-10-19T01:56:19.459Z
(word2vec params)Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html
Timestamp: 2020-10-19T01:56:19.459Z
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html Timestamp: 2020-10-19T01:56:19.459Z
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs | 
| ← | Move to previous article | 
| → | Move to next article | 
| Ctrl+/ | Jump to the search field |