Liking cljdoc? Tell your friends :D

zero-one.geni.ml.feature


binariserclj

(binariser params)

Binarize a column of continuous features given a threshold.

Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

Binarize a column of continuous features given a threshold.

Since 3.0.0,
Binarize can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
threshold parameter is used for single column usage, and thresholds is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z
sourceraw docstring

binarizerclj

(binarizer params)

Binarize a column of continuous features given a threshold.

Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

Binarize a column of continuous features given a threshold.

Since 3.0.0,
Binarize can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
threshold parameter is used for single column usage, and thresholds is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z
sourceraw docstring

bucketed-random-projection-lshclj

(bucketed-random-projection-lsh params)

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.

The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

References:

Wikipedia on Stable Distributions

  1. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html

Timestamp: 2020-10-19T01:56:05.693Z

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for
Euclidean distance metrics.

The input is dense or sparse vectors, each of which represents a point in the Euclidean
distance space. The output will be vectors of configurable dimension. Hash values in the
same dimension are calculated by the same hash function.

References:

1. 
Wikipedia on Stable Distributions

2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint
arXiv:1408.2927 (2014).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html

Timestamp: 2020-10-19T01:56:05.693Z
sourceraw docstring

bucketiserclj

(bucketiser params)

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0,
Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
splits parameter is only used for single column usage, and splitsArray is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z
sourceraw docstring

bucketizerclj

(bucketizer params)

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0,
Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
splits parameter is only used for single column usage, and splitsArray is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z
sourceraw docstring

chi-sq-selectorclj

(chi-sq-selector params)

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html

Timestamp: 2020-10-19T01:56:06.428Z

Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods: numTopFeatures, percentile, fpr,
fdr, fwe.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html

Timestamp: 2020-10-19T01:56:06.428Z
sourceraw docstring

count-vectoriserclj

(count-vectoriser params)

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

Extracts a vocabulary from document collections and generates a CountVectorizerModel.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z
sourceraw docstring

count-vectorizerclj

(count-vectorizer params)

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

Extracts a vocabulary from document collections and generates a CountVectorizerModel.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z
sourceraw docstring

dctclj

(dct params)

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

More information on DCT-II in Discrete cosine transform (Wikipedia).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero
padding is performed on the input vector.
It returns a real vector of the same length representing the DCT. The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).

More information on 
DCT-II in Discrete cosine transform (Wikipedia).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z
sourceraw docstring

discrete-cosine-transformclj

(discrete-cosine-transform params)

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

More information on DCT-II in Discrete cosine transform (Wikipedia).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero
padding is performed on the input vector.
It returns a real vector of the same length representing the DCT. The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).

More information on 
DCT-II in Discrete cosine transform (Wikipedia).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z
sourceraw docstring

elementwise-productclj

(elementwise-product params)

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html

Timestamp: 2020-10-19T01:56:07.551Z

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a
provided "weight" vector.  In other words, it scales each column of the dataset by a scalar
multiplier.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html

Timestamp: 2020-10-19T01:56:07.551Z
sourceraw docstring

feature-hasherclj

(feature-hasher params)

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html

Timestamp: 2020-10-19T01:56:07.938Z

Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature
space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:
 -Numeric columns: For numeric features, the hash value of the column name is used to map the
                   feature value to its index in the feature vector. By default, numeric features
                   are not treated as categorical (even when they are integers). To treat them
                   as categorical, specify the relevant columns in categoricalCols.
 -String columns: For categorical features, the hash value of the string "column_name=value"
                  is used to map to the vector index, with an indicator value of 1.0.
                  Thus, categorical features are "one-hot" encoded
                  (similarly to using OneHotEncoder with dropLast=false).
 -Boolean columns: Boolean values are treated in the same way as string columns. That is,
                   boolean features are represented as "column_name=true" or "column_name=false",
                   with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo
on the hashed value is used to determine the vector index, it is advisable to use a power of two
as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector
indices.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html

Timestamp: 2020-10-19T01:56:07.938Z
sourceraw docstring

hashing-tfclj

(hashing-tf params)

Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html

Timestamp: 2020-10-19T01:56:08.308Z

Maps a sequence of terms to their term frequencies using the hashing trick.
Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32)
to calculate the hash code value for the term object.
Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the numFeatures parameter;
otherwise the features will not be mapped evenly to the columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html

Timestamp: 2020-10-19T01:56:08.308Z
sourceraw docstring

idfclj

(idf params)

Compute the Inverse Document Frequency (IDF) given a collection of documents.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html

Timestamp: 2020-10-19T01:56:08.857Z

Compute the Inverse Document Frequency (IDF) given a collection of documents.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html

Timestamp: 2020-10-19T01:56:08.857Z
sourceraw docstring

imputerclj

(imputer params)

Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature.

Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation.

Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html

Timestamp: 2020-10-19T01:56:09.241Z

Imputation estimator for completing missing values, either using the mean or the median
of the columns in which the missing values are located. The input columns should be of
numeric type. Currently Imputer does not support categorical features
(SPARK-15041) and possibly creates incorrect values for a categorical feature.

Note when an input column is integer, the imputed value is casted (truncated) to an integer type.
For example, if the input column is IntegerType (1, 2, 4, null),
the output will be IntegerType (1, 2, 4, 2) after mean imputation.

Note that the mean/median value is computed after filtering out missing values.
All Null values in the input columns are treated as missing, and so are also imputed. For
computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html

Timestamp: 2020-10-19T01:56:09.241Z
sourceraw docstring

index-to-stringclj

(index-to-string params)

A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html

Timestamp: 2020-10-19T01:56:09.599Z

A Transformer that maps a column of indices back to a new column of corresponding
string values.
The index-string mapping is either from the ML attributes of the input column,
or from user-supplied labels (which take precedence over ML attributes).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html

Timestamp: 2020-10-19T01:56:09.599Z
sourceraw docstring

interactionclj

(interaction params)

Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html

Timestamp: 2020-10-19T01:56:09.965Z

Implements the feature interaction transform. This transformer takes in Double and Vector type
columns and outputs a flattened vector of their feature interactions. To handle interaction,
we first one-hot encode any nominal features. Then, a vector of the feature cross-products is
produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be
Vector(6, 8) if all input features were numeric. If the first feature was instead nominal
with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html

Timestamp: 2020-10-19T01:56:09.965Z
sourceraw docstring

max-abs-scalerclj

(max-abs-scaler params)

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html

Timestamp: 2020-10-19T01:56:10.658Z

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
absolute value in each feature. It does not shift/center the data, and thus does not destroy
any sparsity.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html

Timestamp: 2020-10-19T01:56:10.658Z
sourceraw docstring

min-hash-lshclj

(min-hash-lsh params)

LSH class for Jaccard distance.

The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values.

References: Wikipedia on MinHash

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html

Timestamp: 2020-10-19T01:56:11.035Z

LSH class for Jaccard distance.

The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example,
   Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))
means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any
input vector must have at least 1 non-zero index, and all non-zero values are
treated as binary "1" values.

References:
Wikipedia on MinHash


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html

Timestamp: 2020-10-19T01:56:11.035Z
sourceraw docstring

min-max-scalerclj

(min-max-scaler params)

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as:

For the case (E_{max} == E_{min}), (Rescaled(e_i) = 0.5 * (max + min)).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html

Timestamp: 2020-10-19T01:56:11.407Z

Rescale each feature individually to a common range [min, max] linearly using column summary
statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
feature E is calculated as:



For the case \(E_{max} == E_{min}\), \(Rescaled(e_i) = 0.5 * (max + min)\).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html

Timestamp: 2020-10-19T01:56:11.407Z
sourceraw docstring

n-gramclj

(n-gram params)

A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html

Timestamp: 2020-10-19T01:56:11.769Z

A feature transformer that converts the input array of strings into an array of n-grams. Null
values in the input array are ignored.
It returns an array of n-grams where each n-gram is represented by a space-separated string of
words.

When the input is empty, an empty array is returned.
When the input array length is less than n (number of elements per n-gram), no n-grams are
returned.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html

Timestamp: 2020-10-19T01:56:11.769Z
sourceraw docstring

normaliserclj

(normaliser params)

Normalize a vector to have unit norm using the given p-norm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

Normalize a vector to have unit norm using the given p-norm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z
sourceraw docstring

normalizerclj

(normalizer params)

Normalize a vector to have unit norm using the given p-norm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

Normalize a vector to have unit norm using the given p-norm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z
sourceraw docstring

one-hot-encoderclj

(one-hot-encoder params)

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html

Timestamp: 2020-10-19T01:56:12.690Z

A one-hot encoder that maps a column of category indices to a column of binary vectors, with
at most a single one-value per row that indicates the input category index.
For example with 5 categories, an input value of 2.0 would map to an output vector of
[0.0, 0.0, 1.0, 0.0].
The last category is not included by default (configurable via dropLast),
because it makes the vector entries sum up to one, and hence linearly dependent.
So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html

Timestamp: 2020-10-19T01:56:12.690Z
sourceraw docstring

pcaclj

(pca params)

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html

Timestamp: 2020-10-19T01:56:13.048Z

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k
principal components.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html

Timestamp: 2020-10-19T01:56:13.048Z
sourceraw docstring

polynomial-expansionclj

(polynomial-expansion params)

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html

Timestamp: 2020-10-19T01:56:13.405Z

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion,
which is available at
Polynomial expansion (Wikipedia)
, "In mathematics, an expansion of a product of sums expresses it as a sum of products by using
the fact that multiplication distributes over addition". Take a 2-variable feature vector
as an example: (x, y), if we want to expand it with degree 2, then we get
(x, x * x, y, x * y, y * y).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html

Timestamp: 2020-10-19T01:56:13.405Z
sourceraw docstring

quantile-discretiserclj

(quantile-discretiser params)

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
be set, or if the number of buckets should be the same across columns, numBuckets can be
set as a convenience. Note that in multiple columns case, relative error is applied to all
columns.

NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
will produce a Bucketizer model for making predictions. During the transformation,
Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
covering all real values.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z
sourceraw docstring

quantile-discretizerclj

(quantile-discretizer params)

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
be set, or if the number of buckets should be the same across columns, numBuckets can be
set as a convenience. Note that in multiple columns case, relative error is applied to all
columns.

NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
will produce a Bucketizer model for making predictions. During the transformation,
Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
covering all real values.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z
sourceraw docstring

regex-tokeniserclj

(regex-tokeniser params)

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z
sourceraw docstring

regex-tokenizerclj

(regex-tokenizer params)

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z
sourceraw docstring

robust-scalerclj

(robust-scaler params)

Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html

Timestamp: 2020-10-19T01:56:15.260Z

Scale features using statistics that are robust to outliers.
RobustScaler removes the median and scales the data according to the quantile range.
The quantile range is by default IQR (Interquartile Range, quantile range between the
1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured.
Centering and scaling happen independently on each feature by computing the relevant
statistics on the samples in the training set. Median and quantile range are then
stored to be used on later data using the transform method.
Standardization of a dataset is a common requirement for many machine learning estimators.
Typically this is done by removing the mean and scaling to unit variance. However,
outliers can often influence the sample mean / variance in a negative way.
In such cases, the median and the quantile range often give better results.
Note that NaN values are ignored in the computation of medians and ranges.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html

Timestamp: 2020-10-19T01:56:15.260Z
sourceraw docstring

sql-transformerclj

(sql-transformer params)

Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html

Timestamp: 2020-10-19T01:56:15.611Z

Implements the transformations which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM THIS ...'
where 'THIS' represents the underlying table of the input dataset.
The select clause specifies the fields, constants, and expressions to display in
the output, it can be any select clause that Spark SQL supports. Users can also
use Spark SQL built-in function and UDFs to operate on these selected columns.
For example, SQLTransformer supports statements like:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html

Timestamp: 2020-10-19T01:56:15.611Z
sourceraw docstring

standard-scalerclj

(standard-scaler params)

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

The "unit std" is computed using the

corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html

Timestamp: 2020-10-19T01:56:16.163Z

Standardizes features by removing the mean and scaling to unit variance using column summary
statistics on the samples in the training set.

The "unit std" is computed using the

corrected sample standard deviation,
which is computed as the square root of the unbiased sample variance.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html

Timestamp: 2020-10-19T01:56:16.163Z
sourceraw docstring

stop-words-removerclj

(stop-words-remover params)

A feature transformer that filters out stop words from input.

Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html

Timestamp: 2020-10-19T01:56:16.540Z

A feature transformer that filters out stop words from input.

Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the
inputCols parameter. Note that when both the inputCol and inputCols parameters are set,
an Exception will be thrown.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html

Timestamp: 2020-10-19T01:56:16.540Z
sourceraw docstring

string-indexerclj

(string-indexer params)

A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html

Timestamp: 2020-10-19T01:56:16.905Z

A label indexer that maps string column(s) of labels to ML column(s) of label indices.
If the input columns are numeric, we cast them to string and index the string values.
The indices are in [0, numLabels). By default, this is ordered by label frequencies
so the most frequent label gets index 0. The ordering behavior is controlled by
setting stringOrderType.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html

Timestamp: 2020-10-19T01:56:16.905Z
sourceraw docstring

tokeniserclj

(tokeniser params)

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

A tokenizer that converts the input string to lowercase and then splits it by white spaces.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z
sourceraw docstring

tokenizerclj

(tokenizer params)

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

A tokenizer that converts the input string to lowercase and then splits it by white spaces.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z
sourceraw docstring

vector-assemblerclj

(vector-assembler params)

A feature transformer that merges multiple columns into a vector column.

This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html

Timestamp: 2020-10-19T01:56:17.622Z

A feature transformer that merges multiple columns into a vector column.

This requires one pass over the entire dataset. In case we need to infer column lengths from the
data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html

Timestamp: 2020-10-19T01:56:17.622Z
sourceraw docstring

vector-indexerclj

(vector-indexer params)

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

This returns a model which can transform categorical features to use 0-based indices.

Index stability:

TODO: Future extensions: The following functionality is planned for the future:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html

Timestamp: 2020-10-19T01:56:18.174Z

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

This returns a model which can transform categorical features to use 0-based indices.

Index stability:

TODO: Future extensions: The following functionality is planned for the future:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html

Timestamp: 2020-10-19T01:56:18.174Z
sourceraw docstring

vector-size-hintclj

(vector-size-hint params)

A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata.

Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:18.723Z

A feature transformer that adds size information to the metadata of a vector column.
VectorAssembler needs size information for its input columns and cannot be used on streaming
dataframes without this metadata.

Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:18.723Z
sourceraw docstring

word-2-vecclj

(word-2-vec params)

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further
natural language processing or machine learning process.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z
sourceraw docstring

word2vecclj

(word2vec params)

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further
natural language processing or machine learning process.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close