Liking cljdoc? Tell your friends :D
Clojure only.

zero-one.geni.ml


aft-survival-regressionclj

(aft-survival-regression params)

Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html

Timestamp: 2020-10-19T01:55:51.453Z

Fit a parametric survival regression model named accelerated failure time (AFT) model
(see 
Accelerated failure time model (Wikipedia))
based on the Weibull distribution of the survival time.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html

Timestamp: 2020-10-19T01:55:51.453Z
sourceraw docstring

alsclj

(als params)

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices,
X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices.
The general approach is iterative. During each iteration, one of the factor matrices is held
constant, while the other is solved for using least squares. The newly-solved factor matrix is
then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets
of factors (referred to as "users" and "products") into blocks and reduces communication by only
sending one copy of each user vector to each product block on each iteration, and only for the
product blocks that need that user's feature vector. This is achieved by pre-computing some
information about the ratings matrix to determine the "out-links" of each user (which blocks of
products it will contribute to) and "in-link" information for each product (which of the feature
vectors it receives from each user block it will depend on). This allows us to send only an
array of feature vectors between each user block and product block, and have the product block
find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on
"Collaborative Filtering for Implicit Feedback Datasets", available at
https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R,
this finds the approximations for a preference matrix P where the elements of P are 1 if
r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence'
values related to strength of indicated user
preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic.
Nondeterministic data can cause failure during fitting ALS model.
For example, an order-sensitive operation like sampling after a repartition makes dataset
output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618).
Checkpointing sampled dataset or adding a sort before sampling can help make the dataset
deterministic.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z
sourceraw docstring

alternating-least-squaresclj

(alternating-least-squares params)

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices,
X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices.
The general approach is iterative. During each iteration, one of the factor matrices is held
constant, while the other is solved for using least squares. The newly-solved factor matrix is
then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets
of factors (referred to as "users" and "products") into blocks and reduces communication by only
sending one copy of each user vector to each product block on each iteration, and only for the
product blocks that need that user's feature vector. This is achieved by pre-computing some
information about the ratings matrix to determine the "out-links" of each user (which blocks of
products it will contribute to) and "in-link" information for each product (which of the feature
vectors it receives from each user block it will depend on). This allows us to send only an
array of feature vectors between each user block and product block, and have the product block
find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on
"Collaborative Filtering for Implicit Feedback Datasets", available at
https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R,
this finds the approximations for a preference matrix P where the elements of P are 1 if
r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence'
values related to strength of indicated user
preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic.
Nondeterministic data can cause failure during fitting ALS model.
For example, an order-sensitive operation like sampling after a repartition makes dataset
output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618).
Checkpointing sampled dataset or adding a sort before sampling can help make the dataset
deterministic.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z
sourceraw docstring

approx-nearest-neighborsclj

(approx-nearest-neighbors dataset model key-v n-nearest)
(approx-nearest-neighbors dataset model key-v n-nearest dist-col)

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.

The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest
distance to the item. If the outputCol is missing, the method will transform the data; if
the outputCol exists, it will use the outputCol. This allows caching of the
transformed data when necessary.


The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added
        to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z
sourceraw docstring

approx-nearest-neighboursclj

(approx-nearest-neighbours dataset model key-v n-nearest)
(approx-nearest-neighbours dataset model key-v n-nearest dist-col)

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.

The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest
distance to the item. If the outputCol is missing, the method will transform the data; if
the outputCol exists, it will use the outputCol. This allows caching of the
transformed data when necessary.


The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added
        to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z
sourceraw docstring

approx-similarity-joinclj

(approx-similarity-join dataset-a dataset-b model threshold)
(approx-similarity-join dataset-a dataset-b model threshold dist-col)

Params: (datasetA: Dataset[_], datasetB: Dataset[_], threshold: Double, distCol: String)

Result: Dataset[_]

Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.

One of the datasets to join.

Another dataset to join.

The threshold for the distance of row pairs.

Output column for storing the distance between each pair of rows.

A joined dataset containing pairs of rows. The original rows are in columns "datasetA" and "datasetB", and a column "distCol" is added to show the distance between each pair.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.802Z

Params: (datasetA: Dataset[_], datasetB: Dataset[_], threshold: Double, distCol: String)

Result: Dataset[_]

Join two datasets to approximately find all pairs of rows whose distance are smaller than
the threshold. If the outputCol is missing, the method will transform the data; if the
outputCol exists, it will use the outputCol. This allows caching of the transformed
data when necessary.


One of the datasets to join.

Another dataset to join.

The threshold for the distance of row pairs.

Output column for storing the distance between each pair of rows.

A joined dataset containing pairs of rows. The original rows are in columns
        "datasetA" and "datasetB", and a column "distCol" is added to show the distance
        between each pair.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.802Z
sourceraw docstring

association-rulesclj

(association-rules model)

Params:

Result: DataFrame

Get association rules fitted using the minConfidence. Returns a dataframe with four fields, "antecedent", "consequent", "confidence" and "lift", where "antecedent" and "consequent" are Array[T], whereas "confidence" and "lift" are Double.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.538Z

Params: 

Result: DataFrame

Get association rules fitted using the minConfidence. Returns a dataframe with four fields,
"antecedent", "consequent", "confidence" and "lift", where "antecedent" and "consequent" are
Array[T], whereas "confidence" and "lift" are Double.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.538Z
sourceraw docstring

best-modelclj

(best-model model)

Params:

Result: Model[_]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidatorModel.html

Timestamp: 2020-10-19T01:56:45.449Z

Params: 

Result: Model[_]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidatorModel.html

Timestamp: 2020-10-19T01:56:45.449Z
sourceraw docstring

binariserclj

(binariser params)

Binarize a column of continuous features given a threshold.

Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

Binarize a column of continuous features given a threshold.

Since 3.0.0,
Binarize can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
threshold parameter is used for single column usage, and thresholds is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z
sourceraw docstring

binarizerclj

(binarizer params)

Binarize a column of continuous features given a threshold.

Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

Binarize a column of continuous features given a threshold.

Since 3.0.0,
Binarize can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
threshold parameter is used for single column usage, and thresholds is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z
sourceraw docstring

binary-classification-evaluatorclj

(binary-classification-evaluator params)

Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:00.765Z

Evaluator for binary classification, which expects input columns rawPrediction, label and
 an optional weight column.
The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1)
or of type vector (length-2 vector of raw predictions, scores, or label probabilities).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:00.765Z
sourceraw docstring

binary-summaryclj

(binary-summary model)

Params:

Result: BinaryLogisticRegressionTrainingSummary

Gets summary of model on training set. An exception is thrown if hasSummary is false or it is a multiclass model.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.093Z

Params: 

Result: BinaryLogisticRegressionTrainingSummary

Gets summary of model on training set. An exception is thrown
if hasSummary is false or it is a multiclass model.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.093Z
sourceraw docstring

bisecting-k-meansclj

(bisecting-k-means params)

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html

Timestamp: 2020-10-19T01:56:03.281Z

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques"
by Steinbach, Karypis, and Kumar, with modification to fit Spark.
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and bisects each of them using
k-means, until there are k leaf clusters in total or no leaf clusters are divisible.
The bisecting steps of clusters on the same level are grouped together to increase parallelism.
If bisecting all divisible clusters on the bottom level would result more than k leaf clusters,
larger clusters get higher priority.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html

Timestamp: 2020-10-19T01:56:03.281Z
sourceraw docstring

boundariesclj

(boundaries model)

Params:

Result: Vector

Boundaries in increasing order for which predictions are known.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegressionModel.html

Timestamp: 2020-10-19T01:56:44.821Z

Params: 

Result: Vector

Boundaries in increasing order for which predictions are known.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegressionModel.html

Timestamp: 2020-10-19T01:56:44.821Z
sourceraw docstring

bucketed-random-projection-lshclj

(bucketed-random-projection-lsh params)

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.

The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

References:

Wikipedia on Stable Distributions

  1. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html

Timestamp: 2020-10-19T01:56:05.693Z

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for
Euclidean distance metrics.

The input is dense or sparse vectors, each of which represents a point in the Euclidean
distance space. The output will be vectors of configurable dimension. Hash values in the
same dimension are calculated by the same hash function.

References:

1. 
Wikipedia on Stable Distributions

2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint
arXiv:1408.2927 (2014).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html

Timestamp: 2020-10-19T01:56:05.693Z
sourceraw docstring

bucketiserclj

(bucketiser params)

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0,
Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
splits parameter is only used for single column usage, and splitsArray is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z
sourceraw docstring

bucketizerclj

(bucketizer params)

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0,
Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
splits parameter is only used for single column usage, and splitsArray is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z
sourceraw docstring

category-mapsclj

(category-maps model)

Params:

Result: Map[Int, Map[Double, Int]]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexerModel.html

Timestamp: 2020-10-19T01:56:31.705Z

Params: 

Result: Map[Int, Map[Double, Int]]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexerModel.html

Timestamp: 2020-10-19T01:56:31.705Z
sourceraw docstring

category-sizesclj

(category-sizes model)

Params:

Result: Array[Int]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.967Z

Params: 

Result: Array[Int]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.967Z
sourceraw docstring

chi-sq-selectorclj

(chi-sq-selector params)

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html

Timestamp: 2020-10-19T01:56:06.428Z

Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods: numTopFeatures, percentile, fpr,
fdr, fwe.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html

Timestamp: 2020-10-19T01:56:06.428Z
sourceraw docstring

chi-square-testclj

(chi-square-test dataframe features-col label-col)

Chi-square hypothesis testing for categorical data.

See Wikipedia for more information on the Chi-squared test.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html

Timestamp: 2020-10-19T01:55:49.886Z

Chi-square hypothesis testing for categorical data.

See Wikipedia for more information
on the Chi-squared test.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html

Timestamp: 2020-10-19T01:55:49.886Z
sourceraw docstring

cluster-centersclj

(cluster-centers model)

Params:

Result: Array[Vector]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeansModel.html

Timestamp: 2020-10-19T01:56:36.922Z

Params: 

Result: Array[Vector]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeansModel.html

Timestamp: 2020-10-19T01:56:36.922Z
sourceraw docstring

clustering-evaluatorclj

(clustering-evaluator params)

Evaluator for clustering results. The metric computes the Silhouette measure using the specified distance measure.

The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.html

Timestamp: 2020-10-19T01:56:01.116Z

Evaluator for clustering results.
The metric computes the Silhouette measure using the specified distance measure.

The Silhouette is a measure for the validation of the consistency within clusters. It ranges
between 1 and -1, where a value close to 1 means that the points in a cluster are close to the
other points in the same cluster and far from the points of the other clusters.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.html

Timestamp: 2020-10-19T01:56:01.116Z
sourceraw docstring

coefficient-matrixclj

(coefficient-matrix model)
Params: 

Result: Matrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.098Z
sourceraw docstring

coefficientsclj

(coefficients model)
Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.282Z
sourceraw docstring

corrcljmultimethod

Column: Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Datasate: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

Column: Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Datasate: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
sourceraw docstring

count-vectoriserclj

(count-vectoriser params)

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

Extracts a vocabulary from document collections and generates a CountVectorizerModel.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z
sourceraw docstring

count-vectorizerclj

(count-vectorizer params)

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

Extracts a vocabulary from document collections and generates a CountVectorizerModel.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z
sourceraw docstring

cross-validatorclj

(cross-validator {:keys [estimator evaluator estimator-param-maps num-folds seed
                         parallelism]})

K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidator.html

Timestamp: 2020-10-19T01:55:48.855Z

K-fold cross validation performs model selection by splitting the dataset into a set of
non-overlapping randomly partitioned folds which are used as separate training and test datasets
e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs,
each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the
test set exactly once.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidator.html

Timestamp: 2020-10-19T01:55:48.855Z
sourceraw docstring

dctclj

(dct params)

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

More information on DCT-II in Discrete cosine transform (Wikipedia).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero
padding is performed on the input vector.
It returns a real vector of the same length representing the DCT. The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).

More information on 
DCT-II in Discrete cosine transform (Wikipedia).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z
sourceraw docstring

decision-tree-classifierclj

(decision-tree-classifier params)

Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html

Timestamp: 2020-10-19T01:55:55.948Z

Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning)
for classification.
It supports both binary and multiclass labels, as well as both continuous and categorical
features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html

Timestamp: 2020-10-19T01:55:55.948Z
sourceraw docstring

decision-tree-regressorclj

(decision-tree-regressor params)

Decision tree learning algorithm for regression. It supports both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html

Timestamp: 2020-10-19T01:55:52.001Z

Decision tree
learning algorithm for regression.
It supports both continuous and categorical features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html

Timestamp: 2020-10-19T01:55:52.001Z
sourceraw docstring

depthclj

(depth model)

Params:

Result: Int

Depth of the tree. E.g.: Depth 0 means 1 leaf node. Depth 1 means 1 internal node and 2 leaf nodes.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.586Z

Params: 

Result: Int

Depth of the tree.
E.g.: Depth 0 means 1 leaf node.  Depth 1 means 1 internal node and 2 leaf nodes.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.586Z
sourceraw docstring

describe-topicsclj

Params: (maxTermsPerTopic: Int)

Result: DataFrame

Return the topics described by their top-weighted terms.

Maximum number of terms to collect for each topic. Default value of 10.

Local DataFrame with one topic per Row, with columns:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.892Z

Params: (maxTermsPerTopic: Int)

Result: DataFrame

Return the topics described by their top-weighted terms.


Maximum number of terms to collect for each topic.
                         Default value of 10.

Local DataFrame with one topic per Row, with columns:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.892Z
sourceraw docstring

discrete-cosine-transformclj

(discrete-cosine-transform params)

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

More information on DCT-II in Discrete cosine transform (Wikipedia).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero
padding is performed on the input vector.
It returns a real vector of the same length representing the DCT. The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).

More information on 
DCT-II in Discrete cosine transform (Wikipedia).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z
sourceraw docstring

distributed?clj

(distributed? model)

Params:

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z

Params: 

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z
sourceraw docstring

elementwise-productclj

(elementwise-product params)

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html

Timestamp: 2020-10-19T01:56:07.551Z

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a
provided "weight" vector.  In other words, it scales each column of the dataset by a scalar
multiplier.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html

Timestamp: 2020-10-19T01:56:07.551Z
sourceraw docstring

estimated-doc-concentrationclj

(estimated-doc-concentration model)

Params:

Result: Vector

Value for docConcentration estimated from data. If Online LDA was used and optimizeDocConcentration was set to false, then this returns the fixed (given) value for the docConcentration parameter.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.897Z

Params: 

Result: Vector

Value for docConcentration estimated from data.
If Online LDA was used and optimizeDocConcentration was set to false,
then this returns the fixed (given) value for the docConcentration parameter.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.897Z
sourceraw docstring

evaluateclj

(evaluate dataframe evaluator)

Params: (dataset: Dataset[_])

Result: LinearRegressionSummary

Evaluates the model on a test dataset.

Test dataset to evaluate model on.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.292Z

Params: (dataset: Dataset[_])

Result: LinearRegressionSummary

Evaluates the model on a test dataset.


Test dataset to evaluate model on.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.292Z
sourceraw docstring

feature-hasherclj

(feature-hasher params)

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html

Timestamp: 2020-10-19T01:56:07.938Z

Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature
space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:
 -Numeric columns: For numeric features, the hash value of the column name is used to map the
                   feature value to its index in the feature vector. By default, numeric features
                   are not treated as categorical (even when they are integers). To treat them
                   as categorical, specify the relevant columns in categoricalCols.
 -String columns: For categorical features, the hash value of the string "column_name=value"
                  is used to map to the vector index, with an indicator value of 1.0.
                  Thus, categorical features are "one-hot" encoded
                  (similarly to using OneHotEncoder with dropLast=false).
 -Boolean columns: Boolean values are treated in the same way as string columns. That is,
                   boolean features are represented as "column_name=true" or "column_name=false",
                   with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo
on the hashed value is used to determine the vector index, it is advisable to use a power of two
as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector
indices.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html

Timestamp: 2020-10-19T01:56:07.938Z
sourceraw docstring

feature-importancesclj

(feature-importances model)

Params:

Result: Vector

Estimate of the importance of each feature.

Each feature's importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.) and follows the implementation from scikit-learn.

DecisionTreeClassificationModel.featureImportances

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.595Z

Params: 

Result: Vector

Estimate of the importance of each feature.

Each feature's importance is the average of its importance across all trees in the ensemble
The importance vector is normalized to sum to 1. This method is suggested by Hastie et al.
(Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.)
and follows the implementation from scikit-learn.


DecisionTreeClassificationModel.featureImportances

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.595Z
sourceraw docstring

features-colclj

(features-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.314Z
sourceraw docstring

find-frequent-sequential-patternsclj

(find-frequent-sequential-patterns dataset prefix-span)

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.

A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.


A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency.
        The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z
sourceraw docstring

find-patternsclj

(find-patterns dataset prefix-span)

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.

A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.


A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency.
        The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z
sourceraw docstring

fitclj

(fit dataframe estimator)

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: M

Fits a single model to the input data with optional parameters.

input dataset

the first param pair, overrides embedded params

other param pairs. These values override any specified in this Estimator's embedded ParamMap.

fitted model

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.html

Timestamp: 2020-10-19T01:56:44.210Z

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: M

Fits a single model to the input data with optional parameters.


input dataset

the first param pair, overrides embedded params

other param pairs.  These values override any specified in this
                       Estimator's embedded ParamMap.

fitted model

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.html

Timestamp: 2020-10-19T01:56:44.210Z
sourceraw docstring

fm-classifierclj

(fm-classifier params)

Factorization Machines learning algorithm for classification. It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:

FM classification model uses logistic loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/FMClassifier.html

Timestamp: 2020-10-19T01:55:56.340Z

Factorization Machines learning algorithm for classification.
It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity
(like advertising and recommendation system).
FM formula is:


FM classification model uses logistic loss which can be solved by gradient descent method, and
regularization terms like L2 are usually added to the loss function to prevent overfitting.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/FMClassifier.html

Timestamp: 2020-10-19T01:55:56.340Z
sourceraw docstring

fm-regressorclj

(fm-regressor params)

Factorization Machines learning algorithm for regression. It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:

FM regression model uses MSE loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/FMRegressor.html

Timestamp: 2020-10-19T01:55:52.555Z

Factorization Machines learning algorithm for regression.
It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity
(like advertising and recommendation system).
FM formula is:


FM regression model uses MSE loss which can be solved by gradient descent method, and
regularization terms like L2 are usually added to the loss function to prevent overfitting.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/FMRegressor.html

Timestamp: 2020-10-19T01:55:52.555Z
sourceraw docstring

fp-growthclj

(fp-growth params)

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit().

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
Li et al., PFP: Parallel FP-Growth for Query
Recommendation. PFP distributes computation in such a way that each worker executes an
independent group of mining tasks. The FP-Growth algorithm is described in
Han et al., Mining frequent patterns without
candidate generation. Note null values in the itemsCol column are ignored during fit().


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z
sourceraw docstring

freq-itemsetsclj

(freq-itemsets model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z
sourceraw docstring

frequent-item-setsclj

(frequent-item-sets model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z
sourceraw docstring

frequent-pattern-growthclj

(frequent-pattern-growth params)

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit().

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
Li et al., PFP: Parallel FP-Growth for Query
Recommendation. PFP distributes computation in such a way that each worker executes an
independent group of mining tasks. The FP-Growth algorithm is described in
Han et al., Mining frequent patterns without
candidate generation. Note null values in the itemsCol column are ignored during fit().


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z
sourceraw docstring

gaussian-mixtureclj

(gaussian-mixture params)

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian
Mixture Models (GMMs).  A GMM represents a composite distribution of
independent Gaussian distributions with associated "mixing" weights
specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood
for a mixture of k Gaussians, iterating until the log-likelihood changes by
less than convergenceTol, or until it has reached the max number of iterations.
While this process is generally guaranteed to converge, it is not guaranteed
to find a global optimum.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z
sourceraw docstring

gaussians-dfclj

(gaussians-df model)

Params:

Result: DataFrame

Retrieve Gaussian distributions as a DataFrame. Each row represents a Gaussian Distribution. Two columns are defined: mean and cov. Schema:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.217Z

Params: 

Result: DataFrame

Retrieve Gaussian distributions as a DataFrame.
Each row represents a Gaussian Distribution.
Two columns are defined: mean and cov.
Schema:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.217Z
sourceraw docstring

gbt-classifierclj

(gbt-classifier params)

Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/GBTClassifier.html

Timestamp: 2020-10-19T01:55:56.899Z

Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting)
learning algorithm for classification.
It supports binary labels, as well as both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/GBTClassifier.html

Timestamp: 2020-10-19T01:55:56.899Z
sourceraw docstring

gbt-regressorclj

(gbt-regressor params)

Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GBTRegressor.html

Timestamp: 2020-10-19T01:55:53.108Z

Gradient-Boosted Trees (GBTs)
learning algorithm for regression.
It supports both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GBTRegressor.html

Timestamp: 2020-10-19T01:55:53.108Z
sourceraw docstring

generalised-linear-regressionclj

(generalised-linear-regression params)

Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

Fit a Generalized Linear Model
(see 
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
Valid link functions for each family is listed below. The first link function of each family
is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z
sourceraw docstring

generalized-linear-regressionclj

(generalized-linear-regression params)

Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

Fit a Generalized Linear Model
(see 
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
Valid link functions for each family is listed below. The first link function of each family
is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z
sourceraw docstring

get-features-colclj

(get-features-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.314Z
sourceraw docstring

get-input-colclj

(get-input-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z
sourceraw docstring

get-input-colsclj

(get-input-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z
sourceraw docstring

get-label-colclj

(get-label-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.316Z
sourceraw docstring

get-num-treesclj

(get-num-trees model)
Params: 

Result: Int



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.621Z
sourceraw docstring

get-output-colclj

(get-output-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z
sourceraw docstring

get-output-colsclj

(get-output-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z
sourceraw docstring

get-prediction-colclj

(get-prediction-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.320Z
sourceraw docstring

get-probability-colclj

(get-probability-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.625Z
sourceraw docstring

get-raw-prediction-colclj

(get-raw-prediction-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.626Z
sourceraw docstring

get-sizeclj

(get-size model)

Params:

Result: Int

group getParam

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:32.378Z

Params: 

Result: Int

group getParam

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:32.378Z
sourceraw docstring

get-thresholdsclj

(get-thresholds model)
Params: 

Result: Array[Double]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.629Z
sourceraw docstring

glmclj

(glm params)

Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

Fit a Generalized Linear Model
(see 
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
Valid link functions for each family is listed below. The first link function of each family
is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z
sourceraw docstring

gmmclj

(gmm params)

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian
Mixture Models (GMMs).  A GMM represents a composite distribution of
independent Gaussian distributions with associated "mixing" weights
specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood
for a mixture of k Gaussians, iterating until the log-likelihood changes by
less than convergenceTol, or until it has reached the max number of iterations.
While this process is generally guaranteed to converge, it is not guaranteed
to find a global optimum.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z
sourceraw docstring

hashing-tfclj

(hashing-tf params)

Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html

Timestamp: 2020-10-19T01:56:08.308Z

Maps a sequence of terms to their term frequencies using the hashing trick.
Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32)
to calculate the hash code value for the term object.
Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the numFeatures parameter;
otherwise the features will not be mapped evenly to the columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html

Timestamp: 2020-10-19T01:56:08.308Z
sourceraw docstring

idfclj

(idf params)

Compute the Inverse Document Frequency (IDF) given a collection of documents.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html

Timestamp: 2020-10-19T01:56:08.857Z

Compute the Inverse Document Frequency (IDF) given a collection of documents.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html

Timestamp: 2020-10-19T01:56:08.857Z
sourceraw docstring

idf-vectorclj

(idf-vector model)

Params:

Result: Vector

Returns the IDF vector.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDFModel.html

Timestamp: 2020-10-19T01:56:34.931Z

Params: 

Result: Vector

Returns the IDF vector.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDFModel.html

Timestamp: 2020-10-19T01:56:34.931Z
sourceraw docstring

imputerclj

(imputer params)

Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature.

Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation.

Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html

Timestamp: 2020-10-19T01:56:09.241Z

Imputation estimator for completing missing values, either using the mean or the median
of the columns in which the missing values are located. The input columns should be of
numeric type. Currently Imputer does not support categorical features
(SPARK-15041) and possibly creates incorrect values for a categorical feature.

Note when an input column is integer, the imputed value is casted (truncated) to an integer type.
For example, if the input column is IntegerType (1, 2, 4, null),
the output will be IntegerType (1, 2, 4, 2) after mean imputation.

Note that the mean/median value is computed after filtering out missing values.
All Null values in the input columns are treated as missing, and so are also imputed. For
computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html

Timestamp: 2020-10-19T01:56:09.241Z
sourceraw docstring

index-to-stringclj

(index-to-string params)

A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html

Timestamp: 2020-10-19T01:56:09.599Z

A Transformer that maps a column of indices back to a new column of corresponding
string values.
The index-string mapping is either from the ML attributes of the input column,
or from user-supplied labels (which take precedence over ML attributes).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html

Timestamp: 2020-10-19T01:56:09.599Z
sourceraw docstring

input-colclj

(input-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z
sourceraw docstring

input-colsclj

(input-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z
sourceraw docstring

interactionclj

(interaction params)

Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html

Timestamp: 2020-10-19T01:56:09.965Z

Implements the feature interaction transform. This transformer takes in Double and Vector type
columns and outputs a flattened vector of their feature interactions. To handle interaction,
we first one-hot encode any nominal features. Then, a vector of the feature cross-products is
produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be
Vector(6, 8) if all input features were numeric. If the first feature was instead nominal
with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html

Timestamp: 2020-10-19T01:56:09.965Z
sourceraw docstring

interceptclj

(intercept model)
Params: 

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.333Z
sourceraw docstring

intercept-vectorclj

(intercept-vector model)
Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.167Z
sourceraw docstring

is-distributedclj

(is-distributed model)

Params:

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z

Params: 

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z
sourceraw docstring

isotonic-regressionclj

(isotonic-regression params)

Isotonic regression.

Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.

Uses org.apache.spark.mllib.regression.IsotonicRegression.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegression.html

Timestamp: 2020-10-19T01:55:54.264Z

Isotonic regression.

Currently implemented using parallelized pool adjacent violators algorithm.
Only univariate (single feature) algorithm supported.

Uses org.apache.spark.mllib.regression.IsotonicRegression.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegression.html

Timestamp: 2020-10-19T01:55:54.264Z
sourceraw docstring

item-factorsclj

(item-factors model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.288Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.288Z
sourceraw docstring

k-meansclj

(k-means params)

K-means clustering with support for k-means|| initialization proposed by Bahmani et al.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeans.html

Timestamp: 2020-10-19T01:56:04.224Z

K-means clustering with support for k-means|| initialization proposed by Bahmani et al.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeans.html

Timestamp: 2020-10-19T01:56:04.224Z
sourceraw docstring

kolmogorov-smirnov-testclj

(kolmogorov-smirnov-test dataframe sample-col dist-name params)

Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest$.html

Timestamp: 2020-10-19T01:55:50.540Z

Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a
continuous distribution. By comparing the largest difference between the empirical cumulative
distribution of the sample data and the theoretical distribution we can provide a test for the
the null hypothesis that the sample data comes from that theoretical distribution.
For more information on KS Test:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest$.html

Timestamp: 2020-10-19T01:55:50.540Z
sourceraw docstring

label-colclj

(label-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.316Z
sourceraw docstring

labelsclj

(labels model)

Params:

Result: Array[String]

(Since version 3.0.0)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexerModel.html

Timestamp: 2020-10-19T01:56:31.154Z

Params: 

Result: Array[String]

(Since version 3.0.0)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexerModel.html

Timestamp: 2020-10-19T01:56:31.154Z
sourceraw docstring

latent-dirichlet-allocationclj

(latent-dirichlet-allocation params)

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version):
 Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.

Input data (featuresCol):
 LDA is given a collection of documents as input data, via the featuresCol parameter.
 Each document is specified as a Vector of length vocabSize, where each entry is the
 count for the corresponding term (word) in the document.  Feature transformers such as
 org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer
 can be useful for converting text to word count vectors.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z
sourceraw docstring

ldaclj

(lda params)

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version):
 Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.

Input data (featuresCol):
 LDA is given a collection of documents as input data, via the featuresCol parameter.
 Each document is specified as a Vector of length vocabSize, where each entry is the
 count for the corresponding term (word) in the document.  Feature transformers such as
 org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer
 can be useful for converting text to word count vectors.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z
sourceraw docstring

linear-regressionclj

(linear-regression params)

Linear regression.

The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss:

This supports multiple types of regularization:

The squared error objective function is:

The huber objective function is:

where

Note: Fitting with huber loss only supports none and L2 regularization.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegression.html

Timestamp: 2020-10-19T01:55:54.848Z

Linear regression.

The learning objective is to minimize the specified loss function, with regularization.
This supports two kinds of loss:

This supports multiple types of regularization:

The squared error objective function is:



The huber objective function is:



where



Note: Fitting with huber loss only supports none and L2 regularization.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegression.html

Timestamp: 2020-10-19T01:55:54.848Z
sourceraw docstring

linear-svcclj

(linear-svc params)

Linear SVM Classifier

This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LinearSVC.html

Timestamp: 2020-10-19T01:55:57.279Z

  Linear SVM Classifier

This binary classifier optimizes the Hinge Loss using the OWLQN optimizer.
Only supports L2 regularization currently.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LinearSVC.html

Timestamp: 2020-10-19T01:55:57.279Z
sourceraw docstring

log-likelihoodclj

(log-likelihood dataset model)

Params: (dataset: Dataset[_])

Result: Double

Calculates a lower bound on the log likelihood of the entire corpus.

See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.

test corpus to use for calculating log likelihood

variational lower bound on the log likelihood of the entire corpus

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.959Z

Params: (dataset: Dataset[_])

Result: Double

Calculates a lower bound on the log likelihood of the entire corpus.

See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer
         is set to "em"), this involves collecting a large topicsMatrix to the driver.
         This implementation may be changed in the future.


test corpus to use for calculating log likelihood

variational lower bound on the log likelihood of the entire corpus

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.959Z
sourceraw docstring

log-perplexityclj

(log-perplexity dataset model)

Params: (dataset: Dataset[_])

Result: Double

Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.

test corpus to use for calculating perplexity

Variational upper bound on log perplexity per token.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.961Z

Params: (dataset: Dataset[_])

Result: Double

Calculate an upper bound on perplexity.  (Lower is better.)
See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer
         is set to "em"), this involves collecting a large topicsMatrix to the driver.
         This implementation may be changed in the future.


test corpus to use for calculating perplexity

Variational upper bound on log perplexity per token.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.961Z
sourceraw docstring

logistic-regressionclj

(logistic-regression params)

Logistic regression. Supports:

This class supports fitting traditional logistic regression model by LBFGS/OWLQN and bound (box) constrained logistic regression model by LBFGSB.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegression.html

Timestamp: 2020-10-19T01:55:57.830Z

Logistic regression. Supports:

This class supports fitting traditional logistic regression model by LBFGS/OWLQN and
bound (box) constrained logistic regression model by LBFGSB.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegression.html

Timestamp: 2020-10-19T01:55:57.830Z
sourceraw docstring

max-absclj

(max-abs model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html

Timestamp: 2020-10-19T01:56:33.682Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html

Timestamp: 2020-10-19T01:56:33.682Z
sourceraw docstring

max-abs-scalerclj

(max-abs-scaler params)

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html

Timestamp: 2020-10-19T01:56:10.658Z

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
absolute value in each feature. It does not shift/center the data, and thus does not destroy
any sparsity.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html

Timestamp: 2020-10-19T01:56:10.658Z
sourceraw docstring

meanclj

(mean model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.051Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.051Z
sourceraw docstring

min-hash-lshclj

(min-hash-lsh params)

LSH class for Jaccard distance.

The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values.

References: Wikipedia on MinHash

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html

Timestamp: 2020-10-19T01:56:11.035Z

LSH class for Jaccard distance.

The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example,
   Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))
means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any
input vector must have at least 1 non-zero index, and all non-zero values are
treated as binary "1" values.

References:
Wikipedia on MinHash


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html

Timestamp: 2020-10-19T01:56:11.035Z
sourceraw docstring

min-max-scalerclj

(min-max-scaler params)

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as:

For the case (E_{max} == E_{min}), (Rescaled(e_i) = 0.5 * (max + min)).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html

Timestamp: 2020-10-19T01:56:11.407Z

Rescale each feature individually to a common range [min, max] linearly using column summary
statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
feature E is calculated as:



For the case \(E_{max} == E_{min}\), \(Rescaled(e_i) = 0.5 * (max + min)\).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html

Timestamp: 2020-10-19T01:56:11.407Z
sourceraw docstring

mlp-classifierclj

(mlp-classifier params)

Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z

Classifier trainer based on the Multilayer Perceptron.
Each layer has sigmoid activation function, output layer has softmax.
Number of inputs has to be equal to the size of feature vectors.
Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z
sourceraw docstring

multiclass-classification-evaluatorclj

(multiclass-classification-evaluator params)

Evaluator for multiclass classification, which expects input columns: prediction, label, weight (optional) and probability (only for logLoss).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.471Z

Evaluator for multiclass classification, which expects input columns: prediction, label,
weight (optional) and probability (only for logLoss).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.471Z
sourceraw docstring

multilabel-classification-evaluatorclj

(multilabel-classification-evaluator params)

:: Experimental :: Evaluator for multi-label classification, which expects two input columns: prediction and label.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.814Z

:: Experimental ::
Evaluator for multi-label classification, which expects two input
columns: prediction and label.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.814Z
sourceraw docstring

multilayer-perceptron-classifierclj

(multilayer-perceptron-classifier params)

Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z

Classifier trainer based on the Multilayer Perceptron.
Each layer has sigmoid activation function, output layer has softmax.
Number of inputs has to be equal to the size of feature vectors.
Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z
sourceraw docstring

n-gramclj

(n-gram params)

A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html

Timestamp: 2020-10-19T01:56:11.769Z

A feature transformer that converts the input array of strings into an array of n-grams. Null
values in the input array are ignored.
It returns an array of n-grams where each n-gram is represented by a space-separated string of
words.

When the input is empty, an empty array is returned.
When the input array length is less than n (number of elements per n-gram), no n-grams are
returned.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html

Timestamp: 2020-10-19T01:56:11.769Z
sourceraw docstring

naive-bayesclj

(naive-bayes params)

Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values for Multinomial NB and Bernoulli NB must be nonnegative. Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically, Complement NB uses statistics from the complement of each class to compute the model's coefficients The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must be nonnegative. Since 3.0.0, it also supports Gaussian NB (see here) which can handle continuous data.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayes.html

Timestamp: 2020-10-19T01:55:58.596Z

Naive Bayes Classifiers.
It supports Multinomial NB
(see 
here)
which can handle finitely supported discrete data. For example, by converting documents into
TF-IDF vectors, it can be used for document classification. By making every vector a
binary (0/1) data, it can also be used as Bernoulli NB
(see 
here).
The input feature values for Multinomial NB and Bernoulli NB must be nonnegative.
Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically,
Complement NB uses statistics from the complement of each class to compute the model's coefficients
The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable
than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must
be nonnegative.
Since 3.0.0, it also supports Gaussian NB
(see 
here)
which can handle continuous data.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayes.html

Timestamp: 2020-10-19T01:55:58.596Z
sourceraw docstring

normaliserclj

(normaliser params)

Normalize a vector to have unit norm using the given p-norm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

Normalize a vector to have unit norm using the given p-norm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z
sourceraw docstring

normalizerclj

(normalizer params)

Normalize a vector to have unit norm using the given p-norm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

Normalize a vector to have unit norm using the given p-norm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z
sourceraw docstring

num-classesclj

(num-classes model)

Params:

Result: Int

Number of classes (values which the label can take).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.671Z

Params: 

Result: Int

Number of classes (values which the label can take).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.671Z
sourceraw docstring

num-featuresclj

(num-features model)

Params:

Result: Int

Returns the number of features the model was trained on. If unknown, returns -1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.360Z

Params: 

Result: Int

Returns the number of features the model was trained on. If unknown, returns -1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.360Z
sourceraw docstring

num-nodesclj

(num-nodes model)

Params:

Result: Int

Number of nodes in tree, including leaf nodes.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.668Z

Params: 

Result: Int

Number of nodes in tree, including leaf nodes.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.668Z
sourceraw docstring

one-hot-encoderclj

(one-hot-encoder params)

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html

Timestamp: 2020-10-19T01:56:12.690Z

A one-hot encoder that maps a column of category indices to a column of binary vectors, with
at most a single one-value per row that indicates the input category index.
For example with 5 categories, an input value of 2.0 would map to an output vector of
[0.0, 0.0, 1.0, 0.0].
The last category is not included by default (configurable via dropLast),
because it makes the vector entries sum up to one, and hence linearly dependent.
So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html

Timestamp: 2020-10-19T01:56:12.690Z
sourceraw docstring

one-vs-restclj

(one-vs-rest params)

Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/OneVsRest.html

Timestamp: 2020-10-19T01:55:58.960Z

Reduction of Multiclass Classification to Binary Classification.
Performs reduction using one against all strategy.
For a multiclass classification with k classes, train k models (one per class).
Each example is scored against all k models and the model with highest score
is picked to label the example.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/OneVsRest.html

Timestamp: 2020-10-19T01:55:58.960Z
sourceraw docstring

original-maxclj

(original-max model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.393Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.393Z
sourceraw docstring

original-minclj

(original-min model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.394Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.394Z
sourceraw docstring

output-colclj

(output-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z
sourceraw docstring

output-colsclj

(output-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z
sourceraw docstring

param-gridclj

(param-grid grids)

Builder for a param grid used in grid search-based model selection.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z

Builder for a param grid used in grid search-based model selection.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z
sourceraw docstring

param-grid-builderclj

(param-grid-builder grids)

Builder for a param grid used in grid search-based model selection.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z

Builder for a param grid used in grid search-based model selection.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z
sourceraw docstring

paramsclj

(params stage)

Params:

Result: Array[Param[_]]

Returns all params sorted by their names. The default implementation uses Java reflection to list all public methods that have no arguments and return Param.

Developer should not use this method in constructor because we cannot guarantee that this variable gets initialized before other params.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.738Z

Params: 

Result: Array[Param[_]]

Returns all params sorted by their names. The default implementation uses Java reflection to
list all public methods that have no arguments and return Param.


Developer should not use this method in constructor because we cannot guarantee that
this variable gets initialized before other params.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.738Z
sourceraw docstring

pcclj

(pc model)

Params:

Result: DenseMatrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z

Params: 

Result: DenseMatrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z
sourceraw docstring

pcaclj

(pca params)

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html

Timestamp: 2020-10-19T01:56:13.048Z

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k
principal components.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html

Timestamp: 2020-10-19T01:56:13.048Z
sourceraw docstring

piclj

(pi model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.617Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.617Z
sourceraw docstring

pipelineclj

(pipeline & stages)

A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as an identity transformer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/Pipeline.html

Timestamp: 2020-10-19T01:55:50.903Z

A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each
of which is either an Estimator or a Transformer. When Pipeline.fit is called, the
stages are executed in order. If a stage is an Estimator, its Estimator.fit method will
be called on the input dataset to fit a model. Then the model, which is a transformer, will be
used to transform the dataset as the input to the next stage. If a stage is a Transformer,
its Transformer.transform method will be called to produce the dataset for the next stage.
The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and
transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as
an identity transformer.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/Pipeline.html

Timestamp: 2020-10-19T01:55:50.903Z
sourceraw docstring

polynomial-expansionclj

(polynomial-expansion params)

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html

Timestamp: 2020-10-19T01:56:13.405Z

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion,
which is available at
Polynomial expansion (Wikipedia)
, "In mathematics, an expansion of a product of sums expresses it as a sum of products by using
the fact that multiplication distributes over addition". Take a 2-variable feature vector
as an example: (x, y), if we want to expand it with degree 2, then we get
(x, x * x, y, x * y, y * y).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html

Timestamp: 2020-10-19T01:56:13.405Z
sourceraw docstring

power-iteration-clusteringclj

(power-iteration-clustering params)

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters method to run the PowerIterationClustering algorithm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html

Timestamp: 2020-10-19T01:56:04.968Z

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by
Lin and Cohen. From
the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power
iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters method to run the
PowerIterationClustering algorithm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html

Timestamp: 2020-10-19T01:56:04.968Z
sourceraw docstring

prediction-colclj

(prediction-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.320Z
sourceraw docstring

prefix-spanclj

(prefix-span params)

A parallel PrefixSpan algorithm to mine frequent sequential patterns. The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth (see here). This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to run the PrefixSpan algorithm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:00.046Z

A parallel PrefixSpan algorithm to mine frequent sequential patterns.
The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
Efficiently by Prefix-Projected Pattern Growth
(see here).
This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to
run the PrefixSpan algorithm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:00.046Z
sourceraw docstring

principal-componentsclj

(principal-components model)

Params:

Result: DenseMatrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z

Params: 

Result: DenseMatrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z
sourceraw docstring

probability-colclj

(probability-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.625Z
sourceraw docstring

quantile-discretiserclj

(quantile-discretiser params)

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
be set, or if the number of buckets should be the same across columns, numBuckets can be
set as a convenience. Note that in multiple columns case, relative error is applied to all
columns.

NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
will produce a Bucketizer model for making predictions. During the transformation,
Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
covering all real values.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z
sourceraw docstring

quantile-discretizerclj

(quantile-discretizer params)

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
be set, or if the number of buckets should be the same across columns, numBuckets can be
set as a convenience. Note that in multiple columns case, relative error is applied to all
columns.

NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
will produce a Bucketizer model for making predictions. During the transformation,
Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
covering all real values.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z
sourceraw docstring

random-forest-classifierclj

(random-forest-classifier params)

Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html

Timestamp: 2020-10-19T01:55:59.351Z

Random Forest learning algorithm for
classification.
It supports both binary and multiclass labels, as well as both continuous and categorical
features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html

Timestamp: 2020-10-19T01:55:59.351Z
sourceraw docstring

random-forest-regressorclj

(random-forest-regressor params)

Random Forest learning algorithm for regression. It supports both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html

Timestamp: 2020-10-19T01:55:55.394Z

Random Forest
learning algorithm for regression.
It supports both continuous and categorical features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html

Timestamp: 2020-10-19T01:55:55.394Z
sourceraw docstring

ranking-evaluatorclj

(ranking-evaluator params)

:: Experimental :: Evaluator for ranking, which expects two input columns: prediction and label.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html

Timestamp: 2020-10-19T01:56:02.374Z

:: Experimental ::
Evaluator for ranking, which expects two input columns: prediction and label.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html

Timestamp: 2020-10-19T01:56:02.374Z
sourceraw docstring

raw-prediction-colclj

(raw-prediction-col model)
Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.626Z
sourceraw docstring

read-stage!clj

(read-stage! model-cls path)

Load a saved PipelineStage.

Load a saved PipelineStage.
sourceraw docstring

recommend-for-all-itemsclj

(recommend-for-all-items model num-users)

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are
        stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z
sourceraw docstring

recommend-for-all-usersclj

(recommend-for-all-users model num-items)

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are
        stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z
sourceraw docstring

recommend-for-item-subsetclj

(recommend-for-item-subset model items-df num-users)

Params: (dataset: Dataset[_], numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned.

a Dataset containing a column of item ids. The column name must match itemCol.

max number of recommendations for each item.

a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.317Z

Params: (dataset: Dataset[_], numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item id in the input data set. Note that if
there are duplicate ids in the input dataset, only one set of recommendations per unique id
will be returned.

a Dataset containing a column of item ids. The column name must match itemCol.

max number of recommendations for each item.

a DataFrame of (itemCol: Int, recommendations), where recommendations are
        stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.317Z
sourceraw docstring

recommend-for-user-subsetclj

(recommend-for-user-subset model users-df num-items)

Params: (dataset: Dataset[_], numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned.

a Dataset containing a column of user ids. The column name must match userCol.

max number of recommendations for each user.

a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.319Z

Params: (dataset: Dataset[_], numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user id in the input data set. Note that if
there are duplicate ids in the input dataset, only one set of recommendations per unique id
will be returned.

a Dataset containing a column of user ids. The column name must match userCol.

max number of recommendations for each user.

a DataFrame of (userCol: Int, recommendations), where recommendations are
        stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.319Z
sourceraw docstring

recommend-itemsclj

(recommend-items model num-items)
(recommend-items model users-df num-items)

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are
        stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z
sourceraw docstring

recommend-usersclj

(recommend-users model num-users)
(recommend-users model items-df num-users)

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are
        stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z
sourceraw docstring

regex-tokeniserclj

(regex-tokeniser params)

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z
sourceraw docstring

regex-tokenizerclj

(regex-tokenizer params)

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z
sourceraw docstring

regression-evaluatorclj

(regression-evaluator params)

Evaluator for regression, which expects input columns prediction, label and an optional weight column.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html

Timestamp: 2020-10-19T01:56:02.721Z

Evaluator for regression, which expects input columns prediction, label and
an optional weight column.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html

Timestamp: 2020-10-19T01:56:02.721Z
sourceraw docstring

robust-scalerclj

(robust-scaler params)

Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html

Timestamp: 2020-10-19T01:56:15.260Z

Scale features using statistics that are robust to outliers.
RobustScaler removes the median and scales the data according to the quantile range.
The quantile range is by default IQR (Interquartile Range, quantile range between the
1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured.
Centering and scaling happen independently on each feature by computing the relevant
statistics on the samples in the training set. Median and quantile range are then
stored to be used on later data using the transform method.
Standardization of a dataset is a common requirement for many machine learning estimators.
Typically this is done by removing the mean and scaling to unit variance. However,
outliers can often influence the sample mean / variance in a negative way.
In such cases, the median and the quantile range often give better results.
Note that NaN values are ignored in the computation of medians and ranges.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html

Timestamp: 2020-10-19T01:56:15.260Z
sourceraw docstring

root-nodeclj

(root-node model)

Params:

Result: Node

Root of the decision tree

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.689Z

Params: 

Result: Node

Root of the decision tree

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.689Z
sourceraw docstring

scaleclj

(scale model)
Params: 

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.368Z
sourceraw docstring

sql-transformerclj

(sql-transformer params)

Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html

Timestamp: 2020-10-19T01:56:15.611Z

Implements the transformations which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM THIS ...'
where 'THIS' represents the underlying table of the input dataset.
The select clause specifies the fields, constants, and expressions to display in
the output, it can be any select clause that Spark SQL supports. Users can also
use Spark SQL built-in function and UDFs to operate on these selected columns.
For example, SQLTransformer supports statements like:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html

Timestamp: 2020-10-19T01:56:15.611Z
sourceraw docstring

stagesclj

(stages model)

Params:

Result: Array[Transformer]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/PipelineModel.html

Timestamp: 2020-10-19T01:56:38.367Z

Params: 

Result: Array[Transformer]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/PipelineModel.html

Timestamp: 2020-10-19T01:56:38.367Z
sourceraw docstring

standard-scalerclj

(standard-scaler params)

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

The "unit std" is computed using the

corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html

Timestamp: 2020-10-19T01:56:16.163Z

Standardizes features by removing the mean and scaling to unit variance using column summary
statistics on the samples in the training set.

The "unit std" is computed using the

corrected sample standard deviation,
which is computed as the square root of the unbiased sample variance.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html

Timestamp: 2020-10-19T01:56:16.163Z
sourceraw docstring

stdclj

(std model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.073Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.073Z
sourceraw docstring

stop-words-removerclj

(stop-words-remover params)

A feature transformer that filters out stop words from input.

Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html

Timestamp: 2020-10-19T01:56:16.540Z

A feature transformer that filters out stop words from input.

Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the
inputCols parameter. Note that when both the inputCol and inputCols parameters are set,
an Exception will be thrown.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html

Timestamp: 2020-10-19T01:56:16.540Z
sourceraw docstring

string-indexerclj

(string-indexer params)

A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html

Timestamp: 2020-10-19T01:56:16.905Z

A label indexer that maps string column(s) of labels to ML column(s) of label indices.
If the input columns are numeric, we cast them to string and index the string values.
The indices are in [0, numLabels). By default, this is ordered by label frequencies
so the most frequent label gets index 0. The ordering behavior is controlled by
setting stringOrderType.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html

Timestamp: 2020-10-19T01:56:16.905Z
sourceraw docstring

summaryclj

(summary model)

Params:

Result: LinearRegressionTrainingSummary

Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is thrown if hasSummary is false.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.383Z

Params: 

Result: LinearRegressionTrainingSummary

Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is
thrown if hasSummary is false.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.383Z
sourceraw docstring

supported-optimisersclj

(supported-optimisers model)

Params:

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z

Params: 

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z
sourceraw docstring

supported-optimizersclj

(supported-optimizers model)

Params:

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z

Params: 

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z
sourceraw docstring

surrogate-dfclj

(surrogate-df model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ImputerModel.html

Timestamp: 2020-10-19T01:56:30.491Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ImputerModel.html

Timestamp: 2020-10-19T01:56:30.491Z
sourceraw docstring

thetaclj

(theta model)

Params:

Result: Matrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.648Z

Params: 

Result: Matrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.648Z
sourceraw docstring

thresholdsclj

(thresholds model)
Params: 

Result: Array[Double]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.629Z
sourceraw docstring

tokeniserclj

(tokeniser params)

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

A tokenizer that converts the input string to lowercase and then splits it by white spaces.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z
sourceraw docstring

tokenizerclj

(tokenizer params)

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

A tokenizer that converts the input string to lowercase and then splits it by white spaces.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z
sourceraw docstring

total-num-nodesclj

(total-num-nodes model)

Params:

Result: Int

Total number of nodes, summed over all trees in the ensemble.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.716Z

Params: 

Result: Int

Total number of nodes, summed over all trees in the ensemble.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.716Z
sourceraw docstring

train-validation-splitclj

(train-validation-split {:keys [estimator evaluator estimator-param-maps seed
                                parallelism]})

Validation for hyper-parameter tuning. Randomly splits the input dataset into train and validation sets, and uses evaluation metric on the validation set to select the best model. Similar to CrossValidator, but only splits the set once.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html

Timestamp: 2020-10-19T01:55:49.563Z

Validation for hyper-parameter tuning.
Randomly splits the input dataset into train and validation sets,
and uses evaluation metric on the validation set to select the best model.
Similar to CrossValidator, but only splits the set once.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html

Timestamp: 2020-10-19T01:55:49.563Z
sourceraw docstring

transformclj

(transform dataframe transformer)

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: DataFrame

Transforms the dataset with optional parameters

input dataset

the first param pair, overwrite embedded params

other param pairs, overwrite embedded params

transformed dataset

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.391Z

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: DataFrame

Transforms the dataset with optional parameters

input dataset

the first param pair, overwrite embedded params

other param pairs, overwrite embedded params

transformed dataset

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.391Z
sourceraw docstring

tree-weightsclj

(tree-weights model)

Params:

Result: Array[Double]

Weights for each tree, zippable with trees

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.737Z

Params: 

Result: Array[Double]

Weights for each tree, zippable with trees

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.737Z
sourceraw docstring

treesclj

(trees model)

Params:

Result: Array[DecisionTreeClassificationModel]

Trees in this ensemble. Warning: These have null parent Estimators.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.739Z

Params: 

Result: Array[DecisionTreeClassificationModel]

Trees in this ensemble. Warning: These have null parent Estimators.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.739Z
sourceraw docstring

uidclj

(uid model)

Params:

Result: String

An immutable unique ID for the object and its derivatives.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.754Z

Params: 

Result: String

An immutable unique ID for the object and its derivatives.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.754Z
sourceraw docstring

user-factorsclj

(user-factors model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.347Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.347Z
sourceraw docstring

vector->arrayclj

(vector->array expr)
(vector->array expr dtype)

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z
sourceraw docstring

vector-assemblerclj

(vector-assembler params)

A feature transformer that merges multiple columns into a vector column.

This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html

Timestamp: 2020-10-19T01:56:17.622Z

A feature transformer that merges multiple columns into a vector column.

This requires one pass over the entire dataset. In case we need to infer column lengths from the
data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html

Timestamp: 2020-10-19T01:56:17.622Z
sourceraw docstring

vector-indexerclj

(vector-indexer params)

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

This returns a model which can transform categorical features to use 0-based indices.

Index stability:

TODO: Future extensions: The following functionality is planned for the future:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html

Timestamp: 2020-10-19T01:56:18.174Z

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

This returns a model which can transform categorical features to use 0-based indices.

Index stability:

TODO: Future extensions: The following functionality is planned for the future:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html

Timestamp: 2020-10-19T01:56:18.174Z
sourceraw docstring

vector-size-hintclj

(vector-size-hint params)

A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata.

Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:18.723Z

A feature transformer that adds size information to the metadata of a vector column.
VectorAssembler needs size information for its input columns and cannot be used on streaming
dataframes without this metadata.

Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:18.723Z
sourceraw docstring

vector-to-arrayclj

(vector-to-array expr)
(vector-to-array expr dtype)

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z
sourceraw docstring

vocab-sizeclj

(vocab-size model)

Params:

Result: Int

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:43.011Z

Params: 

Result: Int



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:43.011Z
sourceraw docstring

vocabularyclj

(vocabulary model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html

Timestamp: 2020-10-19T01:56:34.357Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html

Timestamp: 2020-10-19T01:56:34.357Z
sourceraw docstring

weightsclj

(weights model)

Params:

Result: Array[Double]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.312Z

Params: 

Result: Array[Double]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.312Z
sourceraw docstring

word-2-vecclj

(word-2-vec params)

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further
natural language processing or machine learning process.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z
sourceraw docstring

word2vecclj

(word2vec params)

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further
natural language processing or machine learning process.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z
sourceraw docstring

write-native-model!clj

(write-native-model! model path)

Save the native XGBoost's Booster to file.

Save the native XGBoost's `Booster` to file.
sourceraw docstring

write-stage!clj

(write-stage! stage path)
(write-stage! stage path options)

Save a PipelineStage to the specified path.

Save a PipelineStage to the specified path.
sourceraw docstring

xgboost-classifierclj

(xgboost-classifier params)
Gradient boosting classifier based on xgboost.

XGBoost docs: https://xgboost.readthedocs.io/en/latest/

XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.html
sourceraw docstring

xgboost-regressorclj

(xgboost-regressor params)
Gradient boosting classifier based on xgboost.

XGBoost docs: https://xgboost.readthedocs.io/en/latest/

XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.html
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close