(aft-survival-regression params)
Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.
Timestamp: 2020-10-19T01:55:51.453Z
Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html Timestamp: 2020-10-19T01:55:51.453Z
(als params)
Alternating Least Squares (ALS) matrix factorization.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.
Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html
Timestamp: 2020-10-19T01:56:00.419Z
Alternating Least Squares (ALS) matrix factorization. ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix. This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages. For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here. Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items. Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html Timestamp: 2020-10-19T01:56:00.419Z
(alternating-least-squares params)
Alternating Least Squares (ALS) matrix factorization.
ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.
This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.
For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.
Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.
Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html
Timestamp: 2020-10-19T01:56:00.419Z
Alternating Least Squares (ALS) matrix factorization. ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix. This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages. For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here. Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items. Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html Timestamp: 2020-10-19T01:56:00.419Z
(approx-nearest-neighbors dataset model key-v n-nearest)
(approx-nearest-neighbors dataset model key-v n-nearest dist-col)
Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)
Result: Dataset[_]
Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.
The dataset to search for nearest neighbors of the key.
Feature vector representing the item to search for.
The maximum number of nearest neighbors.
Output column for storing the distance between each result row and the key.
A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key.
This method is experimental and will likely change behavior in the next release.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.799Z
Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String) Result: Dataset[_] Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary. The dataset to search for nearest neighbors of the key. Feature vector representing the item to search for. The maximum number of nearest neighbors. Output column for storing the distance between each result row and the key. A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key. This method is experimental and will likely change behavior in the next release. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.799Z
(approx-nearest-neighbours dataset model key-v n-nearest)
(approx-nearest-neighbours dataset model key-v n-nearest dist-col)
Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)
Result: Dataset[_]
Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.
The dataset to search for nearest neighbors of the key.
Feature vector representing the item to search for.
The maximum number of nearest neighbors.
Output column for storing the distance between each result row and the key.
A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key.
This method is experimental and will likely change behavior in the next release.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.799Z
Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String) Result: Dataset[_] Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary. The dataset to search for nearest neighbors of the key. Feature vector representing the item to search for. The maximum number of nearest neighbors. Output column for storing the distance between each result row and the key. A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key. This method is experimental and will likely change behavior in the next release. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.799Z
(approx-similarity-join dataset-a dataset-b model threshold)
(approx-similarity-join dataset-a dataset-b model threshold dist-col)
Params: (datasetA: Dataset[_], datasetB: Dataset[_], threshold: Double, distCol: String)
Result: Dataset[_]
Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.
One of the datasets to join.
Another dataset to join.
The threshold for the distance of row pairs.
Output column for storing the distance between each pair of rows.
A joined dataset containing pairs of rows. The original rows are in columns "datasetA" and "datasetB", and a column "distCol" is added to show the distance between each pair.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.802Z
Params: (datasetA: Dataset[_], datasetB: Dataset[_], threshold: Double, distCol: String) Result: Dataset[_] Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary. One of the datasets to join. Another dataset to join. The threshold for the distance of row pairs. Output column for storing the distance between each pair of rows. A joined dataset containing pairs of rows. The original rows are in columns "datasetA" and "datasetB", and a column "distCol" is added to show the distance between each pair. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.802Z
(association-rules model)
Params:
Result: DataFrame
Get association rules fitted using the minConfidence. Returns a dataframe with four fields, "antecedent", "consequent", "confidence" and "lift", where "antecedent" and "consequent" are Array[T], whereas "confidence" and "lift" are Double.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html
Timestamp: 2020-10-19T01:56:43.538Z
Params: Result: DataFrame Get association rules fitted using the minConfidence. Returns a dataframe with four fields, "antecedent", "consequent", "confidence" and "lift", where "antecedent" and "consequent" are Array[T], whereas "confidence" and "lift" are Double. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html Timestamp: 2020-10-19T01:56:43.538Z
(best-model model)
Params:
Result: Model[_]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidatorModel.html
Timestamp: 2020-10-19T01:56:45.449Z
Params: Result: Model[_] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidatorModel.html Timestamp: 2020-10-19T01:56:45.449Z
(binariser params)
Binarize a column of continuous features given a threshold.
Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html
Timestamp: 2020-10-19T01:56:05.331Z
Binarize a column of continuous features given a threshold. Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html Timestamp: 2020-10-19T01:56:05.331Z
(binarizer params)
Binarize a column of continuous features given a threshold.
Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html
Timestamp: 2020-10-19T01:56:05.331Z
Binarize a column of continuous features given a threshold. Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html Timestamp: 2020-10-19T01:56:05.331Z
(binary-classification-evaluator params)
Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities).
Timestamp: 2020-10-19T01:56:00.765Z
Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html Timestamp: 2020-10-19T01:56:00.765Z
(binary-summary model)
Params:
Result: BinaryLogisticRegressionTrainingSummary
Gets summary of model on training set. An exception is thrown if hasSummary is false or it is a multiclass model.
Timestamp: 2020-10-19T01:56:46.093Z
Params: Result: BinaryLogisticRegressionTrainingSummary Gets summary of model on training set. An exception is thrown if hasSummary is false or it is a multiclass model. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html Timestamp: 2020-10-19T01:56:46.093Z
(bisecting-k-means params)
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html
Timestamp: 2020-10-19T01:56:03.281Z
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html Timestamp: 2020-10-19T01:56:03.281Z
(boundaries model)
Params:
Result: Vector
Boundaries in increasing order for which predictions are known.
Timestamp: 2020-10-19T01:56:44.821Z
Params: Result: Vector Boundaries in increasing order for which predictions are known. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegressionModel.html Timestamp: 2020-10-19T01:56:44.821Z
(bucketed-random-projection-lsh params)
This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.
The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.
References:
Wikipedia on Stable Distributions
Timestamp: 2020-10-19T01:56:05.693Z
This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics. The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function. References: 1. Wikipedia on Stable Distributions 2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html Timestamp: 2020-10-19T01:56:05.693Z
(bucketiser params)
Bucketizer maps a column of continuous features to a column of feature buckets.
Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html
Timestamp: 2020-10-19T01:56:06.060Z
Bucketizer maps a column of continuous features to a column of feature buckets. Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html Timestamp: 2020-10-19T01:56:06.060Z
(bucketizer params)
Bucketizer maps a column of continuous features to a column of feature buckets.
Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html
Timestamp: 2020-10-19T01:56:06.060Z
Bucketizer maps a column of continuous features to a column of feature buckets. Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html Timestamp: 2020-10-19T01:56:06.060Z
(category-maps model)
Params:
Result: Map[Int, Map[Double, Int]]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexerModel.html
Timestamp: 2020-10-19T01:56:31.705Z
Params: Result: Map[Int, Map[Double, Int]] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexerModel.html Timestamp: 2020-10-19T01:56:31.705Z
(category-sizes model)
Params:
Result: Array[Int]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html
Timestamp: 2020-10-19T01:56:28.967Z
Params: Result: Array[Int] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html Timestamp: 2020-10-19T01:56:28.967Z
(chi-sq-selector params)
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html
Timestamp: 2020-10-19T01:56:06.428Z
Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html Timestamp: 2020-10-19T01:56:06.428Z
(chi-square-test dataframe features-col label-col)
Chi-square hypothesis testing for categorical data.
See Wikipedia for more information on the Chi-squared test.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html
Timestamp: 2020-10-19T01:55:49.886Z
Chi-square hypothesis testing for categorical data. See Wikipedia for more information on the Chi-squared test. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html Timestamp: 2020-10-19T01:55:49.886Z
(cluster-centers model)
Params:
Result: Array[Vector]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeansModel.html
Timestamp: 2020-10-19T01:56:36.922Z
Params: Result: Array[Vector] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeansModel.html Timestamp: 2020-10-19T01:56:36.922Z
(clustering-evaluator params)
Evaluator for clustering results. The metric computes the Silhouette measure using the specified distance measure.
The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters.
Timestamp: 2020-10-19T01:56:01.116Z
Evaluator for clustering results. The metric computes the Silhouette measure using the specified distance measure. The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.html Timestamp: 2020-10-19T01:56:01.116Z
(coefficient-matrix model)
Params:
Result: Matrix
Timestamp: 2020-10-19T01:56:46.098Z
Params: Result: Matrix Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html Timestamp: 2020-10-19T01:56:46.098Z
(coefficients model)
Params:
Result: Vector
Timestamp: 2020-10-19T01:56:36.282Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.282Z
Column: Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Datasate: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
Column: Aggregate function: returns the Pearson Correlation Coefficient for two columns. Datasate: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
(count-vectoriser params)
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html
Timestamp: 2020-10-19T01:56:06.801Z
Extracts a vocabulary from document collections and generates a CountVectorizerModel. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html Timestamp: 2020-10-19T01:56:06.801Z
(count-vectorizer params)
Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html
Timestamp: 2020-10-19T01:56:06.801Z
Extracts a vocabulary from document collections and generates a CountVectorizerModel. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html Timestamp: 2020-10-19T01:56:06.801Z
(cross-validator {:keys [estimator evaluator estimator-param-maps num-folds seed
parallelism]})
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidator.html
Timestamp: 2020-10-19T01:55:48.855Z
K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidator.html Timestamp: 2020-10-19T01:55:48.855Z
(dct params)
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
More information on DCT-II in Discrete cosine transform (Wikipedia).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html
Timestamp: 2020-10-19T01:56:07.160Z
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II). More information on DCT-II in Discrete cosine transform (Wikipedia). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html Timestamp: 2020-10-19T01:56:07.160Z
(decision-tree-classifier params)
Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Timestamp: 2020-10-19T01:55:55.948Z
Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html Timestamp: 2020-10-19T01:55:55.948Z
(decision-tree-regressor params)
Decision tree learning algorithm for regression. It supports both continuous and categorical features.
Timestamp: 2020-10-19T01:55:52.001Z
Decision tree learning algorithm for regression. It supports both continuous and categorical features. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html Timestamp: 2020-10-19T01:55:52.001Z
(depth model)
Params:
Result: Int
Depth of the tree. E.g.: Depth 0 means 1 leaf node. Depth 1 means 1 internal node and 2 leaf nodes.
Timestamp: 2020-10-19T01:56:41.586Z
Params: Result: Int Depth of the tree. E.g.: Depth 0 means 1 leaf node. Depth 1 means 1 internal node and 2 leaf nodes. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html Timestamp: 2020-10-19T01:56:41.586Z
Params: (maxTermsPerTopic: Int)
Result: DataFrame
Return the topics described by their top-weighted terms.
Maximum number of terms to collect for each topic. Default value of 10.
Local DataFrame with one topic per Row, with columns:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.892Z
Params: (maxTermsPerTopic: Int) Result: DataFrame Return the topics described by their top-weighted terms. Maximum number of terms to collect for each topic. Default value of 10. Local DataFrame with one topic per Row, with columns: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.892Z
(discrete-cosine-transform params)
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).
More information on DCT-II in Discrete cosine transform (Wikipedia).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html
Timestamp: 2020-10-19T01:56:07.160Z
A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II). More information on DCT-II in Discrete cosine transform (Wikipedia). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html Timestamp: 2020-10-19T01:56:07.160Z
(distributed? model)
Params:
Result: Boolean
Indicates whether this instance is of type DistributedLDAModel
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.877Z
Params: Result: Boolean Indicates whether this instance is of type DistributedLDAModel Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.877Z
(elementwise-product params)
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html
Timestamp: 2020-10-19T01:56:07.551Z
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html Timestamp: 2020-10-19T01:56:07.551Z
(estimated-doc-concentration model)
Params:
Result: Vector
Value for docConcentration estimated from data. If Online LDA was used and optimizeDocConcentration was set to false, then this returns the fixed (given) value for the docConcentration parameter.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.897Z
Params: Result: Vector Value for docConcentration estimated from data. If Online LDA was used and optimizeDocConcentration was set to false, then this returns the fixed (given) value for the docConcentration parameter. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.897Z
(evaluate dataframe evaluator)
Params: (dataset: Dataset[_])
Result: LinearRegressionSummary
Evaluates the model on a test dataset.
Test dataset to evaluate model on.
Timestamp: 2020-10-19T01:56:36.292Z
Params: (dataset: Dataset[_]) Result: LinearRegressionSummary Evaluates the model on a test dataset. Test dataset to evaluate model on. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.292Z
(feature-hasher params)
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html
Timestamp: 2020-10-19T01:56:07.938Z
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0. Null (missing) values are ignored (implicitly zero in the resulting feature vector). The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html Timestamp: 2020-10-19T01:56:07.938Z
(feature-importances model)
Params:
Result: Vector
Estimate of the importance of each feature.
Each feature's importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.) and follows the implementation from scikit-learn.
DecisionTreeClassificationModel.featureImportances
Timestamp: 2020-10-19T01:56:37.595Z
Params: Result: Vector Estimate of the importance of each feature. Each feature's importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.) and follows the implementation from scikit-learn. DecisionTreeClassificationModel.featureImportances Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.595Z
(features-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:36.314Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.314Z
(find-frequent-sequential-patterns dataset prefix-span)
Params: (dataset: Dataset[_])
Result: DataFrame
Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
A dataset or a dataframe containing a sequence column which is
A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html
Timestamp: 2020-10-19T01:56:35.709Z
Params: (dataset: Dataset[_]) Result: DataFrame Finds the complete set of frequent sequential patterns in the input sequences of itemsets. A dataset or a dataframe containing a sequence column which is A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html Timestamp: 2020-10-19T01:56:35.709Z
(find-patterns dataset prefix-span)
Params: (dataset: Dataset[_])
Result: DataFrame
Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
A dataset or a dataframe containing a sequence column which is
A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html
Timestamp: 2020-10-19T01:56:35.709Z
Params: (dataset: Dataset[_]) Result: DataFrame Finds the complete set of frequent sequential patterns in the input sequences of itemsets. A dataset or a dataframe containing a sequence column which is A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html Timestamp: 2020-10-19T01:56:35.709Z
(fit dataframe estimator)
Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)
Result: M
Fits a single model to the input data with optional parameters.
input dataset
the first param pair, overrides embedded params
other param pairs. These values override any specified in this Estimator's embedded ParamMap.
fitted model
Timestamp: 2020-10-19T01:56:44.210Z
Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) Result: M Fits a single model to the input data with optional parameters. input dataset the first param pair, overrides embedded params other param pairs. These values override any specified in this Estimator's embedded ParamMap. fitted model Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.html Timestamp: 2020-10-19T01:56:44.210Z
(fm-classifier params)
Factorization Machines learning algorithm for classification. It supports normal gradient descent and AdamW solver.
The implementation is based upon:
S. Rendle. "Factorization machines" 2010.
FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:
FM classification model uses logistic loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/FMClassifier.html
Timestamp: 2020-10-19T01:55:56.340Z
Factorization Machines learning algorithm for classification. It supports normal gradient descent and AdamW solver. The implementation is based upon: S. Rendle. "Factorization machines" 2010. FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is: FM classification model uses logistic loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/FMClassifier.html Timestamp: 2020-10-19T01:55:56.340Z
(fm-regressor params)
Factorization Machines learning algorithm for regression. It supports normal gradient descent and AdamW solver.
The implementation is based upon:
S. Rendle. "Factorization machines" 2010.
FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:
FM regression model uses MSE loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/FMRegressor.html
Timestamp: 2020-10-19T01:55:52.555Z
Factorization Machines learning algorithm for regression. It supports normal gradient descent and AdamW solver. The implementation is based upon: S. Rendle. "Factorization machines" 2010. FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is: FM regression model uses MSE loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/FMRegressor.html Timestamp: 2020-10-19T01:55:52.555Z
(fp-growth params)
A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit().
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html
Timestamp: 2020-10-19T01:55:59.709Z
A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit(). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html Timestamp: 2020-10-19T01:55:59.709Z
(freq-itemsets model)
Params:
Result: DataFrame
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html
Timestamp: 2020-10-19T01:56:43.556Z
Params: Result: DataFrame Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html Timestamp: 2020-10-19T01:56:43.556Z
(frequent-item-sets model)
Params:
Result: DataFrame
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html
Timestamp: 2020-10-19T01:56:43.556Z
Params: Result: DataFrame Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html Timestamp: 2020-10-19T01:56:43.556Z
(frequent-pattern-growth params)
A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit().
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html
Timestamp: 2020-10-19T01:55:59.709Z
A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit(). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html Timestamp: 2020-10-19T01:55:59.709Z
(gaussian-mixture params)
Gaussian Mixture clustering.
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html
Timestamp: 2020-10-19T01:56:03.645Z
Gaussian Mixture clustering. This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite. Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html Timestamp: 2020-10-19T01:56:03.645Z
(gaussians-df model)
Params:
Result: DataFrame
Retrieve Gaussian distributions as a DataFrame. Each row represents a Gaussian Distribution. Two columns are defined: mean and cov. Schema:
Timestamp: 2020-10-19T01:56:40.217Z
Params: Result: DataFrame Retrieve Gaussian distributions as a DataFrame. Each row represents a Gaussian Distribution. Two columns are defined: mean and cov. Schema: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html Timestamp: 2020-10-19T01:56:40.217Z
(gbt-classifier params)
Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features.
The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.
Notes on Gradient Boosting vs. TreeBoost:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/GBTClassifier.html
Timestamp: 2020-10-19T01:55:56.899Z
Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features. The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999. Notes on Gradient Boosting vs. TreeBoost: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/GBTClassifier.html Timestamp: 2020-10-19T01:55:56.899Z
(gbt-regressor params)
Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features.
The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.
Notes on Gradient Boosting vs. TreeBoost:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GBTRegressor.html
Timestamp: 2020-10-19T01:55:53.108Z
Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features. The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999. Notes on Gradient Boosting vs. TreeBoost: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GBTRegressor.html Timestamp: 2020-10-19T01:55:53.108Z
(generalised-linear-regression params)
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.
Timestamp: 2020-10-19T01:55:53.908Z
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html Timestamp: 2020-10-19T01:55:53.908Z
(generalized-linear-regression params)
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.
Timestamp: 2020-10-19T01:55:53.908Z
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html Timestamp: 2020-10-19T01:55:53.908Z
(get-features-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:36.314Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.314Z
(get-input-col model)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.823Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.823Z
(get-input-cols model)
Params:
Result: Array[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html
Timestamp: 2020-10-19T01:56:28.991Z
Params: Result: Array[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html Timestamp: 2020-10-19T01:56:28.991Z
(get-label-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:36.316Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.316Z
(get-num-trees model)
Params:
Result: Int
Timestamp: 2020-10-19T01:56:37.621Z
Params: Result: Int Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.621Z
(get-output-col model)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.826Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.826Z
(get-output-cols model)
Params:
Result: Array[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html
Timestamp: 2020-10-19T01:56:28.994Z
Params: Result: Array[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html Timestamp: 2020-10-19T01:56:28.994Z
(get-prediction-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:36.320Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.320Z
(get-probability-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:37.625Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.625Z
(get-raw-prediction-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:37.626Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.626Z
(get-size model)
Params:
Result: Int
group getParam
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html
Timestamp: 2020-10-19T01:56:32.378Z
Params: Result: Int group getParam Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html Timestamp: 2020-10-19T01:56:32.378Z
(get-thresholds model)
Params:
Result: Array[Double]
Timestamp: 2020-10-19T01:56:37.629Z
Params: Result: Array[Double] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.629Z
(glm params)
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.
Timestamp: 2020-10-19T01:55:53.908Z
Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html Timestamp: 2020-10-19T01:55:53.908Z
(gmm params)
Gaussian Mixture clustering.
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html
Timestamp: 2020-10-19T01:56:03.645Z
Gaussian Mixture clustering. This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite. Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html Timestamp: 2020-10-19T01:56:03.645Z
(hashing-tf params)
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html
Timestamp: 2020-10-19T01:56:08.308Z
Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html Timestamp: 2020-10-19T01:56:08.308Z
(idf params)
Compute the Inverse Document Frequency (IDF) given a collection of documents.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html
Timestamp: 2020-10-19T01:56:08.857Z
Compute the Inverse Document Frequency (IDF) given a collection of documents. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html Timestamp: 2020-10-19T01:56:08.857Z
(idf-vector model)
Params:
Result: Vector
Returns the IDF vector.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDFModel.html
Timestamp: 2020-10-19T01:56:34.931Z
Params: Result: Vector Returns the IDF vector. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDFModel.html Timestamp: 2020-10-19T01:56:34.931Z
(imputer params)
Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature.
Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation.
Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html
Timestamp: 2020-10-19T01:56:09.241Z
Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature. Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation. Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html Timestamp: 2020-10-19T01:56:09.241Z
(index-to-string params)
A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html
Timestamp: 2020-10-19T01:56:09.599Z
A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html Timestamp: 2020-10-19T01:56:09.599Z
(input-col model)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.823Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.823Z
(input-cols model)
Params:
Result: Array[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html
Timestamp: 2020-10-19T01:56:28.991Z
Params: Result: Array[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html Timestamp: 2020-10-19T01:56:28.991Z
(interaction params)
Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.
For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html
Timestamp: 2020-10-19T01:56:09.965Z
Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced. For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html Timestamp: 2020-10-19T01:56:09.965Z
(intercept model)
Params:
Result: Double
Timestamp: 2020-10-19T01:56:36.333Z
Params: Result: Double Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.333Z
(intercept-vector model)
Params:
Result: Vector
Timestamp: 2020-10-19T01:56:46.167Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html Timestamp: 2020-10-19T01:56:46.167Z
(is-distributed model)
Params:
Result: Boolean
Indicates whether this instance is of type DistributedLDAModel
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.877Z
Params: Result: Boolean Indicates whether this instance is of type DistributedLDAModel Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.877Z
(isotonic-regression params)
Isotonic regression.
Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.
Uses org.apache.spark.mllib.regression.IsotonicRegression.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegression.html
Timestamp: 2020-10-19T01:55:54.264Z
Isotonic regression. Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported. Uses org.apache.spark.mllib.regression.IsotonicRegression. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegression.html Timestamp: 2020-10-19T01:55:54.264Z
(item-factors model)
Params:
Result: DataFrame
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.288Z
Params: Result: DataFrame Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.288Z
(k-means params)
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeans.html
Timestamp: 2020-10-19T01:56:04.224Z
K-means clustering with support for k-means|| initialization proposed by Bahmani et al. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeans.html Timestamp: 2020-10-19T01:56:04.224Z
(kolmogorov-smirnov-test dataframe sample-col dist-name params)
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest$.html
Timestamp: 2020-10-19T01:55:50.540Z
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest$.html Timestamp: 2020-10-19T01:55:50.540Z
(label-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:36.316Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.316Z
(labels model)
Params:
Result: Array[String]
(Since version 3.0.0)
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexerModel.html
Timestamp: 2020-10-19T01:56:31.154Z
Params: Result: Array[String] (Since version 3.0.0) Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexerModel.html Timestamp: 2020-10-19T01:56:31.154Z
(latent-dirichlet-allocation params)
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html
Timestamp: 2020-10-19T01:56:04.609Z
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology: Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html Timestamp: 2020-10-19T01:56:04.609Z
(lda params)
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html
Timestamp: 2020-10-19T01:56:04.609Z
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology: Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html Timestamp: 2020-10-19T01:56:04.609Z
(linear-regression params)
Linear regression.
The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss:
This supports multiple types of regularization:
The squared error objective function is:
The huber objective function is:
where
Note: Fitting with huber loss only supports none and L2 regularization.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegression.html
Timestamp: 2020-10-19T01:55:54.848Z
Linear regression. The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss: This supports multiple types of regularization: The squared error objective function is: The huber objective function is: where Note: Fitting with huber loss only supports none and L2 regularization. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegression.html Timestamp: 2020-10-19T01:55:54.848Z
(linear-svc params)
Linear SVM Classifier
This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LinearSVC.html
Timestamp: 2020-10-19T01:55:57.279Z
Linear SVM Classifier This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LinearSVC.html Timestamp: 2020-10-19T01:55:57.279Z
(log-likelihood dataset model)
Params: (dataset: Dataset[_])
Result: Double
Calculates a lower bound on the log likelihood of the entire corpus.
See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.
test corpus to use for calculating log likelihood
variational lower bound on the log likelihood of the entire corpus
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.959Z
Params: (dataset: Dataset[_]) Result: Double Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010). WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future. test corpus to use for calculating log likelihood variational lower bound on the log likelihood of the entire corpus Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.959Z
(log-perplexity dataset model)
Params: (dataset: Dataset[_])
Result: Double
Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.
test corpus to use for calculating perplexity
Variational upper bound on log perplexity per token.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.961Z
Params: (dataset: Dataset[_]) Result: Double Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010). WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future. test corpus to use for calculating perplexity Variational upper bound on log perplexity per token. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.961Z
(logistic-regression params)
Logistic regression. Supports:
This class supports fitting traditional logistic regression model by LBFGS/OWLQN and bound (box) constrained logistic regression model by LBFGSB.
Timestamp: 2020-10-19T01:55:57.830Z
Logistic regression. Supports: This class supports fitting traditional logistic regression model by LBFGS/OWLQN and bound (box) constrained logistic regression model by LBFGSB. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegression.html Timestamp: 2020-10-19T01:55:57.830Z
(max-abs model)
Params:
Result: Vector
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html
Timestamp: 2020-10-19T01:56:33.682Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html Timestamp: 2020-10-19T01:56:33.682Z
(max-abs-scaler params)
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html
Timestamp: 2020-10-19T01:56:10.658Z
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html Timestamp: 2020-10-19T01:56:10.658Z
(mean model)
Params:
Result: Vector
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html
Timestamp: 2020-10-19T01:56:33.051Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html Timestamp: 2020-10-19T01:56:33.051Z
(min-hash-lsh params)
LSH class for Jaccard distance.
The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values.
References: Wikipedia on MinHash
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html
Timestamp: 2020-10-19T01:56:11.035Z
LSH class for Jaccard distance. The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values. References: Wikipedia on MinHash Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html Timestamp: 2020-10-19T01:56:11.035Z
(min-max-scaler params)
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as:
For the case (E_{max} == E_{min}), (Rescaled(e_i) = 0.5 * (max + min)).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html
Timestamp: 2020-10-19T01:56:11.407Z
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as: For the case \(E_{max} == E_{min}\), \(Rescaled(e_i) = 0.5 * (max + min)\). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html Timestamp: 2020-10-19T01:56:11.407Z
(mlp-classifier params)
Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.
Timestamp: 2020-10-19T01:55:58.225Z
Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html Timestamp: 2020-10-19T01:55:58.225Z
(multiclass-classification-evaluator params)
Evaluator for multiclass classification, which expects input columns: prediction, label, weight (optional) and probability (only for logLoss).
Timestamp: 2020-10-19T01:56:01.471Z
Evaluator for multiclass classification, which expects input columns: prediction, label, weight (optional) and probability (only for logLoss). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html Timestamp: 2020-10-19T01:56:01.471Z
(multilabel-classification-evaluator params)
:: Experimental :: Evaluator for multi-label classification, which expects two input columns: prediction and label.
Timestamp: 2020-10-19T01:56:01.814Z
:: Experimental :: Evaluator for multi-label classification, which expects two input columns: prediction and label. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.html Timestamp: 2020-10-19T01:56:01.814Z
(multilayer-perceptron-classifier params)
Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.
Timestamp: 2020-10-19T01:55:58.225Z
Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html Timestamp: 2020-10-19T01:55:58.225Z
(n-gram params)
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.
When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html
Timestamp: 2020-10-19T01:56:11.769Z
A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words. When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html Timestamp: 2020-10-19T01:56:11.769Z
(naive-bayes params)
Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values for Multinomial NB and Bernoulli NB must be nonnegative. Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically, Complement NB uses statistics from the complement of each class to compute the model's coefficients The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must be nonnegative. Since 3.0.0, it also supports Gaussian NB (see here) which can handle continuous data.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayes.html
Timestamp: 2020-10-19T01:55:58.596Z
Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values for Multinomial NB and Bernoulli NB must be nonnegative. Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically, Complement NB uses statistics from the complement of each class to compute the model's coefficients The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must be nonnegative. Since 3.0.0, it also supports Gaussian NB (see here) which can handle continuous data. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayes.html Timestamp: 2020-10-19T01:55:58.596Z
(normaliser params)
Normalize a vector to have unit norm using the given p-norm.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html
Timestamp: 2020-10-19T01:56:12.133Z
Normalize a vector to have unit norm using the given p-norm. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html Timestamp: 2020-10-19T01:56:12.133Z
(normalizer params)
Normalize a vector to have unit norm using the given p-norm.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html
Timestamp: 2020-10-19T01:56:12.133Z
Normalize a vector to have unit norm using the given p-norm. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html Timestamp: 2020-10-19T01:56:12.133Z
(num-classes model)
Params:
Result: Int
Number of classes (values which the label can take).
Timestamp: 2020-10-19T01:56:37.671Z
Params: Result: Int Number of classes (values which the label can take). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.671Z
(num-features model)
Params:
Result: Int
Returns the number of features the model was trained on. If unknown, returns -1
Timestamp: 2020-10-19T01:56:36.360Z
Params: Result: Int Returns the number of features the model was trained on. If unknown, returns -1 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.360Z
(num-nodes model)
Params:
Result: Int
Number of nodes in tree, including leaf nodes.
Timestamp: 2020-10-19T01:56:41.668Z
Params: Result: Int Number of nodes in tree, including leaf nodes. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html Timestamp: 2020-10-19T01:56:41.668Z
(one-hot-encoder params)
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html
Timestamp: 2020-10-19T01:56:12.690Z
A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html Timestamp: 2020-10-19T01:56:12.690Z
(one-vs-rest params)
Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/OneVsRest.html
Timestamp: 2020-10-19T01:55:58.960Z
Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/OneVsRest.html Timestamp: 2020-10-19T01:55:58.960Z
(original-max model)
Params:
Result: Vector
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html
Timestamp: 2020-10-19T01:56:28.393Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html Timestamp: 2020-10-19T01:56:28.393Z
(original-min model)
Params:
Result: Vector
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html
Timestamp: 2020-10-19T01:56:28.394Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html Timestamp: 2020-10-19T01:56:28.394Z
(output-col model)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html
Timestamp: 2020-10-19T01:56:46.826Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html Timestamp: 2020-10-19T01:56:46.826Z
(output-cols model)
Params:
Result: Array[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html
Timestamp: 2020-10-19T01:56:28.994Z
Params: Result: Array[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html Timestamp: 2020-10-19T01:56:28.994Z
(param-grid grids)
Builder for a param grid used in grid search-based model selection.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html
Timestamp: 2020-10-19T01:55:49.184Z
Builder for a param grid used in grid search-based model selection. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html Timestamp: 2020-10-19T01:55:49.184Z
(param-grid-builder grids)
Builder for a param grid used in grid search-based model selection.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html
Timestamp: 2020-10-19T01:55:49.184Z
Builder for a param grid used in grid search-based model selection. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html Timestamp: 2020-10-19T01:55:49.184Z
(params stage)
Params:
Result: Array[Param[_]]
Returns all params sorted by their names. The default implementation uses Java reflection to list all public methods that have no arguments and return Param.
Developer should not use this method in constructor because we cannot guarantee that this variable gets initialized before other params.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html
Timestamp: 2020-10-19T01:56:35.738Z
Params: Result: Array[Param[_]] Returns all params sorted by their names. The default implementation uses Java reflection to list all public methods that have no arguments and return Param. Developer should not use this method in constructor because we cannot guarantee that this variable gets initialized before other params. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html Timestamp: 2020-10-19T01:56:35.738Z
(pc model)
Params:
Result: DenseMatrix
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html
Timestamp: 2020-10-19T01:56:29.844Z
Params: Result: DenseMatrix Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html Timestamp: 2020-10-19T01:56:29.844Z
(pca params)
PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html
Timestamp: 2020-10-19T01:56:13.048Z
PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html Timestamp: 2020-10-19T01:56:13.048Z
(pi model)
Params:
Result: Vector
Timestamp: 2020-10-19T01:56:39.617Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html Timestamp: 2020-10-19T01:56:39.617Z
(pipeline & stages)
A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as an identity transformer.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/Pipeline.html
Timestamp: 2020-10-19T01:55:50.903Z
A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as an identity transformer. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/Pipeline.html Timestamp: 2020-10-19T01:55:50.903Z
(polynomial-expansion params)
Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html
Timestamp: 2020-10-19T01:56:13.405Z
Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html Timestamp: 2020-10-19T01:56:13.405Z
(power-iteration-clustering params)
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
This class is not yet an Estimator/Transformer, use assignClusters method to run the PowerIterationClustering algorithm.
Timestamp: 2020-10-19T01:56:04.968Z
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data. This class is not yet an Estimator/Transformer, use assignClusters method to run the PowerIterationClustering algorithm. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html Timestamp: 2020-10-19T01:56:04.968Z
(prediction-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:36.320Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.320Z
(prefix-span params)
A parallel PrefixSpan algorithm to mine frequent sequential patterns. The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth (see here). This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to run the PrefixSpan algorithm.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html
Timestamp: 2020-10-19T01:56:00.046Z
A parallel PrefixSpan algorithm to mine frequent sequential patterns. The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth (see here). This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to run the PrefixSpan algorithm. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html Timestamp: 2020-10-19T01:56:00.046Z
(principal-components model)
Params:
Result: DenseMatrix
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html
Timestamp: 2020-10-19T01:56:29.844Z
Params: Result: DenseMatrix Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html Timestamp: 2020-10-19T01:56:29.844Z
(probability-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:37.625Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.625Z
(quantile-discretiser params)
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.
NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html
Timestamp: 2020-10-19T01:56:13.770Z
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns. NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html Timestamp: 2020-10-19T01:56:13.770Z
(quantile-discretizer params)
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.
NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html
Timestamp: 2020-10-19T01:56:13.770Z
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns. NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html Timestamp: 2020-10-19T01:56:13.770Z
(random-forest-classifier params)
Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Timestamp: 2020-10-19T01:55:59.351Z
Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html Timestamp: 2020-10-19T01:55:59.351Z
(random-forest-regressor params)
Random Forest learning algorithm for regression. It supports both continuous and categorical features.
Timestamp: 2020-10-19T01:55:55.394Z
Random Forest learning algorithm for regression. It supports both continuous and categorical features. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html Timestamp: 2020-10-19T01:55:55.394Z
(ranking-evaluator params)
:: Experimental :: Evaluator for ranking, which expects two input columns: prediction and label.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html
Timestamp: 2020-10-19T01:56:02.374Z
:: Experimental :: Evaluator for ranking, which expects two input columns: prediction and label. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html Timestamp: 2020-10-19T01:56:02.374Z
(raw-prediction-col model)
Params:
Result: String
Timestamp: 2020-10-19T01:56:37.626Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.626Z
(read-stage! model-cls path)
Load a saved PipelineStage.
Load a saved PipelineStage.
(recommend-for-all-items model num-users)
Params: (numUsers: Int)
Result: DataFrame
Returns top numUsers users recommended for each item, for all items.
max number of recommendations for each item
a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.310Z
Params: (numUsers: Int) Result: DataFrame Returns top numUsers users recommended for each item, for all items. max number of recommendations for each item a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.310Z
(recommend-for-all-users model num-items)
Params: (numItems: Int)
Result: DataFrame
Returns top numItems items recommended for each user, for all users.
max number of recommendations for each user
a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.315Z
Params: (numItems: Int) Result: DataFrame Returns top numItems items recommended for each user, for all users. max number of recommendations for each user a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.315Z
(recommend-for-item-subset model items-df num-users)
Params: (dataset: Dataset[_], numUsers: Int)
Result: DataFrame
Returns top numUsers users recommended for each item id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned.
a Dataset containing a column of item ids. The column name must match itemCol.
max number of recommendations for each item.
a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.317Z
Params: (dataset: Dataset[_], numUsers: Int) Result: DataFrame Returns top numUsers users recommended for each item id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned. a Dataset containing a column of item ids. The column name must match itemCol. max number of recommendations for each item. a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.317Z
(recommend-for-user-subset model users-df num-items)
Params: (dataset: Dataset[_], numItems: Int)
Result: DataFrame
Returns top numItems items recommended for each user id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned.
a Dataset containing a column of user ids. The column name must match userCol.
max number of recommendations for each user.
a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.319Z
Params: (dataset: Dataset[_], numItems: Int) Result: DataFrame Returns top numItems items recommended for each user id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned. a Dataset containing a column of user ids. The column name must match userCol. max number of recommendations for each user. a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.319Z
(recommend-items model num-items)
(recommend-items model users-df num-items)
Params: (numItems: Int)
Result: DataFrame
Returns top numItems items recommended for each user, for all users.
max number of recommendations for each user
a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.315Z
Params: (numItems: Int) Result: DataFrame Returns top numItems items recommended for each user, for all users. max number of recommendations for each user a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.315Z
(recommend-users model num-users)
(recommend-users model items-df num-users)
Params: (numUsers: Int)
Result: DataFrame
Returns top numUsers users recommended for each item, for all items.
max number of recommendations for each item
a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.310Z
Params: (numUsers: Int) Result: DataFrame Returns top numUsers users recommended for each item, for all items. max number of recommendations for each item a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.310Z
(regex-tokeniser params)
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html
Timestamp: 2020-10-19T01:56:14.327Z
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html Timestamp: 2020-10-19T01:56:14.327Z
(regex-tokenizer params)
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html
Timestamp: 2020-10-19T01:56:14.327Z
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html Timestamp: 2020-10-19T01:56:14.327Z
(regression-evaluator params)
Evaluator for regression, which expects input columns prediction, label and an optional weight column.
Timestamp: 2020-10-19T01:56:02.721Z
Evaluator for regression, which expects input columns prediction, label and an optional weight column. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html Timestamp: 2020-10-19T01:56:02.721Z
(robust-scaler params)
Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html
Timestamp: 2020-10-19T01:56:15.260Z
Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html Timestamp: 2020-10-19T01:56:15.260Z
(root-node model)
Params:
Result: Node
Root of the decision tree
Timestamp: 2020-10-19T01:56:41.689Z
Params: Result: Node Root of the decision tree Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html Timestamp: 2020-10-19T01:56:41.689Z
(scale model)
Params:
Result: Double
Timestamp: 2020-10-19T01:56:36.368Z
Params: Result: Double Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.368Z
(sql-transformer params)
Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html
Timestamp: 2020-10-19T01:56:15.611Z
Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html Timestamp: 2020-10-19T01:56:15.611Z
(stages model)
Params:
Result: Array[Transformer]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/PipelineModel.html
Timestamp: 2020-10-19T01:56:38.367Z
Params: Result: Array[Transformer] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/PipelineModel.html Timestamp: 2020-10-19T01:56:38.367Z
(standard-scaler params)
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
The "unit std" is computed using the
corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html
Timestamp: 2020-10-19T01:56:16.163Z
Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. The "unit std" is computed using the corrected sample standard deviation, which is computed as the square root of the unbiased sample variance. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html Timestamp: 2020-10-19T01:56:16.163Z
(std model)
Params:
Result: Vector
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html
Timestamp: 2020-10-19T01:56:33.073Z
Params: Result: Vector Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html Timestamp: 2020-10-19T01:56:33.073Z
(stop-words-remover params)
A feature transformer that filters out stop words from input.
Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html
Timestamp: 2020-10-19T01:56:16.540Z
A feature transformer that filters out stop words from input. Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html Timestamp: 2020-10-19T01:56:16.540Z
(string-indexer params)
A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html
Timestamp: 2020-10-19T01:56:16.905Z
A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html Timestamp: 2020-10-19T01:56:16.905Z
(summary model)
Params:
Result: LinearRegressionTrainingSummary
Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is thrown if hasSummary is false.
Timestamp: 2020-10-19T01:56:36.383Z
Params: Result: LinearRegressionTrainingSummary Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is thrown if hasSummary is false. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.383Z
(supported-optimisers model)
Params:
Result: Array[String]
Supported values for Param optimizer.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.991Z
Params: Result: Array[String] Supported values for Param optimizer. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.991Z
(supported-optimizers model)
Params:
Result: Array[String]
Supported values for Param optimizer.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:42.991Z
Params: Result: Array[String] Supported values for Param optimizer. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:42.991Z
(surrogate-df model)
Params:
Result: DataFrame
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ImputerModel.html
Timestamp: 2020-10-19T01:56:30.491Z
Params: Result: DataFrame Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ImputerModel.html Timestamp: 2020-10-19T01:56:30.491Z
(theta model)
Params:
Result: Matrix
Timestamp: 2020-10-19T01:56:39.648Z
Params: Result: Matrix Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html Timestamp: 2020-10-19T01:56:39.648Z
(thresholds model)
Params:
Result: Array[Double]
Timestamp: 2020-10-19T01:56:37.629Z
Params: Result: Array[Double] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.629Z
(tokeniser params)
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html
Timestamp: 2020-10-19T01:56:17.265Z
A tokenizer that converts the input string to lowercase and then splits it by white spaces. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html Timestamp: 2020-10-19T01:56:17.265Z
(tokenizer params)
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html
Timestamp: 2020-10-19T01:56:17.265Z
A tokenizer that converts the input string to lowercase and then splits it by white spaces. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html Timestamp: 2020-10-19T01:56:17.265Z
(total-num-nodes model)
Params:
Result: Int
Total number of nodes, summed over all trees in the ensemble.
Timestamp: 2020-10-19T01:56:37.716Z
Params: Result: Int Total number of nodes, summed over all trees in the ensemble. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.716Z
(train-validation-split {:keys [estimator evaluator estimator-param-maps seed
parallelism]})
Validation for hyper-parameter tuning. Randomly splits the input dataset into train and validation sets, and uses evaluation metric on the validation set to select the best model. Similar to CrossValidator, but only splits the set once.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html
Timestamp: 2020-10-19T01:55:49.563Z
Validation for hyper-parameter tuning. Randomly splits the input dataset into train and validation sets, and uses evaluation metric on the validation set to select the best model. Similar to CrossValidator, but only splits the set once. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html Timestamp: 2020-10-19T01:55:49.563Z
(transform dataframe transformer)
Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)
Result: DataFrame
Transforms the dataset with optional parameters
input dataset
the first param pair, overwrite embedded params
other param pairs, overwrite embedded params
transformed dataset
Timestamp: 2020-10-19T01:56:36.391Z
Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) Result: DataFrame Transforms the dataset with optional parameters input dataset the first param pair, overwrite embedded params other param pairs, overwrite embedded params transformed dataset Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html Timestamp: 2020-10-19T01:56:36.391Z
(tree-weights model)
Params:
Result: Array[Double]
Weights for each tree, zippable with trees
Timestamp: 2020-10-19T01:56:37.737Z
Params: Result: Array[Double] Weights for each tree, zippable with trees Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.737Z
(trees model)
Params:
Result: Array[DecisionTreeClassificationModel]
Trees in this ensemble. Warning: These have null parent Estimators.
Timestamp: 2020-10-19T01:56:37.739Z
Params: Result: Array[DecisionTreeClassificationModel] Trees in this ensemble. Warning: These have null parent Estimators. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html Timestamp: 2020-10-19T01:56:37.739Z
(uid model)
Params:
Result: String
An immutable unique ID for the object and its derivatives.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html
Timestamp: 2020-10-19T01:56:35.754Z
Params: Result: String An immutable unique ID for the object and its derivatives. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html Timestamp: 2020-10-19T01:56:35.754Z
(user-factors model)
Params:
Result: DataFrame
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html
Timestamp: 2020-10-19T01:56:42.347Z
Params: Result: DataFrame Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html Timestamp: 2020-10-19T01:56:42.347Z
(vector->array expr)
(vector->array expr dtype)
Params: (v: Column, dtype: String = "float64")
Result: Column
Converts a column of MLlib sparse/dense vectors into a column of dense arrays.
an array<float> if dtype is float32, or array<double> if dtype is float64
3.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html
Timestamp: 2020-10-19T01:56:27.317Z
Params: (v: Column, dtype: String = "float64") Result: Column Converts a column of MLlib sparse/dense vectors into a column of dense arrays. an array<float> if dtype is float32, or array<double> if dtype is float64 3.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html Timestamp: 2020-10-19T01:56:27.317Z
(vector-assembler params)
A feature transformer that merges multiple columns into a vector column.
This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html
Timestamp: 2020-10-19T01:56:17.622Z
A feature transformer that merges multiple columns into a vector column. This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html Timestamp: 2020-10-19T01:56:17.622Z
(vector-indexer params)
Class for indexing categorical feature columns in a dataset of Vector.
This has 2 usage modes:
This returns a model which can transform categorical features to use 0-based indices.
Index stability:
TODO: Future extensions: The following functionality is planned for the future:
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html
Timestamp: 2020-10-19T01:56:18.174Z
Class for indexing categorical feature columns in a dataset of Vector. This has 2 usage modes: This returns a model which can transform categorical features to use 0-based indices. Index stability: TODO: Future extensions: The following functionality is planned for the future: Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html Timestamp: 2020-10-19T01:56:18.174Z
(vector-size-hint params)
A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata.
Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html
Timestamp: 2020-10-19T01:56:18.723Z
A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata. Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html Timestamp: 2020-10-19T01:56:18.723Z
(vector-to-array expr)
(vector-to-array expr dtype)
Params: (v: Column, dtype: String = "float64")
Result: Column
Converts a column of MLlib sparse/dense vectors into a column of dense arrays.
an array<float> if dtype is float32, or array<double> if dtype is float64
3.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html
Timestamp: 2020-10-19T01:56:27.317Z
Params: (v: Column, dtype: String = "float64") Result: Column Converts a column of MLlib sparse/dense vectors into a column of dense arrays. an array<float> if dtype is float32, or array<double> if dtype is float64 3.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html Timestamp: 2020-10-19T01:56:27.317Z
(vocab-size model)
Params:
Result: Int
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html
Timestamp: 2020-10-19T01:56:43.011Z
Params: Result: Int Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html Timestamp: 2020-10-19T01:56:43.011Z
(vocabulary model)
Params:
Result: Array[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html
Timestamp: 2020-10-19T01:56:34.357Z
Params: Result: Array[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html Timestamp: 2020-10-19T01:56:34.357Z
(weights model)
Params:
Result: Array[Double]
Timestamp: 2020-10-19T01:56:40.312Z
Params: Result: Array[Double] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html Timestamp: 2020-10-19T01:56:40.312Z
(word-2-vec params)
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html
Timestamp: 2020-10-19T01:56:19.459Z
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html Timestamp: 2020-10-19T01:56:19.459Z
(word2vec params)
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html
Timestamp: 2020-10-19T01:56:19.459Z
Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html Timestamp: 2020-10-19T01:56:19.459Z
(write-native-model! model path)
Save the native XGBoost's Booster
to file.
Save the native XGBoost's `Booster` to file.
(write-stage! stage path)
(write-stage! stage path options)
Save a PipelineStage to the specified path.
Save a PipelineStage to the specified path.
(xgboost-classifier params)
Gradient boosting classifier based on xgboost.
XGBoost docs: https://xgboost.readthedocs.io/en/latest/
XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.html
Gradient boosting classifier based on xgboost. XGBoost docs: https://xgboost.readthedocs.io/en/latest/ XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.html
(xgboost-regressor params)
Gradient boosting classifier based on xgboost.
XGBoost docs: https://xgboost.readthedocs.io/en/latest/
XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.html
Gradient boosting classifier based on xgboost. XGBoost docs: https://xgboost.readthedocs.io/en/latest/ XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.html
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close