zero-one.geni.ml

Liking cljdoc? Tell your friends :D

Clojure only.

aft-survival-regression
als
alternating-least-squares
approx-nearest-neighbors
approx-nearest-neighbours
approx-similarity-join
association-rules
best-model
binariser
binarizer
binary-classification-evaluator
binary-summary
bisecting-k-means
boundaries
bucketed-random-projection-lsh
bucketiser
bucketizer
category-maps
category-sizes
chi-sq-selector
chi-square-test
cluster-centers
clustering-evaluator
coefficient-matrix
coefficients
corr
count-vectoriser
count-vectorizer
cross-validator
dct
decision-tree-classifier
decision-tree-regressor
depth
describe-topics
discrete-cosine-transform
distributed?
elementwise-product
estimated-doc-concentration
evaluate
feature-hasher
feature-importances
features-col
find-frequent-sequential-patterns
find-patterns
fit
fm-classifier
fm-regressor
fp-growth
freq-itemsets
frequent-item-sets
frequent-pattern-growth
gaussian-mixture
gaussians-df
gbt-classifier
gbt-regressor
generalised-linear-regression
generalized-linear-regression
get-features-col
get-input-col
get-input-cols
get-label-col
get-num-trees
get-output-col
get-output-cols
get-prediction-col
get-probability-col
get-raw-prediction-col
get-size
get-thresholds
glm
gmm
hashing-tf
idf
idf-vector
imputer
index-to-string
input-col
input-cols
interaction
intercept
intercept-vector
is-distributed
isotonic-regression
item-factors
k-means
kolmogorov-smirnov-test
label-col
labels
latent-dirichlet-allocation
lda
linear-regression
linear-svc
log-likelihood
log-perplexity
logistic-regression
max-abs
max-abs-scaler
mean
min-hash-lsh
min-max-scaler
mlp-classifier
multiclass-classification-evaluator
multilabel-classification-evaluator
multilayer-perceptron-classifier
n-gram
naive-bayes
normaliser
normalizer
num-classes
num-features
num-nodes
one-hot-encoder
one-vs-rest
original-max
original-min
output-col
output-cols
param-grid
param-grid-builder
params
pc
pca
pi
pipeline
polynomial-expansion
power-iteration-clustering
prediction-col
prefix-span
principal-components
probability-col
quantile-discretiser
quantile-discretizer
random-forest-classifier
random-forest-regressor
ranking-evaluator
raw-prediction-col
read-stage!
recommend-for-all-items
recommend-for-all-users
recommend-for-item-subset
recommend-for-user-subset
recommend-items
recommend-users
regex-tokeniser
regex-tokenizer
regression-evaluator
robust-scaler
root-node
scale
sql-transformer
stages
standard-scaler
std
stop-words-remover
string-indexer
summary
supported-optimisers
supported-optimizers
surrogate-df
theta
thresholds
tokeniser
tokenizer
total-num-nodes
train-validation-split
transform
tree-weights
trees
uid
user-factors
vector->array
vector-assembler
vector-indexer
vector-size-hint
vector-to-array
vocab-size
vocabulary
weights
word-2-vec
word2vec
write-native-model!
write-stage!
xgboost-classifier
xgboost-regressor

aft-survival-regression^clj

(aft-survival-regression params)

Fit a parametric survival regression model named accelerated failure time (AFT) model (see Accelerated failure time model (Wikipedia)) based on the Weibull distribution of the survival time.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html

Timestamp: 2020-10-19T01:55:51.453Z

Fit a parametric survival regression model named accelerated failure time (AFT) model
(see 
Accelerated failure time model (Wikipedia))
based on the Weibull distribution of the survival time.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html

Timestamp: 2020-10-19T01:55:51.453Z

raw docstring

als^clj

(als params)

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices, X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices. The general approach is iterative. During each iteration, one of the factor matrices is held constant, while the other is solved for using least squares. The newly-solved factor matrix is then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on each iteration, and only for the product blocks that need that user's feature vector. This is achieved by pre-computing some information about the ratings matrix to determine the "out-links" of each user (which blocks of products it will contribute to) and "in-link" information for each product (which of the feature vectors it receives from each user block it will depend on). This allows us to send only an array of feature vectors between each user block and product block, and have the product block find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on "Collaborative Filtering for Implicit Feedback Datasets", available at https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R, this finds the approximations for a preference matrix P where the elements of P are 1 if r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence' values related to strength of indicated user preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic. Nondeterministic data can cause failure during fitting ALS model. For example, an order-sensitive operation like sampling after a repartition makes dataset output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618). Checkpointing sampled dataset or adding a sort before sampling can help make the dataset deterministic.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices,
X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices.
The general approach is iterative. During each iteration, one of the factor matrices is held
constant, while the other is solved for using least squares. The newly-solved factor matrix is
then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets
of factors (referred to as "users" and "products") into blocks and reduces communication by only
sending one copy of each user vector to each product block on each iteration, and only for the
product blocks that need that user's feature vector. This is achieved by pre-computing some
information about the ratings matrix to determine the "out-links" of each user (which blocks of
products it will contribute to) and "in-link" information for each product (which of the feature
vectors it receives from each user block it will depend on). This allows us to send only an
array of feature vectors between each user block and product block, and have the product block
find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on
"Collaborative Filtering for Implicit Feedback Datasets", available at
https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R,
this finds the approximations for a preference matrix P where the elements of P are 1 if
r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence'
values related to strength of indicated user
preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic.
Nondeterministic data can cause failure during fitting ALS model.
For example, an order-sensitive operation like sampling after a repartition makes dataset
output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618).
Checkpointing sampled dataset or adding a sort before sampling can help make the dataset
deterministic.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z

raw docstring

alternating-least-squares^clj

(alternating-least-squares params)

Alternating Least Squares (ALS) matrix factorization.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z

Alternating Least Squares (ALS) matrix factorization.

ALS attempts to estimate the ratings matrix R as the product of two lower-rank matrices,
X and Y, i.e. X * Yt = R. Typically these approximations are called 'factor' matrices.
The general approach is iterative. During each iteration, one of the factor matrices is held
constant, while the other is solved for using least squares. The newly-solved factor matrix is
then held constant while solving for the other factor matrix.

This is a blocked implementation of the ALS factorization algorithm that groups the two sets
of factors (referred to as "users" and "products") into blocks and reduces communication by only
sending one copy of each user vector to each product block on each iteration, and only for the
product blocks that need that user's feature vector. This is achieved by pre-computing some
information about the ratings matrix to determine the "out-links" of each user (which blocks of
products it will contribute to) and "in-link" information for each product (which of the feature
vectors it receives from each user block it will depend on). This allows us to send only an
array of feature vectors between each user block and product block, and have the product block
find the users' ratings and update the products based on these messages.

For implicit preference data, the algorithm used is based on
"Collaborative Filtering for Implicit Feedback Datasets", available at
https://doi.org/10.1109/ICDM.2008.22, adapted for the blocked approach used here.

Essentially instead of finding the low-rank approximations to the rating matrix R,
this finds the approximations for a preference matrix P where the elements of P are 1 if
r is greater than 0 and 0 if r is less than or equal to 0. The ratings then act as 'confidence'
values related to strength of indicated user
preferences rather than explicit ratings given to items.

Note: the input rating dataset to the ALS implementation should be deterministic.
Nondeterministic data can cause failure during fitting ALS model.
For example, an order-sensitive operation like sampling after a repartition makes dataset
output nondeterministic, like dataset.repartition(2).sample(false, 0.5, 1618).
Checkpointing sampled dataset or adding a sort before sampling can help make the dataset
deterministic.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALS.html

Timestamp: 2020-10-19T01:56:00.419Z

raw docstring

approx-nearest-neighbors^clj

(approx-nearest-neighbors dataset model key-v n-nearest)

(approx-nearest-neighbors dataset model key-v n-nearest dist-col)

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest distance to the item. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.

The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest
distance to the item. If the outputCol is missing, the method will transform the data; if
the outputCol exists, it will use the outputCol. This allows caching of the
transformed data when necessary.


The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added
        to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z

raw docstring

approx-nearest-neighbours^clj

(approx-nearest-neighbours dataset model key-v n-nearest)

(approx-nearest-neighbours dataset model key-v n-nearest dist-col)

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z

Params: (dataset: Dataset[_], key: Vector, numNearestNeighbors: Int, distCol: String)

Result: Dataset[_]

Given a large dataset and an item, approximately find at most k items which have the closest
distance to the item. If the outputCol is missing, the method will transform the data; if
the outputCol exists, it will use the outputCol. This allows caching of the
transformed data when necessary.


The dataset to search for nearest neighbors of the key.

Feature vector representing the item to search for.

The maximum number of nearest neighbors.

Output column for storing the distance between each result row and the key.

A dataset containing at most k items closest to the key. A column "distCol" is added
        to show the distance between each row and the key.

This method is experimental and will likely change behavior in the next release.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.799Z

raw docstring

approx-similarity-join^clj

(approx-similarity-join dataset-a dataset-b model threshold)

(approx-similarity-join dataset-a dataset-b model threshold dist-col)

Params: (datasetA: Dataset[_], datasetB: Dataset[_], threshold: Double, distCol: String)

Result: Dataset[_]

Join two datasets to approximately find all pairs of rows whose distance are smaller than the threshold. If the outputCol is missing, the method will transform the data; if the outputCol exists, it will use the outputCol. This allows caching of the transformed data when necessary.

One of the datasets to join.

Another dataset to join.

The threshold for the distance of row pairs.

Output column for storing the distance between each pair of rows.

A joined dataset containing pairs of rows. The original rows are in columns "datasetA" and "datasetB", and a column "distCol" is added to show the distance between each pair.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.802Z

Params: (datasetA: Dataset[_], datasetB: Dataset[_], threshold: Double, distCol: String)

Result: Dataset[_]

Join two datasets to approximately find all pairs of rows whose distance are smaller than
the threshold. If the outputCol is missing, the method will transform the data; if the
outputCol exists, it will use the outputCol. This allows caching of the transformed
data when necessary.


One of the datasets to join.

Another dataset to join.

The threshold for the distance of row pairs.

Output column for storing the distance between each pair of rows.

A joined dataset containing pairs of rows. The original rows are in columns
        "datasetA" and "datasetB", and a column "distCol" is added to show the distance
        between each pair.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.802Z

raw docstring

association-rules^clj

(association-rules model)

Params:

Result: DataFrame

Get association rules fitted using the minConfidence. Returns a dataframe with four fields, "antecedent", "consequent", "confidence" and "lift", where "antecedent" and "consequent" are Array[T], whereas "confidence" and "lift" are Double.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.538Z

Params: 

Result: DataFrame

Get association rules fitted using the minConfidence. Returns a dataframe with four fields,
"antecedent", "consequent", "confidence" and "lift", where "antecedent" and "consequent" are
Array[T], whereas "confidence" and "lift" are Double.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.538Z

raw docstring

best-model^clj

(best-model model)

Params:

Result: Model[_]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidatorModel.html

Timestamp: 2020-10-19T01:56:45.449Z

Params: 

Result: Model[_]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidatorModel.html

Timestamp: 2020-10-19T01:56:45.449Z

raw docstring

binariser^clj

(binariser params)

Binarize a column of continuous features given a threshold.

Since 3.0.0, Binarize can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The threshold parameter is used for single column usage, and thresholds is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

Binarize a column of continuous features given a threshold.

Since 3.0.0,
Binarize can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
threshold parameter is used for single column usage, and thresholds is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

raw docstring

binarizer^clj

(binarizer params)

Binarize a column of continuous features given a threshold.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

Binarize a column of continuous features given a threshold.

Since 3.0.0,
Binarize can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
threshold parameter is used for single column usage, and thresholds is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Binarizer.html

Timestamp: 2020-10-19T01:56:05.331Z

raw docstring

binary-classification-evaluator^clj

(binary-classification-evaluator params)

Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:00.765Z

Evaluator for binary classification, which expects input columns rawPrediction, label and
 an optional weight column.
The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1)
or of type vector (length-2 vector of raw predictions, scores, or label probabilities).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:00.765Z

raw docstring

binary-summary^clj

(binary-summary model)

Params:

Result: BinaryLogisticRegressionTrainingSummary

Gets summary of model on training set. An exception is thrown if hasSummary is false or it is a multiclass model.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.093Z

Params: 

Result: BinaryLogisticRegressionTrainingSummary

Gets summary of model on training set. An exception is thrown
if hasSummary is false or it is a multiclass model.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.093Z

raw docstring

bisecting-k-means^clj

(bisecting-k-means params)

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html

Timestamp: 2020-10-19T01:56:03.281Z

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques"
by Steinbach, Karypis, and Kumar, with modification to fit Spark.
The algorithm starts from a single cluster that contains all points.
Iteratively it finds divisible clusters on the bottom level and bisects each of them using
k-means, until there are k leaf clusters in total or no leaf clusters are divisible.
The bisecting steps of clusters on the same level are grouped together to increase parallelism.
If bisecting all divisible clusters on the bottom level would result more than k leaf clusters,
larger clusters get higher priority.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html

Timestamp: 2020-10-19T01:56:03.281Z

raw docstring

boundaries^clj

(boundaries model)

Params:

Result: Vector

Boundaries in increasing order for which predictions are known.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegressionModel.html

Timestamp: 2020-10-19T01:56:44.821Z

Params: 

Result: Vector

Boundaries in increasing order for which predictions are known.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegressionModel.html

Timestamp: 2020-10-19T01:56:44.821Z

raw docstring

bucketed-random-projection-lsh^clj

(bucketed-random-projection-lsh params)

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for Euclidean distance metrics.

The input is dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

References:

Wikipedia on Stable Distributions

Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html

Timestamp: 2020-10-19T01:56:05.693Z

This BucketedRandomProjectionLSH implements Locality Sensitive Hashing functions for
Euclidean distance metrics.

The input is dense or sparse vectors, each of which represents a point in the Euclidean
distance space. The output will be vectors of configurable dimension. Hash values in the
same dimension are calculated by the same hash function.

References:

1. 
Wikipedia on Stable Distributions

2. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint
arXiv:1408.2927 (2014).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html

Timestamp: 2020-10-19T01:56:05.693Z

raw docstring

bucketiser^clj

(bucketiser params)

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0,
Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
splits parameter is only used for single column usage, and splitsArray is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

raw docstring

bucketizer^clj

(bucketizer params)

Bucketizer maps a column of continuous features to a column of feature buckets.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

Bucketizer maps a column of continuous features to a column of feature buckets.

Since 2.3.0,
Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that
when both the inputCol and inputCols parameters are set, an Exception will be thrown. The
splits parameter is only used for single column usage, and splitsArray is for multiple
columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Bucketizer.html

Timestamp: 2020-10-19T01:56:06.060Z

raw docstring

category-maps^clj

(category-maps model)

Params:

Result: Map[Int, Map[Double, Int]]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexerModel.html

Timestamp: 2020-10-19T01:56:31.705Z

Params: 

Result: Map[Int, Map[Double, Int]]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexerModel.html

Timestamp: 2020-10-19T01:56:31.705Z

raw docstring

category-sizes^clj

(category-sizes model)

Params:

Result: Array[Int]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.967Z

Params: 

Result: Array[Int]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.967Z

raw docstring

chi-sq-selector^clj

(chi-sq-selector params)

Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html

Timestamp: 2020-10-19T01:56:06.428Z

Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods: numTopFeatures, percentile, fpr,
fdr, fwe.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html

Timestamp: 2020-10-19T01:56:06.428Z

raw docstring

chi-square-test^clj

(chi-square-test dataframe features-col label-col)

Chi-square hypothesis testing for categorical data.

See Wikipedia for more information on the Chi-squared test.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html

Timestamp: 2020-10-19T01:55:49.886Z

Chi-square hypothesis testing for categorical data.

See Wikipedia for more information
on the Chi-squared test.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html

Timestamp: 2020-10-19T01:55:49.886Z

raw docstring

cluster-centers^clj

(cluster-centers model)

Params:

Result: Array[Vector]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeansModel.html

Timestamp: 2020-10-19T01:56:36.922Z

Params: 

Result: Array[Vector]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeansModel.html

Timestamp: 2020-10-19T01:56:36.922Z

raw docstring

clustering-evaluator^clj

(clustering-evaluator params)

Evaluator for clustering results. The metric computes the Silhouette measure using the specified distance measure.

The Silhouette is a measure for the validation of the consistency within clusters. It ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.html

Timestamp: 2020-10-19T01:56:01.116Z

Evaluator for clustering results.
The metric computes the Silhouette measure using the specified distance measure.

The Silhouette is a measure for the validation of the consistency within clusters. It ranges
between 1 and -1, where a value close to 1 means that the points in a cluster are close to the
other points in the same cluster and far from the points of the other clusters.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.html

Timestamp: 2020-10-19T01:56:01.116Z

raw docstring

coefficient-matrix^clj

(coefficient-matrix model)

Params:

Result: Matrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.098Z

Params: 

Result: Matrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.098Z

raw docstring

coefficients^clj

(coefficients model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.282Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.282Z

raw docstring

corr^cljmultimethod

Column: Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Datasate: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

Column: Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Datasate: Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

raw docstring

count-vectoriser^clj

(count-vectoriser params)

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

Extracts a vocabulary from document collections and generates a CountVectorizerModel.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

raw docstring

count-vectorizer^clj

(count-vectorizer params)

Extracts a vocabulary from document collections and generates a CountVectorizerModel.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

Extracts a vocabulary from document collections and generates a CountVectorizerModel.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizer.html

Timestamp: 2020-10-19T01:56:06.801Z

raw docstring

cross-validator^clj

(cross-validator {:keys [estimator evaluator estimator-param-maps num-folds seed
                         parallelism]})

K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidator.html

Timestamp: 2020-10-19T01:55:48.855Z

K-fold cross validation performs model selection by splitting the dataset into a set of
non-overlapping randomly partitioned folds which are used as separate training and test datasets
e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs,
each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the
test set exactly once.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/CrossValidator.html

Timestamp: 2020-10-19T01:55:48.855Z

raw docstring

dct^clj

(dct params)

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero padding is performed on the input vector. It returns a real vector of the same length representing the DCT. The return vector is scaled such that the transform matrix is unitary (aka scaled DCT-II).

More information on DCT-II in Discrete cosine transform (Wikipedia).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero
padding is performed on the input vector.
It returns a real vector of the same length representing the DCT. The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).

More information on 
DCT-II in Discrete cosine transform (Wikipedia).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

raw docstring

decision-tree-classifier^clj

(decision-tree-classifier params)

Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html

Timestamp: 2020-10-19T01:55:55.948Z

Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning)
for classification.
It supports both binary and multiclass labels, as well as both continuous and categorical
features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html

Timestamp: 2020-10-19T01:55:55.948Z

raw docstring

decision-tree-regressor^clj

(decision-tree-regressor params)

Decision tree learning algorithm for regression. It supports both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html

Timestamp: 2020-10-19T01:55:52.001Z

Decision tree
learning algorithm for regression.
It supports both continuous and categorical features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html

Timestamp: 2020-10-19T01:55:52.001Z

raw docstring

depth^clj

(depth model)

Params:

Result: Int

Depth of the tree. E.g.: Depth 0 means 1 leaf node. Depth 1 means 1 internal node and 2 leaf nodes.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.586Z

Params: 

Result: Int

Depth of the tree.
E.g.: Depth 0 means 1 leaf node.  Depth 1 means 1 internal node and 2 leaf nodes.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.586Z

raw docstring

describe-topics^clj

Params: (maxTermsPerTopic: Int)

Result: DataFrame

Return the topics described by their top-weighted terms.

Maximum number of terms to collect for each topic. Default value of 10.

Local DataFrame with one topic per Row, with columns:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.892Z

Params: (maxTermsPerTopic: Int)

Result: DataFrame

Return the topics described by their top-weighted terms.


Maximum number of terms to collect for each topic.
                         Default value of 10.

Local DataFrame with one topic per Row, with columns:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.892Z

raw docstring

discrete-cosine-transform^clj

(discrete-cosine-transform params)

More information on DCT-II in Discrete cosine transform (Wikipedia).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

A feature transformer that takes the 1D discrete cosine transform of a real vector. No zero
padding is performed on the input vector.
It returns a real vector of the same length representing the DCT. The return vector is scaled
such that the transform matrix is unitary (aka scaled DCT-II).

More information on 
DCT-II in Discrete cosine transform (Wikipedia).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/DCT.html

Timestamp: 2020-10-19T01:56:07.160Z

raw docstring

distributed?^clj

(distributed? model)

Params:

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z

Params: 

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z

raw docstring

elementwise-product^clj

(elementwise-product params)

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. In other words, it scales each column of the dataset by a scalar multiplier.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html

Timestamp: 2020-10-19T01:56:07.551Z

Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a
provided "weight" vector.  In other words, it scales each column of the dataset by a scalar
multiplier.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html

Timestamp: 2020-10-19T01:56:07.551Z

raw docstring

estimated-doc-concentration^clj

(estimated-doc-concentration model)

Params:

Result: Vector

Value for docConcentration estimated from data. If Online LDA was used and optimizeDocConcentration was set to false, then this returns the fixed (given) value for the docConcentration parameter.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.897Z

Params: 

Result: Vector

Value for docConcentration estimated from data.
If Online LDA was used and optimizeDocConcentration was set to false,
then this returns the fixed (given) value for the docConcentration parameter.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.897Z

raw docstring

evaluate^clj

(evaluate dataframe evaluator)

Params: (dataset: Dataset[_])

Result: LinearRegressionSummary

Evaluates the model on a test dataset.

Test dataset to evaluate model on.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.292Z

Params: (dataset: Dataset[_])

Result: LinearRegressionSummary

Evaluates the model on a test dataset.


Test dataset to evaluate model on.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.292Z

raw docstring

feature-hasher^clj

(feature-hasher params)

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html

Timestamp: 2020-10-19T01:56:07.938Z

Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature
space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing)
to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either
numeric or categorical features. Behavior and handling of column data types is as follows:
-Numeric columns: For numeric features, the hash value of the column name is used to map the
feature value to its index in the feature vector. By default, numeric features
are not treated as categorical (even when they are integers). To treat them
as categorical, specify the relevant columns in categoricalCols.
-String columns: For categorical features, the hash value of the string "column_name=value"
is used to map to the vector index, with an indicator value of 1.0.
Thus, categorical features are "one-hot" encoded
(similarly to using OneHotEncoder with dropLast=false).
-Boolean columns: Boolean values are treated in the same way as string columns. That is,
boolean features are represented as "column_name=true" or "column_name=false",
with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo
on the hashed value is used to determine the vector index, it is advisable to use a power of two
as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector
indices.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/FeatureHasher.html

Timestamp: 2020-10-19T01:56:07.938Z

raw docstring

feature-importances^clj

(feature-importances model)

Params:

Result: Vector

Estimate of the importance of each feature.

Each feature's importance is the average of its importance across all trees in the ensemble The importance vector is normalized to sum to 1. This method is suggested by Hastie et al. (Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.) and follows the implementation from scikit-learn.

DecisionTreeClassificationModel.featureImportances

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.595Z

Params: 

Result: Vector

Estimate of the importance of each feature.

Each feature's importance is the average of its importance across all trees in the ensemble
The importance vector is normalized to sum to 1. This method is suggested by Hastie et al.
(Hastie, Tibshirani, Friedman. "The Elements of Statistical Learning, 2nd Edition." 2001.)
and follows the implementation from scikit-learn.


DecisionTreeClassificationModel.featureImportances

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.595Z

raw docstring

features-col^clj

(features-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.314Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.314Z

raw docstring

find-frequent-sequential-patterns^clj

(find-frequent-sequential-patterns dataset prefix-span)

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.

A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.


A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency.
        The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z

raw docstring

find-patterns^clj

(find-patterns dataset prefix-span)

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.

A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency. The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z

Params: (dataset: Dataset[_])

Result: DataFrame

Finds the complete set of frequent sequential patterns in the input sequences of itemsets.


A dataset or a dataframe containing a sequence column which is

A DataFrame that contains columns of sequence and corresponding frequency.
        The schema of it will be:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.709Z

raw docstring

fit^clj

(fit dataframe estimator)

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: M

Fits a single model to the input data with optional parameters.

input dataset

the first param pair, overrides embedded params

other param pairs. These values override any specified in this Estimator's embedded ParamMap.

fitted model

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.html

Timestamp: 2020-10-19T01:56:44.210Z

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: M

Fits a single model to the input data with optional parameters.


input dataset

the first param pair, overrides embedded params

other param pairs.  These values override any specified in this
                       Estimator's embedded ParamMap.

fitted model

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.html

Timestamp: 2020-10-19T01:56:44.210Z

raw docstring

fm-classifier^clj

(fm-classifier params)

Factorization Machines learning algorithm for classification. It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:

FM classification model uses logistic loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/FMClassifier.html

Timestamp: 2020-10-19T01:55:56.340Z

Factorization Machines learning algorithm for classification.
It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity
(like advertising and recommendation system).
FM formula is:


FM classification model uses logistic loss which can be solved by gradient descent method, and
regularization terms like L2 are usually added to the loss function to prevent overfitting.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/FMClassifier.html

Timestamp: 2020-10-19T01:55:56.340Z

raw docstring

fm-regressor^clj

(fm-regressor params)

Factorization Machines learning algorithm for regression. It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity (like advertising and recommendation system). FM formula is:

FM regression model uses MSE loss which can be solved by gradient descent method, and regularization terms like L2 are usually added to the loss function to prevent overfitting.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/FMRegressor.html

Timestamp: 2020-10-19T01:55:52.555Z

Factorization Machines learning algorithm for regression.
It supports normal gradient descent and AdamW solver.

The implementation is based upon:

S. Rendle. "Factorization machines" 2010.

FM is able to estimate interactions even in problems with huge sparsity
(like advertising and recommendation system).
FM formula is:


FM regression model uses MSE loss which can be solved by gradient descent method, and
regularization terms like L2 are usually added to the loss function to prevent overfitting.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/FMRegressor.html

Timestamp: 2020-10-19T01:55:52.555Z

raw docstring

fp-growth^clj

(fp-growth params)

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in Li et al., PFP: Parallel FP-Growth for Query Recommendation. PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in Han et al., Mining frequent patterns without candidate generation. Note null values in the itemsCol column are ignored during fit().

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
Li et al., PFP: Parallel FP-Growth for Query
Recommendation. PFP distributes computation in such a way that each worker executes an
independent group of mining tasks. The FP-Growth algorithm is described in
Han et al., Mining frequent patterns without
candidate generation. Note null values in the itemsCol column are ignored during fit().


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z

raw docstring

freq-itemsets^clj

(freq-itemsets model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z

raw docstring

frequent-item-sets^clj

(frequent-item-sets model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowthModel.html

Timestamp: 2020-10-19T01:56:43.556Z

raw docstring

frequent-pattern-growth^clj

(frequent-pattern-growth params)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z

A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in
Li et al., PFP: Parallel FP-Growth for Query
Recommendation. PFP distributes computation in such a way that each worker executes an
independent group of mining tasks. The FP-Growth algorithm is described in
Han et al., Mining frequent patterns without
candidate generation. Note null values in the itemsCol column are ignored during fit().


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/FPGrowth.html

Timestamp: 2020-10-19T01:55:59.709Z

raw docstring

gaussian-mixture^clj

(gaussian-mixture params)

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian
Mixture Models (GMMs).  A GMM represents a composite distribution of
independent Gaussian distributions with associated "mixing" weights
specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood
for a mixture of k Gaussians, iterating until the log-likelihood changes by
less than convergenceTol, or until it has reached the max number of iterations.
While this process is generally guaranteed to converge, it is not guaranteed
to find a global optimum.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z

raw docstring

gaussians-df^clj

(gaussians-df model)

Params:

Result: DataFrame

Retrieve Gaussian distributions as a DataFrame. Each row represents a Gaussian Distribution. Two columns are defined: mean and cov. Schema:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.217Z

Params: 

Result: DataFrame

Retrieve Gaussian distributions as a DataFrame.
Each row represents a Gaussian Distribution.
Two columns are defined: mean and cov.
Schema:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.217Z

raw docstring

gbt-classifier^clj

(gbt-classifier params)

Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting) learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/GBTClassifier.html

Timestamp: 2020-10-19T01:55:56.899Z

Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting)
learning algorithm for classification.
It supports binary labels, as well as both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/GBTClassifier.html

Timestamp: 2020-10-19T01:55:56.899Z

raw docstring

gbt-regressor^clj

(gbt-regressor params)

Gradient-Boosted Trees (GBTs) learning algorithm for regression. It supports both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GBTRegressor.html

Timestamp: 2020-10-19T01:55:53.108Z

Gradient-Boosted Trees (GBTs)
learning algorithm for regression.
It supports both continuous and categorical features.

The implementation is based upon: J.H. Friedman. "Stochastic Gradient Boosting." 1999.

Notes on Gradient Boosting vs. TreeBoost:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GBTRegressor.html

Timestamp: 2020-10-19T01:55:53.108Z

raw docstring

generalised-linear-regression^clj

(generalised-linear-regression params)

Fit a Generalized Linear Model (see Generalized linear model (Wikipedia)) specified by giving a symbolic description of the linear predictor (link function) and a description of the error distribution (family). It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family. Valid link functions for each family is listed below. The first link function of each family is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

Fit a Generalized Linear Model
(see 
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
Valid link functions for each family is listed below. The first link function of each family
is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

raw docstring

generalized-linear-regression^clj

(generalized-linear-regression params)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

Fit a Generalized Linear Model
(see 
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
Valid link functions for each family is listed below. The first link function of each family
is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

raw docstring

get-features-col^clj

(get-features-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.314Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.314Z

raw docstring

get-input-col^clj

(get-input-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z

raw docstring

get-input-cols^clj

(get-input-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z

raw docstring

get-label-col^clj

(get-label-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.316Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.316Z

raw docstring

get-num-trees^clj

(get-num-trees model)

Params:

Result: Int

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.621Z

Params: 

Result: Int



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.621Z

raw docstring

get-output-col^clj

(get-output-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z

raw docstring

get-output-cols^clj

(get-output-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z

raw docstring

get-prediction-col^clj

(get-prediction-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.320Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.320Z

raw docstring

get-probability-col^clj

(get-probability-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.625Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.625Z

raw docstring

get-raw-prediction-col^clj

(get-raw-prediction-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.626Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.626Z

raw docstring

get-size^clj

(get-size model)

Params:

Result: Int

group getParam

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:32.378Z

Params: 

Result: Int

group getParam

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:32.378Z

raw docstring

get-thresholds^clj

(get-thresholds model)

Params:

Result: Array[Double]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.629Z

Params: 

Result: Array[Double]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.629Z

raw docstring

glm^clj

(glm params)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

Fit a Generalized Linear Model
(see 
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
Valid link functions for each family is listed below. The first link function of each family
is the default one.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html

Timestamp: 2020-10-19T01:55:53.908Z

raw docstring

gmm^clj

(gmm params)

Gaussian Mixture clustering.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian
Mixture Models (GMMs).  A GMM represents a composite distribution of
independent Gaussian distributions with associated "mixing" weights
specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood
for a mixture of k Gaussians, iterating until the log-likelihood changes by
less than convergenceTol, or until it has reached the max number of iterations.
While this process is generally guaranteed to converge, it is not guaranteed
to find a global optimum.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixture.html

Timestamp: 2020-10-19T01:56:03.645Z

raw docstring

hashing-tf^clj

(hashing-tf params)

Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html

Timestamp: 2020-10-19T01:56:08.308Z

Maps a sequence of terms to their term frequencies using the hashing trick.
Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32)
to calculate the hash code value for the term object.
Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the numFeatures parameter;
otherwise the features will not be mapped evenly to the columns.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/HashingTF.html

Timestamp: 2020-10-19T01:56:08.308Z

raw docstring

idf^clj

(idf params)

Compute the Inverse Document Frequency (IDF) given a collection of documents.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html

Timestamp: 2020-10-19T01:56:08.857Z

Compute the Inverse Document Frequency (IDF) given a collection of documents.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDF.html

Timestamp: 2020-10-19T01:56:08.857Z

raw docstring

idf-vector^clj

(idf-vector model)

Params:

Result: Vector

Returns the IDF vector.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDFModel.html

Timestamp: 2020-10-19T01:56:34.931Z

Params: 

Result: Vector

Returns the IDF vector.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IDFModel.html

Timestamp: 2020-10-19T01:56:34.931Z

raw docstring

imputer^clj

(imputer params)

Imputation estimator for completing missing values, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features (SPARK-15041) and possibly creates incorrect values for a categorical feature.

Note when an input column is integer, the imputed value is casted (truncated) to an integer type. For example, if the input column is IntegerType (1, 2, 4, null), the output will be IntegerType (1, 2, 4, 2) after mean imputation.

Note that the mean/median value is computed after filtering out missing values. All Null values in the input columns are treated as missing, and so are also imputed. For computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html

Timestamp: 2020-10-19T01:56:09.241Z

Imputation estimator for completing missing values, either using the mean or the median
of the columns in which the missing values are located. The input columns should be of
numeric type. Currently Imputer does not support categorical features
(SPARK-15041) and possibly creates incorrect values for a categorical feature.

Note when an input column is integer, the imputed value is casted (truncated) to an integer type.
For example, if the input column is IntegerType (1, 2, 4, null),
the output will be IntegerType (1, 2, 4, 2) after mean imputation.

Note that the mean/median value is computed after filtering out missing values.
All Null values in the input columns are treated as missing, and so are also imputed. For
computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Imputer.html

Timestamp: 2020-10-19T01:56:09.241Z

raw docstring

index-to-string^clj

(index-to-string params)

A Transformer that maps a column of indices back to a new column of corresponding string values. The index-string mapping is either from the ML attributes of the input column, or from user-supplied labels (which take precedence over ML attributes).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html

Timestamp: 2020-10-19T01:56:09.599Z

A Transformer that maps a column of indices back to a new column of corresponding
string values.
The index-string mapping is either from the ML attributes of the input column,
or from user-supplied labels (which take precedence over ML attributes).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/IndexToString.html

Timestamp: 2020-10-19T01:56:09.599Z

raw docstring

input-col^clj

(input-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.823Z

raw docstring

input-cols^clj

(input-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.991Z

raw docstring

interaction^clj

(interaction params)

Implements the feature interaction transform. This transformer takes in Double and Vector type columns and outputs a flattened vector of their feature interactions. To handle interaction, we first one-hot encode any nominal features. Then, a vector of the feature cross-products is produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be Vector(6, 8) if all input features were numeric. If the first feature was instead nominal with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html

Timestamp: 2020-10-19T01:56:09.965Z

Implements the feature interaction transform. This transformer takes in Double and Vector type
columns and outputs a flattened vector of their feature interactions. To handle interaction,
we first one-hot encode any nominal features. Then, a vector of the feature cross-products is
produced.

For example, given the input feature values Double(2) and Vector(3, 4), the output would be
Vector(6, 8) if all input features were numeric. If the first feature was instead nominal
with four categories, the output would then be Vector(0, 0, 0, 0, 3, 4, 0, 0).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Interaction.html

Timestamp: 2020-10-19T01:56:09.965Z

raw docstring

intercept^clj

(intercept model)

Params:

Result: Double

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.333Z

Params: 

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.333Z

raw docstring

intercept-vector^clj

(intercept-vector model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.167Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html

Timestamp: 2020-10-19T01:56:46.167Z

raw docstring

is-distributed^clj

(is-distributed model)

Params:

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z

Params: 

Result: Boolean

Indicates whether this instance is of type DistributedLDAModel

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.877Z

raw docstring

isotonic-regression^clj

(isotonic-regression params)

Isotonic regression.

Currently implemented using parallelized pool adjacent violators algorithm. Only univariate (single feature) algorithm supported.

Uses org.apache.spark.mllib.regression.IsotonicRegression.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegression.html

Timestamp: 2020-10-19T01:55:54.264Z

Isotonic regression.

Currently implemented using parallelized pool adjacent violators algorithm.
Only univariate (single feature) algorithm supported.

Uses org.apache.spark.mllib.regression.IsotonicRegression.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/IsotonicRegression.html

Timestamp: 2020-10-19T01:55:54.264Z

raw docstring

item-factors^clj

(item-factors model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.288Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.288Z

raw docstring

k-means^clj

(k-means params)

K-means clustering with support for k-means|| initialization proposed by Bahmani et al.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeans.html

Timestamp: 2020-10-19T01:56:04.224Z

K-means clustering with support for k-means|| initialization proposed by Bahmani et al.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/KMeans.html

Timestamp: 2020-10-19T01:56:04.224Z

raw docstring

kolmogorov-smirnov-test^clj

(kolmogorov-smirnov-test dataframe sample-col dist-name params)

Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest$.html

Timestamp: 2020-10-19T01:55:50.540Z

Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a
continuous distribution. By comparing the largest difference between the empirical cumulative
distribution of the sample data and the theoretical distribution we can provide a test for the
the null hypothesis that the sample data comes from that theoretical distribution.
For more information on KS Test:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest$.html

Timestamp: 2020-10-19T01:55:50.540Z

raw docstring

label-col^clj

(label-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.316Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.316Z

raw docstring

labels^clj

(labels model)

Params:

Result: Array[String]

(Since version 3.0.0)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexerModel.html

Timestamp: 2020-10-19T01:56:31.154Z

Params: 

Result: Array[String]

(Since version 3.0.0)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexerModel.html

Timestamp: 2020-10-19T01:56:31.154Z

raw docstring

latent-dirichlet-allocation^clj

(latent-dirichlet-allocation params)

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version):
 Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.

Input data (featuresCol):
 LDA is given a collection of documents as input data, via the featuresCol parameter.
 Each document is specified as a Vector of length vocabSize, where each entry is the
 count for the corresponding term (word) in the document.  Feature transformers such as
 org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer
 can be useful for converting text to word count vectors.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z

raw docstring

lda^clj

(lda params)

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology:

Original LDA paper (journal version):
 Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.

Input data (featuresCol):
 LDA is given a collection of documents as input data, via the featuresCol parameter.
 Each document is specified as a Vector of length vocabSize, where each entry is the
 count for the corresponding term (word) in the document.  Feature transformers such as
 org.apache.spark.ml.feature.Tokenizer and org.apache.spark.ml.feature.CountVectorizer
 can be useful for converting text to word count vectors.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDA.html

Timestamp: 2020-10-19T01:56:04.609Z

raw docstring

linear-regression^clj

(linear-regression params)

Linear regression.

The learning objective is to minimize the specified loss function, with regularization. This supports two kinds of loss:

This supports multiple types of regularization:

The squared error objective function is:

The huber objective function is:

where

Note: Fitting with huber loss only supports none and L2 regularization.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegression.html

Timestamp: 2020-10-19T01:55:54.848Z

Linear regression.

The learning objective is to minimize the specified loss function, with regularization.
This supports two kinds of loss:

This supports multiple types of regularization:

The squared error objective function is:



The huber objective function is:



where



Note: Fitting with huber loss only supports none and L2 regularization.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegression.html

Timestamp: 2020-10-19T01:55:54.848Z

raw docstring

linear-svc^clj

(linear-svc params)

Linear SVM Classifier

This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. Only supports L2 regularization currently.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LinearSVC.html

Timestamp: 2020-10-19T01:55:57.279Z

  Linear SVM Classifier

This binary classifier optimizes the Hinge Loss using the OWLQN optimizer.
Only supports L2 regularization currently.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LinearSVC.html

Timestamp: 2020-10-19T01:55:57.279Z

raw docstring

log-likelihood^clj

(log-likelihood dataset model)

Params: (dataset: Dataset[_])

Result: Double

Calculates a lower bound on the log likelihood of the entire corpus.

See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.

test corpus to use for calculating log likelihood

variational lower bound on the log likelihood of the entire corpus

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.959Z

Params: (dataset: Dataset[_])

Result: Double

Calculates a lower bound on the log likelihood of the entire corpus.

See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer
         is set to "em"), this involves collecting a large topicsMatrix to the driver.
         This implementation may be changed in the future.


test corpus to use for calculating log likelihood

variational lower bound on the log likelihood of the entire corpus

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.959Z

raw docstring

log-perplexity^clj

(log-perplexity dataset model)

Params: (dataset: Dataset[_])

Result: Double

Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

test corpus to use for calculating perplexity

Variational upper bound on log perplexity per token.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.961Z

Params: (dataset: Dataset[_])

Result: Double

Calculate an upper bound on perplexity.  (Lower is better.)
See Equation (16) in the Online LDA paper (Hoffman et al., 2010).

WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer
         is set to "em"), this involves collecting a large topicsMatrix to the driver.
         This implementation may be changed in the future.


test corpus to use for calculating perplexity

Variational upper bound on log perplexity per token.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.961Z

raw docstring

logistic-regression^clj

(logistic-regression params)

Logistic regression. Supports:

This class supports fitting traditional logistic regression model by LBFGS/OWLQN and bound (box) constrained logistic regression model by LBFGSB.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegression.html

Timestamp: 2020-10-19T01:55:57.830Z

Logistic regression. Supports:

This class supports fitting traditional logistic regression model by LBFGS/OWLQN and
bound (box) constrained logistic regression model by LBFGSB.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/LogisticRegression.html

Timestamp: 2020-10-19T01:55:57.830Z

raw docstring

max-abs^clj

(max-abs model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html

Timestamp: 2020-10-19T01:56:33.682Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html

Timestamp: 2020-10-19T01:56:33.682Z

raw docstring

max-abs-scaler^clj

(max-abs-scaler params)

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html

Timestamp: 2020-10-19T01:56:10.658Z

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
absolute value in each feature. It does not shift/center the data, and thus does not destroy
any sparsity.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html

Timestamp: 2020-10-19T01:56:10.658Z

raw docstring

mean^clj

(mean model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.051Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.051Z

raw docstring

min-hash-lsh^clj

(min-hash-lsh params)

LSH class for Jaccard distance.

The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example, Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0))) means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are treated as binary "1" values.

References: Wikipedia on MinHash

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html

Timestamp: 2020-10-19T01:56:11.035Z

LSH class for Jaccard distance.

The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example,
   Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))
means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any
input vector must have at least 1 non-zero index, and all non-zero values are
treated as binary "1" values.

References:
Wikipedia on MinHash


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSH.html

Timestamp: 2020-10-19T01:56:11.035Z

raw docstring

min-max-scaler^clj

(min-max-scaler params)

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. The rescaled value for feature E is calculated as:

For the case (E_{max} == E_{min}), (Rescaled(e_i) = 0.5 * (max + min)).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html

Timestamp: 2020-10-19T01:56:11.407Z

Rescale each feature individually to a common range [min, max] linearly using column summary
statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
feature E is calculated as:



For the case \(E_{max} == E_{min}\), \(Rescaled(e_i) = 0.5 * (max + min)\).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScaler.html

Timestamp: 2020-10-19T01:56:11.407Z

raw docstring

mlp-classifier^clj

(mlp-classifier params)

Classifier trainer based on the Multilayer Perceptron. Each layer has sigmoid activation function, output layer has softmax. Number of inputs has to be equal to the size of feature vectors. Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z

Classifier trainer based on the Multilayer Perceptron.
Each layer has sigmoid activation function, output layer has softmax.
Number of inputs has to be equal to the size of feature vectors.
Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z

raw docstring

multiclass-classification-evaluator^clj

(multiclass-classification-evaluator params)

Evaluator for multiclass classification, which expects input columns: prediction, label, weight (optional) and probability (only for logLoss).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.471Z

Evaluator for multiclass classification, which expects input columns: prediction, label,
weight (optional) and probability (only for logLoss).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.471Z

raw docstring

multilabel-classification-evaluator^clj

(multilabel-classification-evaluator params)

:: Experimental :: Evaluator for multi-label classification, which expects two input columns: prediction and label.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.814Z

:: Experimental ::
Evaluator for multi-label classification, which expects two input
columns: prediction and label.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/MultilabelClassificationEvaluator.html

Timestamp: 2020-10-19T01:56:01.814Z

raw docstring

multilayer-perceptron-classifier^clj

(multilayer-perceptron-classifier params)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z

Classifier trainer based on the Multilayer Perceptron.
Each layer has sigmoid activation function, output layer has softmax.
Number of inputs has to be equal to the size of feature vectors.
Number of outputs has to be equal to the total number of labels.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html

Timestamp: 2020-10-19T01:55:58.225Z

raw docstring

n-gram^clj

(n-gram params)

A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html

Timestamp: 2020-10-19T01:56:11.769Z

A feature transformer that converts the input array of strings into an array of n-grams. Null
values in the input array are ignored.
It returns an array of n-grams where each n-gram is represented by a space-separated string of
words.

When the input is empty, an empty array is returned.
When the input array length is less than n (number of elements per n-gram), no n-grams are
returned.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/NGram.html

Timestamp: 2020-10-19T01:56:11.769Z

raw docstring

naive-bayes^clj

(naive-bayes params)

Naive Bayes Classifiers. It supports Multinomial NB (see here) which can handle finitely supported discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a binary (0/1) data, it can also be used as Bernoulli NB (see here). The input feature values for Multinomial NB and Bernoulli NB must be nonnegative. Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically, Complement NB uses statistics from the complement of each class to compute the model's coefficients The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must be nonnegative. Since 3.0.0, it also supports Gaussian NB (see here) which can handle continuous data.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayes.html

Timestamp: 2020-10-19T01:55:58.596Z

Naive Bayes Classifiers.
It supports Multinomial NB
(see 
here)
which can handle finitely supported discrete data. For example, by converting documents into
TF-IDF vectors, it can be used for document classification. By making every vector a
binary (0/1) data, it can also be used as Bernoulli NB
(see 
here).
The input feature values for Multinomial NB and Bernoulli NB must be nonnegative.
Since 3.0.0, it supports Complement NB which is an adaptation of the Multinomial NB. Specifically,
Complement NB uses statistics from the complement of each class to compute the model's coefficients
The inventors of Complement NB show empirically that the parameter estimates for CNB are more stable
than those for Multinomial NB. Like Multinomial NB, the input feature values for Complement NB must
be nonnegative.
Since 3.0.0, it also supports Gaussian NB
(see 
here)
which can handle continuous data.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayes.html

Timestamp: 2020-10-19T01:55:58.596Z

raw docstring

normaliser^clj

(normaliser params)

Normalize a vector to have unit norm using the given p-norm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

Normalize a vector to have unit norm using the given p-norm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

raw docstring

normalizer^clj

(normalizer params)

Normalize a vector to have unit norm using the given p-norm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

Normalize a vector to have unit norm using the given p-norm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Normalizer.html

Timestamp: 2020-10-19T01:56:12.133Z

raw docstring

num-classes^clj

(num-classes model)

Params:

Result: Int

Number of classes (values which the label can take).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.671Z

Params: 

Result: Int

Number of classes (values which the label can take).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.671Z

raw docstring

num-features^clj

(num-features model)

Params:

Result: Int

Returns the number of features the model was trained on. If unknown, returns -1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.360Z

Params: 

Result: Int

Returns the number of features the model was trained on. If unknown, returns -1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.360Z

raw docstring

num-nodes^clj

(num-nodes model)

Params:

Result: Int

Number of nodes in tree, including leaf nodes.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.668Z

Params: 

Result: Int

Number of nodes in tree, including leaf nodes.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.668Z

raw docstring

one-hot-encoder^clj

(one-hot-encoder params)

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html

Timestamp: 2020-10-19T01:56:12.690Z

A one-hot encoder that maps a column of category indices to a column of binary vectors, with
at most a single one-value per row that indicates the input category index.
For example with 5 categories, an input value of 2.0 would map to an output vector of
[0.0, 0.0, 1.0, 0.0].
The last category is not included by default (configurable via dropLast),
because it makes the vector entries sum up to one, and hence linearly dependent.
So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoder.html

Timestamp: 2020-10-19T01:56:12.690Z

raw docstring

one-vs-rest^clj

(one-vs-rest params)

Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/OneVsRest.html

Timestamp: 2020-10-19T01:55:58.960Z

Reduction of Multiclass Classification to Binary Classification.
Performs reduction using one against all strategy.
For a multiclass classification with k classes, train k models (one per class).
Each example is scored against all k models and the model with highest score
is picked to label the example.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/OneVsRest.html

Timestamp: 2020-10-19T01:55:58.960Z

raw docstring

original-max^clj

(original-max model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.393Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.393Z

raw docstring

original-min^clj

(original-min model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.394Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html

Timestamp: 2020-10-19T01:56:28.394Z

raw docstring

output-col^clj

(output-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html

Timestamp: 2020-10-19T01:56:46.826Z

raw docstring

output-cols^clj

(output-cols model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/OneHotEncoderModel.html

Timestamp: 2020-10-19T01:56:28.994Z

raw docstring

param-grid^clj

(param-grid grids)

Builder for a param grid used in grid search-based model selection.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z

Builder for a param grid used in grid search-based model selection.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z

raw docstring

param-grid-builder^clj

(param-grid-builder grids)

Builder for a param grid used in grid search-based model selection.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z

Builder for a param grid used in grid search-based model selection.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html

Timestamp: 2020-10-19T01:55:49.184Z

raw docstring

params^clj

(params stage)

Params:

Result: Array[Param[_]]

Returns all params sorted by their names. The default implementation uses Java reflection to list all public methods that have no arguments and return Param.

Developer should not use this method in constructor because we cannot guarantee that this variable gets initialized before other params.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.738Z

Params: 

Result: Array[Param[_]]

Returns all params sorted by their names. The default implementation uses Java reflection to
list all public methods that have no arguments and return Param.


Developer should not use this method in constructor because we cannot guarantee that
this variable gets initialized before other params.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.738Z

raw docstring

pc^clj

(pc model)

Params:

Result: DenseMatrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z

Params: 

Result: DenseMatrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z

raw docstring

pca^clj

(pca params)

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k principal components.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html

Timestamp: 2020-10-19T01:56:13.048Z

PCA trains a model to project vectors to a lower dimensional space of the top PCA!.k
principal components.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCA.html

Timestamp: 2020-10-19T01:56:13.048Z

raw docstring

pi^clj

(pi model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.617Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.617Z

raw docstring

pipeline^clj

(pipeline & stages)

A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer. When Pipeline.fit is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit method will be called on the input dataset to fit a model. Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. If a stage is a Transformer, its Transformer.transform method will be called to produce the dataset for the next stage. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as an identity transformer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/Pipeline.html

Timestamp: 2020-10-19T01:55:50.903Z

A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each
of which is either an Estimator or a Transformer. When Pipeline.fit is called, the
stages are executed in order. If a stage is an Estimator, its Estimator.fit method will
be called on the input dataset to fit a model. Then the model, which is a transformer, will be
used to transform the dataset as the input to the next stage. If a stage is a Transformer,
its Transformer.transform method will be called to produce the dataset for the next stage.
The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and
transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as
an identity transformer.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/Pipeline.html

Timestamp: 2020-10-19T01:55:50.903Z

raw docstring

polynomial-expansion^clj

(polynomial-expansion params)

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at Polynomial expansion (Wikipedia) , "In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition". Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html

Timestamp: 2020-10-19T01:56:13.405Z

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion,
which is available at
Polynomial expansion (Wikipedia)
, "In mathematics, an expansion of a product of sums expresses it as a sum of products by using
the fact that multiplication distributes over addition". Take a 2-variable feature vector
as an example: (x, y), if we want to expand it with degree 2, then we get
(x, x * x, y, x * y, y * y).


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html

Timestamp: 2020-10-19T01:56:13.405Z

raw docstring

power-iteration-clustering^clj

(power-iteration-clustering params)

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters method to run the PowerIterationClustering algorithm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html

Timestamp: 2020-10-19T01:56:04.968Z

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by
Lin and Cohen. From
the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power
iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters method to run the
PowerIterationClustering algorithm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html

Timestamp: 2020-10-19T01:56:04.968Z

raw docstring

prediction-col^clj

(prediction-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.320Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.320Z

raw docstring

prefix-span^clj

(prefix-span params)

A parallel PrefixSpan algorithm to mine frequent sequential patterns. The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth (see here). This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to run the PrefixSpan algorithm.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:00.046Z

A parallel PrefixSpan algorithm to mine frequent sequential patterns.
The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
Efficiently by Prefix-Projected Pattern Growth
(see here).
This class is not yet an Estimator/Transformer, use findFrequentSequentialPatterns method to
run the PrefixSpan algorithm.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:00.046Z

raw docstring

principal-components^clj

(principal-components model)

Params:

Result: DenseMatrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z

Params: 

Result: DenseMatrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/PCAModel.html

Timestamp: 2020-10-19T01:56:29.844Z

raw docstring

probability-col^clj

(probability-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.625Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.625Z

raw docstring

quantile-discretiser^clj

(quantile-discretiser params)

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown. To specify the number of buckets for each column, the numBucketsArray parameter can be set, or if the number of buckets should be the same across columns, numBuckets can be set as a convenience. Note that in multiple columns case, relative error is applied to all columns.

NaN handling: null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for org.apache.spark.sql.DataFrameStatFunctions.approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity, covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
be set, or if the number of buckets should be the same across columns, numBuckets can be
set as a convenience. Note that in multiple columns case, relative error is applied to all
columns.

NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
will produce a Bucketizer model for making predictions. During the transformation,
Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

raw docstring

quantile-discretizer^clj

(quantile-discretizer params)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

QuantileDiscretizer takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the numBuckets parameter. It is
possible that the number of buckets used will be smaller than this value, for example, if there
are too few distinct values of the input to create enough distinct quantiles.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols
parameter. If both of the inputCol and inputCols parameters are set, an Exception will be
thrown. To specify the number of buckets for each column, the numBucketsArray parameter can
be set, or if the number of buckets should be the same across columns, numBuckets can be
set as a convenience. Note that in multiple columns case, relative error is applied to all
columns.

NaN handling:
null and NaN values will be ignored from the column during QuantileDiscretizer fitting. This
will produce a Bucketizer model for making predictions. During the transformation,
Bucketizer will raise an error when it finds NaN values in the dataset, but the user can
also choose to either keep or remove NaN values within the dataset by setting handleInvalid.
If the user chooses to keep NaN values, they will be handled specially and placed into their own
bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3],
but NaNs will be counted in a special bucket[4].

Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
org.apache.spark.sql.DataFrameStatFunctions.approxQuantile
for a detailed description). The precision of the approximation can be controlled with the
relativeError parameter. The lower and upper bin bounds will be -Infinity and +Infinity,
covering all real values.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html

Timestamp: 2020-10-19T01:56:13.770Z

raw docstring

random-forest-classifier^clj

(random-forest-classifier params)

Random Forest learning algorithm for classification. It supports both binary and multiclass labels, as well as both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html

Timestamp: 2020-10-19T01:55:59.351Z

Random Forest learning algorithm for
classification.
It supports both binary and multiclass labels, as well as both continuous and categorical
features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html

Timestamp: 2020-10-19T01:55:59.351Z

raw docstring

random-forest-regressor^clj

(random-forest-regressor params)

Random Forest learning algorithm for regression. It supports both continuous and categorical features.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html

Timestamp: 2020-10-19T01:55:55.394Z

Random Forest
learning algorithm for regression.
It supports both continuous and categorical features.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html

Timestamp: 2020-10-19T01:55:55.394Z

raw docstring

ranking-evaluator^clj

(ranking-evaluator params)

:: Experimental :: Evaluator for ranking, which expects two input columns: prediction and label.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html

Timestamp: 2020-10-19T01:56:02.374Z

:: Experimental ::
Evaluator for ranking, which expects two input columns: prediction and label.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html

Timestamp: 2020-10-19T01:56:02.374Z

raw docstring

raw-prediction-col^clj

(raw-prediction-col model)

Params:

Result: String

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.626Z

Params: 

Result: String



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.626Z

raw docstring

read-stage!^clj

(read-stage! model-cls path)

Load a saved PipelineStage.

Load a saved PipelineStage.

raw docstring

(recommend-for-all-items model num-users)

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are
        stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z

raw docstring

(recommend-for-all-users model num-items)

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are
        stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z

raw docstring

(recommend-for-item-subset model items-df num-users)

Params: (dataset: Dataset[_], numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned.

a Dataset containing a column of item ids. The column name must match itemCol.

max number of recommendations for each item.

a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.317Z

Params: (dataset: Dataset[_], numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item id in the input data set. Note that if
there are duplicate ids in the input dataset, only one set of recommendations per unique id
will be returned.

a Dataset containing a column of item ids. The column name must match itemCol.

max number of recommendations for each item.

a DataFrame of (itemCol: Int, recommendations), where recommendations are
        stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.317Z

raw docstring

(recommend-for-user-subset model users-df num-items)

Params: (dataset: Dataset[_], numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user id in the input data set. Note that if there are duplicate ids in the input dataset, only one set of recommendations per unique id will be returned.

a Dataset containing a column of user ids. The column name must match userCol.

max number of recommendations for each user.

a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.319Z

Params: (dataset: Dataset[_], numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user id in the input data set. Note that if
there are duplicate ids in the input dataset, only one set of recommendations per unique id
will be returned.

a Dataset containing a column of user ids. The column name must match userCol.

max number of recommendations for each user.

a DataFrame of (userCol: Int, recommendations), where recommendations are
        stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.319Z

raw docstring

(recommend-items model num-items)

(recommend-items model users-df num-items)

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z

Params: (numItems: Int)

Result: DataFrame

Returns top numItems items recommended for each user, for all users.

max number of recommendations for each user

a DataFrame of (userCol: Int, recommendations), where recommendations are
        stored as an array of (itemCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.315Z

raw docstring

(recommend-users model num-users)

(recommend-users model items-df num-users)

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z

Params: (numUsers: Int)

Result: DataFrame

Returns top numUsers users recommended for each item, for all items.

max number of recommendations for each item

a DataFrame of (itemCol: Int, recommendations), where recommendations are
        stored as an array of (userCol: Int, rating: Float) Rows.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.310Z

raw docstring

regex-tokeniser^clj

(regex-tokeniser params)

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

raw docstring

regex-tokenizer^clj

(regex-tokenizer params)

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html

Timestamp: 2020-10-19T01:56:14.327Z

raw docstring

regression-evaluator^clj

(regression-evaluator params)

Evaluator for regression, which expects input columns prediction, label and an optional weight column.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html

Timestamp: 2020-10-19T01:56:02.721Z

Evaluator for regression, which expects input columns prediction, label and
an optional weight column.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html

Timestamp: 2020-10-19T01:56:02.721Z

raw docstring

robust-scaler^clj

(robust-scaler params)

Scale features using statistics that are robust to outliers. RobustScaler removes the median and scales the data according to the quantile range. The quantile range is by default IQR (Interquartile Range, quantile range between the 1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured. Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and quantile range are then stored to be used on later data using the transform method. Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the quantile range often give better results. Note that NaN values are ignored in the computation of medians and ranges.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html

Timestamp: 2020-10-19T01:56:15.260Z

Scale features using statistics that are robust to outliers.
RobustScaler removes the median and scales the data according to the quantile range.
The quantile range is by default IQR (Interquartile Range, quantile range between the
1st quartile = 25th quantile and the 3rd quartile = 75th quantile) but can be configured.
Centering and scaling happen independently on each feature by computing the relevant
statistics on the samples in the training set. Median and quantile range are then
stored to be used on later data using the transform method.
Standardization of a dataset is a common requirement for many machine learning estimators.
Typically this is done by removing the mean and scaling to unit variance. However,
outliers can often influence the sample mean / variance in a negative way.
In such cases, the median and the quantile range often give better results.
Note that NaN values are ignored in the computation of medians and ranges.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/RobustScaler.html

Timestamp: 2020-10-19T01:56:15.260Z

raw docstring

root-node^clj

(root-node model)

Params:

Result: Node

Root of the decision tree

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.689Z

Params: 

Result: Node

Root of the decision tree

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/DecisionTreeClassificationModel.html

Timestamp: 2020-10-19T01:56:41.689Z

raw docstring

scale^clj

(scale model)

Params:

Result: Double

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.368Z

Params: 

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.368Z

raw docstring

sql-transformer^clj

(sql-transformer params)

Implements the transformations which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM THIS ...' where 'THIS' represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, it can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html

Timestamp: 2020-10-19T01:56:15.611Z

Implements the transformations which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM THIS ...'
where 'THIS' represents the underlying table of the input dataset.
The select clause specifies the fields, constants, and expressions to display in
the output, it can be any select clause that Spark SQL supports. Users can also
use Spark SQL built-in function and UDFs to operate on these selected columns.
For example, SQLTransformer supports statements like:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/SQLTransformer.html

Timestamp: 2020-10-19T01:56:15.611Z

raw docstring

stages^clj

(stages model)

Params:

Result: Array[Transformer]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/PipelineModel.html

Timestamp: 2020-10-19T01:56:38.367Z

Params: 

Result: Array[Transformer]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/PipelineModel.html

Timestamp: 2020-10-19T01:56:38.367Z

raw docstring

standard-scaler^clj

(standard-scaler params)

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

The "unit std" is computed using the

corrected sample standard deviation, which is computed as the square root of the unbiased sample variance.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html

Timestamp: 2020-10-19T01:56:16.163Z

Standardizes features by removing the mean and scaling to unit variance using column summary
statistics on the samples in the training set.

The "unit std" is computed using the

corrected sample standard deviation,
which is computed as the square root of the unbiased sample variance.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScaler.html

Timestamp: 2020-10-19T01:56:16.163Z

raw docstring

std^clj

(std model)

Params:

Result: Vector

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.073Z

Params: 

Result: Vector



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StandardScalerModel.html

Timestamp: 2020-10-19T01:56:33.073Z

raw docstring

stop-words-remover^clj

(stop-words-remover params)

A feature transformer that filters out stop words from input.

Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html

Timestamp: 2020-10-19T01:56:16.540Z

A feature transformer that filters out stop words from input.

Since 3.0.0, StopWordsRemover can filter out multiple columns at once by setting the
inputCols parameter. Note that when both the inputCol and inputCols parameters are set,
an Exception will be thrown.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StopWordsRemover.html

Timestamp: 2020-10-19T01:56:16.540Z

raw docstring

string-indexer^clj

(string-indexer params)

A label indexer that maps string column(s) of labels to ML column(s) of label indices. If the input columns are numeric, we cast them to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html

Timestamp: 2020-10-19T01:56:16.905Z

A label indexer that maps string column(s) of labels to ML column(s) of label indices.
If the input columns are numeric, we cast them to string and index the string values.
The indices are in [0, numLabels). By default, this is ordered by label frequencies
so the most frequent label gets index 0. The ordering behavior is controlled by
setting stringOrderType.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/StringIndexer.html

Timestamp: 2020-10-19T01:56:16.905Z

raw docstring

summary^clj

(summary model)

Params:

Result: LinearRegressionTrainingSummary

Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is thrown if hasSummary is false.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.383Z

Params: 

Result: LinearRegressionTrainingSummary

Gets summary (e.g. residuals, mse, r-squared ) of model on training set. An exception is
thrown if hasSummary is false.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.383Z

raw docstring

supported-optimisers^clj

(supported-optimisers model)

Params:

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z

Params: 

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z

raw docstring

supported-optimizers^clj

(supported-optimizers model)

Params:

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z

Params: 

Result: Array[String]

Supported values for Param optimizer.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:42.991Z

raw docstring

surrogate-df^clj

(surrogate-df model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ImputerModel.html

Timestamp: 2020-10-19T01:56:30.491Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/ImputerModel.html

Timestamp: 2020-10-19T01:56:30.491Z

raw docstring

theta^clj

(theta model)

Params:

Result: Matrix

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.648Z

Params: 

Result: Matrix



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/NaiveBayesModel.html

Timestamp: 2020-10-19T01:56:39.648Z

raw docstring

thresholds^clj

(thresholds model)

Params:

Result: Array[Double]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.629Z

Params: 

Result: Array[Double]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.629Z

raw docstring

tokeniser^clj

(tokeniser params)

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

A tokenizer that converts the input string to lowercase and then splits it by white spaces.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

raw docstring

tokenizer^clj

(tokenizer params)

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

A tokenizer that converts the input string to lowercase and then splits it by white spaces.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Tokenizer.html

Timestamp: 2020-10-19T01:56:17.265Z

raw docstring

total-num-nodes^clj

(total-num-nodes model)

Params:

Result: Int

Total number of nodes, summed over all trees in the ensemble.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.716Z

Params: 

Result: Int

Total number of nodes, summed over all trees in the ensemble.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.716Z

raw docstring

train-validation-split^clj

(train-validation-split {:keys [estimator evaluator estimator-param-maps seed
                                parallelism]})

Validation for hyper-parameter tuning. Randomly splits the input dataset into train and validation sets, and uses evaluation metric on the validation set to select the best model. Similar to CrossValidator, but only splits the set once.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html

Timestamp: 2020-10-19T01:55:49.563Z

Validation for hyper-parameter tuning.
Randomly splits the input dataset into train and validation sets,
and uses evaluation metric on the validation set to select the best model.
Similar to CrossValidator, but only splits the set once.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html

Timestamp: 2020-10-19T01:55:49.563Z

raw docstring

transform^clj

(transform dataframe transformer)

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: DataFrame

Transforms the dataset with optional parameters

input dataset

the first param pair, overwrite embedded params

other param pairs, overwrite embedded params

transformed dataset

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.391Z

Params: (dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*)

Result: DataFrame

Transforms the dataset with optional parameters

input dataset

the first param pair, overwrite embedded params

other param pairs, overwrite embedded params

transformed dataset

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/regression/LinearRegressionModel.html

Timestamp: 2020-10-19T01:56:36.391Z

raw docstring

tree-weights^clj

(tree-weights model)

Params:

Result: Array[Double]

Weights for each tree, zippable with trees

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.737Z

Params: 

Result: Array[Double]

Weights for each tree, zippable with trees

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.737Z

raw docstring

trees^clj

(trees model)

Params:

Result: Array[DecisionTreeClassificationModel]

Trees in this ensemble. Warning: These have null parent Estimators.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.739Z

Params: 

Result: Array[DecisionTreeClassificationModel]

Trees in this ensemble. Warning: These have null parent Estimators.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/classification/RandomForestClassificationModel.html

Timestamp: 2020-10-19T01:56:37.739Z

raw docstring

uid^clj

(uid model)

Params:

Result: String

An immutable unique ID for the object and its derivatives.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.754Z

Params: 

Result: String

An immutable unique ID for the object and its derivatives.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/fpm/PrefixSpan.html

Timestamp: 2020-10-19T01:56:35.754Z

raw docstring

user-factors^clj

(user-factors model)

Params:

Result: DataFrame

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.347Z

Params: 

Result: DataFrame



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/recommendation/ALSModel.html

Timestamp: 2020-10-19T01:56:42.347Z

raw docstring

vector->array^clj

(vector->array expr)

(vector->array expr dtype)

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z

raw docstring

vector-assembler^clj

(vector-assembler params)

A feature transformer that merges multiple columns into a vector column.

This requires one pass over the entire dataset. In case we need to infer column lengths from the data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html

Timestamp: 2020-10-19T01:56:17.622Z

A feature transformer that merges multiple columns into a vector column.

This requires one pass over the entire dataset. In case we need to infer column lengths from the
data we require an additional call to the 'first' Dataset method, see 'handleInvalid' parameter.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorAssembler.html

Timestamp: 2020-10-19T01:56:17.622Z

raw docstring

vector-indexer^clj

(vector-indexer params)

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

This returns a model which can transform categorical features to use 0-based indices.

Index stability:

TODO: Future extensions: The following functionality is planned for the future:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html

Timestamp: 2020-10-19T01:56:18.174Z

Class for indexing categorical feature columns in a dataset of Vector.

This has 2 usage modes:

This returns a model which can transform categorical features to use 0-based indices.

Index stability:

TODO: Future extensions: The following functionality is planned for the future:

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorIndexer.html

Timestamp: 2020-10-19T01:56:18.174Z

raw docstring

vector-size-hint^clj

(vector-size-hint params)

A feature transformer that adds size information to the metadata of a vector column. VectorAssembler needs size information for its input columns and cannot be used on streaming dataframes without this metadata.

Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:18.723Z

A feature transformer that adds size information to the metadata of a vector column.
VectorAssembler needs size information for its input columns and cannot be used on streaming
dataframes without this metadata.

Note: VectorSizeHint modifies inputCol to include size metadata and does not have an outputCol.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/VectorSizeHint.html

Timestamp: 2020-10-19T01:56:18.723Z

raw docstring

vector-to-array^clj

(vector-to-array expr)

(vector-to-array expr dtype)

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z

Params: (v: Column, dtype: String = "float64")

Result: Column

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

an array<float> if dtype is float32, or array<double> if dtype is float64

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/functions$.html

Timestamp: 2020-10-19T01:56:27.317Z

raw docstring

vocab-size^clj

(vocab-size model)

Params:

Result: Int

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:43.011Z

Params: 

Result: Int



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/LDAModel.html

Timestamp: 2020-10-19T01:56:43.011Z

raw docstring

vocabulary^clj

(vocabulary model)

Params:

Result: Array[String]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html

Timestamp: 2020-10-19T01:56:34.357Z

Params: 

Result: Array[String]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html

Timestamp: 2020-10-19T01:56:34.357Z

raw docstring

weights^clj

(weights model)

Params:

Result: Array[Double]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.312Z

Params: 

Result: Array[Double]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/clustering/GaussianMixtureModel.html

Timestamp: 2020-10-19T01:56:40.312Z

raw docstring

word-2-vec^clj

(word-2-vec params)

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further
natural language processing or machine learning process.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

raw docstring

word2vec^clj

(word2vec params)

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

Word2Vec trains a model of Map(String, Vector), i.e. transforms a word into a code for further
natural language processing or machine learning process.


Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/ml/feature/Word2Vec.html

Timestamp: 2020-10-19T01:56:19.459Z

raw docstring

write-native-model!^clj

(write-native-model! model path)

Save the native XGBoost's Booster to file.

Save the native XGBoost's `Booster` to file.

raw docstring

write-stage!^clj

(write-stage! stage path)

(write-stage! stage path options)

Save a PipelineStage to the specified path.

Save a PipelineStage to the specified path.

raw docstring

xgboost-classifier^clj

(xgboost-classifier params)

Gradient boosting classifier based on xgboost.

XGBoost docs: https://xgboost.readthedocs.io/en/latest/

XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.html

Gradient boosting classifier based on xgboost.

XGBoost docs: https://xgboost.readthedocs.io/en/latest/

XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.html

raw docstring

xgboost-regressor^clj

(xgboost-regressor params)

Gradient boosting classifier based on xgboost.

XGBoost docs: https://xgboost.readthedocs.io/en/latest/

XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.html

Gradient boosting classifier based on xgboost.

XGBoost docs: https://xgboost.readthedocs.io/en/latest/

XGBoost4J docs: https://xgboost.readthedocs.io/en/latest/jvm/scaladocs/xgboost4j-spark/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.html

raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

zero-one.geni.ml

aft-survival-regressionclj

alsclj

alternating-least-squaresclj

approx-nearest-neighborsclj

approx-nearest-neighboursclj

approx-similarity-joinclj

association-rulesclj

best-modelclj

binariserclj

binarizerclj

binary-classification-evaluatorclj

binary-summaryclj

bisecting-k-meansclj

boundariesclj

bucketed-random-projection-lshclj

bucketiserclj

bucketizerclj

category-mapsclj

category-sizesclj

chi-sq-selectorclj

chi-square-testclj

cluster-centersclj

clustering-evaluatorclj

coefficient-matrixclj

coefficientsclj

corrcljmultimethod

count-vectoriserclj

count-vectorizerclj

cross-validatorclj

dctclj

decision-tree-classifierclj

decision-tree-regressorclj

depthclj

describe-topicsclj

discrete-cosine-transformclj

distributed?clj

elementwise-productclj

estimated-doc-concentrationclj

evaluateclj

feature-hasherclj

feature-importancesclj

features-colclj

find-frequent-sequential-patternsclj

find-patternsclj

fitclj

fm-classifierclj

fm-regressorclj

fp-growthclj

freq-itemsetsclj

frequent-item-setsclj

frequent-pattern-growthclj

gaussian-mixtureclj

gaussians-dfclj

gbt-classifierclj

gbt-regressorclj

generalised-linear-regressionclj

generalized-linear-regressionclj

get-features-colclj

get-input-colclj

get-input-colsclj

get-label-colclj

get-num-treesclj

get-output-colclj

get-output-colsclj

get-prediction-colclj

get-probability-colclj

get-raw-prediction-colclj

get-sizeclj

get-thresholdsclj

glmclj

gmmclj

hashing-tfclj

idfclj

idf-vectorclj

imputerclj

index-to-stringclj

input-colclj

input-colsclj

interactionclj

aft-survival-regression^clj

als^clj

alternating-least-squares^clj

approx-nearest-neighbors^clj

approx-nearest-neighbours^clj

approx-similarity-join^clj

association-rules^clj

best-model^clj

binariser^clj

binarizer^clj

binary-classification-evaluator^clj

binary-summary^clj

bisecting-k-means^clj

boundaries^clj

bucketed-random-projection-lsh^clj

bucketiser^clj

bucketizer^clj

category-maps^clj

category-sizes^clj

chi-sq-selector^clj

chi-square-test^clj

cluster-centers^clj

clustering-evaluator^clj

coefficient-matrix^clj

coefficients^clj

corr^cljmultimethod

count-vectoriser^clj

count-vectorizer^clj

cross-validator^clj

dct^clj

decision-tree-classifier^clj

decision-tree-regressor^clj

depth^clj

describe-topics^clj

discrete-cosine-transform^clj

distributed?^clj

elementwise-product^clj

estimated-doc-concentration^clj

evaluate^clj

feature-hasher^clj

feature-importances^clj

features-col^clj

find-frequent-sequential-patterns^clj

find-patterns^clj

fit^clj

fm-classifier^clj

fm-regressor^clj

fp-growth^clj

freq-itemsets^clj

frequent-item-sets^clj

frequent-pattern-growth^clj

gaussian-mixture^clj

gaussians-df^clj

gbt-classifier^clj

gbt-regressor^clj

generalised-linear-regression^clj

generalized-linear-regression^clj

get-features-col^clj

get-input-col^clj

get-input-cols^clj

get-label-col^clj

get-num-trees^clj

get-output-col^clj

get-output-cols^clj

get-prediction-col^clj

get-probability-col^clj

get-raw-prediction-col^clj

get-size^clj

get-thresholds^clj

glm^clj

gmm^clj

hashing-tf^clj

idf^clj

idf-vector^clj

imputer^clj

index-to-string^clj

input-col^clj

input-cols^clj

interaction^clj

intercept^clj