(add cms item)
(add cms item cnt)
Params: (item: Any)
Result: Unit
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.095Z
Params: (item: Any) Result: Unit Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.095Z
(agg dataframe & args)
Params: (aggExpr: (String, String), aggExprs: (String, String)*)
Result: DataFrame
(Scala-specific) Aggregates on the entire Dataset without groups.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.739Z
Params: (aggExpr: (String, String), aggExprs: (String, String)*) Result: DataFrame (Scala-specific) Aggregates on the entire Dataset without groups. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.739Z
(agg-all dataframe agg-fn)
Aggregates on all columns of the entire Dataset without groups.
Aggregates on all columns of the entire Dataset without groups.
(approx-quantile dataframe col-or-cols probs rel-error)
Params: (col: String, probabilities: Array[Double], relativeError: Double)
Result: Array[Double]
Calculates the approximate quantiles of a numerical column of a DataFrame.
The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p * N). More precisely,
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.
the name of the numerical column
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
the approximate quantiles at the given probabilities
2.0.0
null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.640Z
Params: (col: String, probabilities: Array[Double], relativeError: Double) Result: Array[Double] Calculates the approximate quantiles of a numerical column of a DataFrame. The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the *exact* rank of x is close to (p * N). More precisely, This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna. the name of the numerical column a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum. The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1. the approximate quantiles at the given probabilities 2.0.0 null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.640Z
(bit-size bloom)
Params: ()
Result: Long
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.738Z
Params: () Result: Long Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.738Z
(bloom-filter dataframe expr expected-num-items num-bits-or-fpp)
Params: (colName: String, expectedNumItems: Long, fpp: Double)
Result: BloomFilter
Builds a Bloom filter over a specified column.
name of the column over which the filter is built
expected number of items which will be put into the filter.
expected false positive probability of the filter.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.647Z
Params: (colName: String, expectedNumItems: Long, fpp: Double) Result: BloomFilter Builds a Bloom filter over a specified column. name of the column over which the filter is built expected number of items which will be put into the filter. expected false positive probability of the filter. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.647Z
(cache dataframe)
Params: ()
Result: Dataset.this.type
Persist this Dataset with the default storage level (MEMORY_AND_DISK).
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.750Z
Params: () Result: Dataset.this.type Persist this Dataset with the default storage level (MEMORY_AND_DISK). 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.750Z
(checkpoint dataframe)
(checkpoint dataframe eager)
Params: ()
Result: Dataset[T]
Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.
2.1.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.752Z
Params: () Result: Dataset[T] Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir. 2.1.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.752Z
(col-regex dataframe col-name)
Params: (colName: String)
Result: Column
Selects column based on the column name specified as a regex and returns it as Column.
2.3.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.758Z
Params: (colName: String) Result: Column Selects column based on the column name specified as a regex and returns it as Column. 2.3.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.758Z
(collect dataframe)
Params: ()
Result: Array[T]
Returns an array that contains all rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.759Z
Params: () Result: Array[T] Returns an array that contains all rows in this Dataset. Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError. For Java API, use collectAsList. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.759Z
(collect-col dataframe col-name)
Returns a vector that contains all rows in the column of the Dataset.
Returns a vector that contains all rows in the column of the Dataset.
(collect-vals dataframe)
Returns the vector values of the Dataset collected.
Returns the vector values of the Dataset collected.
(column-names dataframe)
Returns all column names as an array of strings.
Returns all column names as an array of strings.
(columns dataframe)
Returns all column names as an array of keywords.
Returns all column names as an array of keywords.
(compatible? bloom other)
Params: (other: BloomFilter)
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.740Z
Params: (other: BloomFilter) Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.740Z
(confidence cms)
Params: ()
Result: Double
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.102Z
Params: () Result: Double Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.102Z
(count-min-sketch dataframe expr eps-or-depth confidence-or-width seed)
Params: (colName: String, depth: Int, width: Int, seed: Int)
Result: CountMinSketch
Builds a Count-min Sketch over a specified column.
name of the column over which the sketch is built
depth of the sketch
width of the sketch
random seed
a CountMinSketch over column colName
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.659Z
Params: (colName: String, depth: Int, width: Int, seed: Int) Result: CountMinSketch Builds a Count-min Sketch over a specified column. name of the column over which the sketch is built depth of the sketch width of the sketch random seed a CountMinSketch over column colName 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.659Z
(cov dataframe col-name1 col-name2)
Params: (col1: String, col2: String)
Result: Double
Calculate the sample covariance of two numerical columns of a DataFrame.
the name of the first column
the name of the second column
the covariance of the two columns.
1.4.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.661Z
Params: (col1: String, col2: String) Result: Double Calculate the sample covariance of two numerical columns of a DataFrame. the name of the first column the name of the second column the covariance of the two columns. 1.4.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.661Z
(cross-join left right)
Params: (right: Dataset[_])
Result: DataFrame
Explicit cartesian join with another DataFrame.
Right side of the join operation.
2.1.0
Cartesian joins are very expensive without an extra filter that can be pushed down.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.770Z
Params: (right: Dataset[_]) Result: DataFrame Explicit cartesian join with another DataFrame. Right side of the join operation. 2.1.0 Cartesian joins are very expensive without an extra filter that can be pushed down. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.770Z
(crosstab dataframe col-name1 col-name2)
Params: (col1: String, col2: String)
Result: DataFrame
Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.
The name of the first column. Distinct items will make the first item of each row.
The name of the second column. Distinct items will make the column names of the DataFrame.
A DataFrame containing for the contingency table.
1.4.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.664Z
Params: (col1: String, col2: String) Result: DataFrame Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist. The name of the first column. Distinct items will make the first item of each row. The name of the second column. Distinct items will make the column names of the DataFrame. A DataFrame containing for the contingency table. 1.4.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.664Z
(cube dataframe & exprs)
Params: (cols: Column*)
Result: RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.778Z
Params: (cols: Column*) Result: RelationalGroupedDataset Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.778Z
(depth cms)
Params: ()
Result: Int
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.103Z
Params: () Result: Int Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.103Z
(describe dataframe & col-names)
Params: (cols: String*)
Result: DataFrame
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.
Use summary for expanded statistics and control over which statistics to compute.
Columns to compute statistics on.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.780Z
Params: (cols: String*) Result: DataFrame Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead. Use summary for expanded statistics and control over which statistics to compute. Columns to compute statistics on. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.780Z
(distinct dataframe)
Params: ()
Result: Dataset[T]
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.
2.0.0
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.781Z
Params: () Result: Dataset[T] Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates. 2.0.0 Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.781Z
(drop dataframe & col-names)
Params: (colName: String)
Result: DataFrame
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.785Z
Params: (colName: String) Result: DataFrame Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name. This method can only be used to drop top level columns. the colName string is treated literally without further interpretation. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.785Z
(drop-duplicates dataframe & col-names)
Params: ()
Result: Dataset[T]
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.791Z
Params: () Result: Dataset[T] Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct. For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.791Z
(drop-na dataframe)
(drop-na dataframe min-non-nulls-or-cols)
(drop-na dataframe min-non-nulls cols)
Params: ()
Result: DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values.
1.3.1
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html
Timestamp: 2020-10-19T01:56:23.886Z
Params: () Result: DataFrame Returns a new DataFrame that drops rows containing any null or NaN values. 1.3.1 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html Timestamp: 2020-10-19T01:56:23.886Z
(dtypes dataframe)
Params:
Result: Array[(String, String)]
Returns all column names and their data types as an array.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.792Z
Params: Result: Array[(String, String)] Returns all column names and their data types as an array. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.792Z
(empty? dataframe)
Params:
Result: Boolean
Returns true if the Dataset is empty.
2.4.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.840Z
Params: Result: Boolean Returns true if the Dataset is empty. 2.4.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.840Z
(estimate-count cms item)
Params: (item: Any)
Result: Long
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.104Z
Params: (item: Any) Result: Long Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.104Z
(except dataframe other)
Params: (other: Dataset[T])
Result: Dataset[T]
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT DISTINCT in SQL.
2.0.0
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.796Z
Params: (other: Dataset[T]) Result: Dataset[T] Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT DISTINCT in SQL. 2.0.0 Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.796Z
(except-all dataframe other)
Params: (other: Dataset[T])
Result: Dataset[T]
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL.
2.4.0
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.798Z
Params: (other: Dataset[T]) Result: Dataset[T] Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL. 2.4.0 Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.798Z
(expected-fpp bloom)
Params: ()
Result: Double
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.739Z
Params: () Result: Double Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.739Z
(fill-na dataframe value)
(fill-na dataframe value cols)
Params: (value: Long)
Result: DataFrame
Returns a new DataFrame that replaces null or NaN values in numeric columns with value.
2.2.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html
Timestamp: 2020-10-19T01:56:23.908Z
Params: (value: Long) Result: DataFrame Returns a new DataFrame that replaces null or NaN values in numeric columns with value. 2.2.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html Timestamp: 2020-10-19T01:56:23.908Z
(first-vals dataframe)
Returns the vector values of the first row in the Dataset collected.
Returns the vector values of the first row in the Dataset collected.
(freq-items dataframe col-names)
(freq-items dataframe col-names support)
Params: (cols: Array[String], support: Double)
Result: DataFrame
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
the names of the columns to search frequent items in.
The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.
A Local DataFrame with the Array of frequent items for each column.
1.4.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.676Z
Params: (cols: Array[String], support: Double) Result: DataFrame Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. the names of the columns to search frequent items in. The minimum frequency for an item to be considered frequent. Should be greater than 1e-4. A Local DataFrame with the Array of frequent items for each column. 1.4.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.676Z
(group-by dataframe & exprs)
Params: (cols: Column*)
Result: RelationalGroupedDataset
Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.827Z
Params: (cols: Column*) Result: RelationalGroupedDataset Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.827Z
(head dataframe)
(head dataframe n-rows)
Params: (n: Int)
Result: Array[T]
Returns the first n rows.
1.6.0
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.834Z
Params: (n: Int) Result: Array[T] Returns the first n rows. 1.6.0 this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.834Z
(head-vals dataframe)
(head-vals dataframe n-rows)
Returns the vector values of the first n rows in the Dataset collected.
Returns the vector values of the first n rows in the Dataset collected.
(hint dataframe hint-name & args)
Params: (name: String, parameters: Any*)
Result: Dataset[T]
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:
2.2.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.835Z
Params: (name: String, parameters: Any*) Result: Dataset[T] Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted: 2.2.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.835Z
(input-files dataframe)
Params:
Result: Array[String]
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.837Z
Params: Result: Array[String] Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.837Z
(intersect dataframe other)
Params: (other: Dataset[T])
Result: Dataset[T]
Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.
1.6.0
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.838Z
Params: (other: Dataset[T]) Result: Dataset[T] Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL. 1.6.0 Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.838Z
(intersect-all dataframe other)
Params: (other: Dataset[T])
Result: Dataset[T]
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to INTERSECT ALL in SQL.
2.4.0
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.839Z
Params: (other: Dataset[T]) Result: Dataset[T] Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to INTERSECT ALL in SQL. 2.4.0 Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.839Z
(is-compatible bloom other)
Params: (other: BloomFilter)
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.740Z
Params: (other: BloomFilter) Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.740Z
(is-empty dataframe)
Params:
Result: Boolean
Returns true if the Dataset is empty.
2.4.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.840Z
Params: Result: Boolean Returns true if the Dataset is empty. 2.4.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.840Z
(is-local dataframe)
Params:
Result: Boolean
Returns true if the collect and take methods can be run locally (without any Spark executors).
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.843Z
Params: Result: Boolean Returns true if the collect and take methods can be run locally (without any Spark executors). 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.843Z
(is-streaming dataframe)
Params:
Result: Boolean
Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.844Z
Params: Result: Boolean Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.844Z
(join left right expr)
(join left right expr join-type)
Params: (right: Dataset[_])
Result: DataFrame
Join with another DataFrame.
Behaves as an INNER JOIN and requires a subsequent join predicate.
Right side of the join operation.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.856Z
Params: (right: Dataset[_]) Result: DataFrame Join with another DataFrame. Behaves as an INNER JOIN and requires a subsequent join predicate. Right side of the join operation. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.856Z
(join-with left right condition)
(join-with left right condition join-type)
Params: (other: Dataset[U], condition: Column, joinType: String)
Result: Dataset[(T, U)]
Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
Right side of the join.
Join expression.
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter,full_outer, left, leftouter, left_outer, right, rightouter, right_outer.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.860Z
Params: (other: Dataset[U], condition: Column, joinType: String) Result: Dataset[(T, U)] Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true. This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2. This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common. Right side of the join. Join expression. Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter,full_outer, left, leftouter, left_outer, right, rightouter, right_outer. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.860Z
(last-vals dataframe)
Returns the vector values of the last row in the Dataset collected.
Returns the vector values of the last row in the Dataset collected.
(limit dataframe n-rows)
Params: (n: Int)
Result: Dataset[T]
Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.861Z
Params: (n: Int) Result: Dataset[T] Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.861Z
(local? dataframe)
Params:
Result: Boolean
Returns true if the collect and take methods can be run locally (without any Spark executors).
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.843Z
Params: Result: Boolean Returns true if the collect and take methods can be run locally (without any Spark executors). 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.843Z
(merge-in-place bloom-or-cms other)
Params: (other: BloomFilter)
Result: BloomFilter
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.741Z
Params: (other: BloomFilter) Result: BloomFilter Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.741Z
(might-contain bloom item)
Params: (item: Any)
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.742Z
Params: (item: Any) Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.742Z
(order-by dataframe & exprs)
Params: (sortCol: String, sortCols: String*)
Result: Dataset[T]
Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.884Z
Params: (sortCol: String, sortCols: String*) Result: Dataset[T] Returns a new Dataset sorted by the given expressions. This is an alias of the sort function. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.884Z
(partitions dataframe)
Params:
Result: List[Partition]
Set of partitions in this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.891Z
Params: Result: List[Partition] Set of partitions in this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.891Z
(persist dataframe)
(persist dataframe new-level)
Params: ()
Result: Dataset.this.type
Persist this Dataset with the default storage level (MEMORY_AND_DISK).
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.886Z
Params: () Result: Dataset.this.type Persist this Dataset with the default storage level (MEMORY_AND_DISK). 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.886Z
(pivot grouped expr)
(pivot grouped expr values)
Params: (pivotColumn: String)
Result: RelationalGroupedDataset
Pivots a column of the current DataFrame and performs the specified aggregation.
There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
Name of the column to pivot.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/RelationalGroupedDataset.html
Timestamp: 2020-10-19T01:56:23.317Z
Params: (pivotColumn: String) Result: RelationalGroupedDataset Pivots a column of the current DataFrame and performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally. Name of the column to pivot. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/RelationalGroupedDataset.html Timestamp: 2020-10-19T01:56:23.317Z
(print-schema dataframe)
Params: ()
Result: Unit
Prints the schema to the console in a nice tree format.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.888Z
Params: () Result: Unit Prints the schema to the console in a nice tree format. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.888Z
(put bloom item)
Params: (item: Any)
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html
Timestamp: 2020-10-19T01:56:25.746Z
Params: (item: Any) Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html Timestamp: 2020-10-19T01:56:25.746Z
(random-split dataframe weights)
(random-split dataframe weights seed)
Params: (weights: Array[Double], seed: Long)
Result: Array[Dataset[T]]
Randomly splits this Dataset with the provided weights.
weights for splits, will be normalized if they don't sum to 1.
Seed for sampling. For Java API, use randomSplitAsList.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.892Z
Params: (weights: Array[Double], seed: Long) Result: Array[Dataset[T]] Randomly splits this Dataset with the provided weights. weights for splits, will be normalized if they don't sum to 1. Seed for sampling. For Java API, use randomSplitAsList. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.892Z
(rdd dataframe)
Params:
Result: RDD[T]
Represents the content of the Dataset as an RDD of T.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.894Z
Params: Result: RDD[T] Represents the content of the Dataset as an RDD of T. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.894Z
(relative-error cms)
Params: ()
Result: Double
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.106Z
Params: () Result: Double Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.106Z
(rename-columns dataframe rename-map)
Returns a new Dataset with a column renamed according to the rename-map.
Returns a new Dataset with a column renamed according to the rename-map.
(repartition dataframe & args)
Params: (numPartitions: Int)
Result: Dataset[T]
Returns a new Dataset that has exactly numPartitions partitions.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.901Z
Params: (numPartitions: Int) Result: Dataset[T] Returns a new Dataset that has exactly numPartitions partitions. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.901Z
(repartition-by-range dataframe & args)
Params: (numPartitions: Int, partitionExprs: Column*)
Result: Dataset[T]
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned.
At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.
2.3.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.904Z
Params: (numPartitions: Int, partitionExprs: Column*) Result: Dataset[T] Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned. At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset. Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition. 2.3.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.904Z
(replace-na dataframe cols replacement)
Params: (col: String, replacement: Map[T, T])
Result: DataFrame
Replaces values matching keys in replacement map with the corresponding values.
name of the column to apply the value replacement. If col is "*", replacement is applied on all string, numeric or boolean columns.
value replacement map. Key and value of replacement map must have the same type, and can only be doubles, strings or booleans. The map value can have nulls.
1.3.1
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html
Timestamp: 2020-10-19T01:56:23.927Z
Params: (col: String, replacement: Map[T, T]) Result: DataFrame Replaces values matching keys in replacement map with the corresponding values. name of the column to apply the value replacement. If col is "*", replacement is applied on all string, numeric or boolean columns. value replacement map. Key and value of replacement map must have the same type, and can only be doubles, strings or booleans. The map value can have nulls. 1.3.1 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html Timestamp: 2020-10-19T01:56:23.927Z
(rollup dataframe & exprs)
Params: (cols: Column*)
Result: RelationalGroupedDataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.907Z
Params: (cols: Column*) Result: RelationalGroupedDataset Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.907Z
(sample dataframe fraction)
(sample dataframe fraction with-replacement)
Params: (fraction: Double, seed: Long)
Result: Dataset[T]
Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.
Fraction of rows to generate, range [0.0, 1.0].
Seed for sampling.
2.3.0
This is NOT guaranteed to provide exactly the fraction of the count of the given Dataset.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.913Z
Params: (fraction: Double, seed: Long) Result: Dataset[T] Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed. Fraction of rows to generate, range [0.0, 1.0]. Seed for sampling. 2.3.0 This is NOT guaranteed to provide exactly the fraction of the count of the given Dataset. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.913Z
(sample-by dataframe expr fractions seed)
Params: (col: String, fractions: Map[T, Double], seed: Long)
Result: DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
stratum type
column that defines strata
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
random seed
a new DataFrame that represents the stratified sample
1.5.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html
Timestamp: 2020-10-19T01:56:24.694Z
Params: (col: String, fractions: Map[T, Double], seed: Long) Result: DataFrame Returns a stratified sample without replacement based on the fraction given on each stratum. stratum type column that defines strata sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero. random seed a new DataFrame that represents the stratified sample 1.5.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html Timestamp: 2020-10-19T01:56:24.694Z
(select dataframe & exprs)
Params: (cols: Column*)
Result: DataFrame
Selects a set of column based expressions.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.931Z
Params: (cols: Column*) Result: DataFrame Selects a set of column based expressions. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.931Z
(select-expr dataframe & exprs)
Params: (exprs: String*)
Result: DataFrame
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.933Z
Params: (exprs: String*) Result: DataFrame Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.933Z
(show dataframe)
(show dataframe options)
Params: (numRows: Int)
Result: Unit
Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
Number of rows to show
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.945Z
Params: (numRows: Int) Result: Unit Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example: Number of rows to show 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.945Z
(show-vertical dataframe)
(show-vertical dataframe options)
Displays the Dataset in a list-of-records form.
Displays the Dataset in a list-of-records form.
(sort dataframe & exprs)
Params: (sortCol: String, sortCols: String*)
Result: Dataset[T]
Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.884Z
Params: (sortCol: String, sortCols: String*) Result: Dataset[T] Returns a new Dataset sorted by the given expressions. This is an alias of the sort function. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.884Z
(sort-within-partitions dataframe & exprs)
Params: (sortCol: String, sortCols: String*)
Result: Dataset[T]
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.950Z
Params: (sortCol: String, sortCols: String*) Result: Dataset[T] Returns a new Dataset with each partition sorted by the given expressions. This is the same operation as "SORT BY" in SQL (Hive QL). 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.950Z
(spark-session dataframe)
Params:
Result: SparkSession
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.951Z
Params: Result: SparkSession Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.951Z
(sql-context dataframe)
Params:
Result: SQLContext
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.952Z
Params: Result: SQLContext Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.952Z
(storage-level dataframe)
Params:
Result: StorageLevel
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
2.1.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.954Z
Params: Result: StorageLevel Get the Dataset's current storage level, or StorageLevel.NONE if not persisted. 2.1.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.954Z
(streaming? dataframe)
Params:
Result: Boolean
Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.844Z
Params: Result: Boolean Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.844Z
(summary dataframe & stat-names)
Params: (statistics: String*)
Result: DataFrame
Computes specified statistics for numeric and string columns. Available statistics are:
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.
To do a summary for specific columns first select them:
See also describe for basic statistics.
Statistics from above list to be computed.
2.3.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.957Z
Params: (statistics: String*) Result: DataFrame Computes specified statistics for numeric and string columns. Available statistics are: If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead. To do a summary for specific columns first select them: See also describe for basic statistics. Statistics from above list to be computed. 2.3.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.957Z
(tail dataframe n-rows)
Params: (n: Int)
Result: Array[T]
Returns the last n rows in the Dataset.
Running tail requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.
3.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.959Z
Params: (n: Int) Result: Array[T] Returns the last n rows in the Dataset. Running tail requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError. 3.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.959Z
(tail-vals dataframe n-rows)
Returns the vector values of the last n rows in the Dataset collected.
Returns the vector values of the last n rows in the Dataset collected.
(take dataframe n-rows)
Params: (n: Int)
Result: Array[T]
Returns the first n rows in the Dataset.
Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.961Z
Params: (n: Int) Result: Array[T] Returns the first n rows in the Dataset. Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.961Z
(take-vals dataframe n-rows)
Returns the vector values of the first n rows in the Dataset collected.
Returns the vector values of the first n rows in the Dataset collected.
(to-byte-array cms)
Params: ()
Result: Array[Byte]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.107Z
Params: () Result: Array[Byte] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.107Z
(total-count cms)
Params: ()
Result: Long
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.108Z
Params: () Result: Long Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.108Z
(union & dataframes)
Params: (other: Dataset[T])
Result: Dataset[T]
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name):
Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName to resolve columns by field name in the typed objects.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.974Z
Params: (other: Dataset[T]) Result: Dataset[T] Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct. Also as standard in SQL, this function resolves columns by position (not by name): Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName to resolve columns by field name in the typed objects. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.974Z
(union-by-name & dataframes)
Params: (other: Dataset[T])
Result: Dataset[T]
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
The difference between this function and union is that this function resolves columns by name (not by position):
2.3.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.978Z
Params: (other: Dataset[T]) Result: Dataset[T] Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct. The difference between this function and union is that this function resolves columns by name (not by position): 2.3.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.978Z
(unpersist dataframe)
(unpersist dataframe blocking)
Params: (blocking: Boolean)
Result: Dataset.this.type
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.
Whether to block until all blocks are deleted.
1.6.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.980Z
Params: (blocking: Boolean) Result: Dataset.this.type Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset. Whether to block until all blocks are deleted. 1.6.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.980Z
(width cms)
Params: ()
Result: Int
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html
Timestamp: 2020-10-19T01:56:26.108Z
Params: () Result: Int Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html Timestamp: 2020-10-19T01:56:26.108Z
(with-column dataframe col-name expr)
Params: (colName: String, col: Column)
Result: DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
column's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.
2.0.0
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.987Z
Params: (colName: String, col: Column) Result: DataFrame Returns a new Dataset by adding a column or replacing the existing column that has the same name. column's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset. 2.0.0 this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.987Z
(with-column-renamed dataframe old-name new-name)
Params: (existingName: String, newName: String)
Result: DataFrame
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
2.0.0
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html
Timestamp: 2020-10-19T01:56:20.988Z
Params: (existingName: String, newName: String) Result: DataFrame Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName. 2.0.0 Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html Timestamp: 2020-10-19T01:56:20.988Z
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close