zero-one.geni.core.dataset

Liking cljdoc? Tell your friends :D

Clojure only.

add
agg
agg-all
approx-quantile
bit-size
bloom-filter
cache
checkpoint
col-regex
collect
collect-col
collect-vals
column-names
columns
compatible?
confidence
count-min-sketch
cov
cross-join
crosstab
cube
depth
describe
distinct
drop
drop-duplicates
drop-na
dtypes
empty?
estimate-count
except
except-all
expected-fpp
fill-na
first-vals
freq-items
group-by
head
head-vals
hint
input-files
intersect
intersect-all
is-compatible
is-empty
is-local
is-streaming
join
join-with
last-vals
limit
local?
merge-in-place
might-contain
order-by
partitions
persist
pivot
print-schema
put
random-split
rdd
relative-error
rename-columns
repartition
repartition-by-range
replace-na
rollup
sample
sample-by
select
select-expr
show
show-vertical
sort
sort-within-partitions
spark-session
sql-context
storage-level
streaming?
summary
tail
tail-vals
take
take-vals
to-byte-array
total-count
union
union-by-name
unpersist
width
with-column
with-column-renamed

add^clj

(add cms item)

(add cms item cnt)

Params: (item: Any)

Result: Unit

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.095Z

Params: (item: Any)

Result: Unit



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.095Z

source raw docstring

agg^clj

(agg dataframe & args)

Params: (aggExpr: (String, String), aggExprs: (String, String)*)

Result: DataFrame

(Scala-specific) Aggregates on the entire Dataset without groups.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.739Z

Params: (aggExpr: (String, String), aggExprs: (String, String)*)

Result: DataFrame

(Scala-specific) Aggregates on the entire Dataset without groups.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.739Z

source raw docstring

agg-all^clj

(agg-all dataframe agg-fn)

Aggregates on all columns of the entire Dataset without groups.

Aggregates on all columns of the entire Dataset without groups.

source raw docstring

approx-quantile^clj

(approx-quantile dataframe col-or-cols probs rel-error)

Params: (col: String, probabilities: Array[Double], relativeError: Double)

Result: Array[Double]

Calculates the approximate quantiles of a numerical column of a DataFrame.

The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p * N). More precisely,

This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.

the name of the numerical column

a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

the approximate quantiles at the given probabilities

2.0.0

null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.640Z

Params: (col: String, probabilities: Array[Double], relativeError: Double)

Result: Array[Double]

Calculates the approximate quantiles of a numerical column of a DataFrame.

The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at probability p up to error
err, then the algorithm will return a sample x from the DataFrame so that the *exact* rank
of x is close to (p * N).
More precisely,

This method implements a variation of the Greenwald-Khanna algorithm (with some speed
optimizations).
The algorithm was first present in 
Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.


the name of the numerical column

a list of quantile probabilities
  Each number must belong to [0, 1].
  For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

The relative target precision to achieve (greater than or equal to 0).
  If set to zero, the exact quantiles are computed, which could be very expensive.
  Note that values greater than 1 are accepted but give the same result as 1.

the approximate quantiles at the given probabilities

2.0.0

null and NaN values will be removed from the numerical column before calculation. If
  the dataframe is empty or the column only contains null or NaN, an empty array is returned.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.640Z

source raw docstring

bit-size^clj

(bit-size bloom)

Params: ()

Result: Long

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.738Z

Params: ()

Result: Long



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.738Z

source raw docstring

bloom-filter^clj

(bloom-filter dataframe expr expected-num-items num-bits-or-fpp)

Params: (colName: String, expectedNumItems: Long, fpp: Double)

Result: BloomFilter

Builds a Bloom filter over a specified column.

name of the column over which the filter is built

expected number of items which will be put into the filter.

expected false positive probability of the filter.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.647Z

Params: (colName: String, expectedNumItems: Long, fpp: Double)

Result: BloomFilter

Builds a Bloom filter over a specified column.


name of the column over which the filter is built

expected number of items which will be put into the filter.

expected false positive probability of the filter.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.647Z

source raw docstring

cache^clj

(cache dataframe)

Params: ()

Result: Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.750Z

Params: ()

Result: Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.750Z

source raw docstring

checkpoint^clj

(checkpoint dataframe)

(checkpoint dataframe eager)

Params: ()

Result: Dataset[T]

Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

2.1.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.752Z

Params: ()

Result: Dataset[T]

Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate
the logical plan of this Dataset, which is especially useful in iterative algorithms where the
plan may grow exponentially. It will be saved to files inside the checkpoint
directory set with SparkContext#setCheckpointDir.


2.1.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.752Z

source raw docstring

col-regex^clj

(col-regex dataframe col-name)

Params: (colName: String)

Result: Column

Selects column based on the column name specified as a regex and returns it as Column.

2.3.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.758Z

Params: (colName: String)

Result: Column

Selects column based on the column name specified as a regex and returns it as Column.

2.3.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.758Z

source raw docstring

collect^clj

(collect dataframe)

Params: ()

Result: Array[T]

Returns an array that contains all rows in this Dataset.

Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

For Java API, use collectAsList.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.759Z

Params: ()

Result: Array[T]

Returns an array that contains all rows in this Dataset.

Running collect requires moving all the data into the application's driver process, and
doing so on a very large dataset can crash the driver process with OutOfMemoryError.

For Java API, use collectAsList.


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.759Z

source raw docstring

collect-col^clj

(collect-col dataframe col-name)

Returns a vector that contains all rows in the column of the Dataset.

Returns a vector that contains all rows in the column of the Dataset.

source raw docstring

collect-vals^clj

(collect-vals dataframe)

Returns the vector values of the Dataset collected.

Returns the vector values of the Dataset collected.

source raw docstring

column-names^clj

(column-names dataframe)

Returns all column names as an array of strings.

Returns all column names as an array of strings.

source raw docstring

columns^clj

(columns dataframe)

Returns all column names as an array of keywords.

Returns all column names as an array of keywords.

source raw docstring

compatible?^clj

(compatible? bloom other)

Params: (other: BloomFilter)

Result: Boolean

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.740Z

Params: (other: BloomFilter)

Result: Boolean



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.740Z

source raw docstring

confidence^clj

(confidence cms)

Params: ()

Result: Double

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.102Z

Params: ()

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.102Z

source raw docstring

count-min-sketch^clj

(count-min-sketch dataframe expr eps-or-depth confidence-or-width seed)

Params: (colName: String, depth: Int, width: Int, seed: Int)

Result: CountMinSketch

Builds a Count-min Sketch over a specified column.

name of the column over which the sketch is built

depth of the sketch

width of the sketch

random seed

a CountMinSketch over column colName

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.659Z

Params: (colName: String, depth: Int, width: Int, seed: Int)

Result: CountMinSketch

Builds a Count-min Sketch over a specified column.


name of the column over which the sketch is built

depth of the sketch

width of the sketch

random seed

a CountMinSketch over column colName

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.659Z

source raw docstring

cov^clj

(cov dataframe col-name1 col-name2)

Params: (col1: String, col2: String)

Result: Double

Calculate the sample covariance of two numerical columns of a DataFrame.

the name of the first column

the name of the second column

the covariance of the two columns.

1.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.661Z

Params: (col1: String, col2: String)

Result: Double

Calculate the sample covariance of two numerical columns of a DataFrame.

the name of the first column

the name of the second column

the covariance of the two columns.

1.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.661Z

source raw docstring

cross-join^clj

(cross-join left right)

Params: (right: Dataset[_])

Result: DataFrame

Explicit cartesian join with another DataFrame.

Right side of the join operation.

2.1.0

Cartesian joins are very expensive without an extra filter that can be pushed down.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.770Z

Params: (right: Dataset[_])

Result: DataFrame

Explicit cartesian join with another DataFrame.


Right side of the join operation.

2.1.0

Cartesian joins are very expensive without an extra filter that can be pushed down.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.770Z

source raw docstring

crosstab^clj

(crosstab dataframe col-name1 col-name2)

Params: (col1: String, col2: String)

Result: DataFrame

Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.

The name of the first column. Distinct items will make the first item of each row.

The name of the second column. Distinct items will make the column names of the DataFrame.

A DataFrame containing for the contingency table.

1.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.664Z

Params: (col1: String, col2: String)

Result: DataFrame

Computes a pair-wise frequency table of the given columns. Also known as a contingency table.
The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero
pair frequencies will be returned.
The first column of each row will be the distinct values of col1 and the column names will
be the distinct values of col2. The name of the first column will be col1_col2. Counts
will be returned as Longs. Pairs that have no occurrences will have zero as their counts.
Null elements will be replaced by "null", and back ticks will be dropped from elements if they
exist.


The name of the first column. Distinct items will make the first item of
            each row.

The name of the second column. Distinct items will make the column names
            of the DataFrame.

A DataFrame containing for the contingency table.

1.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.664Z

source raw docstring

cube^clj

(cube dataframe & exprs)

Params: (cols: Column*)

Result: RelationalGroupedDataset

Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.778Z

Params: (cols: Column*)

Result: RelationalGroupedDataset

Create a multi-dimensional cube for the current Dataset using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset for all the available aggregate functions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.778Z

source raw docstring

depth^clj

(depth cms)

Params: ()

Result: Int

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.103Z

Params: ()

Result: Int



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.103Z

source raw docstring

describe^clj

(describe dataframe & col-names)

Params: (cols: String*)

Result: DataFrame

Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

Use summary for expanded statistics and control over which statistics to compute.

Columns to compute statistics on.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.780Z

Params: (cols: String*)

Result: DataFrame

Computes basic statistics for numeric and string columns, including count, mean, stddev, min,
and max. If no columns are given, this function computes statistics for all numerical or
string columns.

This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting Dataset. If you want to
programmatically compute summary statistics, use the agg function instead.

Use summary for expanded statistics and control over which statistics to compute.


Columns to compute statistics on.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.780Z

source raw docstring

distinct^clj

(distinct dataframe)

Params: ()

Result: Dataset[T]

Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.

2.0.0

Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.781Z

Params: ()

Result: Dataset[T]

Returns a new Dataset that contains only the unique rows from this Dataset.
This is an alias for dropDuplicates.


2.0.0

Equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals function defined on T.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.781Z

source raw docstring

drop^clj

(drop dataframe & col-names)

Params: (colName: String)

Result: DataFrame

Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.

This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.785Z

Params: (colName: String)

Result: DataFrame

Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain
column name.

This method can only be used to drop top level columns. the colName string is treated
literally without further interpretation.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.785Z

source raw docstring

drop-duplicates^clj

(drop-duplicates dataframe & col-names)

Params: ()

Result: Dataset[T]

Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.

For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.791Z

Params: ()

Result: Dataset[T]

Returns a new Dataset that contains only the unique rows from this Dataset.
This is an alias for distinct.

For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it
will keep all data across triggers as intermediate state to drop duplicates rows. You can use
withWatermark to limit how late the duplicate data can be and system will accordingly limit
the state. In addition, too late data older than watermark will be dropped to avoid any
possibility of duplicates.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.791Z

source raw docstring

drop-na^clj

(drop-na dataframe)

(drop-na dataframe min-non-nulls-or-cols)

(drop-na dataframe min-non-nulls cols)

Params: ()

Result: DataFrame

Returns a new DataFrame that drops rows containing any null or NaN values.

1.3.1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Timestamp: 2020-10-19T01:56:23.886Z

Params: ()

Result: DataFrame

Returns a new DataFrame that drops rows containing any null or NaN values.


1.3.1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Timestamp: 2020-10-19T01:56:23.886Z

source raw docstring

dtypes^clj

(dtypes dataframe)

Params:

Result: Array[(String, String)]

Returns all column names and their data types as an array.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.792Z

Params: 

Result: Array[(String, String)]

Returns all column names and their data types as an array.


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.792Z

source raw docstring

empty?^clj

(empty? dataframe)

Params:

Result: Boolean

Returns true if the Dataset is empty.

2.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.840Z

Params: 

Result: Boolean

Returns true if the Dataset is empty.


2.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.840Z

source raw docstring

estimate-count^clj

(estimate-count cms item)

Params: (item: Any)

Result: Long

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.104Z

Params: (item: Any)

Result: Long



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.104Z

source raw docstring

except^clj

(except dataframe other)

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT DISTINCT in SQL.

2.0.0

Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.796Z

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows in this Dataset but not in another Dataset.
This is equivalent to EXCEPT DISTINCT in SQL.


2.0.0

Equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals function defined on T.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.796Z

source raw docstring

except-all^clj

(except-all dataframe other)

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL.

2.4.0

Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T. Also as standard in SQL, this function resolves columns by position (not by name).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.798Z

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows in this Dataset but not in another Dataset while
preserving the duplicates.
This is equivalent to EXCEPT ALL in SQL.


2.4.0

Equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals function defined on T. Also as standard in
SQL, this function resolves columns by position (not by name).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.798Z

source raw docstring

expected-fpp^clj

(expected-fpp bloom)

Params: ()

Result: Double

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.739Z

Params: ()

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.739Z

source raw docstring

fill-na^clj

(fill-na dataframe value)

(fill-na dataframe value cols)

Params: (value: Long)

Result: DataFrame

Returns a new DataFrame that replaces null or NaN values in numeric columns with value.

2.2.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Timestamp: 2020-10-19T01:56:23.908Z

Params: (value: Long)

Result: DataFrame

Returns a new DataFrame that replaces null or NaN values in numeric columns with value.


2.2.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Timestamp: 2020-10-19T01:56:23.908Z

source raw docstring

first-vals^clj

(first-vals dataframe)

Returns the vector values of the first row in the Dataset collected.

Returns the vector values of the first row in the Dataset collected.

source raw docstring

freq-items^clj

(freq-items dataframe col-names)

(freq-items dataframe col-names support)

Params: (cols: Array[String], support: Double)

Result: DataFrame

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.

This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

the names of the columns to search frequent items in.

The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.

A Local DataFrame with the Array of frequent items for each column.

1.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.676Z

Params: (cols: Array[String], support: Double)

Result: DataFrame

Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
here, proposed by Karp,
Schenker, and Papadimitriou.
The support should be greater than 1e-4.

This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame.


the names of the columns to search frequent items in.

The minimum frequency for an item to be considered frequent. Should be greater
               than 1e-4.

A Local DataFrame with the Array of frequent items for each column.

1.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.676Z

source raw docstring

group-by^clj

(group-by dataframe & exprs)

Params: (cols: Column*)

Result: RelationalGroupedDataset

Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.827Z

Params: (cols: Column*)

Result: RelationalGroupedDataset

Groups the Dataset using the specified columns, so we can run aggregation on them. See
RelationalGroupedDataset for all the available aggregate functions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.827Z

source raw docstring

head^clj

(head dataframe)

(head dataframe n-rows)

Params: (n: Int)

Result: Array[T]

Returns the first n rows.

1.6.0

this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.834Z

Params: (n: Int)

Result: Array[T]

Returns the first n rows.


1.6.0

this method should only be used if the resulting array is expected to be small, as
all the data is loaded into the driver's memory.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.834Z

source raw docstring

head-vals^clj

(head-vals dataframe)

(head-vals dataframe n-rows)

Returns the vector values of the first n rows in the Dataset collected.

Returns the vector values of the first n rows in the Dataset collected.

source raw docstring

hint^clj

(hint dataframe hint-name & args)

Params: (name: String, parameters: Any*)

Result: Dataset[T]

Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:

2.2.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.835Z

Params: (name: String, parameters: Any*)

Result: Dataset[T]

Specifies some hint on the current Dataset. As an example, the following code specifies
that one of the plan can be broadcasted:

2.2.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.835Z

source raw docstring

input-files^clj

(input-files dataframe)

Params:

Result: Array[String]

Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.837Z

Params: 

Result: Array[String]

Returns a best-effort snapshot of the files that compose this Dataset. This method simply
asks each constituent BaseRelation for its respective files and takes the union of all results.
Depending on the source relations, this may not find all input files. Duplicates are removed.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.837Z

source raw docstring

intersect^clj

(intersect dataframe other)

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.

1.6.0

Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.838Z

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows only in both this Dataset and another Dataset.
This is equivalent to INTERSECT in SQL.


1.6.0

Equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals function defined on T.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.838Z

source raw docstring

intersect-all^clj

(intersect-all dataframe other)

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to INTERSECT ALL in SQL.

2.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.839Z

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing rows only in both this Dataset and another Dataset while
preserving the duplicates.
This is equivalent to INTERSECT ALL in SQL.


2.4.0

Equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals function defined on T. Also as standard
in SQL, this function resolves columns by position (not by name).

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.839Z

source raw docstring

is-compatible^clj

(is-compatible bloom other)

Params: (other: BloomFilter)

Result: Boolean

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.740Z

Params: (other: BloomFilter)

Result: Boolean



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.740Z

source raw docstring

is-empty^clj

(is-empty dataframe)

Params:

Result: Boolean

Returns true if the Dataset is empty.

2.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.840Z

Params: 

Result: Boolean

Returns true if the Dataset is empty.


2.4.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.840Z

source raw docstring

is-local^clj

(is-local dataframe)

Params:

Result: Boolean

Returns true if the collect and take methods can be run locally (without any Spark executors).

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.843Z

Params: 

Result: Boolean

Returns true if the collect and take methods can be run locally
(without any Spark executors).


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.843Z

source raw docstring

is-streaming^clj

(is-streaming dataframe)

Params:

Result: Boolean

Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.844Z

Params: 

Result: Boolean

Returns true if this Dataset contains one or more sources that continuously
return data as it arrives. A Dataset that reads data from a streaming source
must be executed as a StreamingQuery using the start() method in
DataStreamWriter. Methods that return a single answer, e.g. count() or
collect(), will throw an AnalysisException when there is a streaming
source present.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.844Z

source raw docstring

join^clj

(join left right expr)

(join left right expr join-type)

Params: (right: Dataset[_])

Result: DataFrame

Join with another DataFrame.

Behaves as an INNER JOIN and requires a subsequent join predicate.

Right side of the join operation.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.856Z

Params: (right: Dataset[_])

Result: DataFrame

Join with another DataFrame.

Behaves as an INNER JOIN and requires a subsequent join predicate.


Right side of the join operation.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.856Z

source raw docstring

join-with^clj

(join-with left right condition)

(join-with left right condition join-type)

Params: (other: Dataset[U], condition: Column, joinType: String)

Result: Dataset[(T, U)]

Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.

This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

Right side of the join.

Join expression.

Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, fullouter,full_outer, left, leftouter, left_outer, right, rightouter, right_outer.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.860Z

Params: (other: Dataset[U], condition: Column, joinType: String)

Result: Dataset[(T, U)]

Joins this Dataset returning a Tuple2 for each pair where condition evaluates to
true.

This is similar to the relation join function with one important difference in the
result schema. Since joinWith preserves objects present on either side of the join, the
result schema is similarly nested into a tuple under the column names _1 and _2.

This type of join can be useful both for preserving type-safety with the original object
types as well as working with relational data where either side of the join has column
names in common.


Right side of the join.

Join expression.

Type of join to perform. Default inner. Must be one of:
                inner, cross, outer, full, fullouter,full_outer, left,
                leftouter, left_outer, right, rightouter, right_outer.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.860Z

source raw docstring

last-vals^clj

(last-vals dataframe)

Returns the vector values of the last row in the Dataset collected.

Returns the vector values of the last row in the Dataset collected.

source raw docstring

limit^clj

(limit dataframe n-rows)

Params: (n: Int)

Result: Dataset[T]

Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.861Z

Params: (n: Int)

Result: Dataset[T]

Returns a new Dataset by taking the first n rows. The difference between this function
and head is that head is an action and returns an array (by triggering query execution)
while limit returns a new Dataset.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.861Z

source raw docstring

local?^clj

(local? dataframe)

Params:

Result: Boolean

Returns true if the collect and take methods can be run locally (without any Spark executors).

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.843Z

Params: 

Result: Boolean

Returns true if the collect and take methods can be run locally
(without any Spark executors).


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.843Z

source raw docstring

merge-in-place^clj

(merge-in-place bloom-or-cms other)

Params: (other: BloomFilter)

Result: BloomFilter

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.741Z

Params: (other: BloomFilter)

Result: BloomFilter



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.741Z

source raw docstring

might-contain^clj

(might-contain bloom item)

Params: (item: Any)

Result: Boolean

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.742Z

Params: (item: Any)

Result: Boolean



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.742Z

source raw docstring

order-by^clj

(order-by dataframe & exprs)

Params: (sortCol: String, sortCols: String*)

Result: Dataset[T]

Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.884Z

Params: (sortCol: String, sortCols: String*)

Result: Dataset[T]

Returns a new Dataset sorted by the given expressions.
This is an alias of the sort function.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.884Z

source raw docstring

partitions^clj

(partitions dataframe)

Params:

Result: List[Partition]

Set of partitions in this RDD.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html

Timestamp: 2020-10-19T01:56:48.891Z

Params: 

Result: List[Partition]

Set of partitions in this RDD.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html

Timestamp: 2020-10-19T01:56:48.891Z

source raw docstring

persist^clj

(persist dataframe)

(persist dataframe new-level)

Params: ()

Result: Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.886Z

Params: ()

Result: Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.886Z

source raw docstring

pivot^clj

(pivot grouped expr)

(pivot grouped expr values)

Params: (pivotColumn: String)

Result: RelationalGroupedDataset

Pivots a column of the current DataFrame and performs the specified aggregation.

There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.

Name of the column to pivot.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/RelationalGroupedDataset.html

Timestamp: 2020-10-19T01:56:23.317Z

Params: (pivotColumn: String)

Result: RelationalGroupedDataset

Pivots a column of the current DataFrame and performs the specified aggregation.

There are two versions of pivot function: one that requires the caller to specify the list
of distinct values to pivot on, and one that does not. The latter is more concise but less
efficient, because Spark needs to first compute the list of distinct values internally.

Name of the column to pivot.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/RelationalGroupedDataset.html

Timestamp: 2020-10-19T01:56:23.317Z

source raw docstring

print-schema^clj

(print-schema dataframe)

Params: ()

Result: Unit

Prints the schema to the console in a nice tree format.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.888Z

Params: ()

Result: Unit

Prints the schema to the console in a nice tree format.


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.888Z

source raw docstring

put^clj

(put bloom item)

Params: (item: Any)

Result: Boolean

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.746Z

Params: (item: Any)

Result: Boolean



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/BloomFilter.html

Timestamp: 2020-10-19T01:56:25.746Z

source raw docstring

random-split^clj

(random-split dataframe weights)

(random-split dataframe weights seed)

Params: (weights: Array[Double], seed: Long)

Result: Array[Dataset[T]]

Randomly splits this Dataset with the provided weights.

weights for splits, will be normalized if they don't sum to 1.

Seed for sampling. For Java API, use randomSplitAsList.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.892Z

Params: (weights: Array[Double], seed: Long)

Result: Array[Dataset[T]]

Randomly splits this Dataset with the provided weights.


weights for splits, will be normalized if they don't sum to 1.

Seed for sampling.
For Java API, use randomSplitAsList.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.892Z

source raw docstring

rdd^clj

(rdd dataframe)

Params:

Result: RDD[T]

Represents the content of the Dataset as an RDD of T.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.894Z

Params: 

Result: RDD[T]

Represents the content of the Dataset as an RDD of T.


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.894Z

source raw docstring

relative-error^clj

(relative-error cms)

Params: ()

Result: Double

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.106Z

Params: ()

Result: Double



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.106Z

source raw docstring

rename-columns^clj

(rename-columns dataframe rename-map)

Returns a new Dataset with a column renamed according to the rename-map.

Returns a new Dataset with a column renamed according to the rename-map.

source raw docstring

repartition^clj

(repartition dataframe & args)

Params: (numPartitions: Int)

Result: Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.901Z

Params: (numPartitions: Int)

Result: Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions.


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.901Z

source raw docstring

repartition-by-range^clj

(repartition-by-range dataframe & args)

Params: (numPartitions: Int, partitionExprs: Column*)

Result: Dataset[T]

Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is range partitioned.

At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.

Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.

2.3.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.904Z

Params: (numPartitions: Int, partitionExprs: Column*)

Result: Dataset[T]

Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions. The resulting Dataset is range partitioned.

At least one partition-by expression must be specified.
When no explicit sort order is specified, "ascending nulls first" is assumed.
Note, the rows are not sorted in each partition of the resulting Dataset.

Note that due to performance reasons this method uses sampling to estimate the ranges.
Hence, the output may not be consistent, since sampling can return different values.
The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition.


2.3.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.904Z

source raw docstring

replace-na^clj

(replace-na dataframe cols replacement)

Params: (col: String, replacement: Map[T, T])

Result: DataFrame

Replaces values matching keys in replacement map with the corresponding values.

name of the column to apply the value replacement. If col is "*", replacement is applied on all string, numeric or boolean columns.

value replacement map. Key and value of replacement map must have the same type, and can only be doubles, strings or booleans. The map value can have nulls.

1.3.1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Timestamp: 2020-10-19T01:56:23.927Z

Params: (col: String, replacement: Map[T, T])

Result: DataFrame

Replaces values matching keys in replacement map with the corresponding values.

name of the column to apply the value replacement. If col is "*",
           replacement is applied on all string, numeric or boolean columns.

value replacement map. Key and value of replacement map must have
                   the same type, and can only be doubles, strings or booleans.
                   The map value can have nulls.

1.3.1

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameNaFunctions.html

Timestamp: 2020-10-19T01:56:23.927Z

source raw docstring

rollup^clj

(rollup dataframe & exprs)

Params: (cols: Column*)

Result: RelationalGroupedDataset

Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.907Z

Params: (cols: Column*)

Result: RelationalGroupedDataset

Create a multi-dimensional rollup for the current Dataset using the specified columns,
so we can run aggregation on them.
See RelationalGroupedDataset for all the available aggregate functions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.907Z

source raw docstring

sample^clj

(sample dataframe fraction)

(sample dataframe fraction with-replacement)

Params: (fraction: Double, seed: Long)

Result: Dataset[T]

Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.

Fraction of rows to generate, range [0.0, 1.0].

Seed for sampling.

2.3.0

This is NOT guaranteed to provide exactly the fraction of the count of the given Dataset.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.913Z

Params: (fraction: Double, seed: Long)

Result: Dataset[T]

Returns a new Dataset by sampling a fraction of rows (without replacement),
using a user-supplied seed.


Fraction of rows to generate, range [0.0, 1.0].

Seed for sampling.

2.3.0

This is NOT guaranteed to provide exactly the fraction of the count
of the given Dataset.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.913Z

source raw docstring

sample-by^clj

(sample-by dataframe expr fractions seed)

Params: (col: String, fractions: Map[T, Double], seed: Long)

Result: DataFrame

Returns a stratified sample without replacement based on the fraction given on each stratum.

stratum type

column that defines strata

sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

random seed

a new DataFrame that represents the stratified sample

1.5.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.694Z

Params: (col: String, fractions: Map[T, Double], seed: Long)

Result: DataFrame

Returns a stratified sample without replacement based on the fraction given on each stratum.

stratum type

column that defines strata

sampling fraction for each stratum. If a stratum is not specified, we treat
                 its fraction as zero.

random seed

a new DataFrame that represents the stratified sample

1.5.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/DataFrameStatFunctions.html

Timestamp: 2020-10-19T01:56:24.694Z

source raw docstring

select^clj

(select dataframe & exprs)

Params: (cols: Column*)

Result: DataFrame

Selects a set of column based expressions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.931Z

Params: (cols: Column*)

Result: DataFrame

Selects a set of column based expressions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.931Z

source raw docstring

select-expr^clj

(select-expr dataframe & exprs)

Params: (exprs: String*)

Result: DataFrame

Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.933Z

Params: (exprs: String*)

Result: DataFrame

Selects a set of SQL expressions. This is a variant of select that accepts
SQL expressions.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.933Z

source raw docstring

show^clj

(show dataframe)

(show dataframe options)

Params: (numRows: Int)

Result: Unit

Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

Number of rows to show

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.945Z

Params: (numRows: Int)

Result: Unit

Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated,
and all cells will be aligned right. For example:

Number of rows to show

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.945Z

source raw docstring

show-vertical^clj

(show-vertical dataframe)

(show-vertical dataframe options)

Displays the Dataset in a list-of-records form.

Displays the Dataset in a list-of-records form.

source raw docstring

sort^clj

(sort dataframe & exprs)

Params: (sortCol: String, sortCols: String*)

Result: Dataset[T]

Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.884Z

Params: (sortCol: String, sortCols: String*)

Result: Dataset[T]

Returns a new Dataset sorted by the given expressions.
This is an alias of the sort function.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.884Z

source raw docstring

sort-within-partitions^clj

(sort-within-partitions dataframe & exprs)

Params: (sortCol: String, sortCols: String*)

Result: Dataset[T]

Returns a new Dataset with each partition sorted by the given expressions.

This is the same operation as "SORT BY" in SQL (Hive QL).

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.950Z

Params: (sortCol: String, sortCols: String*)

Result: Dataset[T]

Returns a new Dataset with each partition sorted by the given expressions.

This is the same operation as "SORT BY" in SQL (Hive QL).


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.950Z

source raw docstring

spark-session^clj

(spark-session dataframe)

Params:

Result: SparkSession

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.951Z

Params: 

Result: SparkSession



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.951Z

source raw docstring

sql-context^clj

(sql-context dataframe)

Params:

Result: SQLContext

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.952Z

Params: 

Result: SQLContext



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.952Z

source raw docstring

storage-level^clj

(storage-level dataframe)

Params:

Result: StorageLevel

Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.

2.1.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.954Z

Params: 

Result: StorageLevel

Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.


2.1.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.954Z

source raw docstring

streaming?^clj

(streaming? dataframe)

Params:

Result: Boolean

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.844Z

Params: 

Result: Boolean

Returns true if this Dataset contains one or more sources that continuously
return data as it arrives. A Dataset that reads data from a streaming source
must be executed as a StreamingQuery using the start() method in
DataStreamWriter. Methods that return a single answer, e.g. count() or
collect(), will throw an AnalysisException when there is a streaming
source present.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.844Z

source raw docstring

summary^clj

(summary dataframe & stat-names)

Params: (statistics: String*)

Result: DataFrame

Computes specified statistics for numeric and string columns. Available statistics are:

If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

To do a summary for specific columns first select them:

tail^clj

(tail dataframe n-rows)

Params: (n: Int)

Result: Array[T]

Returns the last n rows in the Dataset.

Running tail requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.959Z

Params: (n: Int)

Result: Array[T]

Returns the last n rows in the Dataset.

Running tail requires moving data into the application's driver process, and doing so with
a very large n can crash the driver process with OutOfMemoryError.


3.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.959Z

source raw docstring

tail-vals^clj

(tail-vals dataframe n-rows)

Returns the vector values of the last n rows in the Dataset collected.

Returns the vector values of the last n rows in the Dataset collected.

source raw docstring

take^clj

(take dataframe n-rows)

Params: (n: Int)

Result: Array[T]

Returns the first n rows in the Dataset.

Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.961Z

Params: (n: Int)

Result: Array[T]

Returns the first n rows in the Dataset.

Running take requires moving data into the application's driver process, and doing so with
a very large n can crash the driver process with OutOfMemoryError.


1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.961Z

source raw docstring

take-vals^clj

(take-vals dataframe n-rows)

Returns the vector values of the first n rows in the Dataset collected.

Returns the vector values of the first n rows in the Dataset collected.

source raw docstring

to-byte-array^clj

(to-byte-array cms)

Params: ()

Result: Array[Byte]

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.107Z

Params: ()

Result: Array[Byte]



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.107Z

source raw docstring

total-count^clj

(total-count cms)

Params: ()

Result: Long

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.108Z

Params: ()

Result: Long



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.108Z

source raw docstring

union^clj

(union & dataframes)

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing union of rows in this Dataset and another Dataset.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name):

Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName to resolve columns by field name in the typed objects.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.974Z

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing union of rows in this Dataset and another Dataset.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does
deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name):

Notice that the column positions in the schema aren't necessarily matched with the
fields in the strongly typed objects in a Dataset. This function resolves columns
by their positions in the schema, not the fields in the strongly typed objects. Use
unionByName to resolve columns by field name in the typed objects.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.974Z

source raw docstring

union-by-name^clj

(union-by-name & dataframes)

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing union of rows in this Dataset and another Dataset.

This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

The difference between this function and union is that this function resolves columns by name (not by position):

2.3.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.978Z

Params: (other: Dataset[T])

Result: Dataset[T]

Returns a new Dataset containing union of rows in this Dataset and another Dataset.

This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function followed by a distinct.

The difference between this function and union is that this function
resolves columns by name (not by position):

2.3.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.978Z

source raw docstring

unpersist^clj

(unpersist dataframe)

(unpersist dataframe blocking)

Params: (blocking: Boolean)

Result: Dataset.this.type

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.

Whether to block until all blocks are deleted.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.980Z

Params: (blocking: Boolean)

Result: Dataset.this.type

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
This will not un-persist any cached data that is built upon this Dataset.


Whether to block until all blocks are deleted.

1.6.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.980Z

source raw docstring

width^clj

(width cms)

Params: ()

Result: Int

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.108Z

Params: ()

Result: Int



Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/util/sketch/CountMinSketch.html

Timestamp: 2020-10-19T01:56:26.108Z

source raw docstring

with-column^clj

(with-column dataframe col-name expr)

Params: (colName: String, col: Column)

Result: DataFrame

Returns a new Dataset by adding a column or replacing the existing column that has the same name.

column's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.

2.0.0

this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.987Z

Params: (colName: String, col: Column)

Result: DataFrame

Returns a new Dataset by adding a column or replacing the existing column that has
the same name.

column's expression must only refer to attributes supplied by this Dataset. It is an
error to add a column that refers to some other Dataset.


2.0.0

this method introduces a projection internally. Therefore, calling it multiple times,
for instance, via loops in order to add multiple columns can generate big plans which
can cause performance issues and even StackOverflowException. To avoid this,
use select with the multiple columns at once.

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.987Z

source raw docstring

with-column-renamed^clj

(with-column-renamed dataframe old-name new-name)

Params: (existingName: String, newName: String)

Result: DataFrame

Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.

2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.988Z

Params: (existingName: String, newName: String)

Result: DataFrame

Returns a new Dataset with a column renamed.
This is a no-op if schema doesn't contain existingName.


2.0.0

Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/sql/Dataset.html

Timestamp: 2020-10-19T01:56:20.988Z

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close