This namespace provides the main API for writing Spark tasks.
Most operations in this namespace place the RDD last in the argument list,
just like Clojure collection functions. This lets you compose them using the
thread-last macro (->>
), making it simple to migrate existing Clojure
code.
This namespace provides the main API for writing Spark tasks. Most operations in this namespace place the RDD last in the argument list, just like Clojure collection functions. This lets you compose them using the thread-last macro (`->>`), making it simple to migrate existing Clojure code.
(aggregate aggregator combiner zero rdd)
Aggregate the elements of each partition in rdd
using aggregator
, then
merge the results for all partitions using combiner
. Both functions will be
seeded with the neutral zero
value.
This is an action that causes computation.
Aggregate the elements of each partition in `rdd` using `aggregator`, then merge the results for all partitions using `combiner`. Both functions will be seeded with the neutral `zero` value. This is an action that causes computation.
(broadcast spark-context value)
Broadcast a read-only variable to the cluster, returning a reference for reading it in distributed functions. The variable data will be sent to each cluster only once.
The returned broadcast value can be resolved with deref
or the @
reader
macro.
Broadcast a read-only variable to the cluster, returning a reference for reading it in distributed functions. The variable data will be sent to each cluster only once. The returned broadcast value can be resolved with `deref` or the `@` reader macro.
(cartesian rdd1 rdd2)
Construct an RDD representing the cartesian product of two RDDs. Returns a new pair RDD containing all combinations of elements between the datasets.
Construct an RDD representing the cartesian product of two RDDs. Returns a new pair RDD containing all combinations of elements between the datasets.
(cogroup rdd1 rdd2)
(cogroup rdd1 rdd2 rdd3)
(cogroup rdd1 rdd2 rdd3 rdd4)
Produe a new RDD containing an element for each key k
in the given pair
RDDs mapped to a tuple of the values from all RDDs as lists.
If the input RDDs have types (K, A)
, (K, B)
, and (K, C)
, the grouped
RDD will have type (K, (list(A), list(B), list(C)))
.
Produe a new RDD containing an element for each key `k` in the given pair RDDs mapped to a tuple of the values from all RDDs as lists. If the input RDDs have types `(K, A)`, `(K, B)`, and `(K, C)`, the grouped RDD will have type `(K, (list(A), list(B), list(C)))`.
(cogroup-partitioned rdd1 rdd2 partitions)
(cogroup-partitioned rdd1 rdd2 rdd3 partitions)
(cogroup-partitioned rdd1 rdd2 rdd3 rdd4 partitions)
Produe a new RDD containing an element for each key k
in the given pair
RDDs mapped to a tuple of the values from all RDDs as lists. The resulting
RDD partitions may be controlled by setting partitions
to an integer number
or a Partitioner
instance.
If the input RDDs have types (K, A)
, (K, B)
, and (K, C)
, the grouped
RDD will have type (K, (List(A), List(B), List(C)))
.
Produe a new RDD containing an element for each key `k` in the given pair RDDs mapped to a tuple of the values from all RDDs as lists. The resulting RDD partitions may be controlled by setting `partitions` to an integer number or a `Partitioner` instance. If the input RDDs have types `(K, A)`, `(K, B)`, and `(K, C)`, the grouped RDD will have type `(K, (List(A), List(B), List(C)))`.
(collect rdd)
Collect the elements of rdd
into a vector on the driver. Be careful not to
realize large datasets with this, as the driver will likely run out of
memory.
This is an action that causes computation.
Collect the elements of `rdd` into a vector on the driver. Be careful not to realize large datasets with this, as the driver will likely run out of memory. This is an action that causes computation.
(combine-by-key seq-fn conj-fn merge-fn rdd)
(combine-by-key seq-fn conj-fn merge-fn num-partitions rdd)
Combine the elements for each key using a set of aggregation functions.
If rdd
contains pairs of (K, V)
, the resulting RDD will contain pairs of
type (K, C)
. Callers must provide three functions:
seq-fn
which turns a V into a C (for example, vector
)conj-fn
to add a V to a C (for example, conj
)merge-fn
to combine two C's into a single resultCombine the elements for each key using a set of aggregation functions. If `rdd` contains pairs of `(K, V)`, the resulting RDD will contain pairs of type `(K, C)`. Callers must provide three functions: - `seq-fn` which turns a V into a C (for example, `vector`) - `conj-fn` to add a V to a C (for example, `conj`) - `merge-fn` to combine two C's into a single result
(count rdd)
Count the number of elements in rdd
.
This is an action that causes computation.
Count the number of elements in `rdd`. This is an action that causes computation.
(count-by-key rdd)
Count the distinct key values in rdd
. Returns a map of keys to integer
counts.
This is an action that causes computation.
Count the distinct key values in `rdd`. Returns a map of keys to integer counts. This is an action that causes computation.
(count-by-value rdd)
Count the distinct values in rdd
. Returns a map of values to integer
counts.
This is an action that causes computation.
Count the distinct values in `rdd`. Returns a map of values to integer counts. This is an action that causes computation.
(distinct rdd)
(distinct num-partitions rdd)
Construct an RDD containing only a single copy of each distinct element in
rdd
. Optionally accepts a number of partitions to size the resulting RDD
with.
Construct an RDD containing only a single copy of each distinct element in `rdd`. Optionally accepts a number of partitions to size the resulting RDD with.
(filter f rdd)
Filter the elements of rdd
to the ones which satisfy the predicate f
.
Filter the elements of `rdd` to the ones which satisfy the predicate `f`.
(first rdd)
Find the first element of rdd
.
This is an action that causes computation.
Find the first element of `rdd`. This is an action that causes computation.
(fold f zero rdd)
Aggregate the elements of each partition in rdd
, followed by the results
for all the partitions, by using the given associative function f
and a
neutral zero
value.
This is an action that causes computation.
Aggregate the elements of each partition in `rdd`, followed by the results for all the partitions, by using the given associative function `f` and a neutral `zero` value. This is an action that causes computation.
(foreach f rdd)
Apply the function f
to all elements of rdd
. The function will run on
the executors where the data resides.
Consider foreach-partition
for efficiency if handling an element requires
costly resource acquisition such as a database connection.
This is an action that causes computation.
Apply the function `f` to all elements of `rdd`. The function will run on the executors where the data resides. Consider `foreach-partition` for efficiency if handling an element requires costly resource acquisition such as a database connection. This is an action that causes computation.
(foreach-partition f rdd)
Apply the function f
to all elements of rdd
by calling it with a
sequence of each partition's elements. The function will run on the executors
where the data resides.
This is an action that causes computation.
Apply the function `f` to all elements of `rdd` by calling it with a sequence of each partition's elements. The function will run on the executors where the data resides. This is an action that causes computation.
(full-outer-join rdd1 rdd2)
(full-outer-join rdd1 rdd2 partitions)
Perform a full outer join of rdd1
and rdd2
.
For each element (k, v)
in rdd1
, the resulting RDD will either contain all
pairs (k, (Some(v), Some(w)))
for (k, w)
in rdd2
, or the pair
(k, (Some(v), None))
if no elements in other have key k
. Similarly, for
each element (k, w)
in rdd2
, the resulting RDD will either contain all
pairs (k, (Some(v), Some(w)))
for v
in rdd1
, or the pair
(k, (None, Some(w)))
if no elements in rdd1
have key k
.
Hash-partitions the resulting RDD using the existing partitioner/parallelism
level unless partitions
is be provided as an integer number or a
partitioner instance.
Perform a full outer join of `rdd1` and `rdd2`. For each element `(k, v)` in `rdd1`, the resulting RDD will either contain all pairs `(k, (Some(v), Some(w)))` for `(k, w)` in `rdd2`, or the pair `(k, (Some(v), None))` if no elements in other have key `k`. Similarly, for each element `(k, w)` in `rdd2`, the resulting RDD will either contain all pairs `(k, (Some(v), Some(w)))` for `v` in `rdd1`, or the pair `(k, (None, Some(w)))` if no elements in `rdd1` have key `k`. Hash-partitions the resulting RDD using the existing partitioner/parallelism level unless `partitions` is be provided as an integer number or a partitioner instance.
(group-by f rdd)
(group-by f num-partitions rdd)
Group the elements of rdd
using a key function f
. Returns a pair RDD
with each generated key and all matching elements as a value sequence.
Group the elements of `rdd` using a key function `f`. Returns a pair RDD with each generated key and all matching elements as a value sequence.
(group-by-key rdd)
(group-by-key num-partitions rdd)
Group the entries in the pair rdd
by key. Returns a new pair RDD with one
entry per key, containing all of the matching values as a sequence.
Group the entries in the pair `rdd` by key. Returns a new pair RDD with one entry per key, containing all of the matching values as a sequence.
(intersection rdd1 rdd2)
Construct an RDD representing the intersection of elements which are in both RDDs.
Construct an RDD representing the intersection of elements which are in both RDDs.
(into coll rdd)
(into coll xf rdd)
Collect the elements of rdd
into a collection on the driver. Behaves like
clojure.core/into
, including accepting an optional transducer.
Automatically coerces Scala tuples into Clojure vectors.
Be careful not to realize large datasets with this, as the driver will likely run out of memory.
This is an action that causes computation.
Collect the elements of `rdd` into a collection on the driver. Behaves like `clojure.core/into`, including accepting an optional transducer. Automatically coerces Scala tuples into Clojure vectors. Be careful not to realize large datasets with this, as the driver will likely run out of memory. This is an action that causes computation.
(join rdd1 rdd2)
(join rdd1 rdd2 partitions)
Construct an RDD containing all pairs of elements with matching keys in
rdd1
and rdd2
. Each pair of elements will be returned as a tuple of
(k, (v, w))
, where (k, v)
is in rdd1
and (k, w)
is in rdd2
.
Performs a hash join across the cluster. Optionally, partitions
may be
provided as an integer number or a partitioner instance to control the
partitioning of the resulting RDD.
Construct an RDD containing all pairs of elements with matching keys in `rdd1` and `rdd2`. Each pair of elements will be returned as a tuple of `(k, (v, w))`, where `(k, v)` is in `rdd1` and `(k, w)` is in `rdd2`. Performs a hash join across the cluster. Optionally, `partitions` may be provided as an integer number or a partitioner instance to control the partitioning of the resulting RDD.
(key-by f rdd)
Creates pairs from the elements in rdd
by using f
to compute a key for
each value.
Creates pairs from the elements in `rdd` by using `f` to compute a key for each value.
(keys rdd)
Transform rdd
by replacing each pair with its key. Returns a new RDD
representing the keys.
Transform `rdd` by replacing each pair with its key. Returns a new RDD representing the keys.
(left-outer-join rdd1 rdd2)
(left-outer-join rdd1 rdd2 partitions)
Perform a left outer join of rdd1
and rdd2
.
For each element (k, v)
in rdd1
, the resulting RDD will either contain
all pairs (k, (v, Some(w)))
for (k, w)
in rdd2
, or the pair
(k, (v, None))
if no elements in rdd2
have key k
.
Hash-partitions the resulting RDD using the existing partitioner/parallelism
level unless partitions
is be provided as an integer number or a
partitioner instance.
Perform a left outer join of `rdd1` and `rdd2`. For each element `(k, v)` in `rdd1`, the resulting RDD will either contain all pairs `(k, (v, Some(w)))` for `(k, w)` in `rdd2`, or the pair `(k, (v, None))` if no elements in `rdd2` have key `k`. Hash-partitions the resulting RDD using the existing partitioner/parallelism level unless `partitions` is be provided as an integer number or a partitioner instance.
(lookup rdd k)
Find all values in the rdd
pairs whose keys is k
. The key must be
serializable with the Java serializer (not Kryo) for this to work.
This is an action that causes computation.
Find all values in the `rdd` pairs whose keys is `k`. The key must be serializable with the Java serializer (not Kryo) for this to work. This is an action that causes computation.
(map f rdd)
Map the function f
over each element of rdd
. Returns a new RDD
representing the transformed elements.
Map the function `f` over each element of `rdd`. Returns a new RDD representing the transformed elements.
(map->pairs f rdd)
Map the function f
over each element of rdd
. Returns a new pair RDD
representing the transformed elements.
Map the function `f` over each element of `rdd`. Returns a new pair RDD representing the transformed elements.
(map-partitions f rdd)
(map-partitions f preserve-partitioning? rdd)
Map the function f
over each partition in rdd
, producing a sequence of
results. Returns an RDD representing the concatenation of all the partition
results. The function will be called with an iterator of the elements of each
partition.
Map the function `f` over each partition in `rdd`, producing a sequence of results. Returns an RDD representing the concatenation of all the partition results. The function will be called with an iterator of the elements of each partition.
(map-partitions->pairs f rdd)
(map-partitions->pairs f preserve-partitioning? rdd)
Map the function f
over each partition in rdd
, producing a sequence of
key-value pairs. The function will be called with an iterator of the elements
of the partition.
Map the function `f` over each partition in `rdd`, producing a sequence of key-value pairs. The function will be called with an iterator of the elements of the partition.
(map-partitions-indexed f rdd)
Map the function f
over each partition in rdd
, producing a sequence of
results. Returns an RDD representing the concatenation of all the partition
results. The function will be called with the partition index and an iterator
of the elements of each partition.
Map the function `f` over each partition in `rdd`, producing a sequence of results. Returns an RDD representing the concatenation of all the partition results. The function will be called with the partition index and an iterator of the elements of each partition.
(map-vals f rdd)
Map the function f
over each value of the pairs in rdd
. Returns a new
pair RDD representing the transformed pairs.
Map the function `f` over each value of the pairs in `rdd`. Returns a new pair RDD representing the transformed pairs.
(mapcat f rdd)
Map the function f
over each element in rdd
to produce a sequence of
results. Returns an RDD representing the concatenation of all element
results.
Map the function `f` over each element in `rdd` to produce a sequence of results. Returns an RDD representing the concatenation of all element results.
(mapcat->pairs f rdd)
Map the function f
over each element in rdd
to produce a sequence of
key-value pairs. Returns a new pair RDD representing the concatenation of all
result pairs.
Map the function `f` over each element in `rdd` to produce a sequence of key-value pairs. Returns a new pair RDD representing the concatenation of all result pairs.
(mapcat-vals f rdd)
Map the function f
over each value of the pairs in rdd
to produce a
collection of values. Returns a new pair RDD representing the concatenated
keys and values.
Map the function `f` over each value of the pairs in `rdd` to produce a collection of values. Returns a new pair RDD representing the concatenated keys and values.
(max rdd)
(max compare-fn rdd)
Find the maximum element in rdd
in the ordering defined by compare-fn
.
This is an action that causes computation.
Find the maximum element in `rdd` in the ordering defined by `compare-fn`. This is an action that causes computation.
(min rdd)
(min compare-fn rdd)
Find the minimum element in rdd
in the ordering defined by compare-fn
.
This is an action that causes computation.
Find the minimum element in `rdd` in the ordering defined by `compare-fn`. This is an action that causes computation.
(reduce f rdd)
Aggregate the elements of rdd
using the function f
. The reducing
function must accept two arguments and should be commutative and associative
so that it can be computed correctly in parallel.
This is an action that causes computation.
Aggregate the elements of `rdd` using the function `f`. The reducing function must accept two arguments and should be commutative and associative so that it can be computed correctly in parallel. This is an action that causes computation.
(reduce-by-key f rdd)
Aggregate the pairs of rdd
which share a key by combining all of the
values with the reducing function f
. Returns a new pair RDD with one entry
per unique key, holding the aggregated values.
Aggregate the pairs of `rdd` which share a key by combining all of the values with the reducing function `f`. Returns a new pair RDD with one entry per unique key, holding the aggregated values.
(right-outer-join rdd1 rdd2)
(right-outer-join rdd1 rdd2 partitions)
Perform a right outer join of rdd1
and rdd2
.
For each element (k, w)
in rdd2
, the resulting RDD will either contain
all pairs (k, (Some(v), w))
for (k, v)
in rdd1
, or the pair
(k, (None, w))
if no elements in rdd1
have key k
.
Hash-partitions the resulting RDD using the existing partitioner/parallelism
level unless partitions
is be provided as an integer number or a
partitioner instance.
Perform a right outer join of `rdd1` and `rdd2`. For each element `(k, w)` in `rdd2`, the resulting RDD will either contain all pairs `(k, (Some(v), w))` for `(k, v)` in `rdd1`, or the pair `(k, (None, w))` if no elements in `rdd1` have key `k`. Hash-partitions the resulting RDD using the existing partitioner/parallelism level unless `partitions` is be provided as an integer number or a partitioner instance.
(sample fraction rdd)
(sample fraction replacement? rdd)
(sample fraction replacement? seed rdd)
Generate a randomly sampled subset of rdd
with roughly fraction
of the
original elements. Callers can optionally select whether the sample happens
with replacement, and a random seed to control the sample.
Generate a randomly sampled subset of `rdd` with roughly `fraction` of the original elements. Callers can optionally select whether the sample happens with replacement, and a random seed to control the sample.
(sort-by-key rdd)
(sort-by-key ascending? rdd)
(sort-by-key compare-fn ascending? rdd)
(sort-by-key compare-fn ascending? num-partitions rdd)
Reorder the elements of rdd
so that they are sorted according to their
natural order or the given comparator f
if provided. The result may be
ordered ascending or descending, depending on ascending?
.
Reorder the elements of `rdd` so that they are sorted according to their natural order or the given comparator `f` if provided. The result may be ordered ascending or descending, depending on `ascending?`.
(subtract rdd1 rdd2)
Remove all elements from rdd1
that are present in rdd2
.
Remove all elements from `rdd1` that are present in `rdd2`.
(subtract-by-key rdd1 rdd2)
Construct an RDD representing all pairs in rdd1
for which there is no pair
with a matching key in rdd2
.
Construct an RDD representing all pairs in `rdd1` for which there is no pair with a matching key in `rdd2`.
(take n rdd)
Take the first n
elements of the RDD.
This currently scans the partitions one by one on the driver, so it
will be slow if a lot of elements are required. In that case, use collect
to get the whole RDD instead.
This is an action that causes computation.
Take the first `n` elements of the RDD. This currently scans the partitions _one by one_ on the **driver**, so it will be slow if a lot of elements are required. In that case, use `collect` to get the whole RDD instead. This is an action that causes computation.
(take-ordered n rdd)
(take-ordered n compare-fn rdd)
Take the first n
(smallest) elements from this RDD as defined by the
elements' natural order or specified comparator.
This currently scans the partitions one by one on the driver, so it
will be slow if a lot of elements are required. In that case, use collect
to get the whole RDD instead.
This is an action that causes computation.
Take the first `n` (smallest) elements from this RDD as defined by the elements' natural order or specified comparator. This currently scans the partitions _one by one_ on the **driver**, so it will be slow if a lot of elements are required. In that case, use `collect` to get the whole RDD instead. This is an action that causes computation.
(union rdd1 rdd2)
(union rdd1 rdd2 & rdds)
Construct a union of the elements in the provided RDDs. Any identical elements will appear multiple times.
Construct a union of the elements in the provided RDDs. Any identical elements will appear multiple times.
(vals rdd)
Transform rdd
by replacing each pair with its value. Returns a new RDD
representing the values.
Transform `rdd` by replacing each pair with its value. Returns a new RDD representing the values.
(zip-indexed rdd)
Zip the elements in rdd
with their indices. Returns a new pair RDD with
the element/index tuples.
The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.
This method needs to trigger a spark job when rdd
contains more than one
partition.
Zip the elements in `rdd` with their indices. Returns a new pair RDD with the element/index tuples. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when `rdd` contains more than one partition.
(zip-unique-ids rdd)
Zip the elements in rdd
with unique long identifiers. Returns a new pair
RDD with the element/id tuples.
Items in the kth partition will get ids k
, n+k
, 2*n+k
, ..., where n
is the number of partitions. So the ids won't be sequential and there may be
gaps, but this method won't trigger a spark job, unlike zip-indexed
.
Zip the elements in `rdd` with unique long identifiers. Returns a new pair RDD with the element/id tuples. Items in the kth partition will get ids `k`, `n+k`, `2*n+k`, ..., where `n` is the number of partitions. So the ids won't be sequential and there may be gaps, but this method _won't_ trigger a spark job, unlike `zip-indexed`.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close