Liking cljdoc? Tell your friends :D

sparkplug.core

This namespace provides the main API for writing Spark tasks.

Most operations in this namespace place the RDD last in the argument list, just like Clojure collection functions. This lets you compose them using the thread-last macro (->>), making it simple to migrate existing Clojure code.

This namespace provides the main API for writing Spark tasks.

Most operations in this namespace place the RDD last in the argument list,
just like Clojure collection functions. This lets you compose them using the
thread-last macro (`->>`), making it simple to migrate existing Clojure
code.
raw docstring

aggregateclj

(aggregate aggregator combiner zero rdd)

Aggregate the elements of each partition in rdd using aggregator, then merge the results for all partitions using combiner. Both functions will be seeded with the neutral zero value.

This is an action that causes computation.

Aggregate the elements of each partition in `rdd` using `aggregator`, then
merge the results for all partitions using `combiner`. Both functions will be
seeded with the neutral `zero` value.

This is an action that causes computation.
sourceraw docstring

aggregate-by-keyclj

(aggregate-by-key aggregator combiner zero rdd)
(aggregate-by-key aggregator combiner zero partitioner-or-num-partitions rdd)

When called on an RDD of (K, V) pairs, returns an RDD of (K, U) pairs where the values for each key are aggregated using the given 2-arg aggregator function, 2-arg combiner function, and a neutral zero value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. The number of reduce tasks is configurable by optionally passing a number of partitions or a partitioner.

When called on an RDD of (K, V) pairs, returns an RDD of (K, U) pairs where
the values for each key are aggregated using the given 2-arg aggregator
function, 2-arg combiner function, and a neutral zero value. Allows an
aggregated value type that is different than the input value type, while
avoiding unnecessary allocations. The number of reduce tasks is configurable
by optionally passing a number of partitions or a partitioner.
sourceraw docstring

broadcastclj

(broadcast spark-context value)

Broadcast a read-only variable to the cluster, returning a reference for reading it in distributed functions. The variable data will be sent to each cluster only once.

The returned broadcast value can be resolved with deref or the @ reader macro.

Broadcast a read-only variable to the cluster, returning a reference for
reading it in distributed functions. The variable data will be sent to each
cluster only once.

The returned broadcast value can be resolved with `deref` or the `@` reader
macro.
sourceraw docstring

cartesianclj

(cartesian rdd1 rdd2)

Construct an RDD representing the cartesian product of two RDDs. Returns a new pair RDD containing all combinations of elements between the datasets.

Construct an RDD representing the cartesian product of two RDDs. Returns a
new pair RDD containing all combinations of elements between the datasets.
sourceraw docstring

cogroupclj

(cogroup rdd1 rdd2)
(cogroup rdd1 rdd2 rdd3)
(cogroup rdd1 rdd2 rdd3 rdd4)

Produe a new RDD containing an element for each key k in the given pair RDDs mapped to a tuple of the values from all RDDs as lists.

If the input RDDs have types (K, A), (K, B), and (K, C), the grouped RDD will have type (K, (list(A), list(B), list(C))).

Produe a new RDD containing an element for each key `k` in the given pair
RDDs mapped to a tuple of the values from all RDDs as lists.

If the input RDDs have types `(K, A)`, `(K, B)`, and `(K, C)`, the grouped
RDD will have type `(K, (list(A), list(B), list(C)))`.
sourceraw docstring

cogroup-partitionedclj

(cogroup-partitioned rdd1 rdd2 partitions)
(cogroup-partitioned rdd1 rdd2 rdd3 partitions)
(cogroup-partitioned rdd1 rdd2 rdd3 rdd4 partitions)

Produe a new RDD containing an element for each key k in the given pair RDDs mapped to a tuple of the values from all RDDs as lists. The resulting RDD partitions may be controlled by setting partitions to an integer number or a Partitioner instance.

If the input RDDs have types (K, A), (K, B), and (K, C), the grouped RDD will have type (K, (List(A), List(B), List(C))).

Produe a new RDD containing an element for each key `k` in the given pair
RDDs mapped to a tuple of the values from all RDDs as lists. The resulting
RDD partitions may be controlled by setting `partitions` to an integer number
or a `Partitioner` instance.

If the input RDDs have types `(K, A)`, `(K, B)`, and `(K, C)`, the grouped
RDD will have type `(K, (List(A), List(B), List(C)))`.
sourceraw docstring

collectclj

(collect rdd)

Collect the elements of rdd into a vector on the driver. Be careful not to realize large datasets with this, as the driver will likely run out of memory.

This is an action that causes computation.

Collect the elements of `rdd` into a vector on the driver. Be careful not to
realize large datasets with this, as the driver will likely run out of
memory.

This is an action that causes computation.
sourceraw docstring

combine-by-keyclj

(combine-by-key seq-fn conj-fn merge-fn rdd)
(combine-by-key seq-fn conj-fn merge-fn num-partitions rdd)

Combine the elements for each key using a set of aggregation functions.

If rdd contains pairs of (K, V), the resulting RDD will contain pairs of type (K, C). Callers must provide three functions:

  • seq-fn which turns a V into a C (for example, vector)
  • conj-fn to add a V to a C (for example, conj)
  • merge-fn to combine two C's into a single result
Combine the elements for each key using a set of aggregation functions.

If `rdd` contains pairs of `(K, V)`, the resulting RDD will contain pairs of
type `(K, C)`. Callers must provide three functions:
- `seq-fn` which turns a V into a C (for example, `vector`)
- `conj-fn` to add a V to a C (for example, `conj`)
- `merge-fn` to combine two C's into a single result
sourceraw docstring

countclj

(count rdd)

Count the number of elements in rdd.

This is an action that causes computation.

Count the number of elements in `rdd`.

This is an action that causes computation.
sourceraw docstring

count-by-keyclj

(count-by-key rdd)

Count the distinct key values in rdd. Returns a map of keys to integer counts.

This is an action that causes computation.

Count the distinct key values in `rdd`. Returns a map of keys to integer
counts.

This is an action that causes computation.
sourceraw docstring

count-by-valueclj

(count-by-value rdd)

Count the distinct values in rdd. Returns a map of values to integer counts.

This is an action that causes computation.

Count the distinct values in `rdd`. Returns a map of values to integer
counts.

This is an action that causes computation.
sourceraw docstring

distinctclj

(distinct rdd)
(distinct num-partitions rdd)

Construct an RDD containing only a single copy of each distinct element in rdd. Optionally accepts a number of partitions to size the resulting RDD with.

Construct an RDD containing only a single copy of each distinct element in
`rdd`. Optionally accepts a number of partitions to size the resulting RDD
with.
sourceraw docstring

filterclj

(filter f rdd)

Filter the elements of rdd to the ones which satisfy the predicate f.

Filter the elements of `rdd` to the ones which satisfy the predicate `f`.
sourceraw docstring

firstclj

(first rdd)

Find the first element of rdd.

This is an action that causes computation.

Find the first element of `rdd`.

This is an action that causes computation.
sourceraw docstring

foldclj

(fold f zero rdd)

Aggregate the elements of each partition in rdd, followed by the results for all the partitions, by using the given associative function f and a neutral zero value.

This is an action that causes computation.

Aggregate the elements of each partition in `rdd`, followed by the results
for all the partitions, by using the given associative function `f` and a
neutral `zero` value.

This is an action that causes computation.
sourceraw docstring

foreachclj

(foreach f rdd)

Apply the function f to all elements of rdd. The function will run on the executors where the data resides.

Consider foreach-partition for efficiency if handling an element requires costly resource acquisition such as a database connection.

This is an action that causes computation.

Apply the function `f` to all elements of `rdd`. The function will run on
the executors where the data resides.

Consider `foreach-partition` for efficiency if handling an element requires
costly resource acquisition such as a database connection.

This is an action that causes computation.
sourceraw docstring

foreach-partitionclj

(foreach-partition f rdd)

Apply the function f to all elements of rdd by calling it with a sequence of each partition's elements. The function will run on the executors where the data resides.

This is an action that causes computation.

Apply the function `f` to all elements of `rdd` by calling it with a
sequence of each partition's elements. The function will run on the executors
where the data resides.

This is an action that causes computation.
sourceraw docstring

full-outer-joinclj

(full-outer-join rdd1 rdd2)
(full-outer-join rdd1 rdd2 partitions)

Perform a full outer join of rdd1 and rdd2.

For each element (k, v) in rdd1, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for (k, w) in rdd2, or the pair (k, (Some(v), None)) if no elements in other have key k. Similarly, for each element (k, w) in rdd2, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for v in rdd1, or the pair (k, (None, Some(w))) if no elements in rdd1 have key k.

Hash-partitions the resulting RDD using the existing partitioner/parallelism level unless partitions is be provided as an integer number or a partitioner instance.

Perform a full outer join of `rdd1` and `rdd2`.

For each element `(k, v)` in `rdd1`, the resulting RDD will either contain all
pairs `(k, (Some(v), Some(w)))` for `(k, w)` in `rdd2`, or the pair
`(k, (Some(v), None))` if no elements in other have key `k`. Similarly, for
each element `(k, w)` in `rdd2`, the resulting RDD will either contain all
pairs `(k, (Some(v), Some(w)))` for `v` in `rdd1`, or the pair
`(k, (None, Some(w)))` if no elements in `rdd1` have key `k`.

Hash-partitions the resulting RDD using the existing partitioner/parallelism
level unless `partitions` is be provided as an integer number or a
partitioner instance.
sourceraw docstring

group-byclj

(group-by f rdd)
(group-by f num-partitions rdd)

Group the elements of rdd using a key function f. Returns a pair RDD with each generated key and all matching elements as a value sequence.

Group the elements of `rdd` using a key function `f`. Returns a pair RDD
with each generated key and all matching elements as a value sequence.
sourceraw docstring

group-by-keyclj

(group-by-key rdd)
(group-by-key num-partitions rdd)

Group the entries in the pair rdd by key. Returns a new pair RDD with one entry per key, containing all of the matching values as a sequence.

Group the entries in the pair `rdd` by key. Returns a new pair RDD with one
entry per key, containing all of the matching values as a sequence.
sourceraw docstring

intersectionclj

(intersection rdd1 rdd2)

Construct an RDD representing the intersection of elements which are in both RDDs.

Construct an RDD representing the intersection of elements which are in both
RDDs.
sourceraw docstring

intoclj

(into coll rdd)
(into coll xf rdd)

Collect the elements of rdd into a collection on the driver. Behaves like clojure.core/into, including accepting an optional transducer. Automatically coerces Scala tuples into Clojure vectors.

Be careful not to realize large datasets with this, as the driver will likely run out of memory.

This is an action that causes computation.

Collect the elements of `rdd` into a collection on the driver. Behaves like
`clojure.core/into`, including accepting an optional transducer.
Automatically coerces Scala tuples into Clojure vectors.

Be careful not to realize large datasets with this, as the driver will likely
run out of memory.

This is an action that causes computation.
sourceraw docstring

joinclj

(join rdd1 rdd2)
(join rdd1 rdd2 partitions)

Construct an RDD containing all pairs of elements with matching keys in rdd1 and rdd2. Each pair of elements will be returned as a tuple of (k, (v, w)), where (k, v) is in rdd1 and (k, w) is in rdd2.

Performs a hash join across the cluster. Optionally, partitions may be provided as an integer number or a partitioner instance to control the partitioning of the resulting RDD.

Construct an RDD containing all pairs of elements with matching keys in
`rdd1` and `rdd2`. Each pair of elements will be returned as a tuple of
`(k, (v, w))`, where `(k, v)` is in `rdd1` and `(k, w)` is in `rdd2`.

Performs a hash join across the cluster. Optionally, `partitions` may be
provided as an integer number or a partitioner instance to control the
partitioning of the resulting RDD.
sourceraw docstring

key-byclj

(key-by f rdd)

Creates pairs from the elements in rdd by using f to compute a key for each value.

Creates pairs from the elements in `rdd` by using `f` to compute a key for
each value.
sourceraw docstring

keysclj

(keys rdd)

Transform rdd by replacing each pair with its key. Returns a new RDD representing the keys.

Transform `rdd` by replacing each pair with its key. Returns a new RDD
representing the keys.
sourceraw docstring

left-outer-joinclj

(left-outer-join rdd1 rdd2)
(left-outer-join rdd1 rdd2 partitions)

Perform a left outer join of rdd1 and rdd2.

For each element (k, v) in rdd1, the resulting RDD will either contain all pairs (k, (v, Some(w))) for (k, w) in rdd2, or the pair (k, (v, None)) if no elements in rdd2 have key k.

Hash-partitions the resulting RDD using the existing partitioner/parallelism level unless partitions is be provided as an integer number or a partitioner instance.

Perform a left outer join of `rdd1` and `rdd2`.

For each element `(k, v)` in `rdd1`, the resulting RDD will either contain
all pairs `(k, (v, Some(w)))` for `(k, w)` in `rdd2`, or the pair
`(k, (v, None))` if no elements in `rdd2` have key `k`.

Hash-partitions the resulting RDD using the existing partitioner/parallelism
level unless `partitions` is be provided as an integer number or a
partitioner instance.
sourceraw docstring

lookupclj

(lookup rdd k)

Find all values in the rdd pairs whose keys is k. The key must be serializable with the Java serializer (not Kryo) for this to work.

This is an action that causes computation.

Find all values in the `rdd` pairs whose keys is `k`. The key must be
serializable with the Java serializer (not Kryo) for this to work.

This is an action that causes computation.
sourceraw docstring

mapclj

(map f rdd)

Map the function f over each element of rdd. Returns a new RDD representing the transformed elements.

Map the function `f` over each element of `rdd`. Returns a new RDD
representing the transformed elements.
sourceraw docstring

map->pairsclj

(map->pairs f rdd)

Map the function f over each element of rdd. Returns a new pair RDD representing the transformed elements.

Map the function `f` over each element of `rdd`. Returns a new pair RDD
representing the transformed elements.
sourceraw docstring

map-partitionsclj

(map-partitions f rdd)
(map-partitions f preserve-partitioning? rdd)

Map the function f over each partition in rdd, producing a sequence of results. Returns an RDD representing the concatenation of all the partition results. The function will be called with an iterator of the elements of each partition.

Map the function `f` over each partition in `rdd`, producing a sequence of
results. Returns an RDD representing the concatenation of all the partition
results. The function will be called with an iterator of the elements of each
partition.
sourceraw docstring

map-partitions->pairsclj

(map-partitions->pairs f rdd)
(map-partitions->pairs f preserve-partitioning? rdd)

Map the function f over each partition in rdd, producing a sequence of key-value pairs. The function will be called with an iterator of the elements of the partition.

Map the function `f` over each partition in `rdd`, producing a sequence of
key-value pairs. The function will be called with an iterator of the elements
of the partition.
sourceraw docstring

map-partitions-indexedclj

(map-partitions-indexed f rdd)

Map the function f over each partition in rdd, producing a sequence of results. Returns an RDD representing the concatenation of all the partition results. The function will be called with the partition index and an iterator of the elements of each partition.

Map the function `f` over each partition in `rdd`, producing a sequence of
results. Returns an RDD representing the concatenation of all the partition
results. The function will be called with the partition index and an iterator
of the elements of each partition.
sourceraw docstring

map-valsclj

(map-vals f rdd)

Map the function f over each value of the pairs in rdd. Returns a new pair RDD representing the transformed pairs.

Map the function `f` over each value of the pairs in `rdd`. Returns a new
pair RDD representing the transformed pairs.
sourceraw docstring

mapcatclj

(mapcat f rdd)

Map the function f over each element in rdd to produce a sequence of results. Returns an RDD representing the concatenation of all element results.

Map the function `f` over each element in `rdd` to produce a sequence of
results. Returns an RDD representing the concatenation of all element
results.
sourceraw docstring

mapcat->pairsclj

(mapcat->pairs f rdd)

Map the function f over each element in rdd to produce a sequence of key-value pairs. Returns a new pair RDD representing the concatenation of all result pairs.

Map the function `f` over each element in `rdd` to produce a sequence of
key-value pairs. Returns a new pair RDD representing the concatenation of all
result pairs.
sourceraw docstring

mapcat-valsclj

(mapcat-vals f rdd)

Map the function f over each value of the pairs in rdd to produce a collection of values. Returns a new pair RDD representing the concatenated keys and values.

Map the function `f` over each value of the pairs in `rdd` to produce a
collection of values. Returns a new pair RDD representing the concatenated
keys and values.
sourceraw docstring

maxclj

(max rdd)
(max compare-fn rdd)

Find the maximum element in rdd in the ordering defined by compare-fn.

This is an action that causes computation.

Find the maximum element in `rdd` in the ordering defined by `compare-fn`.

This is an action that causes computation.
sourceraw docstring

minclj

(min rdd)
(min compare-fn rdd)

Find the minimum element in rdd in the ordering defined by compare-fn.

This is an action that causes computation.

Find the minimum element in `rdd` in the ordering defined by `compare-fn`.

This is an action that causes computation.
sourceraw docstring

reduceclj

(reduce f rdd)

Aggregate the elements of rdd using the function f. The reducing function must accept two arguments and should be commutative and associative so that it can be computed correctly in parallel.

This is an action that causes computation.

Aggregate the elements of `rdd` using the function `f`. The reducing
function must accept two arguments and should be commutative and associative
so that it can be computed correctly in parallel.

This is an action that causes computation.
sourceraw docstring

reduce-by-keyclj

(reduce-by-key f rdd)

Aggregate the pairs of rdd which share a key by combining all of the values with the reducing function f. Returns a new pair RDD with one entry per unique key, holding the aggregated values.

Aggregate the pairs of `rdd` which share a key by combining all of the
values with the reducing function `f`. Returns a new pair RDD with one entry
per unique key, holding the aggregated values.
sourceraw docstring

right-outer-joinclj

(right-outer-join rdd1 rdd2)
(right-outer-join rdd1 rdd2 partitions)

Perform a right outer join of rdd1 and rdd2.

For each element (k, w) in rdd2, the resulting RDD will either contain all pairs (k, (Some(v), w)) for (k, v) in rdd1, or the pair (k, (None, w)) if no elements in rdd1 have key k.

Hash-partitions the resulting RDD using the existing partitioner/parallelism level unless partitions is be provided as an integer number or a partitioner instance.

Perform a right outer join of `rdd1` and `rdd2`.

For each element `(k, w)` in `rdd2`, the resulting RDD will either contain
all pairs `(k, (Some(v), w))` for `(k, v)` in `rdd1`, or the pair
`(k, (None, w))` if no elements in `rdd1` have key `k`.

Hash-partitions the resulting RDD using the existing partitioner/parallelism
level unless `partitions` is be provided as an integer number or a
partitioner instance.
sourceraw docstring

sampleclj

(sample fraction rdd)
(sample fraction replacement? rdd)
(sample fraction replacement? seed rdd)

Generate a randomly sampled subset of rdd with roughly fraction of the original elements. Callers can optionally select whether the sample happens with replacement, and a random seed to control the sample.

Generate a randomly sampled subset of `rdd` with roughly `fraction` of the
original elements. Callers can optionally select whether the sample happens
with replacement, and a random seed to control the sample.
sourceraw docstring

sort-by-keyclj

(sort-by-key rdd)
(sort-by-key ascending? rdd)
(sort-by-key compare-fn ascending? rdd)
(sort-by-key compare-fn ascending? num-partitions rdd)

Reorder the elements of rdd so that they are sorted according to their natural order or the given comparator f if provided. The result may be ordered ascending or descending, depending on ascending?.

Reorder the elements of `rdd` so that they are sorted according to their
natural order or the given comparator `f` if provided. The result may be
ordered ascending or descending, depending on `ascending?`.
sourceraw docstring

subtractclj

(subtract rdd1 rdd2)

Remove all elements from rdd1 that are present in rdd2.

Remove all elements from `rdd1` that are present in `rdd2`.
sourceraw docstring

subtract-by-keyclj

(subtract-by-key rdd1 rdd2)

Construct an RDD representing all pairs in rdd1 for which there is no pair with a matching key in rdd2.

Construct an RDD representing all pairs in `rdd1` for which there is no pair
with a matching key in `rdd2`.
sourceraw docstring

takeclj

(take n rdd)

Take the first n elements of the RDD.

This currently scans the partitions one by one on the driver, so it will be slow if a lot of elements are required. In that case, use collect to get the whole RDD instead.

This is an action that causes computation.

Take the first `n` elements of the RDD.

This currently scans the partitions _one by one_ on the **driver**, so it
will be slow if a lot of elements are required. In that case, use `collect`
to get the whole RDD instead.

This is an action that causes computation.
sourceraw docstring

take-orderedclj

(take-ordered n rdd)
(take-ordered n compare-fn rdd)

Take the first n (smallest) elements from this RDD as defined by the elements' natural order or specified comparator.

This currently scans the partitions one by one on the driver, so it will be slow if a lot of elements are required. In that case, use collect to get the whole RDD instead.

This is an action that causes computation.

Take the first `n` (smallest) elements from this RDD as defined by the
elements' natural order or specified comparator.

This currently scans the partitions _one by one_ on the **driver**, so it
will be slow if a lot of elements are required. In that case, use `collect`
to get the whole RDD instead.

This is an action that causes computation.
sourceraw docstring

unionclj

(union rdd1 rdd2)
(union rdd1 rdd2 & rdds)

Construct a union of the elements in the provided RDDs. Any identical elements will appear multiple times.

Construct a union of the elements in the provided RDDs. Any identical
elements will appear multiple times.
sourceraw docstring

valsclj

(vals rdd)

Transform rdd by replacing each pair with its value. Returns a new RDD representing the values.

Transform `rdd` by replacing each pair with its value. Returns a new RDD
representing the values.
sourceraw docstring

zip-indexedclj

(zip-indexed rdd)

Zip the elements in rdd with their indices. Returns a new pair RDD with the element/index tuples.

The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.

This method needs to trigger a spark job when rdd contains more than one partition.

Zip the elements in `rdd` with their indices. Returns a new pair RDD with
the element/index tuples.

The ordering is first based on the partition index and then the ordering of
items within each partition. So the first item in the first partition gets
index 0, and the last item in the last partition receives the largest index.

This method needs to trigger a spark job when `rdd` contains more than one
partition.
sourceraw docstring

zip-unique-idsclj

(zip-unique-ids rdd)

Zip the elements in rdd with unique long identifiers. Returns a new pair RDD with the element/id tuples.

Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So the ids won't be sequential and there may be gaps, but this method won't trigger a spark job, unlike zip-indexed.

Zip the elements in `rdd` with unique long identifiers. Returns a new pair
RDD with the element/id tuples.

Items in the kth partition will get ids `k`, `n+k`, `2*n+k`, ..., where `n`
is the number of partitions. So the ids won't be sequential and there may be
gaps, but this method _won't_ trigger a spark job, unlike `zip-indexed`.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close