Liking cljdoc? Tell your friends :D

sparkplug.rdd

This namespace provides the main API for writing Spark tasks.

Most operations in this namespace place the RDD last in the argument list, just like Clojure collection functions. This lets you compose them using the thread-last macro (->>), making it simple to migrate existing Clojure code.

This namespace provides the main API for writing Spark tasks.

Most operations in this namespace place the RDD last in the argument list,
just like Clojure collection functions. This lets you compose them using the
thread-last macro (`->>`), making it simple to migrate existing Clojure
code.
raw docstring

binary-filesclj

(binary-files spark-context path)
(binary-files spark-context path num-partitions)

Read a directory of binary files from the given URL as a pair RDD of paths to byte streams.

Read a directory of binary files from the given URL as a pair RDD of paths
to byte streams.
sourceraw docstring

cache!clj

(cache! rdd)
(cache! level rdd)

Sets the storage level of rdd to persist its values across operations after the first time it is computed. By default, this uses the :memory-only level, but an alternate may be specified by level.

This can only be used to assign a new storage level if the RDD does not have a storage level set already.

Sets the storage level of `rdd` to persist its values across operations
after the first time it is computed. By default, this uses the `:memory-only`
level, but an alternate may be specified by `level`.

This can only be used to assign a new storage level if the RDD does not have
a storage level set already.
sourceraw docstring

checkpoint!clj

(checkpoint! rdd)

Mark rdd for checkpointing. It will be saved to a file inside the checkpoint directory set on the Spark context and all references to its parent RDDs will be removed.

This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it to a file will require recomputation.

Mark `rdd` for checkpointing. It will be saved to a file inside the
checkpoint directory set on the Spark context and all references to its
parent RDDs will be removed.

This function must be called before any job has been executed on this RDD. It
is strongly recommended that this RDD is persisted in memory, otherwise
saving it to a file will require recomputation.
sourceraw docstring

checkpointed?clj

(checkpointed? rdd)

True if rdd has been marked for checkpointing.

True if `rdd` has been marked for checkpointing.
sourceraw docstring

coalesceclj

(coalesce num-partitions rdd)
(coalesce num-partitions shuffle? rdd)

Decrease the number of partitions in rdd to n. Useful for running operations more efficiently after filtering down a large dataset.

Decrease the number of partitions in `rdd` to `n`. Useful for running
operations more efficiently after filtering down a large dataset.
sourceraw docstring

emptyclj

(empty spark-context)

Construct a new empty RDD.

Construct a new empty RDD.
sourceraw docstring

hash-partitionerclj

(hash-partitioner n)
(hash-partitioner key-fn n)

Construct a partitioner which will hash keys to distribute them uniformly over n buckets. Optionally accepts a key-fn which will be called on each key before hashing it.

Construct a partitioner which will hash keys to distribute them uniformly
over `n` buckets. Optionally accepts a `key-fn` which will be called on each
key before hashing it.
sourceraw docstring

nameclj

(name rdd)

Return the current name for rdd.

Return the current name for `rdd`.
sourceraw docstring

num-partitionsclj

(num-partitions rdd)

Returns the number of partitions in rdd.

Returns the number of partitions in `rdd`.
sourceraw docstring

parallelizeclj

(parallelize spark-context coll)
(parallelize spark-context min-partitions coll)

Distribute a local collection to form an RDD. Optionally accepts a number of partitions to slice the collection into.

Distribute a local collection to form an RDD. Optionally accepts a number
of partitions to slice the collection into.
sourceraw docstring

parallelize-pairsclj

(parallelize-pairs spark-context coll)
(parallelize-pairs spark-context min-partitions coll)

Distributes a local collection to form a pair RDD. Optionally accepts a number of partitions to slice the collection into.

Distributes a local collection to form a pair RDD. Optionally accepts a
number of partitions to slice the collection into.
sourceraw docstring

partition-byclj

(partition-by partitioner rdd)

Return a copy of rdd partitioned by the given partitioner.

Return a copy of `rdd` partitioned by the given `partitioner`.
sourceraw docstring

partitionerclj

(partitioner rdd)

Return the partitioner associated with rdd, or nil if there is no custom partitioner.

Return the partitioner associated with `rdd`, or nil if there is no custom
partitioner.
sourceraw docstring

partitionsclj

(partitions rdd)

Return a vector of the partitions in rdd.

Return a vector of the partitions in `rdd`.
sourceraw docstring

repartitionclj

(repartition n rdd)

Returns a new rdd with exactly n partitions.

This method can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.

If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

Returns a new `rdd` with exactly `n` partitions.

This method can increase or decrease the level of parallelism in this RDD.
Internally, this uses a shuffle to redistribute data.

If you are decreasing the number of partitions in this RDD, consider using
`coalesce`, which can avoid performing a shuffle.
sourceraw docstring

repartition-and-sort-within-partitionsclj

(repartition-and-sort-within-partitions partitioner pair-rdd)
(repartition-and-sort-within-partitions partitioner comparator pair-rdd)

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can
push the sorting down into the shuffle machinery.
sourceraw docstring

save-as-text-fileclj

(save-as-text-file path rdd)

Write the elements of rdd as a text file (or set of text files) in a given directory path in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

Write the elements of `rdd` as a text file (or set of text files) in a given
directory `path` in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark will call toString on each element to convert it to a line
of text in the file.
sourceraw docstring

set-nameclj

(set-name name-str rdd)

Set the name of rdd to name-str.

Set the name of `rdd` to `name-str`.
sourceraw docstring

storage-levelclj

(storage-level rdd)

Return the keyword representing the storage level in the storage-levels map, or the raw value if not found.

Return the keyword representing the storage level in the `storage-levels`
map, or the raw value if not found.
sourceraw docstring

storage-levelsclj

Keyword mappings for available RDD storage levels.

Keyword mappings for available RDD storage levels.
sourceraw docstring

text-fileclj

(text-file spark-context filename)
(text-file spark-context min-partitions filename)

Read a text file from a URL into an RDD of the lines in the file. Optionally accepts a number of partitions to slice the file into.

Read a text file from a URL into an RDD of the lines in the file. Optionally
accepts a number of partitions to slice the file into.
sourceraw docstring

uncache!clj

(uncache! rdd)
(uncache! blocking? rdd)

Mark rdd as non-persistent, and remove all blocks for it from memory and disk. Blocks until all data has been removed unless blocking? is provided and false.

Mark `rdd` as non-persistent, and remove all blocks for it from memory and
disk. Blocks until all data has been removed unless `blocking?` is provided
and false.
sourceraw docstring

whole-text-filesclj

(whole-text-files spark-context filename)
(whole-text-files spark-context min-partitions filename)

Read a directory of text files from a URL into an RDD. Each element of the RDD is a pair of the file path and the full contents of the file.

Read a directory of text files from a URL into an RDD. Each element of the
RDD is a pair of the file path and the full contents of the file.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close