sparkplug.rdd

Liking cljdoc? Tell your friends :D

Clojure only.

This namespace provides the main API for writing Spark tasks.

Most operations in this namespace place the RDD last in the argument list, just like Clojure collection functions. This lets you compose them using the thread-last macro (->>), making it simple to migrate existing Clojure code.

This namespace provides the main API for writing Spark tasks.

Most operations in this namespace place the RDD last in the argument list,
just like Clojure collection functions. This lets you compose them using the
thread-last macro (`->>`), making it simple to migrate existing Clojure
code.

raw docstring

binary-files^clj

(binary-files spark-context path)

(binary-files spark-context path num-partitions)

Read a directory of binary files from the given URL as a pair RDD of paths to byte streams.

Read a directory of binary files from the given URL as a pair RDD of paths
to byte streams.

raw docstring

cache!^clj

(cache! rdd)

(cache! level rdd)

Sets the storage level of rdd to persist its values across operations after the first time it is computed. By default, this uses the :memory-only level, but an alternate may be specified by level.

This can only be used to assign a new storage level if the RDD does not have a storage level set already.

Sets the storage level of `rdd` to persist its values across operations
after the first time it is computed. By default, this uses the `:memory-only`
level, but an alternate may be specified by `level`.

This can only be used to assign a new storage level if the RDD does not have
a storage level set already.

raw docstring

checkpoint!^clj

(checkpoint! rdd)

Mark rdd for checkpointing. It will be saved to a file inside the checkpoint directory set on the Spark context and all references to its parent RDDs will be removed.

This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it to a file will require recomputation.

Mark `rdd` for checkpointing. It will be saved to a file inside the
checkpoint directory set on the Spark context and all references to its
parent RDDs will be removed.

This function must be called before any job has been executed on this RDD. It
is strongly recommended that this RDD is persisted in memory, otherwise
saving it to a file will require recomputation.

raw docstring

checkpointed?^clj

(checkpointed? rdd)

True if rdd has been marked for checkpointing.

True if `rdd` has been marked for checkpointing.

raw docstring

coalesce^clj

(coalesce num-partitions rdd)

(coalesce num-partitions shuffle? rdd)

Decrease the number of partitions in rdd to n. Useful for running operations more efficiently after filtering down a large dataset.

Decrease the number of partitions in `rdd` to `n`. Useful for running
operations more efficiently after filtering down a large dataset.

raw docstring

empty^clj

(empty spark-context)

Construct a new empty RDD.

Construct a new empty RDD.

raw docstring

hash-partitioner^clj

(hash-partitioner n)

(hash-partitioner key-fn n)

Construct a partitioner which will hash keys to distribute them uniformly over n buckets. Optionally accepts a key-fn which will be called on each key before hashing it.

Construct a partitioner which will hash keys to distribute them uniformly
over `n` buckets. Optionally accepts a `key-fn` which will be called on each
key before hashing it.

raw docstring

name^clj

(name rdd)

Return the current name for rdd.

Return the current name for `rdd`.

raw docstring

parallelize^clj

(parallelize spark-context coll)

(parallelize spark-context min-partitions coll)

Distribute a local collection to form an RDD. Optionally accepts a number of partitions to slice the collection into.

Distribute a local collection to form an RDD. Optionally accepts a number
of partitions to slice the collection into.

raw docstring

parallelize-pairs^clj

(parallelize-pairs spark-context coll)

(parallelize-pairs spark-context min-partitions coll)

Distributes a local collection to form a pair RDD. Optionally accepts a number of partitions to slice the collection into.

Distributes a local collection to form a pair RDD. Optionally accepts a
number of partitions to slice the collection into.

raw docstring

partition-by^clj

(partition-by partitioner rdd)

Return a copy of rdd partitioned by the given partitioner.

Return a copy of `rdd` partitioned by the given `partitioner`.

raw docstring

partitioner^clj

(partitioner rdd)

Return the partitioner associated with rdd, or nil if there is no custom partitioner.

Return the partitioner associated with `rdd`, or nil if there is no custom
partitioner.

raw docstring

partitions^clj

(partitions rdd)

Return a vector of the partitions in rdd.

Return a vector of the partitions in `rdd`.

raw docstring

repartition^clj

(repartition n rdd)

Returns a new rdd with exactly n partitions.

This method can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.

If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

Returns a new `rdd` with exactly `n` partitions.

This method can increase or decrease the level of parallelism in this RDD.
Internally, this uses a shuffle to redistribute data.

If you are decreasing the number of partitions in this RDD, consider using
`coalesce`, which can avoid performing a shuffle.

raw docstring

save-as-text-file^clj

(save-as-text-file path rdd)

Write the elements of rdd as a text file (or set of text files) in a given directory path in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

Write the elements of `rdd` as a text file (or set of text files) in a given
directory `path` in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark will call toString on each element to convert it to a line
of text in the file.

raw docstring

set-name^clj

(set-name name-str rdd)

Set the name of rdd to name-str.

Set the name of `rdd` to `name-str`.

raw docstring

storage-level^clj

(storage-level rdd)

Return the keyword representing the storage level in the storage-levels map, or the raw value if not found.

Return the keyword representing the storage level in the `storage-levels`
map, or the raw value if not found.

raw docstring

storage-levels^clj

Keyword mappings for available RDD storage levels.

Keyword mappings for available RDD storage levels.

raw docstring

text-file^clj

(text-file spark-context filename)

(text-file spark-context min-partitions filename)

Read a text file from a URL into an RDD of the lines in the file. Optionally accepts a number of partitions to slice the file into.

Read a text file from a URL into an RDD of the lines in the file. Optionally
accepts a number of partitions to slice the file into.

raw docstring

uncache!^clj

(uncache! rdd)

(uncache! blocking? rdd)

Mark rdd as non-persistent, and remove all blocks for it from memory and disk. Blocks until all data has been removed unless blocking? is provided and false.

Mark `rdd` as non-persistent, and remove all blocks for it from memory and
disk. Blocks until all data has been removed unless `blocking?` is provided
and false.

raw docstring

whole-text-files^clj

(whole-text-files spark-context filename)

(whole-text-files spark-context min-partitions filename)

Read a directory of text files from a URL into an RDD. Each element of the RDD is a pair of the file path and the full contents of the file.

Read a directory of text files from a URL into an RDD. Each element of the
RDD is a pair of the file path and the full contents of the file.

raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

sparkplug.rdd

binary-filesclj

cache!clj

checkpoint!clj

checkpointed?clj

coalesceclj

emptyclj

hash-partitionerclj

nameclj

parallelizeclj

parallelize-pairsclj

partition-byclj

partitionerclj

partitionsclj

repartitionclj

save-as-text-fileclj

set-nameclj

storage-levelclj

storage-levelsclj

text-fileclj

uncache!clj

whole-text-filesclj

binary-files^clj

cache!^clj

checkpoint!^clj

checkpointed?^clj

coalesce^clj

empty^clj

hash-partitioner^clj

name^clj

parallelize^clj

parallelize-pairs^clj

partition-by^clj

partitioner^clj

partitions^clj

repartition^clj

save-as-text-file^clj

set-name^clj

storage-level^clj

storage-levels^clj

text-file^clj

uncache!^clj

whole-text-files^clj