This namespace provides the main API for writing Spark tasks.
Most operations in this namespace place the RDD last in the argument list,
just like Clojure collection functions. This lets you compose them using the
thread-last macro (->>
), making it simple to migrate existing Clojure
code.
This namespace provides the main API for writing Spark tasks. Most operations in this namespace place the RDD last in the argument list, just like Clojure collection functions. This lets you compose them using the thread-last macro (`->>`), making it simple to migrate existing Clojure code.
(binary-files spark-context path)
(binary-files spark-context path num-partitions)
Read a directory of binary files from the given URL as a pair RDD of paths to byte streams.
Read a directory of binary files from the given URL as a pair RDD of paths to byte streams.
(cache! rdd)
(cache! level rdd)
Sets the storage level of rdd
to persist its values across operations
after the first time it is computed. By default, this uses the :memory-only
level, but an alternate may be specified by level
.
This can only be used to assign a new storage level if the RDD does not have a storage level set already.
Sets the storage level of `rdd` to persist its values across operations after the first time it is computed. By default, this uses the `:memory-only` level, but an alternate may be specified by `level`. This can only be used to assign a new storage level if the RDD does not have a storage level set already.
(checkpoint! rdd)
Mark rdd
for checkpointing. It will be saved to a file inside the
checkpoint directory set on the Spark context and all references to its
parent RDDs will be removed.
This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it to a file will require recomputation.
Mark `rdd` for checkpointing. It will be saved to a file inside the checkpoint directory set on the Spark context and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it to a file will require recomputation.
(checkpointed? rdd)
True if rdd
has been marked for checkpointing.
True if `rdd` has been marked for checkpointing.
(coalesce num-partitions rdd)
(coalesce num-partitions shuffle? rdd)
Decrease the number of partitions in rdd
to n
. Useful for running
operations more efficiently after filtering down a large dataset.
Decrease the number of partitions in `rdd` to `n`. Useful for running operations more efficiently after filtering down a large dataset.
(empty spark-context)
Construct a new empty RDD.
Construct a new empty RDD.
(hash-partitioner n)
(hash-partitioner key-fn n)
Construct a partitioner which will hash keys to distribute them uniformly
over n
buckets. Optionally accepts a key-fn
which will be called on each
key before hashing it.
Construct a partitioner which will hash keys to distribute them uniformly over `n` buckets. Optionally accepts a `key-fn` which will be called on each key before hashing it.
(name rdd)
Return the current name for rdd
.
Return the current name for `rdd`.
(num-partitions rdd)
Returns the number of partitions in rdd
.
Returns the number of partitions in `rdd`.
(parallelize spark-context coll)
(parallelize spark-context min-partitions coll)
Distribute a local collection to form an RDD. Optionally accepts a number of partitions to slice the collection into.
Distribute a local collection to form an RDD. Optionally accepts a number of partitions to slice the collection into.
(parallelize-pairs spark-context coll)
(parallelize-pairs spark-context min-partitions coll)
Distributes a local collection to form a pair RDD. Optionally accepts a number of partitions to slice the collection into.
Distributes a local collection to form a pair RDD. Optionally accepts a number of partitions to slice the collection into.
(partition-by partitioner rdd)
Return a copy of rdd
partitioned by the given partitioner
.
Return a copy of `rdd` partitioned by the given `partitioner`.
(partitioner rdd)
Return the partitioner associated with rdd
, or nil if there is no custom
partitioner.
Return the partitioner associated with `rdd`, or nil if there is no custom partitioner.
(partitions rdd)
Return a vector of the partitions in rdd
.
Return a vector of the partitions in `rdd`.
(repartition n rdd)
Returns a new rdd
with exactly n
partitions.
This method can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using
coalesce
, which can avoid performing a shuffle.
Returns a new `rdd` with exactly `n` partitions. This method can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using `coalesce`, which can avoid performing a shuffle.
(repartition-and-sort-within-partitions partitioner pair-rdd)
(repartition-and-sort-within-partitions partitioner comparator pair-rdd)
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
(save-as-text-file path rdd)
Write the elements of rdd
as a text file (or set of text files) in a given
directory path
in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark will call toString on each element to convert it to a line
of text in the file.
Write the elements of `rdd` as a text file (or set of text files) in a given directory `path` in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
(set-name name-str rdd)
Set the name of rdd
to name-str
.
Set the name of `rdd` to `name-str`.
(storage-level rdd)
Return the keyword representing the storage level in the storage-levels
map, or the raw value if not found.
Return the keyword representing the storage level in the `storage-levels` map, or the raw value if not found.
Keyword mappings for available RDD storage levels.
Keyword mappings for available RDD storage levels.
(text-file spark-context filename)
(text-file spark-context min-partitions filename)
Read a text file from a URL into an RDD of the lines in the file. Optionally accepts a number of partitions to slice the file into.
Read a text file from a URL into an RDD of the lines in the file. Optionally accepts a number of partitions to slice the file into.
(uncache! rdd)
(uncache! blocking? rdd)
Mark rdd
as non-persistent, and remove all blocks for it from memory and
disk. Blocks until all data has been removed unless blocking?
is provided
and false.
Mark `rdd` as non-persistent, and remove all blocks for it from memory and disk. Blocks until all data has been removed unless `blocking?` is provided and false.
(whole-text-files spark-context filename)
(whole-text-files spark-context min-partitions filename)
Read a directory of text files from a URL into an RDD. Each element of the RDD is a pair of the file path and the full contents of the file.
Read a directory of text files from a URL into an RDD. Each element of the RDD is a pair of the file path and the full contents of the file.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close