(aggregate rdd zero seq-op comb-op)
Params: (zeroValue: U)
(seqOp: Function2[U, T, U], combOp: Function2[U, U, U])
Result: U
Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.803Z
Params: (zeroValue: U) (seqOp: Function2[U, T, U], combOp: Function2[U, U, U]) Result: U Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.803Z
(aggregate-by-key rdd zero seq-fn comb-fn)
(aggregate-by-key rdd zero num-partitions seq-fn comb-fn)
Params: (zeroValue: U, partitioner: Partitioner, seqFunc: Function2[U, V, U], combFunc: Function2[U, U, U])
Result: JavaPairRDD[K, U]
Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.007Z
Params: (zeroValue: U, partitioner: Partitioner, seqFunc: Function2[U, V, U], combFunc: Function2[U, U, U]) Result: JavaPairRDD[K, U] Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.007Z
(app-name)
(app-name spark)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.487Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.487Z
Params: (path: String, minPartitions: Int)
Result: JavaPairRDD[String, PortableDataStream]
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
For example, if you have the following files:
Do
then rdd contains
A suggestion value of the minimal splitting number for input data.
Small files are preferred; very large files but may cause bad performance.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.492Z
Params: (path: String, minPartitions: Int) Result: JavaPairRDD[String, PortableDataStream] Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: Do then rdd contains A suggestion value of the minimal splitting number for input data. Small files are preferred; very large files but may cause bad performance. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.492Z
(broadcast value)
(broadcast spark value)
Params: (value: T)
Result: Broadcast[T]
Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.495Z
Params: (value: T) Result: Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.495Z
(cache rdd)
Params: ()
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.805Z
Params: () Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.805Z
(cartesian)
(cartesian rdd)
(cartesian left right)
(cartesian left right & rdds)
Params: (other: JavaRDDLike[U, _])
Result: JavaPairRDD[T, U]
Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.807Z
Params: (other: JavaRDDLike[U, _]) Result: JavaPairRDD[T, U] Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.807Z
(checkpoint-dir)
(checkpoint-dir spark)
Params:
Result: Optional[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.509Z
Params: Result: Optional[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.509Z
(checkpointed? rdd)
Params:
Result: Boolean
Return whether this RDD has been checkpointed or not
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.861Z
Params: Result: Boolean Return whether this RDD has been checkpointed or not Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.861Z
(coalesce rdd num-partitions)
(coalesce rdd num-partitions shuffle)
Params: (numPartitions: Int)
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.812Z
Params: (numPartitions: Int) Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.812Z
(cogroup this other1)
(cogroup this other1 other2)
(cogroup this other1 other2 other3)
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner)
Result: JavaPairRDD[K, (Iterable[V], Iterable[W])]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.034Z
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner) Result: JavaPairRDD[K, (Iterable[V], Iterable[W])] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.034Z
(collect rdd)
Params: ()
Result: List[T]
Return an array that contains all of the elements in this RDD.
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.813Z
Params: () Result: List[T] Return an array that contains all of the elements in this RDD. this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.813Z
(collect-async rdd)
Params: ()
Result: JavaFutureAction[List[T]]
The asynchronous version of collect, which returns a future for retrieving an array containing all of the elements in this RDD.
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.814Z
Params: () Result: JavaFutureAction[List[T]] The asynchronous version of collect, which returns a future for retrieving an array containing all of the elements in this RDD. this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.814Z
(collect-partitions rdd partition-ids)
Params: (partitionIds: Array[Int])
Result: Array[List[T]]
Return an array that contains all of the elements in a specific partition of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.816Z
Params: (partitionIds: Array[Int]) Result: Array[List[T]] Return an array that contains all of the elements in a specific partition of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.816Z
(combine-by-key rdd create-fn merge-value-fn merge-combiner-fn)
(combine-by-key rdd
create-fn
merge-value-fn
merge-combiner-fn
partitions-or-partitioner)
Params: (createCombiner: Function[V, C], mergeValue: Function2[C, V, C], mergeCombiners: Function2[C, C, C], partitioner: Partitioner, mapSideCombine: Boolean, serializer: Serializer)
Result: JavaPairRDD[K, C]
Generic function to combine the elements for each key using a custom set of aggregation functions. Turns a JavaPairRDD[(K, V)] into a result of type JavaPairRDD[(K, C)], for a "combined type" C.
Users provide three functions:
In addition, users can control the partitioning of the output RDD, the serializer that is use for the shuffle, and whether to perform map-side aggregation (if a mapper can produce multiple items with the same key).
V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.051Z
Params: (createCombiner: Function[V, C], mergeValue: Function2[C, V, C], mergeCombiners: Function2[C, C, C], partitioner: Partitioner, mapSideCombine: Boolean, serializer: Serializer) Result: JavaPairRDD[K, C] Generic function to combine the elements for each key using a custom set of aggregation functions. Turns a JavaPairRDD[(K, V)] into a result of type JavaPairRDD[(K, C)], for a "combined type" C. Users provide three functions: In addition, users can control the partitioning of the output RDD, the serializer that is use for the shuffle, and whether to perform map-side aggregation (if a mapper can produce multiple items with the same key). V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.051Z
(conf)
(conf spark)
Params:
Result: SparkConf
Return a copy of this JavaSparkContext's configuration. The configuration cannot be changed at runtime.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.511Z
Params: Result: SparkConf Return a copy of this JavaSparkContext's configuration. The configuration cannot be changed at runtime. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.511Z
(context rdd)
Params:
Result: SparkContext
The org.apache.spark.SparkContext that this RDD was created on.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.817Z
Params: Result: SparkContext The org.apache.spark.SparkContext that this RDD was created on. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.817Z
(count rdd)
Params: ()
Result: Long
Return the number of elements in the RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.818Z
Params: () Result: Long Return the number of elements in the RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.818Z
(count-approx rdd timeout)
(count-approx rdd timeout confidence)
Params: (timeout: Long, confidence: Double)
Result: PartialResult[BoundedDouble]
Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
The confidence is the probability that the error bounds of the result will contain the true value. That is, if countApprox were called repeatedly with confidence 0.9, we would expect 90% of the results to contain the true count. The confidence must be in the range [0,1] or an exception will be thrown.
maximum time to wait for the job, in milliseconds
the desired statistical confidence in the result
a potentially incomplete result, with error bounds
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.820Z
Params: (timeout: Long, confidence: Double) Result: PartialResult[BoundedDouble] Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. The confidence is the probability that the error bounds of the result will contain the true value. That is, if countApprox were called repeatedly with confidence 0.9, we would expect 90% of the results to contain the true count. The confidence must be in the range [0,1] or an exception will be thrown. maximum time to wait for the job, in milliseconds the desired statistical confidence in the result a potentially incomplete result, with error bounds Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.820Z
(count-approx-distinct rdd relative-sd)
Params: (relativeSD: Double)
Result: Long
Return approximate number of distinct elements in the RDD.
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.822Z
Params: (relativeSD: Double) Result: Long Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here. Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.822Z
(count-approx-distinct-by-key rdd relative-sd)
(count-approx-distinct-by-key rdd relative-sd partitions-or-partitioner)
Params: (relativeSD: Double, partitioner: Partitioner)
Result: JavaPairRDD[K, Long]
Return approximate number of distinct values for each key in this RDD.
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.
partitioner of the resulting RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.061Z
Params: (relativeSD: Double, partitioner: Partitioner) Result: JavaPairRDD[K, Long] Return approximate number of distinct values for each key in this RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here. Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. partitioner of the resulting RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.061Z
(count-async rdd)
Params: ()
Result: JavaFutureAction[Long]
The asynchronous version of count, which returns a future for counting the number of elements in this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.823Z
Params: () Result: JavaFutureAction[Long] The asynchronous version of count, which returns a future for counting the number of elements in this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.823Z
(count-by-key rdd)
Params: ()
Result: Map[K, Long]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.063Z
Params: () Result: Map[K, Long] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.063Z
(count-by-key-approx rdd timeout)
(count-by-key-approx rdd timeout confidence)
Params: (timeout: Long)
Result: PartialResult[Map[K, BoundedDouble]]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.065Z
Params: (timeout: Long) Result: PartialResult[Map[K, BoundedDouble]] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.065Z
(count-by-value rdd)
Params: ()
Result: Map[T, Long]
Return the count of each unique value in this RDD as a map of (value, count) pairs. The final combine step happens locally on the master, equivalent to running a single reduce task.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.824Z
Params: () Result: Map[T, Long] Return the count of each unique value in this RDD as a map of (value, count) pairs. The final combine step happens locally on the master, equivalent to running a single reduce task. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.824Z
(default-min-partitions)
(default-min-partitions spark)
Params:
Result: Integer
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.503Z
Params: Result: Integer Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.503Z
(default-parallelism)
(default-parallelism spark)
Params:
Result: Integer
Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.504Z
Params: Result: Integer Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.504Z
Flag for controlling the storage of an RDD.
DataFrame is stored only on disk and the CPU computation time is high as I/O involved.
Flag for controlling the storage of an RDD. DataFrame is stored only on disk and the CPU computation time is high as I/O involved.
Flag for controlling the storage of an RDD.
Same as disk-only storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD. Same as disk-only storage level but replicate each partition to two cluster nodes.
(distinct rdd)
(distinct rdd num-partitions)
Params: ()
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.829Z
Params: () Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.829Z
(empty-rdd)
(empty-rdd spark)
Params:
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.505Z
Params: Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.505Z
(empty? rdd)
Params: ()
Result: Boolean
true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.862Z
Params: () Result: Boolean true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.862Z
(filter rdd f)
Params: (f: Function[T, Boolean])
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.832Z
Params: (f: Function[T, Boolean]) Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.832Z
(final-value result)
Params: ()
Result: R
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html
Timestamp: 2020-10-19T01:56:47.226Z
Params: () Result: R Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html Timestamp: 2020-10-19T01:56:47.226Z
(final? result)
Params:
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html
Timestamp: 2020-10-19T01:56:47.229Z
Params: Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html Timestamp: 2020-10-19T01:56:47.229Z
(first rdd)
Params: ()
Result: T
Return the first element in this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.839Z
Params: () Result: T Return the first element in this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.839Z
(flat-map rdd f)
Params: (f: FlatMapFunction[T, U])
Result: JavaRDD[U]
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.840Z
Params: (f: FlatMapFunction[T, U]) Result: JavaRDD[U] Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.840Z
(flat-map-to-pair rdd f)
Params: (f: PairFlatMapFunction[T, K2, V2])
Result: JavaPairRDD[K2, V2]
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.842Z
Params: (f: PairFlatMapFunction[T, K2, V2]) Result: JavaPairRDD[K2, V2] Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.842Z
(flat-map-values rdd f)
Params: (f: FlatMapFunction[V, U])
Result: JavaPairRDD[K, U]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.082Z
Params: (f: FlatMapFunction[V, U]) Result: JavaPairRDD[K, U] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.082Z
(fold rdd zero f)
Params: (zeroValue: T)
(f: Function2[T, T, T])
Result: T
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.844Z
Params: (zeroValue: T) (f: Function2[T, T, T]) Result: T Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.844Z
(fold-by-key rdd zero f)
(fold-by-key rdd zero partitions-or-partitioner f)
Params: (zeroValue: V, partitioner: Partitioner, func: Function2[V, V, V])
Result: JavaPairRDD[K, V]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.088Z
Params: (zeroValue: V, partitioner: Partitioner, func: Function2[V, V, V]) Result: JavaPairRDD[K, V] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.088Z
(foreach rdd f)
Params: (f: VoidFunction[T])
Result: Unit
Applies a function f to all elements of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.845Z
Params: (f: VoidFunction[T]) Result: Unit Applies a function f to all elements of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.845Z
(foreach-async rdd f)
Params: (f: VoidFunction[T])
Result: JavaFutureAction[Void]
The asynchronous version of the foreach action, which applies a function f to all the elements of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.846Z
Params: (f: VoidFunction[T]) Result: JavaFutureAction[Void] The asynchronous version of the foreach action, which applies a function f to all the elements of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.846Z
(foreach-partition rdd f)
Params: (f: VoidFunction[Iterator[T]])
Result: Unit
Applies a function f to each partition of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.847Z
Params: (f: VoidFunction[Iterator[T]]) Result: Unit Applies a function f to each partition of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.847Z
(foreach-partition-async rdd f)
Params: (f: VoidFunction[Iterator[T]])
Result: JavaFutureAction[Void]
The asynchronous version of the foreachPartition action, which applies a function f to each partition of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.849Z
Params: (f: VoidFunction[Iterator[T]]) Result: JavaFutureAction[Void] The asynchronous version of the foreachPartition action, which applies a function f to each partition of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.849Z
(full-outer-join left right)
(full-outer-join left right partitions-or-partitioner)
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner)
Result: JavaPairRDD[K, (Optional[V], Optional[W])]
Perform a full outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in other, or the pair (k, (Some(v), None)) if no elements in other have key k. Similarly, for each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for v in this, or the pair (k, (None, Some(w))) if no elements in this have key k. Uses the given Partitioner to partition the output RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.102Z
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner) Result: JavaPairRDD[K, (Optional[V], Optional[W])] Perform a full outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in other, or the pair (k, (Some(v), None)) if no elements in other have key k. Similarly, for each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for v in this, or the pair (k, (None, Some(w))) if no elements in this have key k. Uses the given Partitioner to partition the output RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.102Z
(get-num-partitions rdd)
Params:
Result: Int
Return the number of partitions in this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.852Z
Params: Result: Int Return the number of partitions in this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.852Z
(get-storage-level rdd)
Params:
Result: StorageLevel
Get the RDD's current storage level, or StorageLevel.NONE if none is set.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.853Z
Params: Result: StorageLevel Get the RDD's current storage level, or StorageLevel.NONE if none is set. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.853Z
(glom rdd)
Params: ()
Result: JavaRDD[List[T]]
Return an RDD created by coalescing all elements within each partition into an array.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.854Z
Params: () Result: JavaRDD[List[T]] Return an RDD created by coalescing all elements within each partition into an array. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.854Z
(group-by rdd f)
(group-by rdd f num-partitions)
Params: (f: Function[T, U])
Result: JavaPairRDD[U, Iterable[T]]
Return an RDD of grouped elements. Each group consists of a key and a sequence of elements mapping to that key.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.857Z
Params: (f: Function[T, U]) Result: JavaPairRDD[U, Iterable[T]] Return an RDD of grouped elements. Each group consists of a key and a sequence of elements mapping to that key. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.857Z
(group-by-key rdd)
(group-by-key rdd num-partitions)
Params: (partitioner: Partitioner)
Result: JavaPairRDD[K, Iterable[V]]
Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner.
If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using JavaPairRDD.reduceByKey or JavaPairRDD.combineByKey will provide much better performance.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.115Z
Params: (partitioner: Partitioner) Result: JavaPairRDD[K, Iterable[V]] Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using JavaPairRDD.reduceByKey or JavaPairRDD.combineByKey will provide much better performance. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.115Z
(id rdd)
Params:
Result: Int
A unique ID for this RDD (within its SparkContext).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.859Z
Params: Result: Int A unique ID for this RDD (within its SparkContext). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.859Z
(initial-value result)
Params:
Result: R
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html
Timestamp: 2020-10-19T01:56:47.228Z
Params: Result: R Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html Timestamp: 2020-10-19T01:56:47.228Z
(intersection)
(intersection rdd)
(intersection left right)
(intersection left right & rdds)
Params: (other: JavaRDD[T])
Result: JavaRDD[T]
Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.
This method performs a shuffle internally.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.860Z
Params: (other: JavaRDD[T]) Result: JavaRDD[T] Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. This method performs a shuffle internally. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.860Z
(is-checkpointed rdd)
Params:
Result: Boolean
Return whether this RDD has been checkpointed or not
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.861Z
Params: Result: Boolean Return whether this RDD has been checkpointed or not Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.861Z
(is-empty rdd)
Params: ()
Result: Boolean
true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.862Z
Params: () Result: Boolean true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.862Z
(is-initial-value-final result)
Params:
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html
Timestamp: 2020-10-19T01:56:47.229Z
Params: Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/partial/PartialResult.html Timestamp: 2020-10-19T01:56:47.229Z
(is-local)
(is-local spark)
Params:
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.531Z
Params: Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.531Z
(jars)
(jars spark)
Params:
Result: List[String]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.532Z
Params: Result: List[String] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.532Z
(java-spark-context spark)
Converts a SparkSession to a JavaSparkContext.
Converts a SparkSession to a JavaSparkContext.
(join left right)
(join left right partitions-or-partitioner)
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner)
Result: JavaPairRDD[K, (V, W)]
Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Uses the given Partitioner to partition the output RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.131Z
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner) Result: JavaPairRDD[K, (V, W)] Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. Uses the given Partitioner to partition the output RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.131Z
(key-by rdd f)
Params: (f: Function[T, U])
Result: JavaPairRDD[U, T]
Creates tuples of the elements in this RDD by applying f.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.865Z
Params: (f: Function[T, U]) Result: JavaPairRDD[U, T] Creates tuples of the elements in this RDD by applying f. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.865Z
(keys rdd)
Params: ()
Result: JavaRDD[K]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.139Z
Params: () Result: JavaRDD[K] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.139Z
(left-outer-join left right)
(left-outer-join left right partitions-or-partitioner)
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner)
Result: JavaPairRDD[K, (V, Optional[W])]
Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Uses the given Partitioner to partition the output RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.144Z
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner) Result: JavaPairRDD[K, (V, Optional[W])] Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Uses the given Partitioner to partition the output RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.144Z
(local-property k)
(local-property spark k)
Params: (key: String)
Result: String
Get a local property set in this thread, or null if it is missing. See org.apache.spark.api.java.JavaSparkContext.setLocalProperty.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.512Z
Params: (key: String) Result: String Get a local property set in this thread, or null if it is missing. See org.apache.spark.api.java.JavaSparkContext.setLocalProperty. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.512Z
(local?)
(local? spark)
Params:
Result: Boolean
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.531Z
Params: Result: Boolean Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.531Z
(lookup rdd k)
Params: (key: K)
Result: List[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.145Z
Params: (key: K) Result: List[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.145Z
Params: (f: Function[T, R])
Result: JavaRDD[R]
Return a new RDD by applying a function to all elements of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.867Z
Params: (f: Function[T, R]) Result: JavaRDD[R] Return a new RDD by applying a function to all elements of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.867Z
(map-partitions rdd f)
(map-partitions rdd f preserves-partitioning)
Params: (f: FlatMapFunction[Iterator[T], U])
Result: JavaRDD[U]
Return a new RDD by applying a function to each partition of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.870Z
Params: (f: FlatMapFunction[Iterator[T], U]) Result: JavaRDD[U] Return a new RDD by applying a function to each partition of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.870Z
(map-partitions-to-pair rdd f)
(map-partitions-to-pair rdd f preserves-partitioning)
Params: (f: PairFlatMapFunction[Iterator[T], K2, V2])
Result: JavaPairRDD[K2, V2]
Return a new RDD by applying a function to each partition of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.875Z
Params: (f: PairFlatMapFunction[Iterator[T], K2, V2]) Result: JavaPairRDD[K2, V2] Return a new RDD by applying a function to each partition of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.875Z
(map-partitions-with-index rdd f)
(map-partitions-with-index rdd f preserves-partitioning)
Params: (f: Function2[Integer, Iterator[T], Iterator[R]], preservesPartitioning: Boolean = false)
Result: JavaRDD[R]
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.877Z
Params: (f: Function2[Integer, Iterator[T], Iterator[R]], preservesPartitioning: Boolean = false) Result: JavaRDD[R] Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.877Z
(map-to-pair rdd f)
Params: (f: PairFunction[T, K2, V2])
Result: JavaPairRDD[K2, V2]
Return a new RDD by applying a function to all elements of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.883Z
Params: (f: PairFunction[T, K2, V2]) Result: JavaPairRDD[K2, V2] Return a new RDD by applying a function to all elements of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.883Z
(map-values rdd f)
Params: (f: Function[V, U])
Result: JavaPairRDD[K, U]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.160Z
Params: (f: Function[V, U]) Result: JavaPairRDD[K, U] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.160Z
(mapcat rdd f)
Params: (f: FlatMapFunction[T, U])
Result: JavaRDD[U]
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.840Z
Params: (f: FlatMapFunction[T, U]) Result: JavaRDD[U] Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.840Z
(mapcat-to-pair rdd f)
Params: (f: PairFlatMapFunction[T, K2, V2])
Result: JavaPairRDD[K2, V2]
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.842Z
Params: (f: PairFlatMapFunction[T, K2, V2]) Result: JavaPairRDD[K2, V2] Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.842Z
(master)
(master spark)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.532Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.532Z
(max rdd cmp)
Params: (comp: Comparator[T])
Result: T
Returns the maximum element from this RDD as defined by the specified Comparator[T].
the comparator that defines ordering
the maximum of the RDD
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.884Z
Params: (comp: Comparator[T]) Result: T Returns the maximum element from this RDD as defined by the specified Comparator[T]. the comparator that defines ordering the maximum of the RDD Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.884Z
Flag for controlling the storage of an RDD.
The default behavior of the DataFrame or Dataset. In this Storage Level, The DataFrame will be stored in JVM memory as deserialized objects. When required storage is greater than available memory, it stores some of the excess partitions into a disk and reads the data from disk when it required. It is slower as there is I/O involved.
Flag for controlling the storage of an RDD. The default behavior of the DataFrame or Dataset. In this Storage Level, The DataFrame will be stored in JVM memory as deserialized objects. When required storage is greater than available memory, it stores some of the excess partitions into a disk and reads the data from disk when it required. It is slower as there is I/O involved.
Flag for controlling the storage of an RDD.
Same as memory-and-disk storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD. Same as memory-and-disk storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD.
Same as memory-and-disk
storage level difference being it serializes the DataFrame objects in memory and on disk when space not available.
Flag for controlling the storage of an RDD. Same as `memory-and-disk` storage level difference being it serializes the DataFrame objects in memory and on disk when space not available.
Flag for controlling the storage of an RDD.
Same as memory-and-disk-ser storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD. Same as memory-and-disk-ser storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD.
Flag for controlling the storage of an RDD.
Flag for controlling the storage of an RDD.
Same as memory-only
storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD. Same as `memory-only` storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD.
Same as memory-only
but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then memory-only
as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.
Flag for controlling the storage of an RDD. Same as `memory-only` but the difference being it stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then `memory-only` as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.
Flag for controlling the storage of an RDD.
Same as memory-only-ser
storage level but replicate each partition to two cluster nodes.
Flag for controlling the storage of an RDD. Same as `memory-only-ser` storage level but replicate each partition to two cluster nodes.
(min rdd cmp)
Params: (comp: Comparator[T])
Result: T
Returns the minimum element from this RDD as defined by the specified Comparator[T].
the comparator that defines ordering
the minimum of the RDD
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.885Z
Params: (comp: Comparator[T]) Result: T Returns the minimum element from this RDD as defined by the specified Comparator[T]. the comparator that defines ordering the minimum of the RDD Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.885Z
(name rdd)
Params: ()
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.886Z
Params: () Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.886Z
Flag for controlling the storage of an RDD.
No caching.
Flag for controlling the storage of an RDD. No caching.
(num-partitions rdd)
Params:
Result: Int
Return the number of partitions in this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.852Z
Params: Result: Int Return the number of partitions in this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.852Z
Flag for controlling the storage of an RDD.
Off-heap refers to objects (serialised to byte array) that are managed by the operating system but stored outside the process heap in native memory (therefore, they are not processed by the garbage collector). Accessing this data is slightly slower than accessing the on-heap storage but still faster than reading/writing from a disk. The downside is that the user has to manually deal with managing the allocated memory.
Flag for controlling the storage of an RDD. Off-heap refers to objects (serialised to byte array) that are managed by the operating system but stored outside the process heap in native memory (therefore, they are not processed by the garbage collector). Accessing this data is slightly slower than accessing the on-heap storage but still faster than reading/writing from a disk. The downside is that the user has to manually deal with managing the allocated memory.
(parallelise data)
(parallelise spark data)
Params: (list: List[T], numSlices: Int)
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.544Z
Params: (list: List[T], numSlices: Int) Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.544Z
(parallelise-doubles data)
(parallelise-doubles spark data)
Params: (list: List[Double], numSlices: Int)
Result: JavaDoubleRDD
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.546Z
Params: (list: List[Double], numSlices: Int) Result: JavaDoubleRDD Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.546Z
(parallelise-pairs data)
(parallelise-pairs spark data)
Params: (list: List[(K, V)], numSlices: Int)
Result: JavaPairRDD[K, V]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.549Z
Params: (list: List[(K, V)], numSlices: Int) Result: JavaPairRDD[K, V] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.549Z
(parallelize data)
(parallelize spark data)
Params: (list: List[T], numSlices: Int)
Result: JavaRDD[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.544Z
Params: (list: List[T], numSlices: Int) Result: JavaRDD[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.544Z
(parallelize-doubles data)
(parallelize-doubles spark data)
Params: (list: List[Double], numSlices: Int)
Result: JavaDoubleRDD
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.546Z
Params: (list: List[Double], numSlices: Int) Result: JavaDoubleRDD Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.546Z
(parallelize-pairs data)
(parallelize-pairs spark data)
Params: (list: List[(K, V)], numSlices: Int)
Result: JavaPairRDD[K, V]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.549Z
Params: (list: List[(K, V)], numSlices: Int) Result: JavaPairRDD[K, V] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.549Z
(partition-by rdd partitioner)
Params: (partitioner: Partitioner)
Result: JavaPairRDD[K, V]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.168Z
Params: (partitioner: Partitioner) Result: JavaPairRDD[K, V] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.168Z
(partitioner rdd)
Params:
Result: Optional[Partitioner]
The partitioner of this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.890Z
Params: Result: Optional[Partitioner] The partitioner of this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.890Z
(partitions rdd)
Params:
Result: List[Partition]
Set of partitions in this RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.891Z
Params: Result: List[Partition] Set of partitions in this RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.891Z
(persist rdd storage)
Params: (newLevel: StorageLevel)
Result: JavaRDD[T]
Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet..
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.892Z
Params: (newLevel: StorageLevel) Result: JavaRDD[T] Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet.. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.892Z
(persistent-rdds)
(persistent-rdds spark)
Params:
Result: Map[Integer, JavaRDD[_]]
Returns a Java map of JavaRDDs that have marked themselves as persistent via cache() call.
This does not necessarily mean the caching or computation was successful.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.513Z
Params: Result: Map[Integer, JavaRDD[_]] Returns a Java map of JavaRDDs that have marked themselves as persistent via cache() call. This does not necessarily mean the caching or computation was successful. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.513Z
(random-split rdd weights)
(random-split rdd weights seed)
Params: (weights: Array[Double])
Result: Array[JavaRDD[T]]
Randomly splits this RDD with the provided weights.
weights for splits, will be normalized if they don't sum to 1
split RDDs in an array
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.902Z
Params: (weights: Array[Double]) Result: Array[JavaRDD[T]] Randomly splits this RDD with the provided weights. weights for splits, will be normalized if they don't sum to 1 split RDDs in an array Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.902Z
(rdd? value)
Tests if value
is an instance of JavaRDD
.
Tests if `value` is an instance of `JavaRDD`.
(reduce rdd f)
Params: (f: Function2[T, T, T])
Result: T
Reduces the elements of this RDD using the specified commutative and associative binary operator.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.904Z
Params: (f: Function2[T, T, T]) Result: T Reduces the elements of this RDD using the specified commutative and associative binary operator. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.904Z
(reduce-by-key rdd f)
(reduce-by-key rdd f partitions-or-partitioner)
Params: (partitioner: Partitioner, func: Function2[V, V, V])
Result: JavaPairRDD[K, V]
Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.188Z
Params: (partitioner: Partitioner, func: Function2[V, V, V]) Result: JavaPairRDD[K, V] Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.188Z
(reduce-by-key-locally rdd f)
Params: (func: Function2[V, V, V])
Result: Map[K, V]
Merge the values for each key using an associative and commutative reduce function, but return the result immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.190Z
Params: (func: Function2[V, V, V]) Result: Map[K, V] Merge the values for each key using an associative and commutative reduce function, but return the result immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.190Z
(repartition rdd num-partitions)
Params: (numPartitions: Int)
Result: JavaRDD[T]
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.905Z
Params: (numPartitions: Int) Result: JavaRDD[T] Return a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.905Z
(repartition-and-sort-within-partitions rdd partitioner)
(repartition-and-sort-within-partitions rdd partitioner cmp)
Params: (partitioner: Partitioner)
Result: JavaPairRDD[K, V]
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.
This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.193Z
Params: (partitioner: Partitioner) Result: JavaPairRDD[K, V] Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.193Z
(resources)
(resources spark)
Params:
Result: Map[String, ResourceInformation]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.550Z
Params: Result: Map[String, ResourceInformation] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.550Z
(right-outer-join left right)
(right-outer-join left right partitions-or-partitioner)
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner)
Result: JavaPairRDD[K, (Optional[V], W)]
Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Uses the given Partitioner to partition the output RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.198Z
Params: (other: JavaPairRDD[K, W], partitioner: Partitioner) Result: JavaPairRDD[K, (Optional[V], W)] Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Uses the given Partitioner to partition the output RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.198Z
(sample rdd with-replacement fraction)
(sample rdd with-replacement fraction seed)
Params: (withReplacement: Boolean, fraction: Double)
Result: JavaRDD[T]
Return a sampled subset of this RDD with a random seed.
can elements be sampled multiple times (replaced when sampled out)
expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be greater than or equal to 0
This is NOT guaranteed to provide exactly the fraction of the count of the given RDD.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.908Z
Params: (withReplacement: Boolean, fraction: Double) Result: JavaRDD[T] Return a sampled subset of this RDD with a random seed. can elements be sampled multiple times (replaced when sampled out) expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be greater than or equal to 0 This is NOT guaranteed to provide exactly the fraction of the count of the given RDD. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.908Z
(sample-by-key rdd with-replacement fractions)
(sample-by-key rdd with-replacement fractions seed)
Params: (withReplacement: Boolean, fractions: Map[K, Double], seed: Long)
Result: JavaPairRDD[K, V]
Return a subset of this RDD sampled by key (via stratified sampling).
Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map, via simple random sampling with one pass over the RDD, to produce a sample of size that's approximately equal to the sum of math.ceil(numItems * samplingRate) over all key values.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.203Z
Params: (withReplacement: Boolean, fractions: Map[K, Double], seed: Long) Result: JavaPairRDD[K, V] Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map, via simple random sampling with one pass over the RDD, to produce a sample of size that's approximately equal to the sum of math.ceil(numItems * samplingRate) over all key values. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.203Z
(sample-by-key-exact rdd with-replacement fractions)
(sample-by-key-exact rdd with-replacement fractions seed)
Params: (withReplacement: Boolean, fractions: Map[K, Double], seed: Long)
Result: JavaPairRDD[K, V]
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
This method differs from sampleByKey in that we make additional passes over the RDD to create a sample size that's exactly equal to the sum of math.ceil(numItems * samplingRate) over all key values with a 99.99% confidence. When sampling without replacement, we need one additional pass over the RDD to guarantee sample size; when sampling with replacement, we need two additional passes.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.206Z
Params: (withReplacement: Boolean, fractions: Map[K, Double], seed: Long) Result: JavaPairRDD[K, V] Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key). This method differs from sampleByKey in that we make additional passes over the RDD to create a sample size that's exactly equal to the sum of math.ceil(numItems * samplingRate) over all key values with a 99.99% confidence. When sampling without replacement, we need one additional pass over the RDD to guarantee sample size; when sampling with replacement, we need two additional passes. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.206Z
(save-as-text-file rdd path)
Params: (path: String)
Result: Unit
Save this RDD as a text file, using string representations of elements.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.911Z
Params: (path: String) Result: Unit Save this RDD as a text file, using string representations of elements. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.911Z
(sc)
(sc spark)
Params:
Result: SparkContext
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.550Z
Params: Result: SparkContext Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.550Z
(sort-by-key rdd)
(sort-by-key rdd asc)
Params: ()
Result: JavaPairRDD[K, V]
Sort the RDD by key, so that each partition contains a sorted range of the elements in ascending order. Calling collect or save on the resulting RDD will return or output an ordered list of records (in the save case, they will be written to multiple part-X files in the filesystem, in order of the keys).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.231Z
Params: () Result: JavaPairRDD[K, V] Sort the RDD by key, so that each partition contains a sorted range of the elements in ascending order. Calling collect or save on the resulting RDD will return or output an ordered list of records (in the save case, they will be written to multiple part-X files in the filesystem, in order of the keys). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.231Z
(spark-context)
(spark-context spark)
Params:
Result: SparkContext
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.550Z
Params: Result: SparkContext Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.550Z
(spark-home)
(spark-home spark)
Params: ()
Result: Optional[String]
Get Spark's home location from either a value set through the constructor, or the spark.home Java property, or the SPARK_HOME environment variable (in that order of preference). If neither of these is set, return None.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.518Z
Params: () Result: Optional[String] Get Spark's home location from either a value set through the constructor, or the spark.home Java property, or the SPARK_HOME environment variable (in that order of preference). If neither of these is set, return None. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.518Z
(storage-level rdd)
Params:
Result: StorageLevel
Get the RDD's current storage level, or StorageLevel.NONE if none is set.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.853Z
Params: Result: StorageLevel Get the RDD's current storage level, or StorageLevel.NONE if none is set. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.853Z
(subtract)
(subtract rdd)
(subtract left right)
(subtract left right arg)
(subtract left right arg & rdds)
Params: (other: JavaRDD[T])
Result: JavaRDD[T]
Return an RDD with the elements from this that are not in other.
Uses this partitioner/partition size, because even if other is huge, the resulting RDD will be less than or equal to us.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.917Z
Params: (other: JavaRDD[T]) Result: JavaRDD[T] Return an RDD with the elements from this that are not in other. Uses this partitioner/partition size, because even if other is huge, the resulting RDD will be less than or equal to us. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.917Z
(subtract-by-key left right)
(subtract-by-key left right partitions-or-partitioner)
Params: (other: JavaPairRDD[K, W])
Result: JavaPairRDD[K, V]
Return an RDD with the pairs from this whose keys are not in other.
Uses this partitioner/partition size, because even if other is huge, the resulting RDD will be <= us.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.240Z
Params: (other: JavaPairRDD[K, W]) Result: JavaPairRDD[K, V] Return an RDD with the pairs from this whose keys are not in other. Uses this partitioner/partition size, because even if other is huge, the resulting RDD will be <= us. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.240Z
(take rdd n)
Params: (num: Int)
Result: List[T]
Take the first num elements of the RDD. This currently scans the partitions one by one, so it will be slow if a lot of partitions are required. In that case, use collect() to get the whole RDD instead.
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.923Z
Params: (num: Int) Result: List[T] Take the first num elements of the RDD. This currently scans the partitions *one by one*, so it will be slow if a lot of partitions are required. In that case, use collect() to get the whole RDD instead. this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.923Z
(take-async rdd n)
Params: (num: Int)
Result: JavaFutureAction[List[T]]
The asynchronous version of the take action, which returns a future for retrieving the first num elements of this RDD.
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.924Z
Params: (num: Int) Result: JavaFutureAction[List[T]] The asynchronous version of the take action, which returns a future for retrieving the first num elements of this RDD. this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.924Z
(take-ordered rdd n)
(take-ordered rdd n cmp)
Params: (num: Int, comp: Comparator[T])
Result: List[T]
Returns the first k (smallest) elements from this RDD as defined by the specified Comparator[T] and maintains the order.
k, the number of elements to return
the comparator that defines the order
an array of top elements
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.927Z
Params: (num: Int, comp: Comparator[T]) Result: List[T] Returns the first k (smallest) elements from this RDD as defined by the specified Comparator[T] and maintains the order. k, the number of elements to return the comparator that defines the order an array of top elements this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.927Z
(take-sample rdd with-replacement n)
(take-sample rdd with-replacement n seed)
Params: (withReplacement: Boolean, num: Int)
Result: List[T]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.929Z
Params: (withReplacement: Boolean, num: Int) Result: List[T] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.929Z
Params: (path: String)
Result: JavaRDD[String]
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.570Z
Params: (path: String) Result: JavaRDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.570Z
(top rdd n)
(top rdd n cmp)
Params: (num: Int, comp: Comparator[T])
Result: List[T]
Returns the top k (largest) elements from this RDD as defined by the specified Comparator[T] and maintains the order.
k, the number of top elements to return
the comparator that defines the order
an array of top elements
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.935Z
Params: (num: Int, comp: Comparator[T]) Result: List[T] Returns the top k (largest) elements from this RDD as defined by the specified Comparator[T] and maintains the order. k, the number of top elements to return the comparator that defines the order an array of top elements this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.935Z
(union)
(union rdd)
(union left right)
(union left right & rdds)
Params: (other: JavaRDD[T])
Result: JavaRDD[T]
Return the union of this RDD and another one. Any identical elements will appear multiple times (use .distinct() to eliminate them).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.942Z
Params: (other: JavaRDD[T]) Result: JavaRDD[T] Return the union of this RDD and another one. Any identical elements will appear multiple times (use .distinct() to eliminate them). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.942Z
(unpersist rdd)
(unpersist rdd blocking)
Params: ()
Result: JavaRDD[T]
Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. This method blocks until all blocks are deleted.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.944Z
Params: () Result: JavaRDD[T] Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. This method blocks until all blocks are deleted. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.944Z
(vals rdd)
Params: ()
Result: JavaRDD[V]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.266Z
Params: () Result: JavaRDD[V] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.266Z
(values rdd)
Params: ()
Result: JavaRDD[V]
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html
Timestamp: 2020-10-19T01:56:48.266Z
Params: () Result: JavaRDD[V] Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaPairRDD.html Timestamp: 2020-10-19T01:56:48.266Z
(version)
(version spark)
Params:
Result: String
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.576Z
Params: Result: String Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.576Z
Params: (path: String, minPartitions: Int)
Result: JavaPairRDD[String, String]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. The text files must be encoded as UTF-8.
For example, if you have the following files:
Do
then rdd contains
A suggestion value of the minimal splitting number for input data.
Small files are preferred, large file is also allowable, but may cause bad performance.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html
Timestamp: 2020-10-19T01:56:49.582Z
Params: (path: String, minPartitions: Int) Result: JavaPairRDD[String, String] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. The text files must be encoded as UTF-8. For example, if you have the following files: Do then rdd contains A suggestion value of the minimal splitting number for input data. Small files are preferred, large file is also allowable, but may cause bad performance. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaSparkContext.html Timestamp: 2020-10-19T01:56:49.582Z
(zip left right)
Params: (other: JavaRDDLike[U, _])
Result: JavaPairRDD[T, U]
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.950Z
Params: (other: JavaRDDLike[U, _]) Result: JavaPairRDD[T, U] Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the other). Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.950Z
(zip-partitions left right f)
Params: (other: JavaRDDLike[U, _], f: FlatMapFunction2[Iterator[T], Iterator[U], V])
Result: JavaRDD[V]
Zip this RDD's partitions with one (or more) RDD(s) and return a new RDD by applying a function to the zipped partitions. Assumes that all the RDDs have the same number of partitions, but does not require them to have the same number of elements in each partition.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.952Z
Params: (other: JavaRDDLike[U, _], f: FlatMapFunction2[Iterator[T], Iterator[U], V]) Result: JavaRDD[V] Zip this RDD's partitions with one (or more) RDD(s) and return a new RDD by applying a function to the zipped partitions. Assumes that all the RDDs have the *same number of partitions*, but does *not* require them to have the same number of elements in each partition. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.952Z
(zip-with-index rdd)
Params: ()
Result: JavaPairRDD[T, Long]
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.953Z
Params: () Result: JavaPairRDD[T, Long] Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.953Z
(zip-with-unique-id rdd)
Params: ()
Result: JavaPairRDD[T, Long]
Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from org.apache.spark.rdd.RDD#zipWithIndex.
Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html
Timestamp: 2020-10-19T01:56:48.954Z
Params: () Result: JavaPairRDD[T, Long] Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from org.apache.spark.rdd.RDD#zipWithIndex. Source: https://spark.apache.org/docs/3.0.1/api/scala/org/apache/spark/api/java/JavaRDD.html Timestamp: 2020-10-19T01:56:48.954Z
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close