Liking cljdoc? Tell your friends :D

thurber

Apache Beam and Google Cloud Dataflow on ~~steroids~~ Clojure. The walkthrough explains everything.

Release notes are here.

Quickstart
Project Goals
Documentation
Demos
- Word Count
- Mobile Gaming Examples
I/O Transforms
Performance
- Tips
More Help
Donate

Quickstart

Clone & cd into this repository.
lein repl
Copy & paste:

(ns try-thurber
  (:require [thurber :as th]
            [thurber.sugar :refer :all]))

(->
  (th/create-pipeline)

  (th/apply!
    (read-text-file
      "demo/word_count/lorem.txt")
    (th/fn* extract-words [sentence]
      (remove empty? (.split sentence "[^\\p{L}]+")))
    (count-per-element)
    (th/fn* format-as-text
      [[k v]] (format "%s: %d" k v))
    (log-sink))

  (th/run-pipeline!))

Output:

...
INFO thurber - extremely: 1
INFO thurber - undertakes: 1
INFO thurber - pleasure: 7
INFO thurber - you: 2
...

Project Goals

Enable Clojure
- Bring Clojure's powerful, expressive toolkit (destructuring, immutability, REPL, async tools, etc etc) to Apache Beam.
REPL Oriented
- Functions are idiomatic/pure Clojure functions by default. (E.g., lazy sequences are supported making iterative event output optional/unnecessary, etc.)
- Develop and test pipelines incrementally from the REPL.
- Evaluate/learn Beam semantics (windowing, triggering) interactively.
Avoid Macros
- Limit macro infection. Most thurber constructions are macro-less, use of any thurber macro constructions (like inline functions) is optional.
AOT Nothing
- Fully dynamic experience. Reload namespaces at whim. thurber's dependency on Beam, Clojure, etc versions are completely dynamic/floatable. No forced AOT'd dependencies, Etc.
No Lock-in
- Pipelines can be composed of Clojure and Java transforms. Incrementally refactor your pipeline to Clojure or back to Java.
Not Afraid of Java Interop
- Wherever Clojure's Java Interop is performant and works cleanly with Beam's fluent API, encourage it; facade/sugar functions are simple to create and left to your own domain-specific implementations.
Completeness
- Support all Beam capabilities (Transforms, State & Timers, Side Inputs, Output Tags, etc.)
Performance
- Be finely tuned for data streaming.

Documentation

Demos

Each namespace in the demo/ source directory is a pipeline written in Clojure using thurber. Comments in the source highlight salient aspects of thurber usage.

Along with the code walkthrough these are the best way to learn thurber's API and serve as recipes for various scenarios (use of tags, side inputs, windowing, combining, Beam's State API, etc etc.)

To execute a demo, start a REPL and evaluate (demo!) from within the respective namespace.

Word Count

The word_count package contains ports of Beam's Word Count Examples to Clojure/thurber.

Mobile Gaming Examples

Beam's Mobile Gaming Examples (documented here) have been ported to Clojure using thurber.

These are fully functional ports. They require deployment to GCP Dataflow:

How to Run Beam Mobile Gaming Examples (thurber): Detailed Instructions

I/O Transforms

Beam has many I/O transforms — see here.

KafkaIO, for example, has some configuration nuances:

./demo/kafka/simple-consumer shows how to configure a Kafka-consuming pipeline using thurber/Clojure

If you need help using thurber/Clojure with another I/O transform, you can open an issue to request any thurber demo code you'd like to see.

Performance

Streaming/big data implies hot code paths. thurber's core has been tuned for performance in various ways, but you may benefit from tuning your own pipeline code:

Performance Tuning Tips

Use Clojure type hints liberally within your stream functions.
- The cost of Java method invocation-by-reflection can be very high, and type hints can have a large impact in these cases.
- A helpful list of primitive type hint aliases can be found here.
Use Clojure's high-performance primitive operations.
Follow Clojure's optimization tips.
- For example: aget is explicitly overloaded for primitive arrays — type hinting is key here.
Compare gaming demos user-score and user-score-opt; the latter is an optimized version of the former pipeline. (The optimized version here compares with the performance of the Java demo in Beam source.)
Be explicit which JVM/JDK version is executing your code at runtime. Mature JVM versions have stronger performance in many cases than earlier versions.
- Note: Dataflow will pick a JVM/JDK version for your runtime/worker nodes based on the Java version you use to launch your pipeline!
Profile your pipeline!
- If deploying to GCP, use Dataflow profiling to zero in on areas to optimize.
When in doubt or in a bind, you can always fall back to Java for sensitive code paths.
- Note: This rarely if ever should be needed to achieve optimal performance.
In general (this is not Clojure/thurber-specific) you should understand Beam "fusion" and when to break fusion to achieve greater linear scalability. More info here.

More Help

Ask a question by opening an issue.

References

License

Like Clojure distributed under the Eclipse Public License.

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close