Functions are idiomatic/pure Clojure functions by default. (E.g., lazy
sequences are supported making iterative event output optional/unnecessary, etc.)
Develop and test pipelines incrementally from the REPL.
Limit macro infection. Most thurber constructions are macro-less, use of any
thurber macro constructions (like inline functions) is optional.
AOT Nothing
Fully dynamic experience. Reload namespaces at whim. thurber's dependency on
Beam, Clojure, etc versions are completely dynamic/floatable. No forced AOT'd
dependencies, Etc.
No Lock-in
Pipelines can be composed of Clojure and Java transforms.
Incrementally refactor your pipeline to Clojure or back to Java.
Not Afraid of Java Interop
Wherever Clojure's Java Interop
is performant and works cleanly with Beam's fluent API, encourage it; facade/sugar
functions are simple to create and left to your own domain-specific implementations.
Completeness
Support all Beam capabilities (Transforms, State & Timers, Side Inputs,
Output Tags, etc.)
Each namespace in the demo/ source directory is a pipeline written in Clojure
using thurber. Comments in the source highlight salient aspects of thurber usage.
Along with the code walkthrough these are the best way to learn
thurber's API and serve as recipes for various scenarios (use of tags, side inputs,
windowing, combining, Beam's State API, etc etc.)
To execute a demo, start a REPL and evaluate (demo!) from within the respective namespace.
Streaming/big data implies hot code paths. thurber's core has been tuned for performance in various ways,
but you may benefit from tuning your own pipeline code:
For example: aget is explicitly overloaded for primitive arrays — type hinting is key here.
Compare gaming demosuser-score and user-score-opt;
the latter is an optimized version of the former pipeline. (The optimized version here compares with the
performance of the Java demo in Beam source.)
Be explicit which JVM/JDK version is executing your code at runtime. Mature JVM versions have stronger
performance in many cases than earlier versions.
Note: Dataflow will pick a JVM/JDK version for your runtime/worker nodes based on the Java version you
use to launch your pipeline!
Profile your pipeline!
If deploying to GCP, use Dataflow profiling
to zero in on areas to optimize.
When in doubt or in a bind, you can always fall back to Java for sensitive code paths.
Note: This rarely if ever should be needed to achieve optimal performance.
In general (this is not Clojure/thurber-specific) you should understand Beam "fusion" and when to break fusion to achieve
greater linear scalability. More info here.