Apache Beam and Google Cloud Dataflow on ~~steroids~~ Clojure.
This is alpha software. Bleeding-edge and all that. API subject to mood swings.
cd
into this repository.lein repl
(ns try-thurber
(:require [thurber :as th]
[clojure.string :as str])
(:import (org.apache.beam.sdk.io TextIO)))
(defn- extract-words [sentence]
(remove empty? (str/split sentence #"[^\p{L}]+")))
(.run
(doto (th/create-pipeline)
(th/apply!
(-> (TextIO/read)
(.from "demo/word_count/lorem.txt"))
#'extract-words
#'th/->kv
(th/count-per-key)
(th/inline
(fn format-as-text
[[k v]] (format "%s: %d" k v)))
#'th/log-elem*)))
You should see streaming word counts:
...
INFO thurber - extremely: 1
INFO thurber - undertakes: 1
INFO thurber - pleasure: 7
INFO thurber - you: 2
...
Each namespace in the demo/
source directory is a pipeline written in Clojure
using thurber. Comments in the source highlight salient aspects of thurber usage.
These are the best way to learn thurber's API and serve as recipes for various scenarios (use of tags, side inputs, windowing, combining, Beam's State API, etc etc.)
To execute a demo, start a REPL and evaluate (demo!)
from within the respective namespace.
The word_count
package contains ports of Beam's
Word Count Examples
to Clojure/thurber.
Beam's Mobile Gaming Examples have been ported to Clojure using thurber.
These are fully functional ports but require deployment to GCP Dataflow. (How-to notes coming soon.)
First make your pipeline work. Then make it fast.
Streaming/big data implies hot code paths.
Use Clojure type hints liberally.
If deploying to GCP, use Dataflow profiling to zero in on areas to optimize.
Copyright © 2020 Aaron Dixon
Like Clojure distributed under the Eclipse Public License.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close