(alpha release coming soon)
Apache Beam and Google Cloud Dataflow on ~~steroids~~ Clojure.
cd
into this repository.lein repl
(ns try-thurber
(:require [thurber :as th]
[clojure.string :as str])
(:import (org.apache.beam.sdk.io TextIO)))
(defn- extract-words [sentence]
(remove empty? (str/split sentence #"[^\p{L}]+")))
(defn- format-as-text [[k v]]
(format "%s: %d" k v))
(.run
(doto (th/create-pipeline)
(th/apply!
(-> (TextIO/read)
(.from "demo/word_count/lorem.txt"))
#'extract-words
#'th/->kv
(th/count-per-key)
#'format-as-text
#'th/log-elem*)))
Beam's Java API is oriented around constructing a Pipeline
and issuing a series of .apply
invocations to mutate the pipeline and build up a directed, acyclic graph of stages.
thurber/apply!
applies a series of transforms
to a Pipeline
or PCollection.
(When pipeline graphs fork, multiple apply!
invocations may be needed to create the different paths.)
thurber/comp*
is how composite transforms
are made. The result of comp*
is a transform that can be used again within a subsequent call to apply!
or comp*
.
Any PTransform
instance can be provided to apply!
and comp*
but thurber supports other transform representations:
#'extract-words
seq
then all items in the seq are emitted. Often these seqs are lazy and produce
many elements.nil
in which case no automatic output occurs.ParDo
implementations will need first-class access to the current
ProcessContext
instance.
thurber/*process-context*
is always bound to this current context.nil
so that no automatic emissions occur.thurber/partial*
, filter*
, combine-per-key
, and other thurber functions produce transform
representations that can be used in apply!
and comp*
Every Beam PCollection
must have a
Coder.
The thurber/nippy
coder de/serializes nippy supported data types
(this includes all Clojure core data types).
thurber/nippy-kv
codes Beam
KV
values that contain nippy-supported data types for the key and value.
Coders can be specified as Clojure metadata on function vars, or directly within a thurber transformation representation.
The default coder used is thurber/nippy
which is appropriate for var-referenced
functions (ParDo
) that output Clojure data types.
Each namespace in the demo/
source directory is a pipeline written in Clojure
using thurber. Comments in the source highlight salient aspects of thurber usage.
These are the best way to learn thurber's API and serve as recipes for various scenarios (use of tags, side inputs, windowing, combining, Beam's State API, etc etc.)
To execute a demo, start a REPL and evaluate (demo!)
from within the respective namespace.
The word_count
package contains ports of Beam's
Word Count Examples
to Clojure/thurber.
Beam's Mobile Gaming Examples have been ported to Clojure using thurber.
These are fully functional ports but require deployment to GCP Dataflow. (How-to notes coming soon.)
First make your pipeline work. Then make it fast. Streaming/big data implies hot code paths. Use Clojure type hints liberally.
Copyright © 2020 Aaron Dixon
Like Clojure distributed under the Eclipse Public License.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close