Concepts — org.onyxplatform/onyx 0.14.6

We’ll take a quick overview of some terms you’ll see in the rest of this user guide.

A segment is the unit of data in Onyx, and it’s represented by a Clojure map. Segments represent the data flowing through the cluster. Segments are the only shape of data that Onyx allows you to emit between functions.

Task

A task is the smallest unit of work in Onyx. It represents an activity of either input, processing, or output.

Workflow

A workflow is the structural specification of an Onyx program. Its purpose is to articulate the paths that data flows through the cluster at runtime. It is specified via a directed, acyclic graph.

The workflow representation is a Clojure vector of vectors. Each inner vector contains exactly two elements, which are keywords. The keywords represent nodes in the graph, and the vector represents a directed edge from the first node to the second.

;;;    in
;;;    |
;;; increment
;;;    |
;;;  output
[[:in :increment] [:increment :out]]

;;;            input
;;;             /\
;;; processing-1 processing-2
;;;     |             |
;;;  output-1      output-2

[[:input :processing-1]
 [:input :processing-2]
 [:processing-1 :output-1]
 [:processing-2 :output-2]]

;;;            input
;;;             /\
;;; processing-1 processing-2
;;;         \      /
;;;          output

[[:input :processing-1]
 [:input :processing-2]
 [:processing-1 :output]
 [:processing-2 :output]]

Example projects: flat-workflow, multi-output-workflow

Catalog

All inputs, outputs, reducers, and functions in a workflow must be described via a catalog. A catalog is a vector of maps, strikingly similar to Datomic’s schema. Configuration and docstrings are described in the catalog.

Example:

[{:onyx/name :in
  :onyx/plugin :onyx.plugin.core-async/input
  :onyx/type :input
  :onyx/medium :core.async
  :onyx/batch-size batch-size
  :onyx/max-peers 1
  :onyx/doc "Reads segments from a core.async channel"}

 {:onyx/name :inc
  :onyx/fn :onyx.peer.min-peers-test/my-inc
  :onyx/type :function
  :onyx/batch-size batch-size}

 {:onyx/name :out
  :onyx/plugin :onyx.plugin.core-async/output
  :onyx/type :output
  :onyx/medium :core.async
  :onyx/batch-size batch-size
  :onyx/max-peers 1
  :onyx/doc "Writes segments to a core.async channel"}]

Flow Conditions

In contrast to a workflow, flow conditions specify on a segment-by-segment basis which direction data should flow determined by predicate functions. This is helpful for conditionally processing a segment based off of its content.

Example:

[{:flow/from :input-stream
  :flow/to [:process-adults]
  :flow/predicate :my.ns/adult?
  :flow/doc "Emits segment if this segment is an adult."}

Function

A function is a construct that receives segments and emits segments for further processing. It literally is a Clojure function.

Lifecycle

A lifecycle is a construct that describes the lifetime of a task. There is an entire chapter devoted to lifecycles, but to be brief, a lifecycle allows you to hook in and execute arbitrary code at critical points during a task. A lifecycle carries a context map that you can merge results back into for use later.

Windows

Windows are a construct that partitions a possible unbounded sequence of data into finite pieces, allowing aggregations to be specified. This lets you treat an infinite sequence of data as if it were finite over a given period of time.

Plugin

A plugin is a means for hooking into data sources to extract data as input and produce data as output. Onyx comes with a few plugins, but you can craft your own, too.

Sentinel

A sentinel is a value that can be pushed into Onyx to signal the end of a stream of data. This effectively lets Onyx switch between streaming and batching mode. The sentinel in Onyx is represented by the Clojure keyword :done.

Peer

A Peer is a node in the cluster responsible for processing data. A single "peer" refers to a physical machine, though we often use the terms peer and virtual peer interchangeably when the difference doesn’t matter.

Virtual Peer

A Virtual Peer refers to a single peer process running on a single physical machine. A single Virtual Peer executes at most one task at a time.

Job

A job is the collection of a workflow, catalog, flow conditions, lifecycles, and execution parameters. A job is most coarse unit of work, and every task is associated with exactly one job - hence a peer can only be working at most one job at any given time.