Monitoring

Liking cljdoc? Tell your friends :D

When setting up an Onyx cluster in production, it’s helpful to know what Onyx itself is doing. Onyx exposes a set of callbacks that are triggered on certain actions.

When you’re taking Onyx to production, it’s not enough to know what your application-specific code is doing. You need to have insight into how Onyx is operating internally to be able to tune performance at an optimal level.

onyx-peer-http-query contains a prometheus endpoint at /metrics. Alternatively, any monitoring java agent that export JMX metrics can be used to export metrics to other providers such as NewRelic.

Health Checks

onyx-peer-http-query contains an Aeron health check that we strongly recommend monitoring.

Task Monitoring

Onyx monitors numerous metrics related to each peer’s task operations and states task-states.

Each of the following metrics is scoped by job id, task, and peer id in the following format:

JMX tag: job.JOBID.task.TASKNAME.peer-id.PEERID.slot-id.SLOT-ID.METRIC ATTRIBUTE_NAME VALUE

Prometheus tag: METRIC_VALUETYPE{job=JOBID, task=TASKNAME, peer_id=PEERID} VALUE

Latency Metrics

Metric

Description

recover_latency

Time to fully recover the job after reallocation. Measured from the time the coordinator sends barrier with epoch 0.

checkpoint_serialization_latency

Latency to serialize the checkpoint for the task.

checkpoint_store_latency

Latency to store the checkpoint in durable storage.

serialization_latency

Latency to serialize segments for messaging.

since_heartbeat

Time since this peer heartbeated.

since_received_heartbeat

Maximum time since heartbeat has been received from any peers. This is a good gauge of when a peer may be timed out by the peer receiving heartbeats.

task_lifecycle_apply_fn

Latency to call :onyx/fn on a batch of messages.

task_lifecycle_read_batch

Latency to read a batch of messages from messenger or the input medium.

task_lifecycle_write_batch

Latency to write a batch of messages to the messenger or the output medium.

Available Metric Types

50thPercentile
75thPercentile
95thPercentile
98thPercentile
99thPercentile
999thPercentile
Count
FifteenMinuteRate
FiveMinuteRate
Max
Mean
MeanRate
Min
OneMinuteRate
StdDev

Gauges / Counters

Metric

Description

checkpoint_size

Size of the last checkpoint.

checkpoint_read_bytes

Number of bytes read from checkpointed storage.

checkpoint_written_bytes

Number of bytes written to checkpointed storage.

replica_version

The job’s allocation replica version that the peer is currently processing. All peers should have the same replica_version in normal operation, as peers with different replica versions are quarantined from each other.

lifecycle_index

The index of the current lifecycle stage for this peer. This gives an indication of what state the peer is currently in. Please look at the onyx.log to see the mapping between lifecycle indexes and index names for this peer.

current_lifecycle_duration

The amount of time that the peer has been in the current lifecycle state. Good indication of whether a task may be blocked. See lifecycle_index to figure out what stage it is stuck in.

offset

Storage medium offset for use by input/output plugins. For example, a Kafka plugin may report the offset that has been read up to in a topic partition.

epoch

The barrier epoch that the peer is up to.

subscription_errors

Number of errors thrown by messenger subscription.

publication_errors

Number of errors thrown by messenger publication.

written_bytes

Total number of bytes written via the messenger

read_bytes

Total number of bytes read from the messenger

peer-group.scheduler-lag

Number of milliseconds that the peer group is behind the coordination log.

peer-group.peers-shutting-down

Number of peers that are currently shutting down. A persistently high number is a sign that something is going wrong with the peers, and that peers are being blocked on shutdown.

Available Metric Types

Value

Rate Metrics

Metric Description

Metric	Description
peer_group_peer_errors	Rate of errors thrown by peers in this peer group.
epoch_rate	Barrier flow rate.
task_lifecycle_apply_fn_throughput	Throughput for `:onyx/fn` application in segments.
task_lifecycle_read_batch_throughput	Throughput read from the input medium or messenger in segments.
task_lifecycle_write_batch_throughput	Throughput written to output medium or messenger in segments.

peer_group_peer_errors

Rate of errors thrown by peers in this peer group.

epoch_rate

Barrier flow rate.

task_lifecycle_apply_fn_throughput

Throughput for :onyx/fn application in segments.

task_lifecycle_read_batch_throughput

Throughput read from the input medium or messenger in segments.

task_lifecycle_write_batch_throughput

Throughput written to output medium or messenger in segments.

Available Metric Types

Count
FifteenMinuteRate
FiveMinuteRate
MeanRate
OneMinuteRate

Coordination Monitoring Events

This is the list of all monitoring events that you can register hooks for. The keys listed are present in the map that is passed to the callback function. The names of the events should readily identify what has taken place to trigger the callback.

Event Name Keys

Event Name	Keys
`:peer-group.since-heartbeat`	`:zookeeper-write-log-entry`
`:event`, `:latency`, `:bytes`	`:zookeeper-read-log-entry`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-catalog`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-workflow`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-flow-conditions`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-lifecycles`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-windows`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-triggers`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-job-metadata`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-task`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-job-hash`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-chunk`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-job-scheduler`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-messaging`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-exception`
`:event`, `:latency`, `:bytes`	`:zookeeper-force-write-chunk`
`:event`, `:latency`, `:bytes`	`:zookeeper-write-origin`
`:event`, `:latency`, `:bytes`	`:zookeeper-read-catalog`
`:event`, `:latency`	`:zookeeper-read-workflow`
`:event`, `:latency`	`:zookeeper-read-flow-conditions`
`:event`, `:latency`	`:zookeeper-read-lifecycles`
`:event`, `:latency`	`:zookeeper-read-windows`
`:event`, `:latency`	`:zookeeper-read-triggers`
`:event`, `:latency`	`:zookeeper-read-job-metadata`
`:event`, `:latency`	`:zookeeper-read-task`
`:event`, `:latency`	`:zookeeper-read-job-hash`
`:event`, `:latency`	`:zookeeper-read-chunk`
`:event`, `:latency`	`:zookeeper-read-origin`
`:event`, `:latency`	`:zookeeper-read-job-scheduler`
`:event`, `:latency`	`:zookeeper-read-messaging`
`:event`, `:latency`	`:zookeeper-read-exception`
`:event`, `:latency`	`:zookeeper-gc-log-entry`
`:event`, `:latency`, `:position`	`:group-prepare-join`
`:event`, `:id`	`:group-notify-join`
`:event`, `:id`	`:group-accept-join`

:peer-group.since-heartbeat

:zookeeper-write-log-entry