When setting up an Onyx cluster in production, it’s helpful to know what Onyx itself is doing. Onyx exposes a set of callbacks that are triggered on certain actions.
When you’re taking Onyx to production, it’s not enough to know what your application-specific code is doing. You need to have insight into how Onyx is operating internally to be able to tune performance at an optimal level.
onyx-peer-http-query contains a prometheus
endpoint at /metrics
.
Alternatively, any monitoring java agent that export JMX metrics can be used to
export metrics to other providers such as NewRelic.
onyx-peer-http-query contains an Aeron health check that we strongly recommend monitoring.
Onyx monitors numerous metrics related to each peer’s task operations and states task-states.
Each of the following metrics is scoped by job id, task, and peer id in the following format:
JMX tag: job.JOBID.task.TASKNAME.peer-id.PEERID.slot-id.SLOT-ID.METRIC ATTRIBUTE_NAME VALUE
Prometheus tag: METRIC_VALUETYPE{job=JOBID, task=TASKNAME, peer_id=PEERID} VALUE
Metric | Description |
recover_latency | Time to fully recover the job after reallocation. Measured from the time the coordinator sends barrier with epoch 0. |
checkpoint_serialization_latency | Latency to serialize the checkpoint for the task. |
checkpoint_store_latency | Latency to store the checkpoint in durable storage. |
serialization_latency | Latency to serialize segments for messaging. |
since_heartbeat | Time since this peer heartbeated. |
since_received_heartbeat | Maximum time since heartbeat has been received from any peers. This is a good gauge of when a peer may be timed out by the peer receiving heartbeats. |
task_lifecycle_apply_fn | Latency to call |
task_lifecycle_read_batch | Latency to read a batch of messages from messenger or the input medium. |
task_lifecycle_write_batch | Latency to write a batch of messages to the messenger or the output medium. |
Available Metric Types
50thPercentile
75thPercentile
95thPercentile
98thPercentile
99thPercentile
999thPercentile
Count
FifteenMinuteRate
FiveMinuteRate
Max
Mean
MeanRate
Min
OneMinuteRate
StdDev
Metric | Description |
checkpoint_size | Size of the last checkpoint. |
checkpoint_read_bytes | Number of bytes read from checkpointed storage. |
checkpoint_written_bytes | Number of bytes written to checkpointed storage. |
replica_version | The job’s allocation replica version that the peer is currently processing. All peers should have the same replica_version in normal operation, as peers with different replica versions are quarantined from each other. |
lifecycle_index | The index of the current lifecycle stage for this peer. This gives an indication of what state the peer is currently in. Please look at the onyx.log to see the mapping between lifecycle indexes and index names for this peer. |
current_lifecycle_duration | The amount of time that the peer has been in the current lifecycle state. Good indication of whether a task may be blocked. See lifecycle_index to figure out what stage it is stuck in. |
offset | Storage medium offset for use by input/output plugins. For example, a Kafka plugin may report the offset that has been read up to in a topic partition. |
epoch | The barrier epoch that the peer is up to. |
subscription_errors | Number of errors thrown by messenger subscription. |
publication_errors | Number of errors thrown by messenger publication. |
written_bytes | Total number of bytes written via the messenger |
read_bytes | Total number of bytes read from the messenger |
peer-group.scheduler-lag | Number of milliseconds that the peer group is behind the coordination log. |
peer-group.peers-shutting-down | Number of peers that are currently shutting down. A persistently high number is a sign that something is going wrong with the peers, and that peers are being blocked on shutdown. |
Available Metric Types
Value
Metric | Description |
---|---|
peer_group_peer_errors | Rate of errors thrown by peers in this peer group. |
epoch_rate | Barrier flow rate. |
task_lifecycle_apply_fn_throughput | Throughput for |
task_lifecycle_read_batch_throughput | Throughput read from the input medium or messenger in segments. |
task_lifecycle_write_batch_throughput | Throughput written to output medium or messenger in segments. |
Available Metric Types
Count
FifteenMinuteRate
FiveMinuteRate
MeanRate
OneMinuteRate
This is the list of all monitoring events that you can register hooks for. The keys listed are present in the map that is passed to the callback function. The names of the events should readily identify what has taken place to trigger the callback.
Event Name | Keys |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close