clj-otel
clj-otel
Observability is the ability to measure a system’s current state based on the telemetry data it generates, such as traces, metrics, logs and events.
An observable system may be queried to describe its composition and behaviour. As systems are often highly distributed in nature, observability addresses the need to understand the behaviour with context across services, technologies and environments.
Monitoring is a complementary activity to observability, where typically a dashboard is configured to report predetermined metrics and alerts are set to trigger on specific conditions in the system. Monitoring is oriented to detection of known types of issues, as the monitored metrics or conditions are determined in advance.
Observability enables detection and exploration of issues as they occur and assists with root cause analysis. By querying an observable system, unexpected types of issue can be detected, classified and monitored as they arise, addressing the problem of "unknown unknowns". Questions on the actual behaviour of an observable system may be more readily answered:
"What caused this failure?"
"What caused this change in behaviour?"
"Why do a particular user’s requests fail?"
"Is my latest canary deployment stable?"
"Should I roll back the latest deployment?"
"Are my intended performance improvements in the latest deployment working as expected?"
"What uses my service, and what are the services it uses?"
"What mix of request types are issued to the system? Is that mix consistent across users?"
"Which areas should I work on to improve experience for most users?"
"How well is the system behaving for my premium users?"
"What user segment suffers high response times and why?"
"What is the biggest contributor to the system’s poor performance?"
"Why is the system’s error rate worse than last month? Are there new error cases?"
"What is the cause of my service level indicators (SLIs) not meeting the service level objectives (SLOs)?"
Through enhanced visibility of system behaviour, observability improves understanding of the impact of changes made by developers and operators.
OpenTelemetry is a CNCF incubating project, formed after the merger of OpenTracing and OpenCensus.
OpenTelemetry’s Mission: to enable effective observability by making high-quality, portable telemetry ubiquitous[1].
OpenTelemetry is a set of APIs, SDKs, tooling, integrations and semantic conventions that are designed for the creation and management of telemetry data. OpenTelemetry provides a vendor-agnostic implementation per language (Java, Go, JavaScript, C++ and others) that can be configured to export telemetry data to backends of your choice. The Java implementation supports both automatic instrumentation of a large number of libraries, frameworks and application servers and manual instrumentation of library or application source code. OpenTelemetry supports various backends, including OSS projects (Jaeger, Zipkin, Prometheus, Grafana Tempo, SigNoz) and commercial products (Honeycomb, Lightstep, Aspecto, Dynatrace, New Relic, Datadog and more).
The OpenTelemetry project also provides the OpenTelemetry Collector, a vendor-agnostic deployable process which receives, processes and exports telemetry data.
Work on the OpenTelemetry specification and implementations is in progress by a large community of contributors. Support for traces is the most mature, followed by metrics and then logs. See a summary of OpenTelemetry status.
There is a growing consensus in the commercial monitoring and APM tools market to adopt OpenTelemetry as a standard for telemetry in cloud native software. Many commercial products today offer ingest of OpenTelemetry trace data and support for metrics and logs is growing. Several vendors have created their own distributions of the OpenTelemetry SDK and OpenTelemetry Collector, which are customisations tailored for use with their products.
clj-otel
enable?clj-otel
extends the reach of OpenTelemetry to Clojure libraries and applications by providing:
A small idiomatic Clojure API that wraps the OpenTelemetry implementation for Java. This enables manual instrumentation of Clojure libraries and applications using pure Clojure.
Ring middleware and Pedestal interceptors for server span support.
Support for creating spans around asynchronous Clojure code.
A Clojure wrapper for programmatic configuration of the OpenTelemetry SDK.
clj-otel
is an umbrella project for several Clojure modules clj-otel-*
.
They depend on the OpenTelemetry implementation for Java opentelemetry-java
and the OpenTelemetry instrumentation agent provided by opentelemetry-java-instrumentation
.
OpenTelemetry exports telemetry data to a variety of telemetry backends. The choice of backend(s) is applied when configuring system components for deployment.
Query and presentation capabilities vary between backends. Many backends predate OpenTelemetry and were conceived as solutions focussed on tracing, monitoring or application performance management (APM). They have since been retrofitted to ingest telemetry data from OpenTelemetry.
The following sections are incomplete selections of open-source software (OSS) and commercial backends that accept telemetry data from OpenTelemetry.
Some commercial telemetry backends have a free version with a reduced capacity or feature set. |
The general workflow for using OpenTelemetry with your library or application is:
Add instrumentation to your library or application such that it exports telemetry data.
Configure system components to control how the telemetry data are processed and exported, either directly to telemetry backends or via OpenTelemetry Collector instance(s).
Use telemetry backend features to explore system behaviour described by the telemetry data.
Instrumenting a library or application involves adding behaviour such that it exports telemetry data as it runs.
Automatic instrumentation achieves this by dynamically altering the library or application at runtime. For the Java platform, automatic instrumentation is performed by the OpenTelemetry instrumentation agent, a Java agent that runs with the application. Many libraries, frameworks and application servers are supported by the agent out of the box. For example, the agent will create server spans for requests received by a Jetty server, and client spans for requests issued by an Apache HttpClient instance.
Manual instrumentation involves adding program code to the library or application at design time, using the OpenTelemetry API.
The clj-otel-api
module in this project wraps the OpenTelemetry API for Java in an idiomatic Clojure facade.
Manual instrumentation program code depends on the OpenTelemetry API, never the OpenTelemetry SDK. |
It is possible to combine automatic and manual instrumentation. For example, attributes and events can be added using manual instrumentation to a span automatically created by the agent. Extra spans can also be added using manual instrumentation. These are examples of manually enriching the telemetry data produced by automatic instrumentation.
Make use of automatic instrumentation if possible for your application, as this is a quick way to get high quality telemetry with almost no effort. Use manual instrumentation to enrich the telemetry data, or if your application does not use a framework supported by the agent. |
In observability terms, telemetry data is an aggregation of data from four sources: traces, metrics, logs and events. In the OpenTelemetry data model, data sources are traces, metrics and logs. Events are treated as a specific type of log or captured as part of a trace.
A trace represents the flow of a single transaction throughout the system. A trace comprises a tree of spans, where a span represents a unit of work in a service and the parent-child relationship between the spans represent dependencies between them. The root span of a trace typically describes the entire transaction and the other spans in the trace describe units of work performed as part of the transaction. Traces provide context for system activity performed in spans.
Span data may include a span kind, name, attributes, start/end timestamps, links to other spans, a list of events and a status.
The span kind indicates the relationship between the span and its parent and children in the trace. The span kind is one of:
CLIENT
: Covers the client side of issuing a synchronous request, where the client side waits until a response is received.
SERVER
: Covers the server side of handling a synchronous request, where the remote client waits for a response.
PRODUCER
: Covers initiation of an asynchronous request, where the corresponding consumer span may start after the producer span ends.
CONSUMER
: Covers processing of an asynchronous producer request.
INTERNAL
: An internal operation within the local application or service.
The span name should identify a class of spans and not include data.
The events in a span are timestamped records that may include attributes. Exceptions thrown in a span’s scope are captured as events.
The span status has a code Ok
or Error
, and in case of Error
may also have a string description.
A metric is a numerical measurement over a period of time. Metrics are used to indicate quantitative aspects of system health, such as resource (memory, disk, compute, network) usage, error rate, message queue length, and request response time.
A service log is made of lines of text (possibly structured e.g. in JSON format) written when certain points in the service code are executed. Logs are well suited to ad-hoc debugging and capture of low-level details.
Events are captured as either a specific type of log or as a span event. Events are records that describe actions taken by the system over time, or environmental changes that occurred which are significant to the system, such as a service deployment or change in configuration.
Attributes may be attached to some telemetry data such as spans and resources.
Attributes are a map where each entry has a string key and a value which is a boolean, long, double, string or an array of one of those types.
Attributes with nil
values are dropped.
See the specification for attributes and attribute naming.
A resource captures information about the entity for which telemetry data is recorded. For example, information on the host and JVM version may be part of a resource. Resources are included as part of other telemetry data such as traces and metrics.
The OpenTelemetry SDK contains resource implementations which capture a variety of host and process information.
Baggage is mechanism for propagating telemetry metadata and is represented as a simple map.
It is intended as a means to add contextual information at a point in a transaction, to be read by a downstream service later in the same transaction and then used as an element of telemetry data e.g. an attribute.
For example, a user identifier may put in the baggage to indicate the principal of a request and subsequent spans in the trace include a principal
attribute.
OpenTelemetry defines a rich set of conventions for telemetry data. This semantic unification across vendors and technologies promotes analysis of telemetry data created in heterogeneous, polyglot systems. In particular, semantic attributes for spans and metrics are defined for common base technologies like HTTP, database, RPC, messaging, FaaS (Function as a Service) and others. See OpenTelemetry semantic conventions documentation.
clj-otel
follows the semantic conventions for areas such as span exception events and manually created HTTP client and server spans.
A context acts as an immutable map that holds values that are transmitted across API boundaries and threads. A context may contain a span, baggage and possibly other values. A new context is created from an existing context with the addition of a new key-value association.
The current context is a thread local io.opentelemetry.context.Context
object.
It is used as a default for many functions in this project clj-otel
and methods of the underlying Java library opentelemetry-java
.
The current context is safe to use when manually instrumenting synchronous code.
The current context cannot be used when manually instrumenting asynchronous code. See Instrumenting asynchronous Clojure code. |
Context propagation is the mechanism used to transmit context values across API boundaries and threads. Context propagation enables traces to become distributed traces, joining clients to servers and producers to consumers. In practice, this is achieved by injecting and extracting header values in HTTP requests using a text map propagator.
OpenTelemetry provides text map propagators for the following protocols:
The W3C Trace Context and W3C baggage header propagation protocols are the most commonly used protocols for propagation of trace context and baggage.
When manually instrumenting asynchronous Clojure code with this library clj-otel
, it is not possible to use the current context.
This is because async Clojure function evaluations share threads, but each evaluation is associated with a distinct context.
The async function must instead maintain a reference to the associated context during evaluation, rather than use the current context.
Some functions in this library clj-otel
take a :context
or :parent
option to indicate the associated context to use, as an alternative to the default current context.
Sampling is the process of selecting some elements from a set and deriving observations on the complete set based on analysis of those selected elements. Sampling is needed when the volume of raw data is too high to analyse cost-effectively.
Trace sampling may be applied at any number of points between the instrumented application and the telemetry backend. OpenTelemetry provides sampler implementations which may be applied in the application and/or the Collector. Some telemetry backends may also apply sampling to trace data they receive, either automatically or with some developer intervention.
Exporters emit telemetry data to consumers, such as the Collector and telemetry backends. Exporters can be push or pull based.
OpenTelemetry Protocol (OTLP) is the OpenTelemetry native protocol for encoding, transport and delivery of telemetry data. OTLP is currently implemented over gRPC and HTTP transports.
Almost all telemetry backends that integrate with OpenTelemetry accept telemetry data in OTLP format. An application or OpenTelemetry Collector exports data to these backends using an OTLP exporter.
The OpenTelemetry SDK implements the creation, sampling, batching and export of telemetry data. The SDK acts as an implementation of the OpenTelemetry API. For an application to export telemetry data, the SDK and its dependencies need to be present and configured at runtime.
The SDK and its dependencies are added to an application in one of the following ways:
By using the OpenTelemetry instrumentation agent: In this option, the SDK and its dependencies are present but do not appear on the application classpath. Also, autoconfiguration is used for configuring the SDK.
By using the opentelemetry-sdk-extension-autoconfigure
library as an application dependency: This option is for autoconfiguration of the SDK where the OpenTelemetry instrumentation agent is not present.
The relevant optional SDK libraries (exporters, extensions, etc.) also need to be added as runtime dependencies.
By adding the SDK as a compile-time dependency to the application: This option is for programmatic configuration of the SDK. The relevant optional SDK libraries also need to be added as compile-time dependencies.
If the SDK is not present at application runtime, all OpenTelemetry API calls default to a no-op implementation where no telemetry data is created.
Autoconfiguration of the OpenTelemetry SDK refers to configuration using system properties or environment variables. Configuration of the OpenTelemetry instrumentation agent uses the same mechanism.
See documentation for SDK autoconfiguration and instrumentation agent configuration.
The SDK can be programmatically configured, as an alternative to autoconfiguration. This is a fallback option if autoconfiguration lacks the desired options.
This project clj-otel
provides a module clj-otel-sdk
for configuring the SDK in Clojure, as well as other support modules clj-otel-exporter-*
,clj-otel-extension-*
and clj-otel-sdk-extension-*
for programmatic access to various optional components.
An OpenTelemetry distro (or "distribution") supplied by a vendor is a repackaging of reference OpenTelemetry software, customised with the purpose of ease of use with the vendor’s products. They are not forks, in that they do not extend or change the OpenTelemetry API.
It is not a requirement to use a vendor’s distro, it should always be possible to use the reference OpenTelemetry software and configure it as appropriate. The obvious advantage to using a distro is ease of use, but a disadvantage is that sometimes the version of distro lags behind the reference OpenTelemetry version.
The OpenTelemetry Collector is a vendor-agnostic deployable process to manage telemetry data as it flows from instrumented applications to telemetry backends. The Collector can transform telemetry data, for example insert or filter attributes. It removes the need to run multiple, vendor-specific agents and collectors, when working with multiple telemetry data formats and telemetry backends.
It is not required to use the OpenTelemetry Collector, though its use is recommended to simplify telemetry data management in larger systems that have many instrumented services. Some exporters provided by OpenTelemetry have default options set to target a Collector instance running on the same host.
The following are alternatives to OpenTelemetry in the Clojure ecosystem, which are concerned with the creation or processing of telemetry data.
μ/log : Micro-logging library that logs events and data, not words
ken : Observability library to instrument Clojure code
clojure.log4j2 : Sugar for using Log4j2 from clojure, including MapMessage
support
timbre-json-appender : Structured log appender for Timbre using jsonista
cartus : Structured logging abstraction with multiple backends
Cambium : Structured logging for Clojure
clj-journal : Structured logging to systemd journal using native systemd libraries and JNA (Java Native Access)
μ/trace : Micro distributed tracing library with the focus on tracking data with custom attributes; a subsystem of μ/log
ken tracing support
opencensus-clojure : Clojure wrapper for opencensus-java
opentracing-clj : OpenTracing API support for Clojure, built on top of opentracing-java
omni-trace : Clojure(Script) tracing for debugging
Riemann : Network event stream processing system, in Clojure
micrometer-clj : Clojure wrapper for Java Micrometer library
metrics-clojure : Clojure façade around Dropwizard Metrics library
TRACKit! : Clojure wrapper for Dropwizard Metrics library
aws-metrics-collector : Clojure AWS Cloudwatch metric collector
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close