Liking cljdoc? Tell your friends :D

Concepts

Observability, OpenTelemetry and clj-otel

What is observability?

Observability is the ability to measure a system’s current state based on the telemetry data it generates, such as traces, metrics, logs and events.

An observable system may be queried to describe its composition and behaviour. As systems are often highly distributed in nature, observability addresses the need to understand the behaviour with context across services, technologies and environments.

Monitoring is a complementary activity to observability, where typically a dashboard is configured to report predetermined metrics and alerts are set to trigger on specific conditions in the system. Monitoring is oriented to the detection of known types of issues, as the monitored metrics or conditions are determined in advance.

The goal of observability

Observability enables detection and exploration of issues as they occur and assists with root cause analysis. By querying an observable system, unexpected types of issue can be detected, characterised and monitored as they arise, addressing the problem of "unknown unknowns". Questions on the actual behaviour of an observable system may be more readily answered:

  • "What caused this failure?"

  • "What caused this change in behaviour?"

  • "Why do a particular user’s requests fail?"

  • "Is my latest canary deployment stable?"

  • "Should I roll back the latest deployment?"

  • "Are my intended performance improvements in the latest deployment working as expected?"

  • "What uses my service, and what are the services it uses?"

  • "What mix of request types are issued to the system? Is that mix consistent across users?"

  • "Which areas should I work on to improve experience for most users?"

  • "How well is the system behaving for my premium users?"

  • "What user segment suffers high response times and why?"

  • "What is the biggest contributor to the system’s poor performance?"

  • "Why is the system’s error rate worse than last month? Are there new error cases?"

  • "What is the cause of my service level indicators (SLIs) not meeting the service level objectives (SLOs)?"

Through enhanced visibility of system behaviour, observability improves understanding of the impact of changes made by developers and operators.

About OpenTelemetry

OpenTelemetry is a CNCF incubating project formed after the merger of OpenTracing and OpenCensus.

OpenTelemetry’s Mission: to enable effective observability by making high-quality, portable telemetry ubiquitous[1].
— OpenTelemetry community

OpenTelemetry is a set of APIs, SDKs, tooling, integrations and semantic conventions for creating and managing telemetry data. OpenTelemetry provides a vendor-agnostic implementation per language (Java, Go, JavaScript, C++ and others) that exports telemetry data to backends of your choice. The Java implementation supports automatic instrumentation of many libraries, frameworks and application servers and manual instrumentation of library or application source code. OpenTelemetry supports various backends, including OSS projects (Jaeger, Zipkin, Prometheus, Grafana Tempo, SigNoz) and commercial products (Honeycomb, Lightstep, Aspecto, Dynatrace, New Relic, Datadog and more).

The OpenTelemetry project also provides the OpenTelemetry Collector, a vendor-agnostic deployable process which receives, processes and exports telemetry data.

OpenTelemetry status and adoption

Work on the OpenTelemetry specification and implementations is in progress by a large community of contributors. Support for traces is the most mature, followed by metrics and logs. See a summary of OpenTelemetry status.

There is a growing consensus in the commercial monitoring and APM tools market to adopt OpenTelemetry as a standard for telemetry in cloud-native software. Today many commercial products accept OpenTelemetry trace data. Product support for metrics and logs is growing. Several vendors have created distributions of the OpenTelemetry SDK and OpenTelemetry Collector, which feature customisations tailored for use with their products.

What does this project clj-otel enable?

clj-otel extends the reach of OpenTelemetry to Clojure libraries and applications by providing:

  • A small idiomatic Clojure API that wraps the OpenTelemetry implementation for Java. This API enables manual instrumentation of Clojure libraries and applications using pure Clojure.

  • Ring middleware and Pedestal interceptors for server span support.

  • Support for creating spans around asynchronous Clojure code.

  • A Clojure wrapper for programmatic configuration of the OpenTelemetry SDK.

clj-otel is an umbrella project for several Clojure modules clj-otel-*. They depend on the OpenTelemetry implementation for Java opentelemetry-java and the OpenTelemetry instrumentation agent provided by opentelemetry-java-instrumentation.

Supported telemetry backends

OpenTelemetry exports telemetry data to a variety of telemetry backends. The choice of backend(s) is applied when configuring system components for deployment.

Query and presentation capabilities vary between backends. Many backends conceived as solutions focused on tracing, monitoring or application performance management (APM) predate OpenTelemetry. Today these backends have been modified to ingest telemetry data from OpenTelemetry.

The following sections are incomplete selections of open-source software (OSS) and commercial backends that accept telemetry data from OpenTelemetry.

Using OpenTelemetry

The general workflow for using OpenTelemetry with your library or application is:

  1. Add instrumentation to your library or application such that it exports telemetry data.

  2. Configure system components to control how the telemetry data are processed and exported, either directly to telemetry backends or via OpenTelemetry Collector instance(s).

  3. Use telemetry backend features to explore system behaviour described by the telemetry data.

Instrumenting libraries and applications

Instrumenting a library or application involves adding behaviour such that it exports telemetry data as it runs.

Automatic instrumentation dynamically alters the library or application at runtime to export telemetry data. For the Java platform, automatic instrumentation is performed by the OpenTelemetry instrumentation agent, a Java agent that runs with the application. Many libraries, frameworks and application servers are supported by the agent out of the box. For example, the agent will create server spans for requests received by a Jetty server, and client spans for requests issued by an Apache HttpClient instance.

If possible, use automatic instrumentation for your application, as this is a quick way to get high quality telemetry with almost no effort.

Manual instrumentation is the process of adding program code to the library or application at design time to export telemetry data using the OpenTelemetry API. The clj-otel-api module in this project wraps the OpenTelemetry API for Java in an idiomatic Clojure facade.

Manual instrumentation program code depends on the OpenTelemetry API, never the OpenTelemetry SDK.

Any combination of automatic and manual instrumentation may be used:

  • Use solely automatic instrumentation to quickly add telemetry without changing any program code.

  • Use solely manual instrumentation if it is not possible to use the instrumentation agent, or the instrumented application does not use a library or framework supported by the agent.

  • Combine automatic and manual instrumentation for enriched telemetry. For example, to enrich a span produced by automatic instrumentation, attributes and events may be added using manual instrumentation.

OpenTelemetry data model

In observability terms, telemetry data comes from four sources: traces, metrics, logs and events. In the OpenTelemetry data model, data sources are traces, metrics and logs. Events are treated as a specific type of log or captured as part of a trace.

Traces

A trace represents the flow of a single transaction throughout the system. A trace comprises a tree of spans, where a span represents a unit of work in a service. Parent-child relationships between spans describe dependencies between them. The root span of a trace typically describes the entire transaction. The other spans represent units of work performed as part of the transaction. Traces provide context for system activity performed in spans.

Span data may include a span kind, name, attributes, start/end timestamps, links to other spans, a list of events and a status.

  • The span kind indicates the relationship between the span and its parent and children in the trace. The span kind is one of:

    • CLIENT : Covers the client side of issuing a synchronous request, where the client side waits until a response is received.

    • SERVER : Covers the server side of handling a synchronous request, where the remote client waits for a response.

    • PRODUCER : Covers initiation of an asynchronous request, where the corresponding consumer span may start after the producer span ends.

    • CONSUMER : Covers processing of an asynchronous producer request.

    • INTERNAL : An internal operation within the local application or service.

  • The span name should identify a class of spans and not include data.

  • The events in a span are timestamped records that may include attributes. Exceptions thrown in a span’s scope are captured as events.

  • The span status has a code Ok or Error, and in the case of Error may also have a string description.

See specifications for span and span kind.

Metrics

A metric is a numerical measurement over a period of time. Metrics are used to indicate quantitative aspects of system health, such as resource (memory, disk, compute, network) usage, error rate, message queue length, and request response time.

Logs

A service log is made of lines of text (possibly structured e.g. in JSON format) written when certain points in the service code are executed. Logs are well suited to ad-hoc debugging and capture of low-level details.

Events

Events are captured as either a specific type of log or as a span event. Events are records that describe actions taken by the system or significant environmental changes, such as a service deployment or change in configuration.

Attributes

Attributes may be attached to some telemetry data such as spans and resources. Attributes are a map where each entry has a string key. Each entry value is a boolean, long, double, string or an array of one of those types. Entries with a nil value are dropped.

OpenTelemetry recommends using namespaced attribute names to prevent clashes. See the specification for attributes and attribute naming.

Resources

A resource captures information about the entity for which telemetry data is recorded. For example, information on the host and JVM version may be part of a resource. Resources form part of the telemetry data.

The OpenTelemetry SDK contains resource implementations which capture host and process information.

Baggage

Baggage is a mechanism for propagating telemetry metadata and is represented as a simple map. It is a means to add contextual information at a point in a transaction, read by a downstream service later in the same transaction and then used as an element of telemetry data, e.g. an attribute. For example, a user identifier is put in the baggage to indicate the principal of a request and subsequent spans in the trace include a principal attribute.

Semantic conventions

OpenTelemetry defines a rich set of conventions for telemetry data. This semantic unification across vendors and technologies promotes analysis of telemetry data created in heterogeneous, polyglot systems. In particular, semantic attributes for spans and metrics are defined for base technologies like HTTP, database, RPC, messaging, FaaS (Function as a Service) and others. See OpenTelemetry semantic conventions documentation.

clj-otel follows the semantic conventions for areas such as span exception events and manually created HTTP client and server spans.

Context

A context is an immutable map that holds values transmitted across API boundaries and threads. A context may contain a span, baggage and possibly other values. A new context is created by adding a key-value association to an existing context.

Current context

The current context is a thread local io.opentelemetry.context.Context object. It is a default for many functions in this project clj-otel and methods of the underlying Java library opentelemetry-java. The current context is safe to use when manually instrumenting synchronous code.

Do not use the current context when manually instrumenting asynchronous code. See Instrumenting asynchronous Clojure code.

Context propagation

Context propagation is the mechanism used to transmit context values across API boundaries and threads. Context propagation enables traces to become distributed traces, joining clients to servers and producers to consumers. In practice HTTP request header values are injected and extracted using a text map propagator.

OpenTelemetry provides text map propagators for the following protocols:

The W3C Trace Context and W3C baggage header propagation protocols are the most commonly used.

Instrumenting asynchronous Clojure code

When manually instrumenting asynchronous Clojure code with this library clj-otel, it is not possible to use the current context. This is because async Clojure function evaluations share threads, but each evaluation is associated with a distinct context. The async function must instead maintain a reference to the associated context during evaluation, rather than use the current context. Some functions in this library clj-otel take a :context or :parent option to indicate the associated context to use, as an alternative to the default current context.

Trace sampling

Sampling is the selection of some elements from a set and deriving observations on the complete set based on analysis of those selected elements. Sampling is needed when the volume of raw data is too high to analyse cost-effectively.

Trace sampling may occur at any number of points between the instrumented application and the telemetry backend. OpenTelemetry provides sampler implementations which may be applied in the application and/or the Collector. Some telemetry backends may also apply sampling to trace data they receive, either automatically or with some developer intervention.

Exporters

Exporters emit telemetry data to consumers, such as the Collector and telemetry backends. Exporters can be push or pull based.

OpenTelemetry Protocol - OTLP

OpenTelemetry Protocol (OTLP) is the OpenTelemetry native protocol for encoding, transport and delivery of telemetry data. OTLP is currently implemented over gRPC and HTTP transports.

Almost all telemetry backends that integrate with OpenTelemetry accept telemetry data in OTLP format. An application or OpenTelemetry Collector exports data to these backends using an OTLP exporter.

Using the OpenTelemetry SDK

The OpenTelemetry SDK implements the creation, sampling, batching and export of telemetry data. The SDK acts as an implementation of the OpenTelemetry API. For an application to export telemetry data, the SDK and its dependencies should be present and configured at runtime.

The SDK and its dependencies are added to an application in one of the following ways:

  • By using the OpenTelemetry instrumentation agent: In this option, the SDK and its dependencies are present but do not appear on the application classpath. Also, autoconfiguration is used for configuring the SDK.

  • By using the opentelemetry-sdk-extension-autoconfigure library as an application dependency: This option is for autoconfiguration of the SDK where the OpenTelemetry instrumentation agent is not present. The relevant optional SDK libraries (exporters, extensions, etc.) also need to be added as runtime dependencies.

  • By adding the SDK as a compile-time dependency to the application: This option is for programmatic configuration of the SDK. The relevant optional SDK libraries also need to be added as compile-time dependencies.

If the SDK is not present at application runtime, all OpenTelemetry API calls default to a no-op implementation where no telemetry data is created.

Autoconfiguration

Autoconfiguration of the OpenTelemetry SDK refers to configuration using system properties or environment variables. Configuration of the OpenTelemetry instrumentation agent uses the same mechanism.

Programmatic configuration

The SDK can be programmatically configured, as an alternative to autoconfiguration. This is a fallback option if autoconfiguration lacks the desired options.

This project clj-otel provides a module clj-otel-sdk for configuring the SDK in Clojure, as well as other support modules clj-otel-exporter-*,clj-otel-extension-* and clj-otel-sdk-extension-* for programmatic access to various optional components.

OpenTelemetry distros

An OpenTelemetry distro (or "distribution") supplied by a vendor is a repackaging of reference OpenTelemetry software, customised for ease of use with the vendor’s products. They are not forks in that they do not extend or change the OpenTelemetry API.

It is not a requirement to use a vendor’s distro since it should always be possible to use the reference OpenTelemetry software and configure it appropriately. The obvious advantage to using a distro is the ease of use. However, a disadvantage is that sometimes the distro version lags behind the reference OpenTelemetry version.

OpenTelemetry Collector

The OpenTelemetry Collector is a vendor-agnostic deployable process to manage telemetry data as it flows from instrumented applications to telemetry backends. The Collector can transform telemetry data by, for example, inserting or filtering attributes. It removes the need to run multiple vendor-specific agents and collectors when working with several telemetry data formats and telemetry backends.

It is not required to use the OpenTelemetry Collector, though it simplifies telemetry data management in large systems with many instrumented services. Some exporters provided by OpenTelemetry have default options set to target a Collector instance running on the same host.

Alternative Clojure telemetry projects

The following are alternatives to OpenTelemetry in the Clojure ecosystem, concerned with telemetry data creation or processing.

Events & structured logs

  • μ/log : Micro-logging library that logs events and data, not words

  • ken : Observability library to instrument Clojure code

  • clojure.log4j2 : Sugar for using Log4j2 from clojure, including MapMessage support

  • timbre-json-appender : Structured log appender for Timbre using jsonista

  • cartus : Structured logging abstraction with multiple backends

  • Cambium : Structured logging for Clojure

  • clj-journal : Structured logging to systemd journal using native systemd libraries and JNA (Java Native Access)

Traces

Unstructured logs

  • Timbre : Pure Clojure/Script logging library

  • clj-loga : Custom log formatting for Timbre

Metrics

Monitoring

  • salutem : Health check library

  • sereno : Uptime monitoring application

  • plumon : Clojure monitoring service with pluggable monitorables

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close