What Is Distributed Tracing

Published by Steve Flanders on

If you’ve been working with microservices applications, you have most likely heard the phrase “distributed tracing”. That’s for good reason, as the complexity of distributed applications makes monitoring via traditional logs and metrics insufficient for debugging purposes. You may be wondering what distributed tracing is, why it matters, and what options exist to get you started.

In this post, I will cover the answers to all these questions.

The Distributed Tracing Trap

Before we jump into the inner-workings of distributed tracing, it’s critical to point out one gotcha. Historically, vendors have provided their own SDKs and agents for the collection of trace data. In the cloud-native era, open standards and open source solutions have commoditized the proprietary process of making your applications “trace-ready” and collecting the data. It is highly recommended to scrap or avoid the vendor-proprietary trap for instrumentation and data collection.

What is Distributed Tracing?

Distributed tracing is the collection of data related to end-to-end requests within a distributed or microservice-based application. An individual end-to-end request is known as a trace and a trace is made up of one or more spans where a span represents a call within the request. A call may be to a separate microservice or a function within a microservice. The general structure of a trace looks like the following:

Distributed trace waterfall

The first span in a trace is known as the root span. It can be followed by one or more child spans. Child spans can also be nested as deep as the call stack goes. A span includes a service name, operation name, duration and, optionally, additional metadata.

How Distributing Tracing Works

Distributing tracing data is generated through the instrumentation of applications, libraries, and frameworks. The instrumentation handles the creation of unique trace/span IDs, keeping track of duration, adding metadata and handling context data. The context part, known as context propagation, is the most critical part. It is responsible for passing context such as the trace ID between function/microservice calls, enabling an observer to view the entire transaction and each stop along the way. Context propagation is done based on the RPC you use. In the case of REST this is header-based (i.e. your application must pass headers between service-to-service calls). In order to work properly, all services within a request must use the same context propagation format.

Why is context propagation so critical? In addition to being required for distributed tracing, it can be used to enhance other observability data. For example, what if you could query for all the logs that were generated for a given request (i.e. distributed trace) through your application? With context propagation enabled you can add information such as trace ID and span ID to your log messages making this possible (more on this in a future post)!

Enabling Distributed Tracing

How do you enable distributed tracing? The two primary options are:

  1. Traffic inspection / Service mesh with context propagation — leverage existing proxy instrumentation to send trace data
  2. Code instrumentation with context propagation — creating or implementing a language-specific library at the application layer

Given that most people do not leverage traffic inspection / service mesh in production today I will focus on the code instrumentation. The steps to get started are:

  1. Add a client library dependency
    • Choose a context propagation format
    • Instantiate a tracer
    • Configure a destination
  2. Add instrumentation to all service-to-service communication
  3. Enhance spans with useful metadata
    • Add key/value labels
    • Add events/logs
  4. Add additional instrumentation
    • Integrations (e.g. DB calls)
    • Function-level instrumentation
    • Async calls

Leveraging open standards and open source data collection is one of the easiest ways to ensure that you get value out of your trace data without locking you into a vendor. Let’s walk through each of the decisions that need to be made as well as the options available today.

Context Propagation

Client Instrumentation

Note: Also known as client libraries

Data Collection

This includes an agent and/or collector

Backend / UI

Where to Begin

If starting today, selecting either OpenCensus or OpenTracing for instrumentation is recommended. Later this year, OpenTelemetry will become available, and will be backwards compatible with both OpenCensus and OpenTracing. For header propagation, W3C would be ideal, but if the client libraries you have chosen do not support W3C yet then B3 is a good alternative. Regardless of the instrumentation and context propagation format you choose, you should deploy and leverage the OpenCensus Service for data collection. This Service provides an Agent and Collector that can be used to receive common open-source and commercial formats and send distributed tracing as well as metric data to one or more backends. This eliminates the need to stand up collection mechanisms per vendor and provides advanced functionality required to handle observability data including buffering and retry at scale, encryption, data redaction, and tail-based sampling.

Summary

The move to distributed architectures has disrupted the traditional monitoring and troubleshooting landscape. In distributed architectures, context and correlation are required in order to solve availability as well as performance issues within an application or environment. Distributed tracing provides the context and correlation that is missing from traditional metrics and logs. In the cloud-native world, open source and open standards are commoditizing the instrumentation, propagation, and collection of observability data. These technologies are reducing the friction related to telemetry collection and are laying the foundation for more powerful observability. When looking for an observability solution you should be looking at open standards and open source data collection options like those provided by OpenTelemetry and sending to either an internally managed or commercial backend. If you are in the middle of this journey, or haven’t yet seen the value from the tracing data you’ve collected, reach out to me at steve at omnition dot io and I can help you gain more insight from the data whether via instrumentation or by using a backend like we have created at Omnition.

Categories: [Observability]

Tags: [OpenCensus OpenTelemetry distributed tracing microservices cloud-native]

See Also