What Is Distributed Tracing
Published by Steve Flanders on
If you’ve been working with microservices applications, you have most likely heard the phrase “distributed tracing”. That’s for good reason, as the complexity of distributed applications makes monitoring via traditional logs and metrics insufficient for debugging purposes. You may be wondering what distributed tracing is, why it matters, and what options exist to get you started.
In this post, I will cover the answers to all these questions.
The Distributed Tracing Trap
Before we jump into the inner-workings of distributed tracing, it’s critical to point out one gotcha. Historically, vendors have provided their own SDKs and agents for the collection of trace data. In the cloud-native era, open standards and open source solutions have commoditized the proprietary process of making your applications “trace-ready” and collecting the data. It is highly recommended to scrap or avoid the vendor-proprietary trap for instrumentation and data collection.
What is Distributed Tracing?
Distributed tracing is the collection of data related to end-to-end requests within a distributed or microservice-based application. An individual end-to-end request is known as a trace and a trace is made up of one or more spans where a span represents a call within the request. A call may be to a separate microservice or a function within a microservice. The general structure of a trace looks like the following:
The first span in a trace is known as the root span. It can be followed by one or more child spans. Child spans can also be nested as deep as the call stack goes. A span includes a service name, operation name, duration and, optionally, additional metadata.
How Distributing Tracing Works
Distributing tracing data is generated through the instrumentation of applications, libraries, and frameworks. The instrumentation handles the creation of unique trace/span IDs, keeping track of duration, adding metadata and handling context data. The context part, known as context propagation, is the most critical part. It is responsible for passing context such as the trace ID between function/microservice calls, enabling an observer to view the entire transaction and each stop along the way. Context propagation is done based on the RPC you use. In the case of REST this is header-based (i.e. your application must pass headers between service-to-service calls). In order to work properly, all services within a request must use the same context propagation format.
Why is context propagation so critical? In addition to being required for distributed tracing, it can be used to enhance other observability data. For example, what if you could query for all the logs that were generated for a given request (i.e. distributed trace) through your application? With context propagation enabled you can add information such as trace ID and span ID to your log messages making this possible (more on this in a future post)!
Enabling Distributed Tracing
How do you enable distributed tracing? The two primary options are:
- Traffic inspection / Service mesh with context propagation — leverage existing proxy instrumentation to send trace data
- Code instrumentation with context propagation — creating or implementing a language-specific library at the application layer
Given that most people do not leverage traffic inspection / service mesh in production today I will focus on the code instrumentation. The steps to get started are:
- Add a client library dependency
- Choose a context propagation format
- Instantiate a tracer
- Configure a destination
- Add instrumentation to all service-to-service communication
- Enhance spans with useful metadata
- Add key/value labels
- Add events/logs
- Add additional instrumentation
- Integrations (e.g. DB calls)
- Function-level instrumentation
- Async calls
Leveraging open standards and open source data collection is one of the easiest ways to ensure that you get value out of your trace data without locking you into a vendor. Let’s walk through each of the decisions that need to be made as well as the options available today.
- B3 (from Zipkin – probably the most common propagation format today (because it was the first open-source solution)
- Uber (from Jaeger – second most widely used format given inclusion into CNCF
- W3C Trace Context – new standard that is about to GA (this is what you should be moving to as it is an agreed upon standard created from thought leaders in the space)
- Vendor / Service provider – these proprietary formats should be avoided as they lead to vendor lock-in and create challenges deriving context from services that are using alternative means of context propagation
Note: Also known as client libraries
- Zipkin – Released in 2012 and originally backed by Twitter; OpenTracing compatible
- Jaeger – Released in 2016, backed by Uber, and a CNCF project; OpenTracing compatible
- OpenTracing – Released in 2016 and a CNCF project
- OpenCensus – Released in 2018 and backed by Google, Omnition and Microsoft
- OpenTelemetry – Announced in 2019 and a CNCF project. This will replace OpenTracing and OpenCensus combining the best of both worlds by providing an implementation as well as support for many client libraries (should move to this once available)
This includes an agent and/or collector
- Jaeger – Also supports Zipkin format
- OpenCensus Service – Created by Google and Omnition and supports many formats including Zipkin and Jaeger
- OpenTelemetry Service – This will replace the OpenCensus Service (should move to this once available)
- Vendor / Service provider – these proprietary (sometimes open-source) agents and collectors should be avoided as they lead to vendor lock-in
Backend / UI
- Lots of commercial players in this space – and for good reason given the need to support large scale and rich analytics. This is the only proprietary solution you should consider for your observability stack.
Where to Begin
If starting today, selecting either OpenCensus or OpenTracing for instrumentation is recommended. Later this year, OpenTelemetry will become available, and will be backwards compatible with both OpenCensus and OpenTracing. For header propagation, W3C would be ideal, but if the client libraries you have chosen do not support W3C yet then B3 is a good alternative. Regardless of the instrumentation and context propagation format you choose, you should deploy and leverage the OpenCensus Service for data collection. This Service provides an Agent and Collector that can be used to receive common open-source and commercial formats and send distributed tracing as well as metric data to one or more backends. This eliminates the need to stand up collection mechanisms per vendor and provides advanced functionality required to handle observability data including buffering and retry at scale, encryption, data redaction, and tail-based sampling.
The move to distributed architectures has disrupted the traditional monitoring and troubleshooting landscape. In distributed architectures, context and correlation are required in order to solve availability as well as performance issues within an application or environment. Distributed tracing provides the context and correlation that is missing from traditional metrics and logs. In the cloud-native world, open source and open standards are commoditizing the instrumentation, propagation, and collection of observability data. These technologies are reducing the friction related to telemetry collection and are laying the foundation for more powerful observability. When looking for an observability solution you should be looking at open standards and open source data collection options like those provided by OpenTelemetry and sending to either an internally managed or commercial backend. If you are in the middle of this journey, or haven’t yet seen the value from the tracing data you’ve collected, reach out to me at steve at omnition dot io and I can help you gain more insight from the data whether via instrumentation or by using a backend like we have created at Omnition.