Intelligent Sampling With OpenCensus
Published by Steve Flanders on
Historically, traces have been sampled. Ideally, you’d keep all traces, however, that is not always an option. OpenCensus is the first, open-source, vendor-agnostic technology to offer intelligent (tail-based) sampling. In this post, I would like to demonstrate why intelligent sampling is valuable over traditional head-based sampling and how you can configure it in OpenCensus.
This is a big deal!
A Background on Trace Sampling
Tracing data is notoriously verbose given that it tracks every request within an application. For high volume applications this results in high volume traces. Some of these traces, such as traces that contain errors, are very important while others, like every HTTP status code 200 with very similar latency, are only valuable when calculating Service Level Indicators. If you have used or configured distributed traces then you have likely come across the term “sampling”. As the name implies, sampling means you only collect/analyze a subset of the tracing data available. The typical reasons why sampling gets configured includes the perception of:
- Reduce overhead on an application: Do not introduce any tracing overhead unless you plan to capture the trace
- Operational overhead: Reduce the amount of data that needs to be processed/stored
- Cost reasons: Reduce the amount of CPU, memory, network and disk required
Sampling has been the de facto method for capturing trace data, for better or worse. Let’s walk through the two different types of sampling leveraged in the distributed tracing world.
With head-based sampling, the sampling decision is made at the beginning of the request (i.e. when the root span is to be created). Head-based sampling is the most common form of sampling leveraged because it is easy to implement. The problem is that this sampling type makes the sampling decision at the beginning of the request so it is unaware of what might happen later in the request. This means head-based sampling is good at reducing verbosity but bad at ensuring relevant traces are kept.
With tail-based sampling (also known as intelligent sampling), the sampling decision is made at the end of the request (i.e. when all spans for a given trace ID have been received). As a result, it is possible to capture relevant traces while minimally or never sampling less relevant traces. While tail-based sampling is often more desirable than head-based sampling, it does introduce additional complexity. For example, all spans for a given trace need to be processed by the same system and the trace cannot be ingested by a backend until the entire trace is collected and a sampling decision is made (it is hard to know when a trace is complete). In addition, this analysis requires more compute power and will need to be factored into how you architect your collection infrastructure.
Sampling in OpenCensus
The OpenCensus Service offers both head-based and tail-based sampling in the OpenCensus Collector. While tail-based sampling must be configured at the collection layer, head-based sampling in the Collector makes it possible to remove one more configuration parameter from the client instrumentation making reconfiguration even easier. Let’s walk through how to configure each sampling type.
Note: While head-based sampling will be added to the OpenCensus Agent in the future, tail-based sampling must happen in the OpenCensus Collector since the sampling decision needs to be made after the entire trace has been collected.
Head-based sampling is configured via a probabilistic rate defined as a percentage. For example:
With head-based sampling, any Collector can receive a span for a given trace. Hashing ensures that all Collectors sample a given trace consistently.
Tail-based sampling is configured based on a policy per exporter. Today, a single sampling policy can be applied to each defined exporter (the plan is to support multiple policies per exporter in the future). The following sampling policies are supported:
- rate limiting: the maximum number of spans per second to export
- string tag filter: traces with the specified key/string-value tags are exported
- numeric tag filter: traces with the specified key/numeric-value tags are exported
- always sample: send all traces as complete traces
Configuration is done via the sampling configuration section. For example:
In addition to the tail-based policies defined above, it is important to
decision-wait configuration parameter. This parameter specifies
how long to wait before applying the sampling policy. If you know you have
traces that take longer than ten seconds to complete then you should change
Given tail-based sampling requires all spans for a given trace to arrive at the same Collector you must either use a single Collector or leverage an external load balancing technique that does traceID based routing.
With the OpenCensus Collector you can configure head-based or tail-based sampling of your distributed tracing data. It supports a variety of different sampling policies today and has an extensible backend making it possible to easily add more policies as desired. Of course you also have the option of not sampling (default in the Agent/Collector) or configuring sampling in the client instrumentation. The OpenCensus Service provides choice to ensure all of your business requirements are met. If you are considering enabling sampling, drop me an email at steve [at] omntion [dot] io as enabling complete trace collection is possible with low overhead and cost while providing complete observability into your application.