Intelligent Sampling With OpenCensus

Published by Steve Flanders on

Historically, traces have been sampled. Ideally, you’d keep all traces, however, that is not always an option. OpenCensus is the first, open-source, vendor-agnostic technology to offer intelligent (tail-based) sampling. In this post, I would like to demonstrate why intelligent sampling is valuable over traditional head-based sampling and how you can configure it in OpenCensus.

This is a big deal!

A Background on Trace Sampling

Tracing data is notoriously verbose given that it tracks every request within an application. For high volume applications this results in high volume traces. Some of these traces, such as traces that contain errors, are very important while others, like every HTTP status code 200 with very similar latency, are only valuable when calculating Service Level Indicators. If you have used or configured distributed traces then you have likely come across the term “sampling”. As the name implies, sampling means you only collect/analyze a subset of the tracing data available. The typical reasons why sampling gets configured includes the perception of:

Sampling has been the de facto method for capturing trace data, for better or worse. Let’s walk through the two different types of sampling leveraged in the distributed tracing world.

Head-based

With head-based sampling, the sampling decision is made at the beginning of the request (i.e. when the root span is to be created). Head-based sampling is the most common form of sampling leveraged because it is easy to implement. The problem is that this sampling type makes the sampling decision at the beginning of the request so it is unaware of what might happen later in the request. This means head-based sampling is good at reducing verbosity but bad at ensuring relevant traces are kept.

Tail-based

With tail-based sampling (also known as intelligent sampling), the sampling decision is made at the end of the request (i.e. when all spans for a given trace ID have been received). As a result, it is possible to capture relevant traces while minimally or never sampling less relevant traces. While tail-based sampling is often more desirable than head-based sampling, it does introduce additional complexity. For example, all spans for a given trace need to be processed by the same system and the trace cannot be ingested by a backend until the entire trace is collected and a sampling decision is made (it is hard to know when a trace is complete). In addition, this analysis requires more compute power and will need to be factored into how you architect your collection infrastructure.

Sampling in OpenCensus

The OpenCensus Service offers both head-based and tail-based sampling in the OpenCensus Collector. While tail-based sampling must be configured at the collection layer, head-based sampling in the Collector makes it possible to remove one more configuration parameter from the client instrumentation making reconfiguration even easier. Let’s walk through how to configure each sampling type.

Note: While head-based sampling will be added to the OpenCensus Agent in the future, tail-based sampling must happen in the OpenCensus Collector since the sampling decision needs to be made after the entire trace has been collected.

Head-based

Head-based sampling is configured via a probabilistic rate defined as a percentage. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sampling:
  mode: head
  policies:
    # sample based on hashing the trace ID
    probabilistic:
      configuration:
        sampling-percentage: 5
        # must have a unique hash-seed per cluster tier
        # most will have a single tier
        hash-seed: 1

With head-based sampling, any Collector can receive a span for a given trace. Hashing ensures that all Collectors sample a given trace consistently.

Tail-based

Tail-based sampling is configured based on a policy per exporter. Today, a single sampling policy can be applied to each defined exporter (the plan is to support multiple policies per exporter in the future). The following sampling policies are supported:

Configuration is done via the sampling configuration section. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
sampling:
  mode: tail
  # amount of time from seeing the first
  # span in a trace until making the
  # sampling decision
  decision-wait: 10s
  # maximum number of traces kept in
  # memory
  num-traces: 10000
  policies:
    # user-defined policy name
    my-string-attribute-filter:
      # exporters the policy applies to
      exporters:
        - jaeger
      policy: string-attribute-filter
      configuration:
        key: key1
        values:
          - value1
          - value2
    my-numeric-attribute-filter:
      exporters:
        - zipkin
      policy: numeric-attribute-filter
      configuration:
        key: key1
        min-value: 0
        max-value: 100

In addition to the tail-based policies defined above, it is important to understand the decision-wait configuration parameter. This parameter specifies how long to wait before applying the sampling policy. If you know you have traces that take longer than ten seconds to complete then you should change this configuration.

Given tail-based sampling requires all spans for a given trace to arrive at the same Collector you must either use a single Collector or leverage an external load balancing technique that does traceID based routing.

Summary

With the OpenCensus Collector you can configure head-based or tail-based sampling of your distributed tracing data. It supports a variety of different sampling policies today and has an extensible backend making it possible to easily add more policies as desired. Of course you also have the option of not sampling (default in the Agent/Collector) or configuring sampling in the client instrumentation. The OpenCensus Service provides choice to ensure all of your business requirements are met. If you are considering enabling sampling, drop me an email at steve [at] omntion [dot] io as enabling complete trace collection is possible with low overhead and cost while providing complete observability into your application.

Categories: [Observability]

Tags: [OpenCensus OpenCensus Service sampling]

See Also