Skip to content

Zipkin 1.29

Compare
Choose a tag to compare
@codefromthecrypt codefromthecrypt released this 07 Aug 09:32
· 1741 commits to master since this release

Zipkin 1.29 models messaging spans, shows errors in the service graph and supports Elasticsearch 6

Message tracing

Producing and consuming messages from a broker, such as RabbitMQ or Kafka, is similar but different than one-way RPC. For example, one message can have multiple consumers, and many times the producer of the message can't know if this will be the case. Also, and particularly in Kafka, consuming a message is often completely decoupled from processing of it, and consumption may happen in bulk.

Through community discussion, notably advice from @bogdandrutu from Census, we reached this conclusion for message tracing with Zipkin:

  • Messaging consumers should always be a child span of the producing span (and not a linked trace)
    • If using B3, this means X-B3-SpanId is the parent of the consumer span
  • "ms" and "mr" annotate message send and receive events
    • span2 format replaces these with Span.Kind.PRODUCER, CONSUMER
  • If producer and consumer spans include duration, it should only reflect local batching delay
    • time spent processing a message should be in a different child span

There are diagrams of how instrumentation work with this model on the website. You can also look at @ImFlog's Kafka 0.11 tracing work in progress. If you have more questions or want to share your work, contact us on gitter.

Visualizing error count between services

Thanks to @hfgbarrigas' initial work, and lots of review support by @shakuzen,
we now have errorCount on dependency links, indicating how many of callCount
between services were in error.

MySQL users who want this need to add the error_count column:

alter table zipkin_dependencies add `error_count` BIGINT

The UI is relatively simple, coloring the line yellow when 50% or more calls are in error, and red when 75%. These rates can be overridden or disabled with configuration.

Example link detail screen
screen shot 2017-07-28 at 6 25 44 pm

Example of when >50% of calls are in error
screen shot 2017-07-28 at 6 25 35 pm

Example of when >75% of calls are in error
screen shot 2017-07-28 at 6 25 08 pm

Trace instrumentation's contract is easy: add the "error" tag, for example on http 500. When aggregating links, the value of the "error" tag isn't important. Please update to latest versions of instrumentation if you don't see errors, yet. For example, zipkin-ruby recently support this thanks to @jcarres-mdsol.

Elasticsearch 6

Currently, Elasticsearch uses one index for all types: spans, dependencies (and a special service name index). Elasticsearch 6 no longer supports multiple types per index. Instead we write separate indexes for span and dependency links when Elasticsearch 6 is detected. Incidentally, we also use the new span2 json format, which is simplified and more efficient.

The next version will support the same single-type indexing with Elasticsearch 2.4+. If you can't wait that long, look at #1674 for the experimental flag you can use today.

Thanks to @anuraaga @ImFlog @xeraa and @jcarres-mdsol for advice and support leading to this feature. The next release will thank those who test it!