Refactor metrics processing #34

nadiamoe · 2025-01-14T13:41:24Z

Note

Reviewers: This PR changes the output.go file substantially, making the diff useless. I recommend to look a the new code only, and focus efforts on considering whether the tests are sufficient to prove backwards-compatibility.

This PR overhauls how metrics are processed in the SM output extension, deliberately taking a few tradeoffs against performance to make a code that is going to be chatty and verbose as easy to read as possible.

How the new code works

The new code works like this:

When samples arrive through func (o *Output) AddMetricSamples(containers []metrics.SampleContainer), samples are added to a map in the metricStore. This map indexes metrics by timeseries, with the value being the numeric value of the timeseries. If a sample for a given timeseries arrives more than once, it is immediately aggregated this early, depending on the metric type: Counters are added to its previous value, gauges replace the previous value, and Rates and Trends are averaged. At this point, metrics are aggregated but their labels and names are untouched.

When func (o *Output) Stop() error is called, transformation takes place in three steps.

First, new timeseries are Derived from existing ones. This includes creating new timeseries by changing the metric name (e.g. adding suffixes), or creating new trimeseries with information contained in another. This process can use one trimeseries to create serveral derived ones, but every derived trimeseries must come exactly from one timeseries output by k6. The derivation process walks through the metricStore sequentially and exactly once, so it is no possible to combine multiple k6 metrics, or derive derivates.

Next, some trimeseries are Removed. This is done by checking if the metric name is on a "remove list". This step helps to both remove unused metrics, but also to remove trimeseries that we cloned with a new name in the Derive step, effectively allowing us to rename a trimeseries.

Finally, labels are stripped from trimeseries. Some labels are removed from all timeseries, while others are removed from all except some. This process does not aggregate the resulting timeseries in any way, and it is technically possible to obtain duplicated timeseries as a result. This however should be very unlikely in practice as we remove labels that are known to not contribute to cardinality, just noise.

Each transformation step walks through the map once, with the Derivation process being the most expensive as new entries are added to it.

After the transformation is done, metrics are serialized into a prometheus text format and written to the output file.

Transformations

The following transformations are currently implemented, in the Derive-Remove-RemoveLabels process described above:

Rename http_reqs to http_requests_total
Rename http_req_failed to http_requests_failed_total
Create http_info gauge with info-like labels from http_reqs, with a value of 1.
- tls_version, proto labels are kept in http_info and removed from all other metrics.
Create http_ssl gauge with a value of 1 if http_reqs had a tls_version label, 0 otherwise.
Create http_got_expected_response gauge, which is 1 if http_reqs had the expected_response label set to true, and 0 otherwise.
- expected_response label is removed from all other metrics.
Create http_error_code gauge, whose value equals the value of the error_code label originally present.
- error_code label is removed from all other metrics.
- error label, containing a textual representation from the error, is removed from all metrics.
Create http_status_code gauge, whose value equals the value of the status label originally present.
Create http_version gauge, whose value equals the value of the HTTP protocol version parsed from the original proto label.
data_(sent|received) metrics are renamed to data_(sent|received)_bytes
http_req_duration is renamed and scaled to http_total_duration_seconds
iteration_duration is renamed and scaled to iteration_duration_seconds
http_req_(blocked|connecting|receiving|sending|handshaking|waiting) are converted to http_duration_secondswith aphase` label, with the value of this label being a slightly reworded version of the original (preserving pre-refactor behavior).
vus, vus_max, and iterations metrics are removed
group label is removed from all metrics
checks metric is renamed to check_success_rate
checks metric is emitted as logs
- time="2025-01-14T18:21:59+01:00" level=info msg="check result" check="is 200" count=3 metric=checks_total scenario=default value=0.3
checks is derived into a checks_total counter, with a result="(fail|success)" label is created from.

This were, to my understanding, all transformations done using the previous approach.

Opinions

This new approach is deliberately opinionated in a couple regards:

Everything is aggregated very early and the only value retained for Trends and Rates is the average. This second part is easy to change if the need arises, but so far it doesn't seem to me to be the case, as most of the times we will want to report the average only.
Readable, dumb code is used for transformations. It might be tempted to extract some to a data-like representation (e.g. a map from old name to new name), but I deliberately chose not to.
Explicitness and top-down readability are prioritized over performance (within reason). There is an assumption of the number of timeseries that need to be subject to this process (after aggregation) should be relatively small (<1k), for which dumb map operations are considered enough.

Open ends

What to do with the name label. It was previously not emitted, and now is. This seems to be the right thing to do according to k6 maintainers.
checks_total and the newly added check_success_rate now carry a check label containing the check name. Before this refactor this label was stripped on the argument of being high-cardinality, but given we're allowing URLs as labels I don't see this being a particularly strong argument. As we cannot perform per-label aggregations, we would need to drop this metric entirely if we want to get rid of the check label.

Not seen in practice with iterations=1, but part of the test suite

nadiamoe force-pushed the metric-processing-v2 branch 4 times, most recently from 89a17a4 to 980c3ba Compare January 15, 2025 12:32

nadiamoe mentioned this pull request Jan 15, 2025

chore: integration: properly parse output metrics and assert more things about them #44

Merged

nadiamoe force-pushed the metric-processing-v2 branch 2 times, most recently from cf080eb to eb85f9e Compare January 15, 2025 14:30

nadiamoe changed the base branch from main to integration-prometheus January 15, 2025 14:30

nadiamoe force-pushed the integration-prometheus branch from 7b04031 to 08dac7f Compare January 27, 2025 15:37

nadiamoe mentioned this pull request Feb 5, 2025

fix: In grpc test runs we don't have the expected http metrics #3

Closed

nadiamoe force-pushed the integration-prometheus branch 4 times, most recently from 214a896 to 7364a6b Compare February 7, 2025 17:39

Base automatically changed from integration-prometheus to main February 14, 2025 10:44

nadiamoe force-pushed the metric-processing-v2 branch 3 times, most recently from 46cf5b8 to c242885 Compare February 14, 2025 12:43

nadiamoe added 4 commits February 14, 2025 15:44

benchmark presence checking on both a map and a slice

7b9f783

refactor metric processing

3532fa3

remove iterations metric

8657079

Not seen in practice with iterations=1, but part of the test suite

create http_status_code metric

8a249cb

nadiamoe force-pushed the metric-processing-v2 branch from a9004ff to f446106 Compare February 14, 2025 14:44

nadiamoe added 6 commits February 14, 2025 17:15

keep status label only on http_requests_total metric

276ad0d

remove high-cardinality error label

2bfb7d7

create http_version metric

acf3250

remove most of the output unit tests

a3155d2

generate http_duration_seconds{phase="resolve"} hardcoded to zero

9f62763

use custom sm phase names for http_duration_seconds

ebe3731

nadiamoe force-pushed the metric-processing-v2 branch from f446106 to ebe3731 Compare February 14, 2025 16:15

nadiamoe marked this pull request as ready for review February 14, 2025 16:28

nadiamoe requested a review from a team as a code owner February 14, 2025 16:28

nadiamoe requested review from mem, ka3de and d0ugal and removed request for ka3de February 14, 2025 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor metrics processing #34

Refactor metrics processing #34

nadiamoe commented Jan 14, 2025 •

edited

Loading

Refactor metrics processing #34

Are you sure you want to change the base?

Refactor metrics processing #34

Conversation

nadiamoe commented Jan 14, 2025 • edited Loading

How the new code works

Transformations

Opinions

Open ends

nadiamoe commented Jan 14, 2025 •

edited

Loading