Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metrics processing #34

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Refactor metrics processing #34

wants to merge 10 commits into from

Conversation

nadiamoe
Copy link
Member

@nadiamoe nadiamoe commented Jan 14, 2025

Note

Reviewers: This PR changes the output.go file substantially, making the diff useless. I recommend to look a the new code only, and focus efforts on considering whether the tests are sufficient to prove backwards-compatibility.

This PR overhauls how metrics are processed in the SM output extension, deliberately taking a few tradeoffs against performance to make a code that is going to be chatty and verbose as easy to read as possible.

How the new code works

The new code works like this:

When samples arrive through func (o *Output) AddMetricSamples(containers []metrics.SampleContainer), samples are added to a map in the metricStore. This map indexes metrics by timeseries, with the value being the numeric value of the timeseries. If a sample for a given timeseries arrives more than once, it is immediately aggregated this early, depending on the metric type: Counters are added to its previous value, gauges replace the previous value, and Rates and Trends are averaged. At this point, metrics are aggregated but their labels and names are untouched.

When func (o *Output) Stop() error is called, transformation takes place in three steps.

First, new timeseries are Derived from existing ones. This includes creating new timeseries by changing the metric name (e.g. adding suffixes), or creating new trimeseries with information contained in another. This process can use one trimeseries to create serveral derived ones, but every derived trimeseries must come exactly from one timeseries output by k6. The derivation process walks through the metricStore sequentially and exactly once, so it is no possible to combine multiple k6 metrics, or derive derivates.

Next, some trimeseries are Removed. This is done by checking if the metric name is on a "remove list". This step helps to both remove unused metrics, but also to remove trimeseries that we cloned with a new name in the Derive step, effectively allowing us to rename a trimeseries.

Finally, labels are stripped from trimeseries. Some labels are removed from all timeseries, while others are removed from all except some. This process does not aggregate the resulting timeseries in any way, and it is technically possible to obtain duplicated timeseries as a result. This however should be very unlikely in practice as we remove labels that are known to not contribute to cardinality, just noise.

Each transformation step walks through the map once, with the Derivation process being the most expensive as new entries are added to it.

After the transformation is done, metrics are serialized into a prometheus text format and written to the output file.

Transformations

The following transformations are currently implemented, in the Derive-Remove-RemoveLabels process described above:

  • Rename http_reqs to http_requests_total
  • Rename http_req_failed to http_requests_failed_total
  • Create http_info gauge with info-like labels from http_reqs, with a value of 1.
    • tls_version, proto labels are kept in http_info and removed from all other metrics.
  • Create http_ssl gauge with a value of 1 if http_reqs had a tls_version label, 0 otherwise.
  • Create http_got_expected_response gauge, which is 1 if http_reqs had the expected_response label set to true, and 0 otherwise.
    • expected_response label is removed from all other metrics.
  • Create http_error_code gauge, whose value equals the value of the error_code label originally present.
    • error_code label is removed from all other metrics.
    • error label, containing a textual representation from the error, is removed from all metrics.
  • Create http_status_code gauge, whose value equals the value of the status label originally present.
  • Create http_version gauge, whose value equals the value of the HTTP protocol version parsed from the original proto label.
  • data_(sent|received) metrics are renamed to data_(sent|received)_bytes
  • http_req_duration is renamed and scaled to http_total_duration_seconds
  • iteration_duration is renamed and scaled to iteration_duration_seconds
  • http_req_(blocked|connecting|receiving|sending|handshaking|waiting) are converted to http_duration_secondswith aphase` label, with the value of this label being a slightly reworded version of the original (preserving pre-refactor behavior).
  • vus, vus_max, and iterations metrics are removed
  • group label is removed from all metrics
  • checks metric is renamed to check_success_rate
  • checks metric is emitted as logs
    • time="2025-01-14T18:21:59+01:00" level=info msg="check result" check="is 200" count=3 metric=checks_total scenario=default value=0.3
  • checks is derived into a checks_total counter, with a result="(fail|success)" label is created from.

This were, to my understanding, all transformations done using the previous approach.

Opinions

This new approach is deliberately opinionated in a couple regards:

  • Everything is aggregated very early and the only value retained for Trends and Rates is the average. This second part is easy to change if the need arises, but so far it doesn't seem to me to be the case, as most of the times we will want to report the average only.
  • Readable, dumb code is used for transformations. It might be tempted to extract some to a data-like representation (e.g. a map from old name to new name), but I deliberately chose not to.
  • Explicitness and top-down readability are prioritized over performance (within reason). There is an assumption of the number of timeseries that need to be subject to this process (after aggregation) should be relatively small (<1k), for which dumb map operations are considered enough.

Open ends

  1. What to do with the name label. It was previously not emitted, and now is. This seems to be the right thing to do according to k6 maintainers.
  2. checks_total and the newly added check_success_rate now carry a check label containing the check name. Before this refactor this label was stripped on the argument of being high-cardinality, but given we're allowing URLs as labels I don't see this being a particularly strong argument. As we cannot perform per-label aggregations, we would need to drop this metric entirely if we want to get rid of the check label.

@nadiamoe nadiamoe force-pushed the metric-processing-v2 branch 4 times, most recently from 89a17a4 to 980c3ba Compare January 15, 2025 12:32
@nadiamoe nadiamoe force-pushed the metric-processing-v2 branch 2 times, most recently from cf080eb to eb85f9e Compare January 15, 2025 14:30
@nadiamoe nadiamoe changed the base branch from main to integration-prometheus January 15, 2025 14:30
@nadiamoe nadiamoe force-pushed the integration-prometheus branch from 7b04031 to 08dac7f Compare January 27, 2025 15:37
@nadiamoe nadiamoe force-pushed the integration-prometheus branch 4 times, most recently from 214a896 to 7364a6b Compare February 7, 2025 17:39
Base automatically changed from integration-prometheus to main February 14, 2025 10:44
@nadiamoe nadiamoe force-pushed the metric-processing-v2 branch 3 times, most recently from 46cf5b8 to c242885 Compare February 14, 2025 12:43
@nadiamoe nadiamoe force-pushed the metric-processing-v2 branch from a9004ff to f446106 Compare February 14, 2025 14:44
@nadiamoe nadiamoe force-pushed the metric-processing-v2 branch from f446106 to ebe3731 Compare February 14, 2025 16:15
@nadiamoe nadiamoe marked this pull request as ready for review February 14, 2025 16:28
@nadiamoe nadiamoe requested a review from a team as a code owner February 14, 2025 16:28
@nadiamoe nadiamoe requested review from mem, ka3de and d0ugal and removed request for ka3de February 14, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant