-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: wal fsync latency histogram different between prometheus and db console #115825
Comments
I was poking around the Histogram code today to do some optimizing for allocations, and I noticed some weirdness that led me here. I believe #104088 broke the When grabbing a window to calculate a quantile against, we are using the cumulative histogram. Shouldn't we just be using the current one? cockroach/pkg/util/metric/metric.go Line 669 in 50a47aa
As of the linked PR, the |
Yep, I noticed those things too. Putting up a fix soon. |
These histograms were incorrectly not setting `mwh.prev`. The `ToPrometheusMetric` was also merging the `cum` with the `prev`. Instead, it should have been using `cur`. Since both `Update()` and `Rotate()` set `cur` to be the correct current window already. Informs cockroachdb#115825. Release note: None
This bug was introduced in 4a2d06a and then was exemplified by some of the changes in bae5045. There are two main problems (one from each):
In addition, both changes were missing tests for the windowing logic for manual histograms. I made a fix that is within the confines of the current interface. I still strongly believe we should strongly consider taking on #116584 to make this interface easier to use so such bugs are difficult to introduce. Below is a test I ran using the same workload @kvoli mentioned above. I will put a PR with these fixes soon. |
These histograms were missing the windowing logic we use in histograms to store them in our internal time series DB. The logic we follow there is that we keep 3 histograms. The first one is a cumulative one that we export to prometheus through the `status/vars` page. The second and third are `prev` and `cur` that together cover the entire duration of the histogram, default being 60s. When the ticker rotates these, we swap out the prev with the current, and start a brand new current histogram. When calculating derivitive statistics, we use the combined view of the windows which is used to display these metrics in DB console. This is similar to the way Grafana handles these using its rate interval. Fixes cockroachdb#115825. Release note: None
These histograms were missing the windowing logic we use in histograms to store them in our internal time series DB. The logic we follow there is that we keep 3 histograms. The first one is a cumulative one that we export to prometheus through the `status/vars` page. The second and third are `prev` and `cur` that together cover the entire duration of the histogram, default being 60s. When the ticker rotates these, we swap out the prev with the current, and start a brand new current histogram. When calculating derivative statistics, we use the combined view of the windows which is used to display these metrics in DB console. This is similar to the way Grafana handles these using its rate interval. Fixes cockroachdb#115825. Release note: None
These histograms were missing the windowing logic we use in histograms to store them in our internal time series DB. The logic we follow there is that we keep 3 histograms. The first one is a cumulative one that we export to prometheus through the `status/vars` page. The second and third are `prev` and `cur` that together cover the entire duration of the histogram, default being 60s. When the ticker rotates these, we swap out the prev with the current, and start a brand new current histogram. When calculating derivative statistics, we use the combined view of the windows which is used to display these metrics in DB console. This is similar to the way Grafana handles these using its rate interval. Fixes cockroachdb#115825. Release note: None
116895: metric: fix windowing for manual histograms r=abarganier,itsbilal a=aadityasondhi These histograms were missing the windowing logic we use in histograms to store them in our internal time series DB. The logic we follow there is that we keep 3 histograms. The first one is a cumulative one that we export to prometheus through the `status/vars` page. The second and third are `prev` and `cur` that together cover the entire duration of the histogram, default being 60s. When the ticker rotates these, we swap out the prev with the current, and start a brand new current histogram. When calculating derivative statistics, we use the combined view of the windows which is used to display these metrics in DB console. This is similar to the way Grafana handles these using its rate interval. Fixes #115825. Release note: None 117097: span: Add strict mode r=miretskiy a=miretskiy Two commits to help debug #116661 1. Log "slow span" messages both from the coordinator and from the aggregators. 2. span: Add strict mode Add strict mode for span frontier implementations that requires the forwarded span to be a sub-span of the spans tracked by the frontier. The functionality is opt-in, and is enabled via `COCKROACH_SPAN_FRONTIER_STRICT_MODE_ENABLED` env var. Enroll cdc roachtests to use strict mode. Informs #116661 Release Note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>
These histograms were missing the windowing logic we use in histograms to store them in our internal time series DB. The logic we follow there is that we keep 3 histograms. The first one is a cumulative one that we export to prometheus through the `status/vars` page. The second and third are `prev` and `cur` that together cover the entire duration of the histogram, default being 60s. When the ticker rotates these, we swap out the prev with the current, and start a brand new current histogram. When calculating derivative statistics, we use the combined view of the windows which is used to display these metrics in DB console. This is similar to the way Grafana handles these using its rate interval. Fixes cockroachdb#115825. Release note: None
These histograms were missing the windowing logic we use in histograms to store them in our internal time series DB. The logic we follow there is that we keep 3 histograms. The first one is a cumulative one that we export to prometheus through the `status/vars` page. The second and third are `prev` and `cur` that together cover the entire duration of the histogram, default being 60s. When the ticker rotates these, we swap out the prev with the current, and start a brand new current histogram. When calculating derivative statistics, we use the combined view of the windows which is used to display these metrics in DB console. This is similar to the way Grafana handles these using its rate interval. Fixes #115825. Release note: None
Describe the problem
The
storage.wal.fsync.latency
metric is different when viewed in grafana (querying prometheus) vs DB Console. The DB Console chart appears to be incorrect as the values are overly smoothed.To Reproduce
Expected behavior
DB Console and prometheus are comparable.
Additional data / screenshots
Environment:
Additional context
Fsync metric on DBConsole is not usable. This is unfortunate as it is a relied upon signal for disk saturation.
Jira issue: CRDB-34231
The text was updated successfully, but these errors were encountered: