-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real time NDT/discard coincidence SLI #627
Comments
@nkinkade this may be possible now at 1min resolution due to your work with discov2. As a first try (may be improved), a recording rule like:
Could then be sum_over_time()'d for 1hr or 24hr for all tests that were coincident with discards. and divided by the test count per the same period to get a percentage. This would give an even more conservative estimate than a 10sec measure but far faster. |
A possibly more formalized version of that query?
It may be obvious, but to be 100% sure I understand, the notion of this query is to just give us a general feel for where things stand within a margin of error of ~2m? Nothing concrete, no data annotations, no alert, but just a panel on a dashboard to monitor? You mention that this may now be possible due to the work on DISCOv2, but wasn't this also possible with snmp_exporter? snmp_exporter gave us 1m counts for all the same metrics. |
Depending on how well these metrics track with the BQ metrics, they could replace them, or give us earlier warnings when things are really bad. So, I imagine it's possible that it could replace the BQ metrics. But, we'd need to compare them before coming to that conclusion. If it's possible, it would give faster notice and be a much simpler configuration to maintain. I think you're right about snmp exporter... |
Today the ndt/discard SLI is a function of e2e data in BQ. This takes about 36hrs to be available for the prior day.
We have near real time telemetry from the switch and the ndt-server. In principle we could find a way to count ndt tests that occur during the same intervals that discards occur.
It may be further possible to only count NDT tests under stricter criteria, e.g.
discards > N pps
orndt_s2c_bandwidth > N mbps
. This would require experimentation and possibly additional insight from tcpinfo instruments like % of packets retransmitted.The text was updated successfully, but these errors were encountered: