Real time NDT/discard coincidence SLI #627

stephen-soltesz · 2020-01-30T20:48:28Z

Today the ndt/discard SLI is a function of e2e data in BQ. This takes about 36hrs to be available for the prior day.

We have near real time telemetry from the switch and the ndt-server. In principle we could find a way to count ndt tests that occur during the same intervals that discards occur.

It may be further possible to only count NDT tests under stricter criteria, e.g. discards > N pps or ndt_s2c_bandwidth > N mbps. This would require experimentation and possibly additional insight from tcpinfo instruments like % of packets retransmitted.

The text was updated successfully, but these errors were encountered:

stephen-soltesz · 2020-12-09T00:58:16Z

@nkinkade this may be possible now at 1min resolution due to your work with discov2.

As a first try (may be improved), a recording rule like:

increase(ndt7_client_test_results_total[2m]) > 0
   and on(machine) (
      increase(ifOutDiscards{ifAlias="uplink"}[2m]) > 0)

Could then be sum_over_time()'d for 1hr or 24hr for all tests that were coincident with discards. and divided by the test count per the same period to get a percentage.

This would give an even more conservative estimate than a 10sec measure but far faster.

nkinkade · 2020-12-11T19:54:17Z

A possibly more formalized version of that query?

increase(ndt7_client_test_results_total{result="okay-with-rate", direction="download"}[2m]) > 0 and 
  on(machine) max by (site) (increase(ifOutDiscards{ifAlias="uplink"}[2m]) > 0)

It may be obvious, but to be 100% sure I understand, the notion of this query is to just give us a general feel for where things stand within a margin of error of ~2m? Nothing concrete, no data annotations, no alert, but just a panel on a dashboard to monitor?

You mention that this may now be possible due to the work on DISCOv2, but wasn't this also possible with snmp_exporter? snmp_exporter gave us 1m counts for all the same metrics.

stephen-soltesz · 2020-12-11T20:03:50Z

Depending on how well these metrics track with the BQ metrics, they could replace them, or give us earlier warnings when things are really bad. So, I imagine it's possible that it could replace the BQ metrics. But, we'd need to compare them before coming to that conclusion. If it's possible, it would give faster notice and be a much simpler configuration to maintain.

I think you're right about snmp exporter... max by(site) .. will lose the machine label. So, having per-machine metrics is a discov2 thing.

autolabel bot added the review/triage label Jan 30, 2020

stephen-soltesz added 0% todo and removed review/triage labels Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real time NDT/discard coincidence SLI #627

Real time NDT/discard coincidence SLI #627

stephen-soltesz commented Jan 30, 2020

stephen-soltesz commented Dec 9, 2020

nkinkade commented Dec 11, 2020

stephen-soltesz commented Dec 11, 2020

Real time NDT/discard coincidence SLI #627

Real time NDT/discard coincidence SLI #627

Comments

stephen-soltesz commented Jan 30, 2020

stephen-soltesz commented Dec 9, 2020

nkinkade commented Dec 11, 2020

stephen-soltesz commented Dec 11, 2020