Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-series architecture epic #291

Open
jimaek opened this issue Feb 20, 2023 · 0 comments
Open

Time-series architecture epic #291

jimaek opened this issue Feb 20, 2023 · 0 comments

Comments

@jimaek
Copy link
Member

jimaek commented Feb 20, 2023

To support use-cases like continuously plotting performance data for an endpoint we need to design a time-series system.

This means that each probe must have the ability to run scheduled tests every minute or even every 30 seconds. The results will have to be processed and stored in a time-series DB. The API will then new endpoints to read this data and output results ready to be charted on a frontend.

It sounds like the easiest solution would be to run a cron on the API level and then send commands to all probes to run the scheduled test. Summary:

  • We need a way for the admin to register the continuous tests. There could be many but up to 200 while its an admin-only feature.
  • The tests could be HTTP, DNS, PING commands with different parameters targeting different endpoints.
  • This means that a single probe could receive 200 different tests it needs to run every 30 seconds. All that without impacting the data or the quality of service. Sounds problematic for smaller probes if we consider some endpoints could take 10+ seconds to respond. Unless we build some kind of queue and de-duplication system. Need to discuss this.
  • The results will be returned to the API as normal, but in this case instead of being outputted to a user it needs to process them and store into a persistent DB

Next we need to select the best possible DB that has good performance, its easy to use and easy to operate. I am considering https://questdb.io/ or clickhouse but further research and benchmarking is needed.

The DB needs to be able to store data from (number of registered tests)*(number of probes) per (cycle). So if we have 1000 probes and 200 registered tests that run every 30 seconds, the API would have to accept and then store 400k data points per minute.
The DB will then downsample the raw data into aggregated values based on algorithms like average and median. But the raw data will remain and be used. We also need to decide on the TTL of raw data and downsampled data. I would say no reason to store more than 2 years worth of data of any kind.

Before storing the data in the DB we need to consider:

  • Filter trash data e.g. errors or impossible values like 0.0ms to run a query
  • Consider deduplication or some kind of pre-processing of data from the same ASN+City combo
  • Check the ASN of the target IP and exclude probes on the same ASN.
  • Add all related metadata. location, resolver, all perf data...

Next we need new API endpoints to read the results for a set date-range. e.g. a month, a quarter, a year.... This affects the DB's schema as well.
To consider:

  • A weights system. The idea is to view perf data per endpoint per location. But if you want to see the worldwide performance you will run into an issue where it will be heavily impacted by the number of tests per location. e.g. if 90% of tests come from California, then to be the best in the world you would only need to be the best in California. Thats why we need a system that would be fair.
  • We need to support different aggregations. e.g. average, media, 90th percentile, 95th percentile..
  • But even while providing aggregated data we still need a way to query and show the raw data that was used to make the aggregation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant