-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Showing metrics deviation(p99, errors) while listing services/endpoints #1102
Comments
I think the requirement here is to have a column showcasing the change in latency (or error) with respect to the prior hour if the current dropdown is 1 hour. Currently, most of the attributes are calculated at ingestion time. Doing this at ingestion time will be complex as we need the information of the prior hour (predefined window) and currently, our view-gen is stateless. Secondly, it will be limited to a set of pre-defined time windows used for comparison (say 15 mins or 30 mins). So, this seems to be more suitable by doing query time. So, here, I think, we will need to fire two queries for a two-time window (one for the current hour, and one for the prior hour) and calculate the value for that attribute. where should we do this at query service/gateway service? Do we also have to support orderby on such a column? @jayesh, do you think of any other way to capture this requirement in UI? @aaron-steinfeld do you have any thoughts on this? |
Metrics are calculated at read time at a service (or any aggregate) level. Only individual span values are calculated at ingestion time. The tricky bit is basically what you said, that any delta (and I think there might be some work going on for deltas elsewhere, @jake-bassett - are you aware of any?), is defined by two time ranges, the current and the comparison. Sometimes the previous window makes sense, but that's really use case driven. For example, if I'm looking at the past hour and this issue has been happening for 2 hours, the prior hour is far less useful to me than the same hour yesterday. So new controls would likely be needed, which introduces more complexity - one of the reasons we've abandoned efforts like this in the past. As far as order by support - if we compute the delta client side, like I was assuming, we wouldn't have support for order by (we could probably hack it in for the current page of data, but I'd argue against the inconsistency). If we compute the delta server side, that's a more significant change, and I guess the answer there would be - depends on how we introduce that support. |
Use Case
In the microservice world, when a customer reports an issue related to the error/degradation/latency we start debugging the by asking the below questions
We can identify the deviation for error/latency by going to the respective service/endpoint overview dashboard and check the patterns in the errors or latency graph. This workflow is not scalable for large number of services and dependencies.
Proposal
Add metrics(p99, error) deviation while listing services and endpoints.
The text was updated successfully, but these errors were encountered: