-
Notifications
You must be signed in to change notification settings - Fork 125
feat(l1): add rpc error rates to metrics and panels #5335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds success/error rate monitoring for RPC and Engine API calls by extracting RPC-specific metrics into a dedicated module and instrumenting request outcomes. The changes enable visibility into request success rates alongside existing latency metrics.
- Extracts RPC metrics logic from the block processing profiling module into a dedicated
rpc.rsmodule - Adds new
rpc_requests_totalcounter metric to track success/error outcomes per namespace and method - Adds Grafana dashboard panels displaying success/error rates for Engine and RPC APIs
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json | Adds two new success/error rate panels and updates metric names from function_duration_seconds to rpc_request_duration_seconds |
| crates/networking/rpc/rpc.rs | Instruments RPC handler to record request outcomes (success/error) after request completion |
| crates/blockchain/metrics/rpc.rs | New module defining RPC-specific metrics including outcome counter and duration histogram |
| crates/blockchain/metrics/profiling.rs | Removes RPC-related utilities that were moved to the new rpc module |
| crates/blockchain/metrics/mod.rs | Exports new rpc module and adds gather_default_metrics helper function |
| crates/blockchain/metrics/api.rs | Updates to use gather_default_metrics instead of gather_profiling_metrics |
| cmd/ethrex/initializers.rs | Adds initialization call for RPC metrics alongside existing profiling metrics |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…5395) **Motivation** Previous success/error rate has too few information to tell what was failing. **Description** This PR add `error_kind` to the RPC/Engine outcome to be able to tell what failed. We moved both error rates to it's'own row making every other row clearer and added a new panel deagregated by method and error kind. <img width="2543" height="1145" alt="image" src="https://github.com/user-attachments/assets/8ab819fc-fc75-4c96-8b07-345e7afdad6a" /> Will close #5379 once in main
**Motivation** Add success/error rate panels for engine and rpc api **Description** This PR: - Extracted the RPC metric logic in its own module to avoid overload the previous block_processing only profiling module - Add the new error/success rates metrics and instrumentation along side the previous rpc instrumentation - Move shared logic for registering default metrics outside of profiling - Adds a whole new row for tracking both RPC and Engine errorr rates and deagregation by method and kind of error - Additional Engine pie chart to show proportion of calls by method - Dahsboard docs have been updated, see the changes [here](https://github.com/lambdaclass/ethrex/blob/rpc_error_rates/docs/developers/l1/dashboards.md#engine-api) and [here](https://github.com/lambdaclass/ethrex/blob/rpc_error_rates/docs/developers/l1/dashboards.md#engine-and-rpc-error-rates) <img width="2543" height="1145" alt="image" src="https://github.com/user-attachments/assets/19eb2383-7dd3-41b1-ad8b-d1580a98ebb6" /> **Next Steps** We still have some improvements that could be made to the block processing profiling: - [ ] As tracked by #5327 we want to do some work apart from this refactor - [x] A new issue has been created to rename the rest of the modules in metrics and remove the additional `metrics_` prefix #5378. - [x] Apart from this we may want to have more information related to errors to be able to deagregate them as tracked by #5379 **NOTE** Once this is merged and published in our shared grafana we'll need to update the servers with main for being able to see the rpc/engine latency panels given that the metric names changed in this PR due to the extraction from the previous block profiling module. Closes #5379 --------- Co-authored-by: Copilot <[email protected]>
…5408) **Motivation** When we added RPC metrics for success/error rates in #5335, we found that we were returning "Internal error" in the `engine_getPayloadBodiesByRange` method. This was related to changes done in #2430, which turned missing payloads into a DB error. The spec says we should return `null`s for any missing payloads, so this deviates from the spec. **Description** This PR fixes the spec deviation by changing the batch-GET methods from the Store API to return `None` on missing values instead of failing. Closes #5403 --------- Co-authored-by: Rodrigo Oliveri <[email protected]>
…5408) **Motivation** When we added RPC metrics for success/error rates in #5335, we found that we were returning "Internal error" in the `engine_getPayloadBodiesByRange` method. This was related to changes done in #2430, which turned missing payloads into a DB error. The spec says we should return `null`s for any missing payloads, so this deviates from the spec. **Description** This PR fixes the spec deviation by changing the batch-GET methods from the Store API to return `None` on missing values instead of failing. Closes #5403 --------- Co-authored-by: Rodrigo Oliveri <[email protected]>
Motivation
Add success/error rate panels for engine and rpc api
Description
This PR:
Next Steps
We still have some improvements that could be made to the block processing profiling:
FunctionProfilingLayer#5327 we want to do some work apart from this refactormetrics_prefix Metrics modules renaming #5378.NOTE
Once this is merged and published in our shared grafana we'll need to update the servers with main for being able to see the rpc/engine latency panels given that the metric names changed in this PR due to the extraction from the previous block profiling module.
Closes #5379