Skip to content

Conversation

@rodrigo-o
Copy link
Collaborator

@rodrigo-o rodrigo-o commented Nov 13, 2025

Motivation

Add success/error rate panels for engine and rpc api

Description

This PR:

  • Extracted the RPC metric logic in its own module to avoid overload the previous block_processing only profiling module
  • Add the new error/success rates metrics and instrumentation along side the previous rpc instrumentation
  • Move shared logic for registering default metrics outside of profiling
  • Adds a whole new row for tracking both RPC and Engine errorr rates and deagregation by method and kind of error
  • Additional Engine pie chart to show proportion of calls by method
  • Dahsboard docs have been updated, see the changes here and here
image

Next Steps

We still have some improvements that could be made to the block processing profiling:

NOTE
Once this is merged and published in our shared grafana we'll need to update the servers with main for being able to see the rpc/engine latency panels given that the metric names changed in this PR due to the extraction from the previous block profiling module.

Closes #5379

rodrigo-o and others added 29 commits November 7, 2025 19:01
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
@rodrigo-o rodrigo-o marked this pull request as ready for review November 17, 2025 23:09
@rodrigo-o rodrigo-o requested a review from a team as a code owner November 17, 2025 23:09
Copilot AI review requested due to automatic review settings November 17, 2025 23:09
@ethrex-project-sync ethrex-project-sync bot moved this to In Review in ethrex_l1 Nov 17, 2025
Copilot finished reviewing on behalf of rodrigo-o November 17, 2025 23:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds success/error rate monitoring for RPC and Engine API calls by extracting RPC-specific metrics into a dedicated module and instrumenting request outcomes. The changes enable visibility into request success rates alongside existing latency metrics.

  • Extracts RPC metrics logic from the block processing profiling module into a dedicated rpc.rs module
  • Adds new rpc_requests_total counter metric to track success/error outcomes per namespace and method
  • Adds Grafana dashboard panels displaying success/error rates for Engine and RPC APIs

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json Adds two new success/error rate panels and updates metric names from function_duration_seconds to rpc_request_duration_seconds
crates/networking/rpc/rpc.rs Instruments RPC handler to record request outcomes (success/error) after request completion
crates/blockchain/metrics/rpc.rs New module defining RPC-specific metrics including outcome counter and duration histogram
crates/blockchain/metrics/profiling.rs Removes RPC-related utilities that were moved to the new rpc module
crates/blockchain/metrics/mod.rs Exports new rpc module and adds gather_default_metrics helper function
crates/blockchain/metrics/api.rs Updates to use gather_default_metrics instead of gather_profiling_metrics
cmd/ethrex/initializers.rs Adds initialization call for RPC metrics alongside existing profiling metrics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…5395)

**Motivation**

Previous success/error rate has too few information to tell what was
failing.

**Description**

This PR add `error_kind` to the RPC/Engine outcome to be able to tell
what failed. We moved both error rates to it's'own row making every
other row clearer and added a new panel deagregated by method and error
kind.

<img width="2543" height="1145" alt="image"
src="https://github.com/user-attachments/assets/8ab819fc-fc75-4c96-8b07-345e7afdad6a"
/>

Will close #5379 once in main
@rodrigo-o rodrigo-o enabled auto-merge November 24, 2025 14:00
@rodrigo-o rodrigo-o added this pull request to the merge queue Nov 24, 2025
Merged via the queue into main with commit fa3a298 Nov 24, 2025
43 checks passed
@rodrigo-o rodrigo-o deleted the rpc_error_rates branch November 24, 2025 14:56
@github-project-automation github-project-automation bot moved this from In Review to Done in ethrex_l1 Nov 24, 2025
xqft pushed a commit that referenced this pull request Nov 24, 2025
**Motivation**

Add success/error rate panels for engine and rpc api

**Description**

This PR:
- Extracted the RPC metric logic in its own module to avoid overload the
previous block_processing only profiling module
- Add the new error/success rates metrics and instrumentation along side
the previous rpc instrumentation
- Move shared logic for registering default metrics outside of profiling
- Adds a whole new row for tracking both RPC and Engine errorr rates and
deagregation by method and kind of error
- Additional Engine pie chart to show proportion of calls by method
- Dahsboard docs have been updated, see the changes
[here](https://github.com/lambdaclass/ethrex/blob/rpc_error_rates/docs/developers/l1/dashboards.md#engine-api)
and
[here](https://github.com/lambdaclass/ethrex/blob/rpc_error_rates/docs/developers/l1/dashboards.md#engine-and-rpc-error-rates)

<img width="2543" height="1145" alt="image"
src="https://github.com/user-attachments/assets/19eb2383-7dd3-41b1-ad8b-d1580a98ebb6"
/>

**Next Steps**

We still have some improvements that could be made to the block
processing profiling:
- [ ] As tracked by #5327 we want to do some work apart from this
refactor
- [x] A new issue has been created to rename the rest of the modules in
metrics and remove the additional `metrics_` prefix #5378.
- [x] Apart from this we may want to have more information related to
errors to be able to deagregate them as tracked by #5379

**NOTE**
Once this is merged and published in our shared grafana we'll need to
update the servers with main for being able to see the rpc/engine
latency panels given that the metric names changed in this PR due to the
extraction from the previous block profiling module.

Closes #5379

---------

Co-authored-by: Copilot <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Nov 25, 2025
…5408)

**Motivation**

When we added RPC metrics for success/error rates in #5335, we found
that we were returning "Internal error" in the
`engine_getPayloadBodiesByRange` method. This was related to changes
done in #2430, which turned missing payloads into a DB error. The spec
says we should return `null`s for any missing payloads, so this deviates
from the spec.

**Description**

This PR fixes the spec deviation by changing the batch-GET methods from
the Store API to return `None` on missing values instead of failing.

Closes #5403

---------

Co-authored-by: Rodrigo Oliveri <[email protected]>
LeanSerra pushed a commit that referenced this pull request Nov 26, 2025
…5408)

**Motivation**

When we added RPC metrics for success/error rates in #5335, we found
that we were returning "Internal error" in the
`engine_getPayloadBodiesByRange` method. This was related to changes
done in #2430, which turned missing payloads into a DB error. The spec
says we should return `null`s for any missing payloads, so this deviates
from the spec.

**Description**

This PR fixes the spec deviation by changing the batch-GET methods from
the Store API to return `None` on missing values instead of failing.

Closes #5403

---------

Co-authored-by: Rodrigo Oliveri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Deagregate RPC/Engine error rate to actual methods

4 participants