Skip to content

Commit fa3a298

Browse files
rodrigo-oCopilot
andauthored
feat(l1): add rpc error rates to metrics and panels (#5335)
**Motivation** Add success/error rate panels for engine and rpc api **Description** This PR: - Extracted the RPC metric logic in its own module to avoid overload the previous block_processing only profiling module - Add the new error/success rates metrics and instrumentation along side the previous rpc instrumentation - Move shared logic for registering default metrics outside of profiling - Adds a whole new row for tracking both RPC and Engine errorr rates and deagregation by method and kind of error - Additional Engine pie chart to show proportion of calls by method - Dahsboard docs have been updated, see the changes [here](https://github.com/lambdaclass/ethrex/blob/rpc_error_rates/docs/developers/l1/dashboards.md#engine-api) and [here](https://github.com/lambdaclass/ethrex/blob/rpc_error_rates/docs/developers/l1/dashboards.md#engine-and-rpc-error-rates) <img width="2543" height="1145" alt="image" src="https://github.com/user-attachments/assets/19eb2383-7dd3-41b1-ad8b-d1580a98ebb6" /> **Next Steps** We still have some improvements that could be made to the block processing profiling: - [ ] As tracked by #5327 we want to do some work apart from this refactor - [x] A new issue has been created to rename the rest of the modules in metrics and remove the additional `metrics_` prefix #5378. - [x] Apart from this we may want to have more information related to errors to be able to deagregate them as tracked by #5379 **NOTE** Once this is merged and published in our shared grafana we'll need to update the servers with main for being able to see the rpc/engine latency panels given that the metric names changed in this PR due to the extraction from the previous block profiling module. Closes #5379 --------- Co-authored-by: Copilot <[email protected]>
1 parent f505c00 commit fa3a298

File tree

17 files changed

+1046
-385
lines changed

17 files changed

+1046
-385
lines changed

cmd/ethrex/initializers.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ use ethrex_common::types::Genesis;
1111
use ethrex_config::networks::Network;
1212

1313
use ethrex_metrics::profiling::{FunctionProfilingLayer, initialize_block_processing_profile};
14+
use ethrex_metrics::rpc::initialize_rpc_metrics;
1415
use ethrex_p2p::rlpx::initiator::RLPxInitiator;
1516
use ethrex_p2p::{
1617
discv4::peer_table::PeerTable,
@@ -89,6 +90,7 @@ pub fn init_metrics(opts: &Options, tracker: TaskTracker) {
8990
);
9091

9192
initialize_block_processing_profile();
93+
initialize_rpc_metrics();
9294

9395
tracker.spawn(metrics_api);
9496
}

crates/blockchain/metrics/api.rs

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
use axum::{Router, routing::get};
22

3-
use crate::profiling::gather_profiling_metrics;
4-
53
use crate::{
6-
MetricsApiError, blocks::METRICS_BLOCKS, process::METRICS_PROCESS, transactions::METRICS_TX,
4+
MetricsApiError, blocks::METRICS_BLOCKS, gather_default_metrics, process::METRICS_PROCESS,
5+
transactions::METRICS_TX,
76
};
87

98
pub async fn start_prometheus_metrics_api(
@@ -32,10 +31,10 @@ pub(crate) async fn get_metrics() -> String {
3231
};
3332

3433
ret_string.push('\n');
35-
match gather_profiling_metrics() {
34+
match gather_default_metrics() {
3635
Ok(string) => ret_string.push_str(&string),
3736
Err(_) => {
38-
tracing::error!("Failed to register METRICS_PROFILING");
37+
tracing::error!("Failed to gather default Prometheus metrics");
3938
return String::new();
4039
}
4140
};

crates/blockchain/metrics/mod.rs

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ pub mod l2;
88
pub mod process;
99
#[cfg(feature = "api")]
1010
pub mod profiling;
11+
#[cfg(feature = "api")]
12+
pub mod rpc;
1113
#[cfg(any(feature = "api", feature = "transactions"))]
1214
pub mod transactions;
1315

@@ -70,3 +72,24 @@ pub enum MetricsError {
7072
#[error("MetricsL2Error {0}")]
7173
FromUtf8Error(#[from] std::string::FromUtf8Error),
7274
}
75+
76+
#[cfg(feature = "api")]
77+
/// Returns all metrics currently registered in Prometheus' default registry.
78+
///
79+
/// Both profiling and RPC metrics register with this default registry, and the
80+
/// metrics API surfaces them by calling this helper.
81+
pub fn gather_default_metrics() -> Result<String, MetricsError> {
82+
use prometheus::{Encoder, TextEncoder};
83+
84+
let encoder = TextEncoder::new();
85+
let metric_families = prometheus::gather();
86+
87+
let mut buffer = Vec::new();
88+
encoder
89+
.encode(&metric_families, &mut buffer)
90+
.map_err(|e| MetricsError::PrometheusErr(e.to_string()))?;
91+
92+
let res = String::from_utf8(buffer)?;
93+
94+
Ok(res)
95+
}

crates/blockchain/metrics/profiling.rs

Lines changed: 5 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
1-
use prometheus::{Encoder, HistogramTimer, HistogramVec, TextEncoder, register_histogram_vec};
2-
use std::{future::Future, sync::LazyLock};
1+
use prometheus::{HistogramTimer, HistogramVec, register_histogram_vec};
2+
use std::sync::LazyLock;
33
use tracing::{
44
Subscriber,
55
field::{Field, Visit},
66
span::{Attributes, Id},
77
};
88
use tracing_subscriber::{Layer, layer::Context, registry::LookupSpan};
99

10-
use crate::MetricsError;
11-
1210
pub static METRICS_BLOCK_PROCESSING_PROFILE: LazyLock<HistogramVec> =
1311
LazyLock::new(initialize_histogram_vec);
1412

13+
// Metrics defined in this module register into the Prometheus default registry.
14+
// The metrics API exposes them by calling `gather_default_metrics()`.
15+
1516
fn initialize_histogram_vec() -> HistogramVec {
1617
register_histogram_vec!(
1718
"function_duration_seconds",
@@ -111,45 +112,6 @@ where
111112
}
112113
}
113114

114-
/// Records the duration of an async operation in the function profiling histogram.
115-
///
116-
/// This provides a lightweight alternative to the `#[instrument]` attribute when you need
117-
/// manual control over timing instrumentation, such as in RPC handlers.
118-
///
119-
/// # Parameters
120-
/// * `namespace` - Category for the metric (e.g., "rpc", "engine", "block_execution")
121-
/// * `function_name` - Name identifier for the operation being timed
122-
/// * `future` - The async operation to time
123-
///
124-
/// Use this function when you need to instrument an async operation for duration metrics,
125-
/// but cannot or do not want to use the `#[instrument]` attribute (for example, in RPC handlers).
126-
pub async fn record_async_duration<Fut, T>(namespace: &str, function_name: &str, future: Fut) -> T
127-
where
128-
Fut: Future<Output = T>,
129-
{
130-
let timer = METRICS_BLOCK_PROCESSING_PROFILE
131-
.with_label_values(&[namespace, function_name])
132-
.start_timer();
133-
134-
let output = future.await;
135-
timer.observe_duration();
136-
output
137-
}
138-
139-
pub fn gather_profiling_metrics() -> Result<String, MetricsError> {
140-
let encoder = TextEncoder::new();
141-
let metric_families = prometheus::gather();
142-
143-
let mut buffer = Vec::new();
144-
encoder
145-
.encode(&metric_families, &mut buffer)
146-
.map_err(|e| MetricsError::PrometheusErr(e.to_string()))?;
147-
148-
let res = String::from_utf8(buffer)?;
149-
150-
Ok(res)
151-
}
152-
153115
pub fn initialize_block_processing_profile() {
154116
METRICS_BLOCK_PROCESSING_PROFILE.reset();
155117
}

crates/blockchain/metrics/rpc.rs

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
use prometheus::{CounterVec, HistogramVec, register_counter_vec, register_histogram_vec};
2+
use std::{future::Future, sync::LazyLock};
3+
4+
pub static METRICS_RPC_REQUEST_OUTCOMES: LazyLock<CounterVec> =
5+
LazyLock::new(initialize_rpc_outcomes_counter);
6+
7+
pub static METRICS_RPC_DURATION: LazyLock<HistogramVec> =
8+
LazyLock::new(initialize_rpc_duration_histogram);
9+
10+
// Metrics defined in this module register into the Prometheus default registry.
11+
// The metrics API exposes them by calling `gather_default_metrics()`.
12+
13+
fn initialize_rpc_outcomes_counter() -> CounterVec {
14+
register_counter_vec!(
15+
"rpc_requests_total",
16+
"Total number of RPC requests partitioned by namespace, method, and outcome",
17+
&["namespace", "method", "outcome", "error_kind"],
18+
)
19+
.unwrap()
20+
}
21+
22+
fn initialize_rpc_duration_histogram() -> HistogramVec {
23+
register_histogram_vec!(
24+
"rpc_request_duration_seconds",
25+
"Histogram of RPC request handling duration partitioned by namespace and method",
26+
&["namespace", "method"],
27+
)
28+
.unwrap()
29+
}
30+
31+
/// Represents the outcome of an RPC request when recording metrics.
32+
#[derive(Clone)]
33+
pub enum RpcOutcome {
34+
Success,
35+
Error(&'static str),
36+
}
37+
38+
impl RpcOutcome {
39+
fn as_label(&self) -> &'static str {
40+
match self {
41+
RpcOutcome::Success => "success",
42+
RpcOutcome::Error(_) => "error",
43+
}
44+
}
45+
46+
fn error_kind(&self) -> &str {
47+
match self {
48+
RpcOutcome::Success => "",
49+
RpcOutcome::Error(kind) => kind,
50+
}
51+
}
52+
}
53+
54+
pub fn record_rpc_outcome(namespace: &str, method: &str, outcome: RpcOutcome) {
55+
METRICS_RPC_REQUEST_OUTCOMES
56+
.with_label_values(&[namespace, method, outcome.as_label(), outcome.error_kind()])
57+
.inc();
58+
}
59+
60+
pub fn initialize_rpc_metrics() {
61+
METRICS_RPC_REQUEST_OUTCOMES.reset();
62+
METRICS_RPC_DURATION.reset();
63+
}
64+
65+
/// Records the duration of an async operation in the RPC request duration histogram.
66+
///
67+
/// This provides a lightweight alternative to the `#[instrument]` attribute.
68+
///
69+
/// # Parameters
70+
/// * `namespace` - Category for the metric (e.g., "rpc", "engine", "block_execution")
71+
/// * `method` - Name identifier for the operation being timed
72+
/// * `future` - The async operation to time
73+
///
74+
pub async fn record_async_duration<Fut, T>(namespace: &str, method: &str, future: Fut) -> T
75+
where
76+
Fut: Future<Output = T>,
77+
{
78+
let timer = METRICS_RPC_DURATION
79+
.with_label_values(&[namespace, method])
80+
.start_timer();
81+
82+
let output = future.await;
83+
timer.observe_duration();
84+
output
85+
}

crates/networking/rpc/rpc.rs

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ use bytes::Bytes;
5555
use ethrex_blockchain::Blockchain;
5656
use ethrex_blockchain::error::ChainError;
5757
use ethrex_common::types::Block;
58-
use ethrex_metrics::profiling::record_async_duration;
58+
use ethrex_metrics::rpc::{RpcOutcome, record_async_duration, record_rpc_outcome};
5959
use ethrex_p2p::peer_handler::PeerHandler;
6060
use ethrex_p2p::sync_manager::SyncManager;
6161
use ethrex_p2p::types::Node;
@@ -196,16 +196,48 @@ pub trait RpcHandler: Sized {
196196
Ok(RpcNamespace::Engine) => "engine",
197197
_ => "rpc",
198198
};
199+
let method = req.method.as_str();
200+
201+
let result =
202+
record_async_duration(
203+
namespace,
204+
method,
205+
async move { request.handle(context).await },
206+
)
207+
.await;
208+
209+
let outcome = match &result {
210+
Ok(_) => RpcOutcome::Success,
211+
Err(err) => RpcOutcome::Error(get_error_kind(err)),
212+
};
213+
record_rpc_outcome(namespace, method, outcome);
199214

200-
record_async_duration(namespace, req.method.as_str(), async move {
201-
request.handle(context).await
202-
})
203-
.await
215+
result
204216
}
205217

206218
async fn handle(&self, context: RpcApiContext) -> Result<Value, RpcErr>;
207219
}
208220

221+
fn get_error_kind(err: &RpcErr) -> &'static str {
222+
match err {
223+
RpcErr::MethodNotFound(_) => "MethodNotFound",
224+
RpcErr::WrongParam(_) => "WrongParam",
225+
RpcErr::BadParams(_) => "BadParams",
226+
RpcErr::MissingParam(_) => "MissingParam",
227+
RpcErr::TooLargeRequest => "TooLargeRequest",
228+
RpcErr::BadHexFormat(_) => "BadHexFormat",
229+
RpcErr::UnsuportedFork(_) => "UnsuportedFork",
230+
RpcErr::Internal(_) => "Internal",
231+
RpcErr::Vm(_) => "Vm",
232+
RpcErr::Revert { .. } => "Revert",
233+
RpcErr::Halt { .. } => "Halt",
234+
RpcErr::AuthenticationError(_) => "AuthenticationError",
235+
RpcErr::InvalidForkChoiceState(_) => "InvalidForkChoiceState",
236+
RpcErr::InvalidPayloadAttributes(_) => "InvalidPayloadAttributes",
237+
RpcErr::UnknownPayload(_) => "UnknownPayload",
238+
}
239+
}
240+
209241
pub const FILTER_DURATION: Duration = {
210242
if cfg!(test) {
211243
Duration::from_secs(1)

docs/developers/l1/dashboards.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -94,16 +94,21 @@ Collapsed row that surfaces the `namespace="engine"` Prometheus timers so you ca
9494

9595
![Engine API row](img/engine_api_row.png)
9696

97-
### Engine Request Rate by Method
98-
Shows how many Engine API calls per second we process, split by JSON-RPC method and averaged across the currently selected dashboard range.
97+
### Engine Total Time per Method
98+
Pie chart that shows where Engine time is spent across methods over the selected range. Quickly surfaces which endpoints dominate total processing time.
9999

100-
![Engine Request Rate by Method](img/engine_request_rate_by_method.png)
100+
![Engine Total Time per Method](img/engine_total_time_per_method.png)
101101

102102
### Engine Latency by Methods (Avg Duration)
103103
Bar gauge of the historical average latency per Engine method over the selected time range.
104104

105105
![Engine Latency by Methods](img/engine_latency_by_methods.png)
106106

107+
### Engine Request Rate by Method
108+
Shows how many Engine API calls per second we process, split by JSON-RPC method and averaged across the currently selected dashboard range.
109+
110+
![Engine Request Rate by Method](img/engine_request_rate_by_method.png)
111+
107112
### Engine Latency by Method
108113
Live timeseries that tries to correlate to the per-block execution time by showing real-time latency per Engine method with an 18 s lookback window.
109114

@@ -117,10 +122,10 @@ Another collapsed row focused on the public JSON-RPC surface (`namespace="rpc"`)
117122

118123
![RPC API row](img/rpc_api_row.png)
119124

120-
### RPC Time per Method
125+
### RPC Total Time per Method
121126
Pie chart that shows where RPC time is spent across methods over the selected range. Quickly surfaces which endpoints dominate total processing time.
122127

123-
![RPC Time per Method](img/rpc_time_per_method.png)
128+
![RPC Total Time per Method](img/rpc_total_time_per_method.png)
124129

125130
### Slowest RPC Methods
126131
Table listing the highest average-latency methods over the active dashboard range. Used to prioritise optimisation or caching efforts.
@@ -139,6 +144,28 @@ Live timeseries that tries to correlate to the per-block execution time by showi
139144

140145
_**Limitations**: The RPC latency views inherit the same windowing caveats as the Engine charts: averages use the dashboard time range while the live chart relies on an 18 s window._
141146

147+
## Engine and RPC Error rates
148+
149+
Collapsed row showing error rates for both Engine and RPC APIs side by side and a deagreagated panel by method and kind of error. Each panel repeats per instance to be able to compare behaviour across nodes.
150+
151+
![Engine and RPC Error rates row](img/engine_and_rpc_error_rates_row.png)
152+
153+
### Engine Success/Error Rate
154+
Shows the rate of successful vs. failed Engine API requests per second.
155+
156+
![Engine Success/Error Rate](img/engine_success_error_rate.png)
157+
158+
### RPC Success/Error Rate
159+
Shows the rate of successful vs. failed RPC API requests per second.
160+
161+
![RPC Success/Error Rate](img/rpc_success_error_rate.png)
162+
163+
### Engine and RPC Errors % by Method and Kind
164+
165+
Deaggregated view of error percentages split by method and error kind for both Engine and RPC APIs. The % are calculated against total requests for a particular method, so all different error percentage for a method should sum up to the percentage of errors for that method.
166+
167+
![Engine and RPC Errors % by Method and Kind](img/engine_and_rpc_errors_by_method_and_kind.png)
168+
142169
## Process and server info
143170

144171
Row panels showing process-level and host-level metrics to help you monitor resource usage and spot potential issues.
149 KB
Loading
94.4 KB
Loading
58.1 KB
Loading

0 commit comments

Comments
 (0)