-
Notifications
You must be signed in to change notification settings - Fork 3
feat(router): Subgraph Timeout Configuration #541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ardatan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the router's resilience and configurability by adding a robust timeout feature for subgraph requests. It allows operators to define timeouts either as fixed durations or dynamically via VRL expressions, applicable globally or on a per-subgraph basis. This enhancement is integrated into a refactored traffic shaping configuration, providing greater control over request execution behavior. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable subgraph timeout feature, enhancing the router's resilience. The implementation is well-structured, using tokio::timeout and providing flexibility with dynamic timeouts through VRL expressions. The configuration refactoring to support per-subgraph settings is also a welcome improvement.
My review focuses on several key areas to further refine the implementation:
- Performance: I've identified a high-severity issue where new HTTP clients and connection pools are created unnecessarily. I've also pointed out a couple of medium-severity opportunities to reduce allocations on hot paths, aligning with the repository's performance-first ethos.
- Correctness & Readability: I've suggested improvements to the timeout expression evaluation to provide clearer, more accurate error messages and handle edge cases more robustly. Additionally, I've noted some documentation comments that could be clarified for better user understanding.
The proposed changes aim to enhance performance, improve error handling, and increase the overall clarity and maintainability of the new feature.
✅
|
|
🐋 This PR was built and pushed to the following Docker images: Image Names: Platforms: Image Tags: Docker metadata{
"buildx.build.ref": "builder-129e9f65-ee08-4627-875e-3da8e8b8be74/builder-129e9f65-ee08-4627-875e-3da8e8b8be740/r7r547s2pph1jq3p3xzs6727d",
"containerimage.descriptor": {
"mediaType": "application/vnd.oci.image.index.v1+json",
"digest": "sha256:023f5b5f8266ea36afabf41292c8d3d982fd4f48b5e3bee2d8a209242929f49c",
"size": 1609
},
"containerimage.digest": "sha256:023f5b5f8266ea36afabf41292c8d3d982fd4f48b5e3bee2d8a209242929f49c",
"image.name": "ghcr.io/graphql-hive/router:pr-541,ghcr.io/graphql-hive/router:sha-cc9fb63"
} |
0a61e05 to
e319e11
Compare
128b10a to
766bd0e
Compare
98ceed4 to
7cc1030
Compare
046cc07 to
8650809
Compare
d069f91 to
48c0b6b
Compare
48c0b6b to
7d062f6
Compare
The previous implementation had expression handling scattered across multiple locations: - `duration_or_prog.rs` handled duration expressions - `utils/expression.rs` had generic compilation logic - Type-specific conversions were ad-hoc This refactor centralizes all expression handling into a module with: 1. Generic traits for extensibility - `FromVrlValue`: Provides a way to convert VRL values to specific types - `CompileExpression`: Compiles VRL expression strings into executable programs - `ExecutableProgram`: Executes compiled VRL programs with proper context setup 2. Type-safe ValueOrProgram enum - A generic enum that can represent either a static value or a computed expression - Eliminates duplication (previously had DurationOrProgram only) - Enables easy extension for new types (StringOrProgram, HeaderValueOrProgram) 3. Proper error handling - Dedicated error types for compilation and execution failures - Type-specific conversion errors (DurationConversionError, StringConversionError, etc.)
|
🐋 This PR was built and pushed to the following Docker images: Image Names: Platforms: Image Tags: Docker metadata{
"buildx.build.ref": "builder-6f0af3b3-855f-4028-9afb-255e195911ad/builder-6f0af3b3-855f-4028-9afb-255e195911ad0/9as8usg22jkrzih3twzx4roqg",
"containerimage.descriptor": {
"mediaType": "application/vnd.oci.image.index.v1+json",
"digest": "sha256:39c9373476ce29ac7607a18ef2561a36a99b3f89dd1abe7d21f43ccd804a739b",
"size": 1609
},
"containerimage.digest": "sha256:39c9373476ce29ac7607a18ef2561a36a99b3f89dd1abe7d21f43ccd804a739b",
"image.name": "ghcr.io/graphql-hive/router:pr-541,ghcr.io/graphql-hive/router:sha-ca15f71"
} |
Rename timeout label from 'global' to 'all' in SubgraphExecutorMap.
Use concrete error types for parsing (FromUtf8Error and humantime::DurationError) and wrap them via DurationParseErrorSource to simplify conversion errors and From conversions. Re-export the moved ProgramResolutionError and streamline expression resolution to use ? for propagation instead of manual wrapping.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a powerful and flexible timeout configuration system for subgraphs, which is a great enhancement for router resilience. The implementation, including the refactoring of VRL expression handling into a dedicated expressions module, is well-structured and improves code clarity and reusability. I've identified a few issues: an incorrect unit in a timeout error message, a potential memory leak due to unbounded caching of dynamic subgraph executors, a missing case for float values when parsing durations, and a changeset file that doesn't fully align with the repository's contribution guidelines. Addressing these points will make this feature even more robust.
| SubgraphExecutorError::RequestTimeout( | ||
| self.endpoint.to_string(), | ||
| timeout_duration.as_secs(), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message for a request timeout is defined to report the duration in milliseconds, but the value passed here is in seconds (timeout_duration.as_secs()). This creates a misleading error message for operators and developers. For example, a 5-second timeout would be reported as "timed out after 5 milliseconds".
Please use as_millis() to provide the correct value.
| SubgraphExecutorError::RequestTimeout( | |
| self.endpoint.to_string(), | |
| timeout_duration.as_secs(), | |
| ) | |
| SubgraphExecutorError::RequestTimeout( | |
| self.endpoint.to_string(), | |
| timeout_duration.as_millis() as u64, | |
| ) |
| // Check if an executor for this endpoint already exists. | ||
| self.executors_by_subgraph | ||
| .get(subgraph_name) | ||
| .and_then(|endpoints| endpoints.get(endpoint_str.as_ref()).map(|e| e.clone())) | ||
| .map(Ok) | ||
| // If not, create and register a new one. | ||
| .unwrap_or_else(|| self.register_executor(subgraph_name, endpoint_str.as_ref())) | ||
| if let Some(executor) = self.get_executor_from_endpoint(subgraph_name, &endpoint_str) { | ||
| return Ok(executor); | ||
| } | ||
|
|
||
| // If not, create and register a new one. | ||
| self.register_executor(subgraph_name, &endpoint_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation caches a new subgraph executor for each unique URL resolved from a VRL expression. If an expression generates a high-cardinality or unbounded number of URLs (e.g., by including a request-specific ID), the executors_by_subgraph map could grow indefinitely, leading to a memory leak and potential denial-of-service. This pattern violates the style guide's rule against unbounded growth.1
To mitigate this, consider adding a cap to the number of dynamic executors that can be created per subgraph. At a minimum, this risk should be clearly documented for users of the dynamic URL override feature.
Style Guide References
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gemini-code-assist it's up to the user as a user controls it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the user does control the VRL expression. However, the core concern remains that an ill-configured or malicious VRL expression could still lead to an unbounded number of unique endpoint URLs being generated. This would cause the executors_by_subgraph map to grow indefinitely, potentially consuming all available memory and leading to a denial-of-service for the router.
While the user has control, it's crucial that they are fully aware of this potential pitfall. Therefore, I still recommend adding clear documentation that highlights this risk and advises users on how to design their VRL expressions to prevent unbounded growth, or to implement external monitoring for the number of active executors if dynamic URLs are used extensively.
| --- | ||
| default: minor | ||
| --- | ||
|
|
||
| # Breaking | ||
|
|
||
| Removed `pool_idle_timeout_seconds` from `traffic_shaping`, instead use `pool_idle_timeout` with duration format. | ||
|
|
||
| ```diff | ||
| traffic_shaping: | ||
| - pool_idle_timeout_seconds: 30 | ||
| + pool_idle_timeout: 30s | ||
| ``` | ||
|
|
||
| #540 by @ardatan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changeset file appears to violate the repository's style guide in a couple of ways:1
- Target Package: It's marked as
default: minor, but since it modifiesTrafficShapingConfigwithin theconfigcrate, it should target therouterpackage (e.g.,router: minor). - Missing Feature Changeset: The main feature of this PR—configurable subgraph timeouts and the restructuring of
traffic_shaping—is a significant change toHiveRouterConfigbut is not described in a changeset. A new changeset file for this feature should be created, targeting therouterpackage and including a YAML example as required by the guide. - Breaking Change Level: The change from
pool_idle_timeout_secondstopool_idle_timeoutis a breaking change and should likely be marked asmajor, notminor.
Style Guide References
Footnotes
| fn from_vrl_value(value: VrlValue) -> Result<Self, Self::Error> { | ||
| match value { | ||
| VrlValue::Integer(i) => { | ||
| if i < 0 { | ||
| return Err(DurationConversionError::NegativeValue); | ||
| } | ||
| Ok(Duration::from_millis(i as u64)) | ||
| } | ||
| VrlValue::Bytes(b) => { | ||
| let s = std::str::from_utf8(&b)?; | ||
| Ok(parse_duration(s)?) | ||
| } | ||
| other => Err(DurationConversionError::UnexpectedType { | ||
| type_name: other.kind().to_string(), | ||
| }), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for DurationOrExpression in lib/router-config/src/traffic_shaping.rs states that a VRL expression can return an integer or a float (representing milliseconds). However, this implementation of from_vrl_value for Duration only handles VrlValue::Integer and VrlValue::Bytes, but not VrlValue::Float.
This inconsistency could lead to unexpected errors at runtime. Please add support for float values.
fn from_vrl_value(value: VrlValue) -> Result<Self, Self::Error> {
match value {
VrlValue::Integer(i) => {
if i < 0 {
return Err(DurationConversionError::NegativeValue);
}
Ok(Duration::from_millis(i as u64))
}
VrlValue::Float(f) => {
if f < 0.0 {
return Err(DurationConversionError::NegativeValue);
}
Ok(Duration::from_secs_f64(f / 1000.0))
}
VrlValue::Bytes(b) => {
let s = std::str::from_utf8(&b)?;
Ok(parse_duration(s)?)
}
other => Err(DurationConversionError::UnexpectedType {
type_name: other.kind().to_string(),
}),
}
}
Implementation of Timeout in #317
Ref ROUTER-110
Ref ROUTER-151
This also adds subgraphs and all options to traffic_shaping as in Apollo Router. So subgraph specific configuration can be done with subgraphs;
Apollo Router -> https://www.apollographql.com/docs/graphos/routing/performance/traffic-shaping#configuration
Documentation -> graphql-hive/console#7214