feat(router): Subgraph Timeout Configuration #541

ardatan · 2025-10-31T10:55:39Z

Implementation of Timeout in #317
Ref ROUTER-110
Ref ROUTER-151
This also adds subgraphs and all options to traffic_shaping as in Apollo Router. So subgraph specific configuration can be done with subgraphs;

Apollo Router -> https://www.apollographql.com/docs/graphos/routing/performance/traffic-shaping#configuration

traffic_shaping:
   all:
      request_timeout: 5s
   subgraphs:
       products:
           request_timeout:
                expression: |
                   if (.request.operation.kind == "mutation") {
                        "15s"
                    } else {
                        "5s"
                    }

Documentation -> graphql-hive/console#7214

gemini-code-assist · 2025-10-31T10:56:05Z

Summary of Changes

Hello @ardatan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the router's resilience and configurability by adding a robust timeout feature for subgraph requests. It allows operators to define timeouts either as fixed durations or dynamically via VRL expressions, applicable globally or on a per-subgraph basis. This enhancement is integrated into a refactored traffic shaping configuration, providing greater control over request execution behavior.

Highlights

Subgraph Timeout Implementation: Introduced a timeout mechanism for subgraph requests, allowing for both fixed duration and dynamic VRL expression-based timeouts.
Flexible Traffic Shaping Configuration: Enhanced the traffic shaping configuration to support global settings (all) and specific overrides for individual subgraphs (subgraphs), mirroring Apollo Router's approach.
VRL Expression Integration: Centralized VRL expression handling into a new Expression primitive, simplifying its use across various configurations like subgraph URLs, headers, and now, timeouts.
Error Handling: Added specific error types for timeout failures and issues with VRL expression resolution for timeouts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable subgraph timeout feature, enhancing the router's resilience. The implementation is well-structured, using tokio::timeout and providing flexibility with dynamic timeouts through VRL expressions. The configuration refactoring to support per-subgraph settings is also a welcome improvement.

My review focuses on several key areas to further refine the implementation:

Performance: I've identified a high-severity issue where new HTTP clients and connection pools are created unnecessarily. I've also pointed out a couple of medium-severity opportunities to reduce allocations on hot paths, aligning with the repository's performance-first ethos.
Correctness & Readability: I've suggested improvements to the timeout expression evaluation to provide clearer, more accurate error messages and handle edge cases more robustly. Additionally, I've noted some documentation comments that could be clarified for better user understanding.

The proposed changes aim to enhance performance, improve error handling, and increase the overall clarity and maintainability of the new feature.

lib/executor/src/executors/map.rs

lib/executor/src/executors/http.rs

lib/router-config/src/traffic_shaping.rs

github-actions · 2025-10-31T11:01:33Z

✅ `k6-benchmark` results

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     █ setup

     checks.........................: 100.00% ✓ 197865      ✗ 0    
     data_received..................: 5.8 GB  192 MB/s
     data_sent......................: 77 MB   2.6 MB/s
     http_req_blocked...............: avg=4.84µs   min=831ns   med=2.09µs  max=21.04ms  p(90)=3.02µs   p(95)=3.49µs  
     http_req_connecting............: avg=1.09µs   min=0s      med=0s      max=2.85ms   p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=22.25ms  min=2.5ms   med=21.34ms max=114.5ms  p(90)=29.92ms  p(95)=33.35ms 
       { expected_response:true }...: avg=22.25ms  min=2.5ms   med=21.34ms max=114.5ms  p(90)=29.92ms  p(95)=33.35ms 
     http_req_failed................: 0.00%   ✓ 0           ✗ 65975
     http_req_receiving.............: avg=150.05µs min=27.24µs med=45.03µs max=59.77ms  p(90)=102.19µs p(95)=424.88µs
     http_req_sending...............: avg=25.65µs  min=6.09µs  med=11.49µs max=21.29ms  p(90)=17.69µs  p(95)=28.63µs 
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=22.08ms  min=2.4ms   med=21.19ms max=71.75ms  p(90)=29.65ms  p(95)=33.02ms 
     http_reqs......................: 65975   2192.971364/s
     iteration_duration.............: avg=22.74ms  min=6.61ms  med=21.7ms  max=213.48ms p(90)=30.38ms  p(95)=33.89ms 
     iterations.....................: 65955   2192.306575/s
     vus............................: 50      min=50        max=50 
     vus_max........................: 50      min=50        max=50

github-actions · 2025-10-31T11:01:35Z

🐋 This PR was built and pushed to the following Docker images:

Image Names: ghcr.io/graphql-hive/router

Platforms: linux/amd64,linux/arm64

Image Tags: ghcr.io/graphql-hive/router:pr-541 ghcr.io/graphql-hive/router:sha-cc9fb63

Docker metadata

{
"buildx.build.ref": "builder-129e9f65-ee08-4627-875e-3da8e8b8be74/builder-129e9f65-ee08-4627-875e-3da8e8b8be740/r7r547s2pph1jq3p3xzs6727d",
"containerimage.descriptor": {
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "digest": "sha256:023f5b5f8266ea36afabf41292c8d3d982fd4f48b5e3bee2d8a209242929f49c",
  "size": 1609
},
"containerimage.digest": "sha256:023f5b5f8266ea36afabf41292c8d3d982fd4f48b5e3bee2d8a209242929f49c",
"image.name": "ghcr.io/graphql-hive/router:pr-541,ghcr.io/graphql-hive/router:sha-cc9fb63"
}

The previous implementation had expression handling scattered across multiple locations: - `duration_or_prog.rs` handled duration expressions - `utils/expression.rs` had generic compilation logic - Type-specific conversions were ad-hoc This refactor centralizes all expression handling into a module with: 1. Generic traits for extensibility - `FromVrlValue`: Provides a way to convert VRL values to specific types - `CompileExpression`: Compiles VRL expression strings into executable programs - `ExecutableProgram`: Executes compiled VRL programs with proper context setup 2. Type-safe ValueOrProgram enum - A generic enum that can represent either a static value or a computed expression - Eliminates duplication (previously had DurationOrProgram only) - Enables easy extension for new types (StringOrProgram, HeaderValueOrProgram) 3. Proper error handling - Dedicated error types for compilation and execution failures - Type-specific conversion errors (DurationConversionError, StringConversionError, etc.)

github-actions · 2025-11-26T13:00:34Z

🐋 This PR was built and pushed to the following Docker images:

Image Names: ghcr.io/graphql-hive/router

Platforms: linux/amd64,linux/arm64

Image Tags: ghcr.io/graphql-hive/router:pr-541 ghcr.io/graphql-hive/router:sha-ca15f71

Docker metadata

{
"buildx.build.ref": "builder-6f0af3b3-855f-4028-9afb-255e195911ad/builder-6f0af3b3-855f-4028-9afb-255e195911ad0/9as8usg22jkrzih3twzx4roqg",
"containerimage.descriptor": {
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "digest": "sha256:39c9373476ce29ac7607a18ef2561a36a99b3f89dd1abe7d21f43ccd804a739b",
  "size": 1609
},
"containerimage.digest": "sha256:39c9373476ce29ac7607a18ef2561a36a99b3f89dd1abe7d21f43ccd804a739b",
"image.name": "ghcr.io/graphql-hive/router:pr-541,ghcr.io/graphql-hive/router:sha-ca15f71"
}

Rename timeout label from 'global' to 'all' in SubgraphExecutorMap.

overrides

Use concrete error types for parsing (FromUtf8Error and humantime::DurationError) and wrap them via DurationParseErrorSource to simplify conversion errors and From conversions. Re-export the moved ProgramResolutionError and streamline expression resolution to use ? for propagation instead of manual wrapping.

kamilkisiela · 2025-11-26T15:43:36Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a powerful and flexible timeout configuration system for subgraphs, which is a great enhancement for router resilience. The implementation, including the refactoring of VRL expression handling into a dedicated expressions module, is well-structured and improves code clarity and reusability. I've identified a few issues: an incorrect unit in a timeout error message, a potential memory leak due to unbounded caching of dynamic subgraph executors, a missing case for float values when parsing durations, and a changeset file that doesn't fully align with the repository's contribution guidelines. Addressing these points will make this feature even more robust.

gemini-code-assist · 2025-11-26T15:45:59Z

lib/executor/src/executors/http.rs

+                    SubgraphExecutorError::RequestTimeout(
+                        self.endpoint.to_string(),
+                        timeout_duration.as_secs(),
+                    )


The error message for a request timeout is defined to report the duration in milliseconds, but the value passed here is in seconds (timeout_duration.as_secs()). This creates a misleading error message for operators and developers. For example, a 5-second timeout would be reported as "timed out after 5 milliseconds".

Please use as_millis() to provide the correct value.

Suggested change

SubgraphExecutorError::RequestTimeout(

self.endpoint.to_string(),

timeout_duration.as_secs(),

)

SubgraphExecutorError::RequestTimeout(

self.endpoint.to_string(),

timeout_duration.as_millis() as u64,

)

gemini-code-assist · 2025-11-26T15:45:59Z

lib/executor/src/executors/map.rs

        // Check if an executor for this endpoint already exists.
-        self.executors_by_subgraph
-            .get(subgraph_name)
-            .and_then(|endpoints| endpoints.get(endpoint_str.as_ref()).map(|e| e.clone()))
-            .map(Ok)
-            // If not, create and register a new one.
-            .unwrap_or_else(|| self.register_executor(subgraph_name, endpoint_str.as_ref()))
+        if let Some(executor) = self.get_executor_from_endpoint(subgraph_name, &endpoint_str) {
+            return Ok(executor);
+        }
+
+        // If not, create and register a new one.
+        self.register_executor(subgraph_name, &endpoint_str)


The current implementation caches a new subgraph executor for each unique URL resolved from a VRL expression. If an expression generates a high-cardinality or unbounded number of URLs (e.g., by including a request-specific ID), the executors_by_subgraph map could grow indefinitely, leading to a memory leak and potential denial-of-service. This pattern violates the style guide's rule against unbounded growth.¹

To mitigate this, consider adding a cap to the number of dynamic executors that can be created per subgraph. At a minimum, this risk should be clearly documented for users of the dynamic URL override feature.

Style Guide References

Footnotes

↩

@gemini-code-assist it's up to the user as a user controls it

You're right, the user does control the VRL expression. However, the core concern remains that an ill-configured or malicious VRL expression could still lead to an unbounded number of unique endpoint URLs being generated. This would cause the executors_by_subgraph map to grow indefinitely, potentially consuming all available memory and leading to a denial-of-service for the router.

While the user has control, it's crucial that they are fully aware of this potential pitfall. Therefore, I still recommend adding clear documentation that highlights this risk and advises users on how to design their VRL expressions to prevent unbounded growth, or to implement external monitoring for the number of active executors if dynamic URLs are used extensively.

gemini-code-assist · 2025-11-26T15:45:59Z

.changeset/shared_utilities_to_handle_vrl_expressions.md

+---
+default: minor
+---
+
+# Breaking
+
+Removed `pool_idle_timeout_seconds` from `traffic_shaping`, instead use `pool_idle_timeout` with duration format.
+
+```diff
+traffic_shaping:
+-  pool_idle_timeout_seconds: 30
+  pool_idle_timeout: 30s
+```
+
+#540 by @ardatan


This changeset file appears to violate the repository's style guide in a couple of ways:¹

Target Package: It's marked as default: minor, but since it modifies TrafficShapingConfig within the config crate, it should target the router package (e.g., router: minor).

Missing Feature Changeset: The main feature of this PR—configurable subgraph timeouts and the restructuring of traffic_shaping—is a significant change to HiveRouterConfig but is not described in a changeset. A new changeset file for this feature should be created, targeting the router package and including a YAML example as required by the guide.

Breaking Change Level: The change from pool_idle_timeout_seconds to pool_idle_timeout is a breaking change and should likely be marked as major, not minor.

Style Guide References

Footnotes

↩

gemini-code-assist · 2025-11-26T15:45:59Z

lib/executor/src/expressions/values/duration.rs

+    fn from_vrl_value(value: VrlValue) -> Result<Self, Self::Error> {
+        match value {
+            VrlValue::Integer(i) => {
+                if i < 0 {
+                    return Err(DurationConversionError::NegativeValue);
+                }
+                Ok(Duration::from_millis(i as u64))
+            }
+            VrlValue::Bytes(b) => {
+                let s = std::str::from_utf8(&b)?;
+                Ok(parse_duration(s)?)
+            }
+            other => Err(DurationConversionError::UnexpectedType {
+                type_name: other.kind().to_string(),
+            }),
+        }
+    }
+}


The documentation for DurationOrExpression in lib/router-config/src/traffic_shaping.rs states that a VRL expression can return an integer or a float (representing milliseconds). However, this implementation of from_vrl_value for Duration only handles VrlValue::Integer and VrlValue::Bytes, but not VrlValue::Float.

This inconsistency could lead to unexpected errors at runtime. Please add support for float values.

fn from_vrl_value(value: VrlValue) -> Result<Self, Self::Error> { match value { VrlValue::Integer(i) => { if i < 0 { return Err(DurationConversionError::NegativeValue); } Ok(Duration::from_millis(i as u64)) } VrlValue::Float(f) => { if f < 0.0 { return Err(DurationConversionError::NegativeValue); } Ok(Duration::from_secs_f64(f / 1000.0)) } VrlValue::Bytes(b) => { let s = std::str::from_utf8(&b)?; Ok(parse_duration(s)?) } other => Err(DurationConversionError::UnexpectedType { type_name: other.kind().to_string(), }), } }

ardatan changed the base branch from main to primitive-expression October 31, 2025 10:55

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

lib/executor/src/executors/map.rs Outdated Show resolved Hide resolved

lib/executor/src/executors/http.rs Outdated Show resolved Hide resolved

lib/executor/src/executors/http.rs Outdated Show resolved Hide resolved

lib/router-config/src/traffic_shaping.rs Outdated Show resolved Hide resolved

ardatan changed the title ~~Subgraph timeout~~ feat(router): Subgraph Timeout Configuration Oct 31, 2025

ardatan force-pushed the subgraph_timeout branch 3 times, most recently from 0a61e05 to e319e11 Compare October 31, 2025 15:10

ardatan force-pushed the primitive-expression branch 2 times, most recently from 128b10a to 766bd0e Compare November 3, 2025 11:30

ardatan force-pushed the subgraph_timeout branch 4 times, most recently from 98ceed4 to 7cc1030 Compare November 3, 2025 13:31

ardatan mentioned this pull request Nov 4, 2025

docs(router): subgraph timeout configuration graphql-hive/console#7214

Open

ardatan force-pushed the primitive-expression branch from 046cc07 to 8650809 Compare November 5, 2025 23:25

Base automatically changed from primitive-expression to main November 21, 2025 15:38

kamilkisiela force-pushed the subgraph_timeout branch from d069f91 to 48c0b6b Compare November 25, 2025 09:14

ardatan and others added 8 commits November 25, 2025 10:17

feat(router): Subgraph Timeout Configuration

40f60ab

I guess?

09e1181

Make sure Duration in config is treated as String

2fe644d

e2e tests, fix config and adjust naming

a0aa887

Compile timeout expressions during config load

0e5bf38

Use a link to the vector.dev documentation for the VRL error messages

a91a23e

syntax

c776475

Introduce .default value as a fallback

7d062f6

kamilkisiela force-pushed the subgraph_timeout branch from 48c0b6b to 7d062f6 Compare November 26, 2025 12:38

kamilkisiela added 6 commits November 26, 2025 14:31

Use VRL formatter and make executor error transparent

04bbc67

Rename timeout label from 'global' to 'all' in SubgraphExecutorMap.

Actually, bring back the old logic as I forgot to do subgraph url

4bb9b61

overrides

fmt

d49a0f3

Avoid client override when pool timeout unchanged

9ad96ab

Set tracing::instrument level to trace

e81e5af

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

kamilkisiela added 4 commits November 26, 2025 16:50

Fix changeset

e92a3eb

Fix RequestTimeout duration being s instead of ms

c6ec347

changeset

d79b363

Remove obsolete changeset about pool_idle_timeout

acb1c4d

feat(router): Subgraph Timeout Configuration #541

Are you sure you want to change the base?

feat(router): Subgraph Timeout Configuration #541

Uh oh!

Conversation

ardatan commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Oct 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ k6-benchmark results

Uh oh!

github-actions bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamilkisiela commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Style Guide References

Footnotes

Uh oh!

kamilkisiela Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Style Guide References

Footnotes

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ardatan commented Oct 31, 2025 •

edited

Loading

github-actions bot commented Oct 31, 2025 •

edited

Loading

✅ `k6-benchmark` results

github-actions bot commented Oct 31, 2025 •

edited

Loading

github-actions bot commented Nov 26, 2025 •

edited

Loading