Skip to content

HDFS-17909. [ARR] AsyncRouterHandlerExecutors should use bounded queue#8448

Open
kokonguyen191 wants to merge 4 commits intoapache:trunkfrom
kokonguyen191:async-router-server-f
Open

HDFS-17909. [ARR] AsyncRouterHandlerExecutors should use bounded queue#8448
kokonguyen191 wants to merge 4 commits intoapache:trunkfrom
kokonguyen191:async-router-server-f

Conversation

@kokonguyen191
Copy link
Copy Markdown
Contributor

@kokonguyen191 kokonguyen191 commented Apr 21, 2026

Description of PR

When a lot of requests arrive at a router and there is a backlog due to downstream namespaces not being able to fully handle the requests, routers can go OOM if there is no limit to the queue used by the async handlers.

This situation can be simulated using the unit test attached with this patch.

public static void main(String[] args) throws Exception {
    setUpCluster();
    RemoteMethod method = new RemoteMethod("listOpenFiles", new Class<?>[] {long.class, EnumSet.class, String.class}, 0, EnumSet.of(OpenFilesIterator.OpenFilesType.BLOCKING_DECOMMISSION), new RemoteParam());
    UserGroupInformation ugi = RouterRpcServer.getRemoteUser();
    Class<?> protocol = method.getProtocol();
    String bigPath = "/veryBigOperation";
    Object[] params = new Object[] {0, EnumSet.of(OpenFilesIterator.OpenFilesType.BLOCKING_DECOMMISSION), bigPath};
    List<? extends FederationNamenodeContext> namenodes = asyncRpcClient.getOrderedNamenodes(ns0, true);
    while (true) {
      asyncRpcClient.invokeMethod(ugi, namenodes, true, protocol, method.getMethod(), params);
      asyncReturn(Map.class);
    }
  }

Rising memory usage from unbounded growing queue
image

How was this patch tested?

This patch has been used in our prod cluster for over a year and everything's working properly.

@hfutatzhanghb
Copy link
Copy Markdown
Member

@kokonguyen191 Thanks for pushing this. Could you please provide more details in the description about how this situation was triggered? use pseudocode or any other way to make it clearly. Thanks a lot.

@kokonguyen191
Copy link
Copy Markdown
Contributor Author

Hi @hfutatzhanghb, thanks for taking a look at this patch. I have updated the description as well as the unit test for easier issue repro.

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 34s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 35s trunk passed
+1 💚 compile 1m 10s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 33s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 1m 4s trunk passed
+1 💚 mvnsite 1m 39s trunk passed
+1 💚 javadoc 1m 8s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 1s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 2m 24s trunk passed
+1 💚 shadedclient 30m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 0s the patch passed
+1 💚 compile 0m 39s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 0m 39s the patch passed
+1 💚 compile 0m 59s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 0m 59s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 27s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
+1 💚 mvnsite 1m 5s the patch passed
+1 💚 javadoc 0m 34s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
-1 ❌ javadoc 0m 34s /results-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1 with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1 generated 1 new + 937 unchanged - 0 fixed = 938 total (was 937)
+1 💚 spotbugs 2m 5s the patch passed
+1 💚 shadedclient 28m 47s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 43m 52s hadoop-hdfs-rbf in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
164m 23s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/2/artifact/out/Dockerfile
GITHUB PR #8448
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname Linux 2b8dcc5a7c7f 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2c3f471
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/2/testReport/
Max. process+thread count 4205 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/2/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 53s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 48m 46s trunk passed
+1 💚 compile 1m 5s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 1m 33s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 0m 57s trunk passed
+1 💚 mvnsite 1m 39s trunk passed
+1 💚 javadoc 1m 3s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 0s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 2m 34s trunk passed
+1 💚 shadedclient 35m 36s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 7s the patch passed
+1 💚 compile 0m 39s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 0m 39s the patch passed
+1 💚 compile 1m 5s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 1m 5s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 27s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
+1 💚 mvnsite 1m 12s the patch passed
+1 💚 javadoc 0m 32s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
-1 ❌ javadoc 0m 33s /results-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1 with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1 generated 1 new + 937 unchanged - 0 fixed = 938 total (was 937)
+1 💚 spotbugs 2m 16s the patch passed
+1 💚 shadedclient 34m 27s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 48m 56s hadoop-hdfs-rbf in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
188m 15s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/3/artifact/out/Dockerfile
GITHUB PR #8448
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname Linux 0f998e8f35ff 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2c3f471
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/3/testReport/
Max. process+thread count 4093 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/3/console
versions git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 40s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
-1 ❌ mvninstall 3m 5s /branch-mvninstall-root.txt root in trunk failed.
-1 ❌ compile 0m 19s /branch-compile-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-21.0.10+7-Ubuntu-124.04.txt hadoop-hdfs-rbf in trunk failed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04.
-1 ❌ compile 0m 23s /branch-compile-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-rbf in trunk failed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1.
-0 ⚠️ checkstyle 0m 21s /buildtool-branch-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt The patch fails to run checkstyle in hadoop-hdfs-rbf
-1 ❌ mvnsite 0m 25s /branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in trunk failed.
-1 ❌ javadoc 0m 17s /branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-21.0.10+7-Ubuntu-124.04.txt hadoop-hdfs-rbf in trunk failed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04.
-1 ❌ javadoc 0m 24s /branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-rbf in trunk failed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1.
-1 ❌ spotbugs 0m 24s /branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in trunk failed.
-1 ❌ shadedclient 2m 55s branch has errors when building and testing our client artifacts.
_ Patch Compile Tests _
-1 ❌ mvninstall 0m 19s /patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in the patch failed.
-1 ❌ compile 0m 13s /patch-compile-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-21.0.10+7-Ubuntu-124.04.txt hadoop-hdfs-rbf in the patch failed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04.
-1 ❌ javac 0m 13s /patch-compile-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-21.0.10+7-Ubuntu-124.04.txt hadoop-hdfs-rbf in the patch failed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04.
-1 ❌ compile 0m 15s /patch-compile-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-rbf in the patch failed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1.
-1 ❌ javac 0m 15s /patch-compile-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-rbf in the patch failed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 13s /buildtool-patch-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt The patch fails to run checkstyle in hadoop-hdfs-rbf
-1 ❌ mvnsite 0m 24s /patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in the patch failed.
-1 ❌ javadoc 0m 14s /patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-21.0.10+7-Ubuntu-124.04.txt hadoop-hdfs-rbf in the patch failed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04.
-1 ❌ javadoc 0m 23s /patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-rbf-jdkUbuntu-17.0.18+8-Ubuntu-124.04.1.txt hadoop-hdfs-rbf in the patch failed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1.
-1 ❌ spotbugs 0m 23s /patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in the patch failed.
+1 💚 shadedclient 4m 6s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 0m 24s /patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt hadoop-hdfs-rbf in the patch failed.
+0 🆗 asflicense 0m 24s ASF License check generated no output?
15m 11s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/4/artifact/out/Dockerfile
GITHUB PR #8448
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname Linux be01ad7de18a 5.15.0-160-generic #170-Ubuntu SMP Wed Oct 1 10:06:56 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c78afc8
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/4/testReport/
Max. process+thread count 79 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8448/4/console
versions git=2.43.0 maven=3.9.11
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Copy link
Copy Markdown
Contributor

@ZanderXu ZanderXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kokonguyen191 for your contribution.

This PR makes sense to me. How about splitting this MR into two separate MRs?

  1. The first MR changes the queue to a bounded queue.
  2. The second MR adds metrics for the queue size.

@KeeProMise
Copy link
Copy Markdown
Member

Hi @kokonguyen191 @hfutatzhanghb @ZanderXu I had the impression that DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY could control the maximum number of requests. Is it possible to achieve the same effect using this?

@hfutatzhanghb
Copy link
Copy Markdown
Member

Hi @kokonguyen191 @hfutatzhanghb @ZanderXu I had the impression that DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY could control the maximum number of requests. Is it possible to achieve the same effect using this?

I have the same question with @KeeProMise . @kokonguyen191 Could you please clarify here?
IIUC, When we use acquirePermit, Router have put this rpc call into its call queue. So DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY can not control the max calls at router side?

image

@ZanderXu
Copy link
Copy Markdown
Contributor

ZanderXu commented May 6, 2026

Hi @kokonguyen191 @hfutatzhanghb @ZanderXu I had the impression that DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY could control the maximum number of requests. Is it possible to achieve the same effect using this?

I have the same question with @KeeProMise . @kokonguyen191 Could you please clarify here? IIUC, When we use acquirePermit, Router have put this rpc call into its call queue. So DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY can not control the max calls at router side?

Thanks @KeeProMise @hfutatzhanghb for your review.

Let me try to explain my understanding of this issue.

  1. On the RBF side, there are two relevant thread groups: the 8888 handlers and the NS handlers inside the NS thread pool.
  2. The 8888 handlers receive client requests and submit them to the corresponding NS thread pool. These requests are put into the thread pool’s unbounded queue. The NS handlers then take requests from the queue and process them.
  3. This is basically a producer-consumer model. The 8888 handlers are the producers, and the NS handlers are the consumers. If the producers are faster than the consumers, the unbounded queue may keep growing and eventually exhaust memory. Once that happens, the whole RBF instance may become unavailable.

Unfortunately, there are several cases where the consumers can become slower than the producers:

  1. When NS permits are exhausted, NS handlers may have to wait for a permit, up to DFS_ROUTER_FAIRNESS_ACQUIRE_TIMEOUT.
  2. Even after an NS handler gets a permit, it may still get blocked while sending the request to the downstream NN. For example, the NN may be slow or unavailable because of GC, a large delete, a large rename, HA failover, or a machine failure.

In these cases, NS handlers cannot consume requests fast enough, but the 8888 handlers may continue accepting and enqueueing new requests. This can cause the unbounded queue to grow continuously and eventually trigger OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants