HDFS-17909. [ARR] AsyncRouterHandlerExecutors should use bounded queue#8448
HDFS-17909. [ARR] AsyncRouterHandlerExecutors should use bounded queue#8448kokonguyen191 wants to merge 4 commits intoapache:trunkfrom
Conversation
|
@kokonguyen191 Thanks for pushing this. Could you please provide more details in the description about how this situation was triggered? use pseudocode or any other way to make it clearly. Thanks a lot. |
|
Hi @hfutatzhanghb, thanks for taking a look at this patch. I have updated the description as well as the unit test for easier issue repro. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
ZanderXu
left a comment
There was a problem hiding this comment.
Thanks @kokonguyen191 for your contribution.
This PR makes sense to me. How about splitting this MR into two separate MRs?
- The first MR changes the queue to a bounded queue.
- The second MR adds metrics for the queue size.
|
Hi @kokonguyen191 @hfutatzhanghb @ZanderXu I had the impression that DFS_ROUTER_ASYNC_RPC_MAX_ASYNCCALL_PERMIT_KEY could control the maximum number of requests. Is it possible to achieve the same effect using this? |
I have the same question with @KeeProMise . @kokonguyen191 Could you please clarify here?
|
Thanks @KeeProMise @hfutatzhanghb for your review. Let me try to explain my understanding of this issue.
Unfortunately, there are several cases where the consumers can become slower than the producers:
In these cases, NS handlers cannot consume requests fast enough, but the 8888 handlers may continue accepting and enqueueing new requests. This can cause the unbounded queue to grow continuously and eventually trigger OOM. |

Description of PR
When a lot of requests arrive at a router and there is a backlog due to downstream namespaces not being able to fully handle the requests, routers can go OOM if there is no limit to the queue used by the async handlers.
This situation can be simulated using the unit test attached with this patch.
Rising memory usage from unbounded growing queue

How was this patch tested?
This patch has been used in our prod cluster for over a year and everything's working properly.