Skip to content

Replace fixed sleep with flush in user_drv shutdown#10808

Open
jtdowney wants to merge 2 commits intoerlang:maintfrom
jtdowney:user-drv-flush-shutdown
Open

Replace fixed sleep with flush in user_drv shutdown#10808
jtdowney wants to merge 2 commits intoerlang:maintfrom
jtdowney:user-drv-flush-shutdown

Conversation

@jtdowney
Copy link

@jtdowney jtdowney commented Mar 6, 2026

When init:stop() is called, user_sup:terminate/2 needs to give pending I/O a chance to reach the terminal before it kills the user process. The current implementation does this with receive after 1000 -> ok end, an unconditional one-second sleep. This works, but it means every shutdown pays a full second of latency, whether or not there is actually output in flight. On an idle node, the delay is pure waste.

This change introduces user_drv:flush/0, which enqueues a synchronous put_chars_sync request carrying an empty binary into the existing I/O queue. Because the queue is processed in order, the call returns only after every preceding write has been acknowledged by the writer process. user_sup:terminate/2 now calls flush() instead of sleeping, so shutdown completes as soon as the queue drains rather than always waiting the full second. The flush call uses a one-second timeout and catches all failures, so it degrades gracefully if user_drv is already gone or unresponsive. The worst case is the same one-second delay as before.

@CLAassistant
Copy link

CLAassistant commented Mar 6, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

CT Test Results

    2 files     72 suites   1h 4m 22s ⏱️
1 687 tests 1 343 ✅ 344 💤 0 ❌
1 939 runs  1 537 ✅ 402 💤 0 ❌

Results for commit d59b46f.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@frazze-jobb
Copy link
Contributor

I ran this locally and run_erl_SUITE:heavy fails just like the other PR

@jtdowney
Copy link
Author

I've had some time to dig into this and figure out what I think is going on. I got suspicious that run_erl_SUITE:heavy would fail, but not run_erl_SUITE:heavier. I wasn't able to get it to consistently fail on macOS, but once I switched to a Linux server, it did fail. That also led me to think that this may actually be related to buffering and system issues, not necessarily the code under test. The baseline behaviour of 1s sleep gives the port enough time to receive the child's output and deliver it to the mailbox for counting. While this PR takes a more conservative approach than #10776, neither keeps the child around long enough for the parent process to deliver the buffer for counting, even though the output has been delivered to the operating system and is available to the parent. I added a 10ms wait to the test in de6f52a, which has made this test consistently pass on macOS and Linux for me. It, however, also let #10776 pass.

@garazdawi
Copy link
Contributor

I was surprised that the tests started failing when the sleep was removed as output should be synchronous. So I think then that the change that was made in #10776 is enough together with the testcase fix you did here.

@jtdowney jtdowney force-pushed the user-drv-flush-shutdown branch from de6f52a to 72d6da9 Compare March 16, 2026 14:16
@jtdowney
Copy link
Author

@garazdawi I agree. I pushed an updated tree with the commit from @tsloughter in #10776 cherry-picked along with the test update I made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team:VM Assigned to OTP team VM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants