Skip to content

Background Worker QoS#403

Open
lightsighter wants to merge 3 commits intomainfrom
mbauer-bgwork-qos
Open

Background Worker QoS#403
lightsighter wants to merge 3 commits intomainfrom
mbauer-bgwork-qos

Conversation

@lightsighter
Copy link
Copy Markdown
Contributor

Add support for a lightweight dynamic analysis that will detect when background work items become unresponsive and could impact Realm's QoS.

… background worker items become unresponsive which can impact Realm QoS
@lightsighter lightsighter self-assigned this Feb 26, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.63%. Comparing base (42f7484) to head (c78b1c2).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #403      +/-   ##
==========================================
+ Coverage   29.13%   29.63%   +0.49%     
==========================================
  Files         194      195       +1     
  Lines       40153    40236      +83     
  Branches    14548    14566      +18     
==========================================
+ Hits        11698    11923     +225     
+ Misses      28039    27886     -153     
- Partials      416      427      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lightsighter
Copy link
Copy Markdown
Contributor Author

Debating whether this is actually a good idea or not. Both UCX and GASNet seem to be egregiously bad at obeying their budgets:

UCX:

[0 - 7ffbb17c5c40]    0.424742 {4}{bgwork}: work item 'ucp-poll' (slot 3) exceeded overrun threshold: elapsed=44885200 ns (448x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbaffbec40]    0.424744 {4}{bgwork}: work item 'ucp-poll' (slot 4) exceeded overrun threshold: elapsed=44886931 ns (448x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbb17c5c40]    0.471550 {4}{bgwork}: work item 'ucp-poll' (slot 4) exceeded overrun threshold: elapsed=46709503 ns (467x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbb15bdc40]    0.471552 {4}{bgwork}: work item 'ucp-poll' (slot 3) exceeded overrun threshold: elapsed=46706277 ns (467x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbb17c5c40]    0.508146 {4}{bgwork}: work item 'activemsg handler' (slot 1) exceeded overrun threshold: elapsed=36593767 ns (36
5x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbb15bdc40]    0.508163 {4}{bgwork}: work item 'ucp-poll' (slot 3) exceeded overrun threshold: elapsed=36501960 ns (365x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbaffbec40]    0.508173 {4}{bgwork}: work item 'ucp-poll' (slot 4) exceeded overrun threshold: elapsed=36507922 ns (365x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffbb15bdc40]    0.580208 {4}{bgwork}: work item 'ucp-poll' (slot 4) exceeded overrun threshold: elapsed=12345439 ns (123x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffbb16c1c40]    0.047502 {4}{bgwork}: work item 'ucp-poll' (slot 4) exceeded overrun threshold: elapsed=46422388 ns (464x), budg
et=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffbb17c5c40]    0.167045 {4}{bgwork}: work item 'ucp-poll' (slot 3) exceeded overrun threshold: elapsed=166015510 ns (1660x), bu
dget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive

GASNetEX:

[0 - 7ffde755cc80]   13.347686 {4}{bgwork}: work item 'gex-poll' (slot 0) exceeded overrun threshold: elapsed=178402244 ns (1784x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[0 - 7ffde575cc80]   13.387510 {4}{bgwork}: work item 'gex-poll' (slot 0) exceeded overrun threshold: elapsed=35617292 ns (356x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffff0167c80]    0.040423 {4}{bgwork}: work item 'gex-poll' (slot 0) exceeded overrun threshold: elapsed=39511915 ns (395x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffff0167c80]    0.057339 {4}{bgwork}: work item 'activemsg handler' (slot 6) exceeded overrun threshold: elapsed=16894922 ns (168x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffde5658c80]    0.103671 {4}{bgwork}: work item 'gex-poll' (slot 0) exceeded overrun threshold: elapsed=63149503 ns (631x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffff0167c80]    0.171708 {4}{bgwork}: work item 'gex-poll' (slot 0) exceeded overrun threshold: elapsed=68027536 ns (680x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive
[1 - 7ffde5658c80]    0.202669 {4}{bgwork}: work item 'gex-poll' (slot 0) exceeded overrun threshold: elapsed=30763192 ns (307x), budget=100000 ns, limit=10000000 ns (100x) - runtime may appear unresponsive

long long t_stop = Clock::current_time_in_nanoseconds(true /*absolute*/);
long long elapsed = t_stop - t_start;

// overrun detection (limit precomputed during configure)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a rate-limiting on warnings here. The consistently slow work-item will severely spam the logging

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was hoping that this warning would almost never fire, and if it did, then I would actually want it to be very loud and verbose because the expectation is that it should never happen. Given how often I'm encountering this in Flash Cache, I think we might need to discuss a more robust solution to the liveness problem rather than just issuing warnings which is why I'm not pushing to merge this pull request right away.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add, if we do end up merging it though, I don't think it should have a rate limit on it, but we should only merge it if we can find a timeout that we agree is reasonable for all Realm programs to operate under and we expect all the different modules serviced by background worker threads to meet it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay please let know when you need a review etc on this. Otherwise I don't what's the status of this PR is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants