Conversation
… background worker items become unresponsive which can impact Realm QoS
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #403 +/- ##
==========================================
+ Coverage 29.13% 29.63% +0.49%
==========================================
Files 194 195 +1
Lines 40153 40236 +83
Branches 14548 14566 +18
==========================================
+ Hits 11698 11923 +225
+ Misses 28039 27886 -153
- Partials 416 427 +11 ☔ View full report in Codecov by Sentry. |
|
Debating whether this is actually a good idea or not. Both UCX and GASNet seem to be egregiously bad at obeying their budgets: UCX: GASNetEX: |
| long long t_stop = Clock::current_time_in_nanoseconds(true /*absolute*/); | ||
| long long elapsed = t_stop - t_start; | ||
|
|
||
| // overrun detection (limit precomputed during configure) |
There was a problem hiding this comment.
We need a rate-limiting on warnings here. The consistently slow work-item will severely spam the logging
There was a problem hiding this comment.
So I was hoping that this warning would almost never fire, and if it did, then I would actually want it to be very loud and verbose because the expectation is that it should never happen. Given how often I'm encountering this in Flash Cache, I think we might need to discuss a more robust solution to the liveness problem rather than just issuing warnings which is why I'm not pushing to merge this pull request right away.
There was a problem hiding this comment.
I'll add, if we do end up merging it though, I don't think it should have a rate limit on it, but we should only merge it if we can find a timeout that we agree is reasonable for all Realm programs to operate under and we expect all the different modules serviced by background worker threads to meet it.
There was a problem hiding this comment.
Okay please let know when you need a review etc on this. Otherwise I don't what's the status of this PR is
Add support for a lightweight dynamic analysis that will detect when background work items become unresponsive and could impact Realm's QoS.