Skip to content

[native] Track CPU & Memory overload in native worker. #24949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

spershin
Copy link
Contributor

@spershin spershin commented Apr 21, 2025

Description

We introduce new set of config properties to setup cpu and memory use thresholds above which we will consider worker overloaded.
Default values are zero, meaning we are not tracking it.

Motivation and Context

We want to spin up a system where Presto is aware of the workers' load and acts accordingly.

Impact

The change is harmless - only tracking the overload (optionally).

== NO RELEASE NOTE ==

@spershin spershin requested a review from a team as a code owner April 21, 2025 23:36
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Apr 21, 2025
@spershin spershin force-pushed the TrackCpuMemNativeWorkerOverload branch from 788ce9d to fb795d3 Compare April 21, 2025 23:55
<< ", threshold: "
<< velox::succinctBytes(overloadedThresholdMemBytes);
}
RECORD_METRIC_VALUE(kCounterOverloadedMem, isMemOverloaded ? 100 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why is 0 and 100, but not a range of values in between. or why not just 0 and 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkdutta
Because when the counters backend starts averaging the signal over some time period, say 60 seconds the 1 will turn into 0 and we won't see any signal.
100 however won't have such issue and will be easy to reason about how much % of the period the system was overloaded if the number is below 100.

**memoryInfo_.wlock() = std::move(memoryInfo);
}

void PrestoServer::checkOverload() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe offloading all these in a separate class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanjialiang
I would rather keep this logic here.
Next we will need Server to tell TaskManager that we are overloaded or no longer, so it can act on the Task queue.
Doing this in some another side class seems wrong.
Spinning up the class to just decide if we are overloaded based on a couple of metrics seems too much.

If that somehow grows over time, then we can refactor.

@steveburnett
Copy link
Contributor

New configuration properties should be documented. Suggest adding new doc in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/presto_cpp/properties.rst.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:Meta PR from Meta
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants