Skip to content

[native] Track CPU & Memory overload in native worker. #24949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 1, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions presto-docs/src/main/sphinx/presto_cpp/properties.rst
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,33 @@ The default value of 60 gb is calculated based on available machine memory of 64
Specifies the amount of memory to shrink when the memory pushback is
triggered. This only applies if ``system-mem-pushback-enabled`` is ``true``.

``system-mem-pushback-abort-enabled``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``boolean``
* **Default value:** ``false``

If true, memory pushback will abort queries with the largest memory usage under
low memory condition. This only applies if ``system-mem-pushback-enabled`` is ``true``.

``worker-overloaded-threshold-mem-gb``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``integer``
* **Default value:** ``0``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd to use 0 for default threshold. Can we use something like -1 to indicate that it can be ignored ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit
Thanks for reviewing this PR, seem like hard to find anyone brave enough these days to review a small change.

Using zeros for 'no op' is the practice we use already in the config.
See lines:
252 for kDriverCancelTasksWithStuckOperatorsThresholdMs
296 for kSystemMemLimitGb
365 for kAsyncCachePersistenceInterval
451 for kSharedArbitratorMaxMemoryArbitrationTime
532 for kSharedArbitratorGlobalArbitrationMemoryReclaimPct

It actually makes a certain semantic sense - threshold 0 for something that cannot be negative does not make sense, so it can be used as the 'magic number'.


Memory threshold in GB above which the worker is considered overloaded in terms of
memory use. Ignored if zero.

``worker-overloaded-threshold-cpu-pct``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Type:** ``integer``
* **Default value:** ``0``

CPU threshold in % above which the worker is considered overloaded in terms of
CPU use. Ignored if zero.

Environment Variables As Values For Worker Properties
-----------------------------------------------------

Expand Down
47 changes: 47 additions & 0 deletions presto-native-execution/presto_cpp/main/PrestoServer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1426,9 +1426,56 @@ void PrestoServer::populateMemAndCPUInfo() {
});
RECORD_METRIC_VALUE(kCounterNumQueryContexts, numContexts);
cpuMon_.update();
checkOverload();
**memoryInfo_.wlock() = std::move(memoryInfo);
}

void PrestoServer::checkOverload() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe offloading all these in a separate class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanjialiang
I would rather keep this logic here.
Next we will need Server to tell TaskManager that we are overloaded or no longer, so it can act on the Task queue.
Doing this in some another side class seems wrong.
Spinning up the class to just decide if we are overloaded based on a couple of metrics seems too much.

If that somehow grows over time, then we can refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding more and more stuff to PrestoServer now. It will be easier to read and maintain if we move these logic to a separate class maybe called SystemMonitor. We can start from this one and maybe move a couple of stuff from current PrestoServer there, as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree.
With many files, each containing a bit of logic, it is harder to navigate and understand the code.
In my opinion, this code does not deserve its own class/file yet.

auto systemConfig = SystemConfig::instance();

const auto overloadedThresholdMemBytes =
systemConfig->workerOverloadedThresholdMemGb() * 1024 * 1024 * 1024;
if (overloadedThresholdMemBytes > 0) {
const auto currentUsedMemoryBytes = (memoryChecker_ != nullptr)
? memoryChecker_->cachedSystemUsedMemoryBytes()
: 0;
const bool isMemOverloaded =
(currentUsedMemoryBytes > overloadedThresholdMemBytes);
if (isMemOverloaded) {
LOG(WARNING) << "Server memory is overloaded. Currently used: "
<< velox::succinctBytes(currentUsedMemoryBytes)
<< ", threshold: "
<< velox::succinctBytes(overloadedThresholdMemBytes);
} else if (isMemOverloaded_) {
LOG(INFO) << "Server memory is no longer overloaded. Currently used: "
<< velox::succinctBytes(currentUsedMemoryBytes)
<< ", threshold: "
<< velox::succinctBytes(overloadedThresholdMemBytes);
}
RECORD_METRIC_VALUE(kCounterOverloadedMem, isMemOverloaded ? 100 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why is 0 and 100, but not a range of values in between. or why not just 0 and 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkdutta
Because when the counters backend starts averaging the signal over some time period, say 60 seconds the 1 will turn into 0 and we won't see any signal.
100 however won't have such issue and will be easy to reason about how much % of the period the system was overloaded if the number is below 100.

isMemOverloaded_ = isMemOverloaded;
}

const auto overloadedThresholdCpuPct =
systemConfig->workerOverloadedThresholdCpuPct();
if (overloadedThresholdCpuPct > 0) {
const auto currentUsedCpuPct = cpuMon_.getCPULoadPct();
const bool isCpuOverloaded =
(currentUsedCpuPct > overloadedThresholdCpuPct);
if (isCpuOverloaded) {
LOG(WARNING) << "Server CPU is overloaded. Currently used: "
<< currentUsedCpuPct
<< "%, threshold: " << overloadedThresholdCpuPct << "%";
} else if (isCpuOverloaded_) {
LOG(INFO) << "Server CPU is no longer overloaded. Currently used: "
<< currentUsedCpuPct
<< "%, threshold: " << overloadedThresholdCpuPct << "%";
}
RECORD_METRIC_VALUE(kCounterOverloadedCpu, isCpuOverloaded ? 100 : 0);
isCpuOverloaded_ = isCpuOverloaded;
}
}

static protocol::Duration getUptime(
std::chrono::steady_clock::time_point& start) {
auto seconds = std::chrono::duration_cast<std::chrono::seconds>(
Expand Down
4 changes: 4 additions & 0 deletions presto-native-execution/presto_cpp/main/PrestoServer.h
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,8 @@ class PrestoServer {

std::unique_ptr<velox::cache::SsdCache> setupSsdCache();

void checkOverload();

const std::string configDirectoryPath_;

std::shared_ptr<CoordinatorDiscoverer> coordinatorDiscoverer_;
Expand Down Expand Up @@ -273,6 +275,8 @@ class PrestoServer {
std::unique_ptr<PeriodicTaskManager> periodicTaskManager_;
std::unique_ptr<PrestoServerOperations> prestoServerOperations_;
std::unique_ptr<PeriodicMemoryChecker> memoryChecker_;
bool isMemOverloaded_{false};
bool isCpuOverloaded_{false};

// We update these members asynchronously and return in http requests w/o
// delay.
Expand Down
10 changes: 10 additions & 0 deletions presto-native-execution/presto_cpp/main/common/Configs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,8 @@ SystemConfig::SystemConfig() {
NUM_PROP(kSystemMemShrinkGb, 8),
BOOL_PROP(kMallocMemHeapDumpEnabled, false),
BOOL_PROP(kSystemMemPushbackAbortEnabled, false),
NUM_PROP(kWorkerOverloadedThresholdMemGb, 0),
NUM_PROP(kWorkerOverloadedThresholdCpuPct, 0),
NUM_PROP(kMallocHeapDumpThresholdGb, 20),
NUM_PROP(kMallocMemMinHeapDumpInterval, 10),
NUM_PROP(kMallocMemMaxHeapDumpFiles, 5),
Expand Down Expand Up @@ -499,6 +501,14 @@ bool SystemConfig::systemMemPushBackAbortEnabled() const {
return optionalProperty<bool>(kSystemMemPushbackAbortEnabled).value();
}

uint64_t SystemConfig::workerOverloadedThresholdMemGb() const {
return optionalProperty<uint64_t>(kWorkerOverloadedThresholdMemGb).value();
}

uint32_t SystemConfig::workerOverloadedThresholdCpuPct() const {
return optionalProperty<uint32_t>(kWorkerOverloadedThresholdCpuPct).value();
}

bool SystemConfig::mallocMemHeapDumpEnabled() const {
return optionalProperty<bool>(kMallocMemHeapDumpEnabled).value();
}
Expand Down
15 changes: 14 additions & 1 deletion presto-native-execution/presto_cpp/main/common/Configs.h
Original file line number Diff line number Diff line change
Expand Up @@ -299,12 +299,21 @@ class SystemConfig : public ConfigBase {
/// get the server out of low memory condition. This only applies if
/// 'system-mem-pushback-enabled' is true.
static constexpr std::string_view kSystemMemShrinkGb{"system-mem-shrink-gb"};
/// If true, memory pushback will quickly abort queries with the most memory
/// If true, memory pushback will abort queries with the largest memory
/// usage under low memory condition. This only applies if
/// 'system-mem-pushback-enabled' is set.
static constexpr std::string_view kSystemMemPushbackAbortEnabled{
"system-mem-pushback-abort-enabled"};

/// Memory threshold in GB above which the worker is considered overloaded.
/// Ignored if zero. Default is zero.
static constexpr std::string_view kWorkerOverloadedThresholdMemGb{
"worker-overloaded-threshold-mem-gb"};
/// CPU threshold in % above which the worker is considered overloaded.
/// Ignored if zero. Default is zero.
static constexpr std::string_view kWorkerOverloadedThresholdCpuPct{
"worker-overloaded-threshold-cpu-pct"};

/// If true, memory allocated via malloc is periodically checked and a heap
/// profile is dumped if usage exceeds 'malloc-heap-dump-gb-threshold'.
static constexpr std::string_view kMallocMemHeapDumpEnabled{
Expand Down Expand Up @@ -828,6 +837,10 @@ class SystemConfig : public ConfigBase {

bool systemMemPushBackAbortEnabled() const;

uint64_t workerOverloadedThresholdMemGb() const;

uint32_t workerOverloadedThresholdCpuPct() const;

bool mallocMemHeapDumpEnabled() const;

uint32_t mallocHeapDumpThresholdGb() const;
Expand Down
2 changes: 2 additions & 0 deletions presto-native-execution/presto_cpp/main/common/Counters.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@ void registerPrestoMetrics() {
kCounterNumBlockedWaitForConnectorDrivers,
facebook::velox::StatType::AVG);
DEFINE_METRIC(kCounterNumBlockedYieldDrivers, facebook::velox::StatType::AVG);
DEFINE_METRIC(kCounterOverloadedMem, facebook::velox::StatType::AVG);
DEFINE_METRIC(kCounterOverloadedCpu, facebook::velox::StatType::AVG);
DEFINE_METRIC(kCounterNumStuckDrivers, facebook::velox::StatType::AVG);
DEFINE_METRIC(
kCounterTotalPartitionedOutputBuffer, facebook::velox::StatType::AVG);
Expand Down
7 changes: 7 additions & 0 deletions presto-native-execution/presto_cpp/main/common/Counters.h
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,13 @@ constexpr folly::StringPiece kCounterNumBlockedYieldDrivers{
constexpr folly::StringPiece kCounterNumStuckDrivers{
"presto_cpp.num_stuck_drivers"};

/// Worker exports 0 or 100 for this counter. 0 meaning not memory overloaded
/// and 100 meaning memory overloaded.
constexpr folly::StringPiece kCounterOverloadedMem{"presto_cpp.overloaded_mem"};
/// Worker exports 0 or 100 for this counter. 0 meaning not CPU overloaded
/// and 100 meaning CPU overloaded.
constexpr folly::StringPiece kCounterOverloadedCpu{"presto_cpp.overloaded_cpu"};

/// Number of total OutputBuffer managed by all
/// OutputBufferManager
constexpr folly::StringPiece kCounterTotalPartitionedOutputBuffer{
Expand Down
Loading