-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[native] Track CPU & Memory overload in native worker. #24949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1426,9 +1426,56 @@ void PrestoServer::populateMemAndCPUInfo() { | |
}); | ||
RECORD_METRIC_VALUE(kCounterNumQueryContexts, numContexts); | ||
cpuMon_.update(); | ||
checkOverload(); | ||
**memoryInfo_.wlock() = std::move(memoryInfo); | ||
} | ||
|
||
void PrestoServer::checkOverload() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe offloading all these in a separate class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tanjialiang If that somehow grows over time, then we can refactor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are adding more and more stuff to PrestoServer now. It will be easier to read and maintain if we move these logic to a separate class maybe called SystemMonitor. We can start from this one and maybe move a couple of stuff from current PrestoServer there, as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree. |
||
auto systemConfig = SystemConfig::instance(); | ||
|
||
const auto overloadedThresholdMemBytes = | ||
systemConfig->workerOverloadedThresholdMemGb() * 1024 * 1024 * 1024; | ||
if (overloadedThresholdMemBytes > 0) { | ||
const auto currentUsedMemoryBytes = (memoryChecker_ != nullptr) | ||
? memoryChecker_->cachedSystemUsedMemoryBytes() | ||
: 0; | ||
const bool isMemOverloaded = | ||
(currentUsedMemoryBytes > overloadedThresholdMemBytes); | ||
if (isMemOverloaded) { | ||
LOG(WARNING) << "Server memory is overloaded. Currently used: " | ||
<< velox::succinctBytes(currentUsedMemoryBytes) | ||
<< ", threshold: " | ||
<< velox::succinctBytes(overloadedThresholdMemBytes); | ||
} else if (isMemOverloaded_) { | ||
LOG(INFO) << "Server memory is no longer overloaded. Currently used: " | ||
<< velox::succinctBytes(currentUsedMemoryBytes) | ||
<< ", threshold: " | ||
<< velox::succinctBytes(overloadedThresholdMemBytes); | ||
} | ||
RECORD_METRIC_VALUE(kCounterOverloadedMem, isMemOverloaded ? 100 : 0); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious why is 0 and 100, but not a range of values in between. or why not just 0 and 1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @amitkdutta |
||
isMemOverloaded_ = isMemOverloaded; | ||
} | ||
|
||
const auto overloadedThresholdCpuPct = | ||
systemConfig->workerOverloadedThresholdCpuPct(); | ||
if (overloadedThresholdCpuPct > 0) { | ||
const auto currentUsedCpuPct = cpuMon_.getCPULoadPct(); | ||
const bool isCpuOverloaded = | ||
(currentUsedCpuPct > overloadedThresholdCpuPct); | ||
if (isCpuOverloaded) { | ||
LOG(WARNING) << "Server CPU is overloaded. Currently used: " | ||
<< currentUsedCpuPct | ||
<< "%, threshold: " << overloadedThresholdCpuPct << "%"; | ||
} else if (isCpuOverloaded_) { | ||
LOG(INFO) << "Server CPU is no longer overloaded. Currently used: " | ||
<< currentUsedCpuPct | ||
<< "%, threshold: " << overloadedThresholdCpuPct << "%"; | ||
} | ||
RECORD_METRIC_VALUE(kCounterOverloadedCpu, isCpuOverloaded ? 100 : 0); | ||
isCpuOverloaded_ = isCpuOverloaded; | ||
} | ||
} | ||
|
||
static protocol::Duration getUptime( | ||
std::chrono::steady_clock::time_point& start) { | ||
auto seconds = std::chrono::duration_cast<std::chrono::seconds>( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit odd to use 0 for default threshold. Can we use something like -1 to indicate that it can be ignored ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit
Thanks for reviewing this PR, seem like hard to find anyone brave enough these days to review a small change.
Using zeros for 'no op' is the practice we use already in the config.
See lines:
252 for kDriverCancelTasksWithStuckOperatorsThresholdMs
296 for kSystemMemLimitGb
365 for kAsyncCachePersistenceInterval
451 for kSharedArbitratorMaxMemoryArbitrationTime
532 for kSharedArbitratorGlobalArbitrationMemoryReclaimPct
It actually makes a certain semantic sense - threshold 0 for something that cannot be negative does not make sense, so it can be used as the 'magic number'.