-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24
base: main
Are you sure you want to change the base?
Conversation
src/pinnable_mapped_file.cpp
Outdated
|
||
// we are in `copy_on_write` mode. | ||
static time_t check_time = 0; | ||
constexpr int check_interval = 60; // seconds | ||
constexpr size_t one_gb = 1ull << 30; | ||
constexpr int check_interval = 30; // seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth reducing the check_interval once oom_score is above some threshold? For example, check every 1s after it is above 900.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is needed. When the oom
threshold is reached, all state pages are flushed (in sync mode).
I changed the flush to synchronous so the memory is actually reclaimed before we recreate the mapping. The synchronous flush takes time (5-10s) when the state is heavily accessed (while syncing) and we have a 30s delay.
With a shorter delay, we likely will have fewer dirty pages, and the sync would be faster, but I'm concerned that we would spend all our time flushing pages.
Obviously this memory constrained mode is not a desirable mode to function in, at least while syncing, as it requires lots of disk i/o. I see it more as a way to either:
- avoid a crash alltogether is there is a temporary lack of available memory on the system, for example because another program is running.
- delay a crash in order to give the user a chance to exit gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need some sort of unit test (either in chainbase or leap), that somehow exercises this code path.
src/pinnable_mapped_file.cpp
Outdated
std::cerr << "CHAINBASE: Writing \"" << _database_name << "\" database file, complete." << '\n'; | ||
} else if (mapped_writable_instance) { | ||
// we are saving while processing... recreate the copy_on_write mapping with clean pages. | ||
_file_mapped_region = bip::mapped_region(_file_mapping, bip::copy_on_write); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be both atomic and return the exact same address, otherwise bad things could happen. You probably will need to mmap(MAP_FIXED)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to use mmap()
with MAP_FIXED
. Strangely enough, the previous code worked (same address was returned).
if (*oom_score >= 980) { | ||
// linux returned a high out-of-memory (oom) score for the current process, indicating a high | ||
// probablility that the process will be killed soon (The valid range is from 0 to 1000. | ||
// The higher the value is, the higher the probability is that the process will be killed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict that's not quite how it's documented as behaving, but rather just that a "A higher score means that the process is more likely to be selected by the OOM-killer." The absolute value doesn't matter, only its relative value compared to all the other processes' value. A process could still be killed due to system wide memory pressure at values less than 1000.
While in one limited test 980 may have made sense, it's not clear that value is appropriate if there are multiple nodeos running on the same box, as a simple example. Or, what happens if a lot of memory is consumed in tmpfs? Or, how do containers factor in to this?
I'd like to see more research on appropriate ways to detect system wide memory pressure (which is really what we want in this case). For example,
https://www.kernel.org/doc/html/latest/accounting/psi.html
kinda-sorta sounds more like what we're looking for, but I'm not entirely sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the cases you mention (multiple nodeos,a lot of memory is consumed in tmpfs), if users select mapped_private
, there is a high likelihood that the nodeos will be killed by the OOM-killer.
This change just allows nodeos to decrease its OOM score when it gets very high.
Do I claim this PR catches all cases? No.
Would some additional research allow us to do a better job at predicting when this memory relief is needed? Probably.
Is it better to catch some at least some clear cases of OOM danger and avoid crashing if we can. Yes IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change just allows nodeos to decrease its OOM score when it gets very high.
I would be interested in doing more research, but my understanding right now is that a high absolute oom_score
isn't really guaranteed to mean anything. It's only that score relative to other processes that mean something. And even then, it only means that the process with the highest score will be first on the chopping block next time the OOM killer takes action (whenever that is). These are all important distinctions to what has been written the comment.
Leaning on the absolute value of oom_score
to determine when the process is getting close to being killed by OOM, appears to be leaning on implementation specific behavior. For example, consider if Linux changed in the next release so that the process using the most memory has a fixed oom_score
of 1000, then the process with the second most fixed 999, etc etc. That still is completely valid by the rules of: "A higher score means that the process is more likely to be selected by the OOM-killer. The basis for this score is the amount of memory used by the process"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually testing this more makes me more convinced oom_score
isn't a good metric. For example, I created a 32GB VM, and set chain-state-db-size-mb = 24576
. Once nodeos gets to running, I would get a flush about every 85000 blocks. Granted, this is nearly 12 hours of real time (but maybe not really real time since we don't know how speculative trxs dirty things), but it was every 60-90 seconds during a replay. When the flush was triggered nodeos only had 15-16GB resident and,
chainbase: oom = 984, flushed 822054 state pages to disk (in 16065 ms) to decrease memory pressure
chainbase: oom score after flush: 675
chainbase: The system is running low in memory for running nodeos. If this is not a temporary condition, consider increasing the available RAM
meaning just ~3GB was dirty.
There was clearly no memory pressure at all: literally half of the VM's memory was completely unused. Yet, we're telling the user the system is running low on memory.
Interestingly, it was already the case before.
// -------------------------------------------------------------------------------------- | ||
void* old_address = _cow_address; | ||
munmap(_cow_address, _database_size); // first clear old region so we don't overcommit | ||
_cow_address = mmap(_cow_address, _database_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small period of time between the munmap()
and mmap(MAP_FIXED)
that some other thread may be given a mapping at this location; in such a case the mmap(MAP_FIXED)
will then silently blow away that data. Should just be able to skip the munmap()
here to avoid that race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerned about doing a mmap
(when we already have a high oom score) without first blowing away the old mapping. Maybe it is not an issue since we use MAP_FIXED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw the doc: "If the memory region specified by addr and length overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded", so I feel better about implementing your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Such as another thread `mmap`ing the same area of virtual memory.
test/test.cpp
Outdated
pmf.set_oom_delay(1); | ||
|
||
size_t flush_count = 0; | ||
for (int i=0; i<max_elems; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw there is a miss-matched signed comparison warning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Fixed.
I know why most pages of the state quickly get dirty when syncing. It is because everytime an item gets inserted into (or deleted from) a boost intrusive It is even worse for the |
test/test.cpp
Outdated
@@ -146,16 +147,20 @@ BOOST_AUTO_TEST_CASE( oom_flush_dirty_pages ) { | |||
db.create<book>( [i]( book& b ) { b.a = (int)i; b.b = (int)(i+1); } ); | |||
if (i % 1000 == 0) { | |||
if (auto res = db.check_memory_and_flush_if_needed()) { | |||
// we need to wait some time after clearing the soft-dirty bits from the task's PTEs | |||
// for the next read to be accurate (see https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying clearing the bits is asynchronous? If that's the case the whole approach seems dangerous.
I was initially worried about that when doing the initial review, but when I looked through the kernel source it seemed like it was synchronous.
Or instead is the problem here in this test that the oom_score
is only updated at some interval and we need to wait around for that update to occur? If that's the case, I don't think citing that soft-dirty.txt
refers to the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that when we don't wait, we see more pages as dirty as we should, so I don't think it is dangerous, except for the fact that we'd flush more pages to disk than strictly needed. But I agree that it is unnerving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs
but before the dirty bit is cleared asynchronously in the background.
After save_database_file()
we should have completely fresh pages anyways though right? So why are we seeing seemingly spurious dirty bits in brand new pages? Is it possible the dirty bit is only valid based on some other criteria? (some other bit being set?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.
If it behaved like that the feature would be utterly useless.
Is it possible the dirty bit is only valid based on some other criteria? (some other bit being set?)
No I don't think so because all we need to do is wait a second or two and then we see the correct pages dirty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to
clear_refs
but before the dirty bit is cleared asynchronously in the background.
I have tested it and you don't miss writes.
If you write to a page right after clear_refs
, yo see the dirty bit whether you check right after the clear_refs
, or two seconds later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the kernel unit tests not require waiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, the test doesn't check exactly the same thing. The check has one page with the soft-dirty bit set and verifies that the bit is cleared immediately after clear_refs
.
What I see in my test is that newly mapped pages, which were never written to, report being soft-dirty
for a second or two after clear-refs
, and then don't anymore. But if one of these pages is written to immediately after clear-refs
, the soft-dirty
bit will remain set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spoonincode you are correct, there was an issue in the test (now corrected) that made me think that clear_refs
did not clear the soft-dirty
bit immediately, but that was an error on my part. So my original comment was wrong, the clearing of the soft-dirty
bits is synchronous.
It shows that updated pages are still seen as soft-dirty after the 2s wait.
Check that after the flush to disk, the new memory mapping of the disk db has all the previously inserted values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe any real world testing that has been performed?
I ran an EOS node in a container limited to 6GB. Unsurprisingly, since the container's limit is not reflected in the oom_score, nodeos just ends up getting killed eventually.
I guess a VM or system-wide memory limitation would be required for testing.
@@ -127,4 +128,49 @@ BOOST_AUTO_TEST_CASE( open_and_create ) { | |||
BOOST_REQUIRE_EQUAL( new_book.b, copy_new_book.b ); | |||
} | |||
|
|||
BOOST_AUTO_TEST_CASE( oom_flush_dirty_pages ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test should probably be #ifdef'ed for linux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the test work even if oom
is not available. It doesn't test oom
in that case, but does exercise chainbase some.
*((char*)_file_mapped_region.get_address()+header_dirty_bit_offset) = dirty; // set dirty bit in our memory mapping | ||
|
||
_segment_manager = reinterpret_cast<segment_manager*>((char*)_file_mapped_region.get_address()+header_size); | ||
_cow_address = mmap(NULL, _database_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, _file_mapping.get_mapping_handle().handle, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed a couple of issues here when testing,
The size should be the size of the DB, not the configured size. These will differ when a user tries to shrink the DB (not supported). Just looking through the code some I'm worried we might have broke this in other ways in #21 (such as when writing out heap mode), and may require a separate fix for 5.0. I'm also curious why grow_shrink test isn't failing; maybe the test needs to be improved, unsure.
If this fails, nodeos crashes leaving the DB dirty. This is fairly easy for someone to stumble on: if I only have a box with 24GB of RAM and configure chain-state-db-size-mb = 24576
, it will fail. Need to be checking the return value here and more gracefully fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size should be the size of the DB, not the configured size.
Hum, I'm not following. Here we use _database_size
, which is the shared_file_size
passed to the database
and pinnable_mapped_file
constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried we might have broke this in other ways in #21 (such as when writing out heap mode)
Do you have any specific issue in mind. That is the main reason I added the _database_size
member, to ensure we stored the requested database size, so we could precisely write this size, excluding extra space required for huge pages alignment or such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requested database size isn't the actual database size in case of,
chainbase/src/pinnable_mapped_file.cpp
Lines 136 to 140 in 7615dda
else if(shared_file_size < existing_file_size) { | |
std::cerr << "CHAINBASE: \"" << _database_name << "\" requested size of " << shared_file_size << " is less than " | |
"existing size of " << existing_file_size << ". This database will not be shrunk and will " | |
"remain at " << existing_file_size << '\n'; | |
} |
So, for example, if you first
chain-state-db-size-mb = 24576
and then restart with chain-state-db-size-mb = 1024
the DB size will remain 24576 but _database_size
is 1024 and it results in a crash with mapped_private
It also probably results in heap mode not being completely written out now too for similar reason: _database_size
is requested size, not actual size. Basically any place we use _database_size
is suspect currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I fixed the issue in this PR, and if you agree with the change, I'll create a PR for fixing it in 5.0
Upon reflection, I don't think this feature is necessary, and maybe not even desirable, because:
mapped_private
, you have to configure enough (RAM + swap) size for the full configuredchain-state-db-size-mb
(or themmap(M_PRIVATE)
fails) and nodeos doesn't start. As a result we will be able to grow the state db and use up to that configured size without any issue.As long as the actual state size in actually accessed by transactions is less than the RAM, there shouldn't be any swapping, but if in-use state grow beyond the available RAM, swapping should prevent nodeos being killed by the OOM killer.
So I believe that the solution as implemented in this PR is (1) not necessary, and (2) not provably safe and effective, therefore I am in favor of closing this PR.
Please let me know if there is any disagreement regarding closing this PR, otherwise I'll close it as
not planned
in a few days.Original PR comment
Part of resolution for Leap #1721.
The Linux kernel assigns and maintains an Out of Memory (OOM) score on a per-process basis, The valid range is from 0 (never kill) to 1000 (always kill), the lower the value is, the lower is the probability the process will be killed.
We track this score for Leap, and when the score gets high, flush the dirty pages of our private mapping to disk, lowering the Leap process OOM score, and thus preventing the system from killing our process.
Currently, we trigger the flush() for an
oom
value of980
, which is pretty hard to reach on a system which has more memory than the state database working set (about 20GB for EOS). I tried launching a bunch of programs gobbling memory whilenodeos
was running, but these programs were killed by the OS whilenodeos
continued working happily, even with a loweroom_score
threshold (I tried900
and800
).This PR makes leap
mapped_private
mode function in a somewhat optimal mode, using as much memory as possible for its chainbase db in order to preventing constant writebacks to disk, while still doing occasional writebacks (to avoid crashing) on systems with limited RAM.