Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

greg7mdp
Copy link
Contributor

@greg7mdp greg7mdp commented Oct 5, 2023

Upon reflection, I don't think this feature is necessary, and maybe not even desirable, because:

  1. When using mapped_private, you have to configure enough (RAM + swap) size for the full configured chain-state-db-size-mb (or the mmap(M_PRIVATE) fails) and nodeos doesn't start. As a result we will be able to grow the state db and use up to that configured size without any issue.
    As long as the actual state size in actually accessed by transactions is less than the RAM, there shouldn't be any swapping, but if in-use state grow beyond the available RAM, swapping should prevent nodeos being killed by the OOM killer.
  2. Matt's testing and comments in this PR show that it is quite difficult to demonstrate that the solution implemented in this PR will reliably prevent the process being killed, and will not activate too early.

So I believe that the solution as implemented in this PR is (1) not necessary, and (2) not provably safe and effective, therefore I am in favor of closing this PR.

Please let me know if there is any disagreement regarding closing this PR, otherwise I'll close it as not planned in a few days.

Original PR comment

Part of resolution for Leap #1721.

The Linux kernel assigns and maintains an Out of Memory (OOM) score on a per-process basis, The valid range is from 0 (never kill) to 1000 (always kill), the lower the value is, the lower is the probability the process will be killed.

We track this score for Leap, and when the score gets high, flush the dirty pages of our private mapping to disk, lowering the Leap process OOM score, and thus preventing the system from killing our process.

Currently, we trigger the flush() for an oom value of 980, which is pretty hard to reach on a system which has more memory than the state database working set (about 20GB for EOS). I tried launching a bunch of programs gobbling memory while nodeos was running, but these programs were killed by the OS while nodeos continued working happily, even with a lower oom_score threshold (I tried 900 and 800).

This PR makes leap mapped_private mode function in a somewhat optimal mode, using as much memory as possible for its chainbase db in order to preventing constant writebacks to disk, while still doing occasional writebacks (to avoid crashing) on systems with limited RAM.


// we are in `copy_on_write` mode.
static time_t check_time = 0;
constexpr int check_interval = 60; // seconds
constexpr size_t one_gb = 1ull << 30;
constexpr int check_interval = 30; // seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth reducing the check_interval once oom_score is above some threshold? For example, check every 1s after it is above 900.

Copy link
Contributor Author

@greg7mdp greg7mdp Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed. When the oom threshold is reached, all state pages are flushed (in sync mode).
I changed the flush to synchronous so the memory is actually reclaimed before we recreate the mapping. The synchronous flush takes time (5-10s) when the state is heavily accessed (while syncing) and we have a 30s delay.
With a shorter delay, we likely will have fewer dirty pages, and the sync would be faster, but I'm concerned that we would spend all our time flushing pages.

Obviously this memory constrained mode is not a desirable mode to function in, at least while syncing, as it requires lots of disk i/o. I see it more as a way to either:

  • avoid a crash alltogether is there is a temporary lack of available memory on the system, for example because another program is running.
  • delay a crash in order to give the user a chance to exit gracefully.

Copy link
Member

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need some sort of unit test (either in chainbase or leap), that somehow exercises this code path.

std::cerr << "CHAINBASE: Writing \"" << _database_name << "\" database file, complete." << '\n';
} else if (mapped_writable_instance) {
// we are saving while processing... recreate the copy_on_write mapping with clean pages.
_file_mapped_region = bip::mapped_region(_file_mapping, bip::copy_on_write);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be both atomic and return the exact same address, otherwise bad things could happen. You probably will need to mmap(MAP_FIXED)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to use mmap() with MAP_FIXED. Strangely enough, the previous code worked (same address was returned).

if (*oom_score >= 980) {
// linux returned a high out-of-memory (oom) score for the current process, indicating a high
// probablility that the process will be killed soon (The valid range is from 0 to 1000.
// The higher the value is, the higher the probability is that the process will be killed).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict that's not quite how it's documented as behaving, but rather just that a "A higher score means that the process is more likely to be selected by the OOM-killer." The absolute value doesn't matter, only its relative value compared to all the other processes' value. A process could still be killed due to system wide memory pressure at values less than 1000.

While in one limited test 980 may have made sense, it's not clear that value is appropriate if there are multiple nodeos running on the same box, as a simple example. Or, what happens if a lot of memory is consumed in tmpfs? Or, how do containers factor in to this?

I'd like to see more research on appropriate ways to detect system wide memory pressure (which is really what we want in this case). For example,
https://www.kernel.org/doc/html/latest/accounting/psi.html
kinda-sorta sounds more like what we're looking for, but I'm not entirely sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the cases you mention (multiple nodeos,a lot of memory is consumed in tmpfs), if users select mapped_private, there is a high likelihood that the nodeos will be killed by the OOM-killer.

This change just allows nodeos to decrease its OOM score when it gets very high.

Do I claim this PR catches all cases? No.

Would some additional research allow us to do a better job at predicting when this memory relief is needed? Probably.

Is it better to catch some at least some clear cases of OOM danger and avoid crashing if we can. Yes IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change just allows nodeos to decrease its OOM score when it gets very high.

I would be interested in doing more research, but my understanding right now is that a high absolute oom_score isn't really guaranteed to mean anything. It's only that score relative to other processes that mean something. And even then, it only means that the process with the highest score will be first on the chopping block next time the OOM killer takes action (whenever that is). These are all important distinctions to what has been written the comment.

Leaning on the absolute value of oom_score to determine when the process is getting close to being killed by OOM, appears to be leaning on implementation specific behavior. For example, consider if Linux changed in the next release so that the process using the most memory has a fixed oom_score of 1000, then the process with the second most fixed 999, etc etc. That still is completely valid by the rules of: "A higher score means that the process is more likely to be selected by the OOM-killer. The basis for this score is the amount of memory used by the process"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually testing this more makes me more convinced oom_score isn't a good metric. For example, I created a 32GB VM, and set chain-state-db-size-mb = 24576. Once nodeos gets to running, I would get a flush about every 85000 blocks. Granted, this is nearly 12 hours of real time (but maybe not really real time since we don't know how speculative trxs dirty things), but it was every 60-90 seconds during a replay. When the flush was triggered nodeos only had 15-16GB resident and,

chainbase: oom = 984, flushed 822054 state pages to disk (in 16065 ms) to decrease memory pressure
chainbase: oom score after flush: 675
chainbase: The system is running low in memory for running nodeos. If this is not a temporary condition, consider increasing the available RAM

meaning just ~3GB was dirty.

There was clearly no memory pressure at all: literally half of the VM's memory was completely unused. Yet, we're telling the user the system is running low on memory.

@greg7mdp greg7mdp requested a review from heifner October 6, 2023 16:30
// --------------------------------------------------------------------------------------
void* old_address = _cow_address;
munmap(_cow_address, _database_size); // first clear old region so we don't overcommit
_cow_address = mmap(_cow_address, _database_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small period of time between the munmap() and mmap(MAP_FIXED) that some other thread may be given a mapping at this location; in such a case the mmap(MAP_FIXED) will then silently blow away that data. Should just be able to skip the munmap() here to avoid that race.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned about doing a mmap (when we already have a high oom score) without first blowing away the old mapping. Maybe it is not an issue since we use MAP_FIXED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw the doc: "If the memory region specified by addr and length overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded", so I feel better about implementing your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Such as another thread `mmap`ing the same area of virtual memory.
test/test.cpp Outdated
pmf.set_oom_delay(1);

size_t flush_count = 0;
for (int i=0; i<max_elems; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw there is a miss-matched signed comparison warning here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Fixed.

@greg7mdp
Copy link
Contributor Author

I know why most pages of the state quickly get dirty when syncing. It is because everytime an item gets inserted into (or deleted from) a boost intrusive rbtree, the whole tree is rebalanced (rebalance_after_insertion()) which involves updating the color member of many tree elements.

It is even worse for the id tree, since we store monotonously increasing ids, we naturally would create a perfectly unbalanced tree.

test/test.cpp Outdated
@@ -146,16 +147,20 @@ BOOST_AUTO_TEST_CASE( oom_flush_dirty_pages ) {
db.create<book>( [i]( book& b ) { b.a = (int)i; b.b = (int)(i+1); } );
if (i % 1000 == 0) {
if (auto res = db.check_memory_and_flush_if_needed()) {
// we need to wait some time after clearing the soft-dirty bits from the task's PTEs
// for the next read to be accurate (see https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying clearing the bits is asynchronous? If that's the case the whole approach seems dangerous.

I was initially worried about that when doing the initial review, but when I looked through the kernel source it seemed like it was synchronous.

Or instead is the problem here in this test that the oom_score is only updated at some interval and we need to wait around for that update to occur? If that's the case, I don't think citing that soft-dirty.txt refers to the same issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not updating the oom_score (in the test we don't care about the value being updated because we set an arbitrarily low threshold).
It seems that if you check for the soft-dirty bits on the pages right after clearing them, you don't get correct results.

The reason I linked this url is:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that when we don't wait, we see more pages as dirty as we should, so I don't think it is dangerous, except for the fact that we'd flush more pages to disk than strictly needed. But I agree that it is unnerving.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.

After save_database_file() we should have completely fresh pages anyways though right? So why are we seeing seemingly spurious dirty bits in brand new pages? Is it possible the dirty bit is only valid based on some other criteria? (some other bit being set?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.

If it behaved like that the feature would be utterly useless.

Is it possible the dirty bit is only valid based on some other criteria? (some other bit being set?)

No I don't think so because all we need to do is wait a second or two and then we see the correct pages dirty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.

I have tested it and you don't miss writes.

If you write to a page right after clear_refs, yo see the dirty bit whether you check right after the clear_refs, or two seconds later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the kernel unit tests not require waiting?

Copy link
Contributor Author

@greg7mdp greg7mdp Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, the test doesn't check exactly the same thing. The check has one page with the soft-dirty bit set and verifies that the bit is cleared immediately after clear_refs.

What I see in my test is that newly mapped pages, which were never written to, report being soft-dirty for a second or two after clear-refs, and then don't anymore. But if one of these pages is written to immediately after clear-refs, the soft-dirty bit will remain set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spoonincode you are correct, there was an issue in the test (now corrected) that made me think that clear_refs did not clear the soft-dirty bit immediately, but that was an error on my part. So my original comment was wrong, the clearing of the soft-dirty bits is synchronous.

@greg7mdp greg7mdp requested review from spoonincode and heifner and removed request for heifner October 23, 2023 13:40
Copy link
Member

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe any real world testing that has been performed?

I ran an EOS node in a container limited to 6GB. Unsurprisingly, since the container's limit is not reflected in the oom_score, nodeos just ends up getting killed eventually.

I guess a VM or system-wide memory limitation would be required for testing.

@@ -127,4 +128,49 @@ BOOST_AUTO_TEST_CASE( open_and_create ) {
BOOST_REQUIRE_EQUAL( new_book.b, copy_new_book.b );
}

BOOST_AUTO_TEST_CASE( oom_flush_dirty_pages ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test should probably be #ifdef'ed for linux

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the test work even if oom is not available. It doesn't test oom in that case, but does exercise chainbase some.

*((char*)_file_mapped_region.get_address()+header_dirty_bit_offset) = dirty; // set dirty bit in our memory mapping

_segment_manager = reinterpret_cast<segment_manager*>((char*)_file_mapped_region.get_address()+header_size);
_cow_address = mmap(NULL, _database_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, _file_mapping.get_mapping_handle().handle, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed a couple of issues here when testing,

The size should be the size of the DB, not the configured size. These will differ when a user tries to shrink the DB (not supported). Just looking through the code some I'm worried we might have broke this in other ways in #21 (such as when writing out heap mode), and may require a separate fix for 5.0. I'm also curious why grow_shrink test isn't failing; maybe the test needs to be improved, unsure.

If this fails, nodeos crashes leaving the DB dirty. This is fairly easy for someone to stumble on: if I only have a box with 24GB of RAM and configure chain-state-db-size-mb = 24576, it will fail. Need to be checking the return value here and more gracefully fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size should be the size of the DB, not the configured size.

Hum, I'm not following. Here we use _database_size, which is the shared_file_size passed to the database and pinnable_mapped_file constructors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried we might have broke this in other ways in #21 (such as when writing out heap mode)

Do you have any specific issue in mind. That is the main reason I added the _database_size member, to ensure we stored the requested database size, so we could precisely write this size, excluding extra space required for huge pages alignment or such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requested database size isn't the actual database size in case of,

else if(shared_file_size < existing_file_size) {
std::cerr << "CHAINBASE: \"" << _database_name << "\" requested size of " << shared_file_size << " is less than "
"existing size of " << existing_file_size << ". This database will not be shrunk and will "
"remain at " << existing_file_size << '\n';
}

So, for example, if you first chain-state-db-size-mb = 24576 and then restart with chain-state-db-size-mb = 1024 the DB size will remain 24576 but _database_size is 1024 and it results in a crash with mapped_private

It also probably results in heap mode not being completely written out now too for similar reason: _database_size is requested size, not actual size. Basically any place we use _database_size is suspect currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I fixed the issue in this PR, and if you agree with the change, I'll create a PR for fixing it in 5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants