New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24

greg7mdp · 2023-10-05T21:03:44Z

Upon reflection, I don't think this feature is necessary, and maybe not even desirable, because:

When using mapped_private, you have to configure enough (RAM + swap) size for the full configured chain-state-db-size-mb (or the mmap(M_PRIVATE) fails) and nodeos doesn't start. As a result we will be able to grow the state db and use up to that configured size without any issue.
As long as the actual state size in actually accessed by transactions is less than the RAM, there shouldn't be any swapping, but if in-use state grow beyond the available RAM, swapping should prevent nodeos being killed by the OOM killer.
Matt's testing and comments in this PR show that it is quite difficult to demonstrate that the solution implemented in this PR will reliably prevent the process being killed, and will not activate too early.

So I believe that the solution as implemented in this PR is (1) not necessary, and (2) not provably safe and effective, therefore I am in favor of closing this PR.

Please let me know if there is any disagreement regarding closing this PR, otherwise I'll close it as not planned in a few days.

Original PR comment

Part of resolution for Leap #1721.

The Linux kernel assigns and maintains an Out of Memory (OOM) score on a per-process basis, The valid range is from 0 (never kill) to 1000 (always kill), the lower the value is, the lower is the probability the process will be killed.

We track this score for Leap, and when the score gets high, flush the dirty pages of our private mapping to disk, lowering the Leap process OOM score, and thus preventing the system from killing our process.

Currently, we trigger the flush() for an oom value of 980, which is pretty hard to reach on a system which has more memory than the state database working set (about 20GB for EOS). I tried launching a bunch of programs gobbling memory while nodeos was running, but these programs were killed by the OS while nodeos continued working happily, even with a lower oom_score threshold (I tried 900 and 800).

This PR makes leap mapped_private mode function in a somewhat optimal mode, using as much memory as possible for its chainbase db in order to preventing constant writebacks to disk, while still doing occasional writebacks (to avoid crashing) on systems with limited RAM.

heifner · 2023-10-06T15:13:29Z

src/pinnable_mapped_file.cpp


   // we are in `copy_on_write` mode.
   static time_t check_time = 0;
-   constexpr int check_interval = 60; // seconds
-   constexpr size_t one_gb = 1ull << 30;
+   constexpr int check_interval = 30; // seconds


I wonder if it is worth reducing the check_interval once oom_score is above some threshold? For example, check every 1s after it is above 900.

I don't think it is needed. When the oom threshold is reached, all state pages are flushed (in sync mode).
I changed the flush to synchronous so the memory is actually reclaimed before we recreate the mapping. The synchronous flush takes time (5-10s) when the state is heavily accessed (while syncing) and we have a 30s delay.
With a shorter delay, we likely will have fewer dirty pages, and the sync would be faster, but I'm concerned that we would spend all our time flushing pages.

Obviously this memory constrained mode is not a desirable mode to function in, at least while syncing, as it requires lots of disk i/o. I see it more as a way to either:

avoid a crash alltogether is there is a temporary lack of available memory on the system, for example because another program is running.

delay a crash in order to give the user a chance to exit gracefully.

spoonincode

This need some sort of unit test (either in chainbase or leap), that somehow exercises this code path.

spoonincode · 2023-10-06T15:51:38Z

src/pinnable_mapped_file.cpp

+      std::cerr << "CHAINBASE: Writing \"" << _database_name << "\" database file, complete." << '\n';
+   } else if (mapped_writable_instance) {
+      // we are saving while processing... recreate the copy_on_write mapping with clean pages.
+      _file_mapped_region = bip::mapped_region(_file_mapping, bip::copy_on_write);


This needs to be both atomic and return the exact same address, otherwise bad things could happen. You probably will need to mmap(MAP_FIXED)

I updated the PR to use mmap() with MAP_FIXED. Strangely enough, the previous code worked (same address was returned).

spoonincode · 2023-10-06T16:06:57Z

src/pinnable_mapped_file.cpp

+         if (*oom_score >= 980) {
+            // linux returned a high out-of-memory (oom) score for the current process, indicating a high 
+            // probablility that the process will be killed soon (The valid range is from 0 to 1000.
+            // The higher the value is, the higher the probability is that the process will be killed).


afaict that's not quite how it's documented as behaving, but rather just that a "A higher score means that the process is more likely to be selected by the OOM-killer." The absolute value doesn't matter, only its relative value compared to all the other processes' value. A process could still be killed due to system wide memory pressure at values less than 1000.

While in one limited test 980 may have made sense, it's not clear that value is appropriate if there are multiple nodeos running on the same box, as a simple example. Or, what happens if a lot of memory is consumed in tmpfs? Or, how do containers factor in to this?

I'd like to see more research on appropriate ways to detect system wide memory pressure (which is really what we want in this case). For example,
https://www.kernel.org/doc/html/latest/accounting/psi.html
kinda-sorta sounds more like what we're looking for, but I'm not entirely sure.

in the cases you mention (multiple nodeos,a lot of memory is consumed in tmpfs), if users select mapped_private, there is a high likelihood that the nodeos will be killed by the OOM-killer.

This change just allows nodeos to decrease its OOM score when it gets very high.

Do I claim this PR catches all cases? No.

Would some additional research allow us to do a better job at predicting when this memory relief is needed? Probably.

Is it better to catch some at least some clear cases of OOM danger and avoid crashing if we can. Yes IMHO.

This change just allows nodeos to decrease its OOM score when it gets very high.

I would be interested in doing more research, but my understanding right now is that a high absolute oom_score isn't really guaranteed to mean anything. It's only that score relative to other processes that mean something. And even then, it only means that the process with the highest score will be first on the chopping block next time the OOM killer takes action (whenever that is). These are all important distinctions to what has been written the comment.

Leaning on the absolute value of oom_score to determine when the process is getting close to being killed by OOM, appears to be leaning on implementation specific behavior. For example, consider if Linux changed in the next release so that the process using the most memory has a fixed oom_score of 1000, then the process with the second most fixed 999, etc etc. That still is completely valid by the rules of: "A higher score means that the process is more likely to be selected by the OOM-killer. The basis for this score is the amount of memory used by the process"

Manually testing this more makes me more convinced oom_score isn't a good metric. For example, I created a 32GB VM, and set chain-state-db-size-mb = 24576. Once nodeos gets to running, I would get a flush about every 85000 blocks. Granted, this is nearly 12 hours of real time (but maybe not really real time since we don't know how speculative trxs dirty things), but it was every 60-90 seconds during a replay. When the flush was triggered nodeos only had 15-16GB resident and,

chainbase: oom = 984, flushed 822054 state pages to disk (in 16065 ms) to decrease memory pressure chainbase: oom score after flush: 675 chainbase: The system is running low in memory for running nodeos. If this is not a temporary condition, consider increasing the available RAM

meaning just ~3GB was dirty.

There was clearly no memory pressure at all: literally half of the VM's memory was completely unused. Yet, we're telling the user the system is running low on memory.

Interestingly, it was already the case before.

…ets high

spoonincode · 2023-10-11T15:06:43Z

src/pinnable_mapped_file.cpp

+      // --------------------------------------------------------------------------------------
+      void* old_address = _cow_address;
+      munmap(_cow_address, _database_size);  // first clear old region so we don't overcommit
+      _cow_address = mmap(_cow_address, _database_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED,


There is a small period of time between the munmap() and mmap(MAP_FIXED) that some other thread may be given a mapping at this location; in such a case the mmap(MAP_FIXED) will then silently blow away that data. Should just be able to skip the munmap() here to avoid that race.

I was concerned about doing a mmap (when we already have a high oom score) without first blowing away the old mapping. Maybe it is not an issue since we use MAP_FIXED?

I just saw the doc: "If the memory region specified by addr and length overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded", so I feel better about implementing your suggestion.

Such as another thread `mmap`ing the same area of virtual memory.

spoonincode · 2023-10-12T19:46:41Z

test/test.cpp

+   pmf.set_oom_delay(1);
+
+   size_t flush_count = 0;
+   for (int i=0; i<max_elems; ++i) {


fwiw there is a miss-matched signed comparison warning here

Thanks, Fixed.

greg7mdp · 2023-10-12T21:00:39Z

I know why most pages of the state quickly get dirty when syncing. It is because everytime an item gets inserted into (or deleted from) a boost intrusive rbtree, the whole tree is rebalanced (rebalance_after_insertion()) which involves updating the color member of many tree elements.

It is even worse for the id tree, since we store monotonously increasing ids, we naturally would create a perfectly unbalanced tree.

spoonincode · 2023-10-12T21:25:50Z

test/test.cpp

@@ -146,16 +147,20 @@ BOOST_AUTO_TEST_CASE( oom_flush_dirty_pages ) {
      db.create<book>( [i]( book& b ) { b.a = (int)i; b.b = (int)(i+1); } );
      if (i % 1000 == 0) {
         if (auto res = db.check_memory_and_flush_if_needed()) {
+            // we need to wait some time after clearing the soft-dirty bits from the task's PTEs
+            // for the next read to be accurate (see https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt)


Are you saying clearing the bits is asynchronous? If that's the case the whole approach seems dangerous.

I was initially worried about that when doing the initial review, but when I looked through the kernel source it seemed like it was synchronous.

Or instead is the problem here in this test that the oom_score is only updated at some interval and we need to wait around for that update to occur? If that's the case, I don't think citing that soft-dirty.txt refers to the same issue.

The issue is not updating the oom_score (in the test we don't care about the value being updated because we set an arbitrarily low threshold).
It seems that if you check for the soft-dirty bits on the pages right after clearing them, you don't get correct results.

The reason I linked this url is:

It seems that when we don't wait, we see more pages as dirty as we should, so I don't think it is dangerous, except for the fact that we'd flush more pages to disk than strictly needed. But I agree that it is unnerving.

It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.

After save_database_file() we should have completely fresh pages anyways though right? So why are we seeing seemingly spurious dirty bits in brand new pages? Is it possible the dirty bit is only valid based on some other criteria? (some other bit being set?)

It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.

If it behaved like that the feature would be utterly useless.

Is it possible the dirty bit is only valid based on some other criteria? (some other bit being set?)

No I don't think so because all we need to do is wait a second or two and then we see the correct pages dirty.

It's dangerous because if it's not synchronous you'll miss writes that occur on pages after writing to clear_refs but before the dirty bit is cleared asynchronously in the background.

I have tested it and you don't miss writes.

If you write to a page right after clear_refs, yo see the dirty bit whether you check right after the clear_refs, or two seconds later.

Why do the kernel unit tests not require waiting?

Hum, the test doesn't check exactly the same thing. The check has one page with the soft-dirty bit set and verifies that the bit is cleared immediately after clear_refs.

What I see in my test is that newly mapped pages, which were never written to, report being soft-dirty for a second or two after clear-refs, and then don't anymore. But if one of these pages is written to immediately after clear-refs, the soft-dirty bit will remain set.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/mm/soft-dirty.c?h=v6.5.7#n127

@spoonincode you are correct, there was an issue in the test (now corrected) that made me think that clear_refs did not clear the soft-dirty bit immediately, but that was an error on my part. So my original comment was wrong, the clearing of the soft-dirty bits is synchronous.

It shows that updated pages are still seen as soft-dirty after the 2s wait.

Check that after the flush to disk, the new memory mapping of the disk db has all the previously inserted values.

spoonincode

Can you describe any real world testing that has been performed?

I ran an EOS node in a container limited to 6GB. Unsurprisingly, since the container's limit is not reflected in the oom_score, nodeos just ends up getting killed eventually.

I guess a VM or system-wide memory limitation would be required for testing.

spoonincode · 2023-11-02T19:38:44Z

test/test.cpp

@@ -127,4 +128,49 @@ BOOST_AUTO_TEST_CASE( open_and_create ) {
   BOOST_REQUIRE_EQUAL( new_book.b, copy_new_book.b );
 }

+BOOST_AUTO_TEST_CASE( oom_flush_dirty_pages ) {


Test should probably be #ifdef'ed for linux

I have made the test work even if oom is not available. It doesn't test oom in that case, but does exercise chainbase some.

spoonincode · 2023-11-06T14:44:37Z

src/pinnable_mapped_file.cpp

-   *((char*)_file_mapped_region.get_address()+header_dirty_bit_offset) = dirty; // set dirty bit in our memory mapping
-
-   _segment_manager = reinterpret_cast<segment_manager*>((char*)_file_mapped_region.get_address()+header_size);
+   _cow_address = mmap(NULL, _database_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, _file_mapping.get_mapping_handle().handle, 0);


I've noticed a couple of issues here when testing,

The size should be the size of the DB, not the configured size. These will differ when a user tries to shrink the DB (not supported). Just looking through the code some I'm worried we might have broke this in other ways in #21 (such as when writing out heap mode), and may require a separate fix for 5.0. I'm also curious why grow_shrink test isn't failing; maybe the test needs to be improved, unsure.

If this fails, nodeos crashes leaving the DB dirty. This is fairly easy for someone to stumble on: if I only have a box with 24GB of RAM and configure chain-state-db-size-mb = 24576, it will fail. Need to be checking the return value here and more gracefully fail.

The size should be the size of the DB, not the configured size.

Hum, I'm not following. Here we use _database_size, which is the shared_file_size passed to the database and pinnable_mapped_file constructors.

I'm worried we might have broke this in other ways in #21 (such as when writing out heap mode)

Do you have any specific issue in mind. That is the main reason I added the _database_size member, to ensure we stored the requested database size, so we could precisely write this size, excluding extra space required for huge pages alignment or such.

The requested database size isn't the actual database size in case of,

chainbase/src/pinnable_mapped_file.cpp

Lines 136 to 140 in 7615dda

else if(shared_file_size < existing_file_size) {

std::cerr << "CHAINBASE: \"" << _database_name << "\" requested size of " << shared_file_size << " is less than "

"existing size of " << existing_file_size << ". This database will not be shrunk and will "

"remain at " << existing_file_size << '\n';

}

So, for example, if you first chain-state-db-size-mb = 24576 and then restart with chain-state-db-size-mb = 1024 the DB size will remain 24576 but _database_size is 1024 and it results in a crash with mapped_private

It also probably results in heap mode not being completely written out now too for similar reason: _database_size is requested size, not actual size. Basically any place we use _database_size is suspect currently.

Thanks! I fixed the issue in this PR, and if you agree with the change, I'll create a PR for fixing it in 5.0

…hared_file_size`

greg7mdp added 8 commits October 5, 2023 09:03

Bump required cmake version as suggested in PR #21.

91cf743

Cleanup compiler feature dependency as suggested in review of PR #21.

302274b

Implement check_memory_and_flush_if_needed function

2cf9842

Update threshold and comment.

cf8ff37

Return more data from check_memory_and_flush_if_needed

6663302

Always return oom score if available.

3460d0e

Don't force immediate flush of all state dirty pages.

90c653b

Disable save messages when processing oom event.

11f9090

greg7mdp requested review from heifner and spoonincode October 5, 2023 21:03

heifner requested changes Oct 6, 2023

View reviewed changes

heifner approved these changes Oct 6, 2023

View reviewed changes

Recreate copy_on_write mapping after disk file is updated.

6ee4f83

spoonincode requested changes Oct 6, 2023

View reviewed changes

Destroy old region before creating a new one.

ba91bc2

greg7mdp requested a review from heifner October 6, 2023 16:30

greg7mdp added 7 commits October 6, 2023 17:12

Use mmap with MAP_FIXED to ensure we get the same address

664b069

Interestingly, it was already the case before.

Don't use mmap/munmap when _WIN32 is defined.

febb50c

Use _cow_size consistently.

c55e62e

Remove unnecessary _cow_size and rename a couple new variables.

923444c

Update return value of check_memory_and_flush_if_needed

b782713

Check for empty std::optional before returning oom_post.

34a5033

Add test exercising the flush of dirty pages when the oom score g…

408af79

…ets high

spoonincode reviewed Oct 11, 2023

View reviewed changes

Skip munmap to avoid race condition between munmap and mmap.

e63909f

Such as another thread `mmap`ing the same area of virtual memory.

spoonincode reviewed Oct 12, 2023

View reviewed changes

greg7mdp added 2 commits October 12, 2023 16:09

Fix mismatched sign comparison.

1e56487

Small cleanup; add wait in test for accurate soft-dirty bit read.

580a743

spoonincode reviewed Oct 12, 2023

View reviewed changes

greg7mdp added 5 commits October 13, 2023 13:25

Merge branch 'main' of github.com:AntelopeIO/chainbase into gh_1721

1819359

Update test to insert in the db right after doing clear_refs.

930611e

It shows that updated pages are still seen as soft-dirty after the 2s wait.

Add db consistency check after flush in test.

113f0be

Check that after the flush to disk, the new memory mapping of the disk db has all the previously inserted values.

Fix wrong delay in test, and remove unnecesary wait.

f6343e7

Remove unnecessary cerr output in test.

3e11aa8

greg7mdp requested review from spoonincode and heifner and removed request for heifner October 23, 2023 13:40

spoonincode reviewed Nov 2, 2023

View reviewed changes

greg7mdp added 2 commits November 3, 2023 08:17

Merge branch 'main' into gh_1721

1160b4f

Make oom test work on systems that don't support oom.

b9968ed

spoonincode reviewed Nov 6, 2023

View reviewed changes

greg7mdp added 3 commits November 6, 2023 13:49

Fix issue with incorrect database_size when requesting a smaller `s…

63b76dc

…hared_file_size`

Check for mmap failing (could happen if RAM size < database_size).

734dc46

Make whitespace change for _database_size fix same as in [5.0] PR.

7b9a4f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24

New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24

greg7mdp commented Oct 5, 2023 •

edited

Loading

heifner Oct 6, 2023

greg7mdp Oct 6, 2023 •

edited

Loading

spoonincode left a comment

spoonincode Oct 6, 2023

greg7mdp Oct 6, 2023

spoonincode Oct 6, 2023

greg7mdp Oct 6, 2023

spoonincode Oct 6, 2023

spoonincode Nov 6, 2023

spoonincode Oct 11, 2023

greg7mdp Oct 11, 2023

greg7mdp Oct 11, 2023

greg7mdp Oct 11, 2023

spoonincode Oct 12, 2023

greg7mdp Oct 12, 2023

greg7mdp commented Oct 12, 2023

spoonincode Oct 12, 2023

greg7mdp Oct 12, 2023

greg7mdp Oct 12, 2023

spoonincode Oct 12, 2023

greg7mdp Oct 13, 2023

greg7mdp Oct 13, 2023

spoonincode Oct 13, 2023

greg7mdp Oct 13, 2023 •

edited

Loading

spoonincode Oct 13, 2023

greg7mdp Oct 23, 2023

spoonincode left a comment

spoonincode Nov 2, 2023

greg7mdp Nov 3, 2023

spoonincode Nov 6, 2023

greg7mdp Nov 6, 2023

greg7mdp Nov 6, 2023

spoonincode Nov 6, 2023

greg7mdp Nov 6, 2023

	else if(shared_file_size < existing_file_size) {
	std::cerr << "CHAINBASE: \"" << _database_name << "\" requested size of " << shared_file_size << " is less than "
	"existing size of " << existing_file_size << ". This database will not be shrunk and will "
	"remain at " << existing_file_size << '\n';
	}

New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24

Are you sure you want to change the base?

New mapped_private mode: avoid crash by flushing dirty pages when memory pressure gets high #24

Conversation

greg7mdp commented Oct 5, 2023 • edited Loading

Upon reflection, I don't think this feature is necessary, and maybe not even desirable, because:

Original PR comment

Choose a reason for hiding this comment

greg7mdp Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

spoonincode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greg7mdp commented Oct 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greg7mdp Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spoonincode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greg7mdp commented Oct 5, 2023 •

edited

Loading

greg7mdp Oct 6, 2023 •

edited

Loading

greg7mdp Oct 13, 2023 •

edited

Loading