Closed
Description
Today's run of the std test suite in Miri failed on macOS:
3.314536 test sync::rwlock::tests::frob ... error: Undefined Behavior: trying to retag from <42368726> for SharedReadOnly permission at alloc9742021[0x0], but that tag does not exist in the borrow stack for this location
0.000053 --> /home/runner/work/miri-test-libstd/miri-test-libstd/rust-src-patched/library/core/src/ptr/non_null.rs:401:18
0.000015 |
0.000009 401 | unsafe { &*self.as_ptr().cast_const() }
0.000010 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0.000009 | |
0.000010 | trying to retag from <42368726> for SharedReadOnly permission at alloc9742021[0x0], but that tag does not exist in the borrow stack for this location
0.000008 | this error occurs as part of retag at alloc9742021[0x0..0x30]
0.000009 |
0.000009 = help: this indicates a potential bug in the program: it performed an invalid operation, but the Stacked Borrows rules it violated are still experimental
0.000009 = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
0.000010 help: <42368726> was created by a SharedReadOnly retag at offsets [0x0..0x8]
0.000009 --> std_miri_test/../library/std/src/sys/locks/rwlock/queue.rs:348:32
0.000009 |
0.000009 348 | let mut next = ptr::from_ref(&node)
0.000009 | ^^^^^^^^^^^^^^^^^^^^
0.000011 help: <42368726> was later invalidated at offsets [0x8..0x10] by a write access
0.000010 --> std_miri_test/../library/std/src/sys/locks/rwlock/queue.rs:347:17
0.000008 |
0.000009 347 | node.prev = AtomicLink::new(None);
0.000009 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0.000009 = note: BACKTRACE (of the first span):
0.000009 = note: inside `core::ptr::NonNull::<sys::locks::rwlock::queue::Node>::as_ref::<'_>` at /home/runner/work/miri-test-libstd/miri-test-libstd/rust-src-patched/library/core/src/ptr/non_null.rs:401:18: 401:46
0.000009 note: inside `sys::locks::rwlock::queue::add_backlinks_and_find_tail`
0.000009 --> std_miri_test/../library/std/src/sys/locks/rwlock/queue.rs:265:17
0.000009 |
0.000009 265 | next.as_ref().prev.set(Some(current));
0.000009 | ^^^^^^^^^^^^^
0.000008 note: inside `sys::locks::rwlock::queue::RwLock::unlock_queue`
0.000009 --> std_miri_test/../library/std/src/sys/locks/rwlock/queue.rs:487:33
0.000009 |
0.000008 487 | let tail = unsafe { add_backlinks_and_find_tail(to_node(state)) };
0.000009 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0.000009 note: inside `sys::locks::rwlock::queue::RwLock::lock_contended`
0.000009 --> std_miri_test/../library/std/src/sys/locks/rwlock/queue.rs:383:25
0.000009 |
0.000008 383 | self.unlock_queue(next);
0.000009 | ^^^^^^^^^^^^^^^^^^^^^^^
0.000009 note: inside `sys::locks::rwlock::queue::RwLock::write`
0.000010 --> std_miri_test/../library/std/src/sys/locks/rwlock/queue.rs:312:13
0.000009 |
0.000009 312 | self.lock_contended(true)
0.000009 | ^^^^^^^^^^^^^^^^^^^^^^^^^
0.000009 note: inside `sync::rwlock::RwLock::<()>::write`
0.000009 --> std_miri_test/../library/std/src/sync/rwlock.rs:361:13
0.000009 |
0.000009 361 | self.inner.write();
0.000009 | ^^^^^^^^^^^^^^^^^^
0.000010 note: inside closure
0.000011 --> std_miri_test/../library/std/src/sync/rwlock/tests.rs:37:26
0.000009 |
0.000010 37 | drop(r.write().unwrap());
0.000008 | ^^^^^^^^^
The diff to the last successful run is this, but since this is a concurrency test, there's a high chance that the bug already existed before and only surfaces randomly. So #110211 is the most likely culprit.
So far I haven't managed to reproduce this outside the context of the standard library test suite.
Activity
RalfJung commentedon Feb 26, 2024
Looking at the code, what seems to be happening is that one iteration through the loop we create a shared reference here
rust/library/std/src/sys/locks/rwlock/queue.rs
Lines 348 to 350 in b58f647
I am surprised that the diagnostic says "created by a SharedReadOnly retag at offsets [0x0..0x8]". This type does have interior mutability, so parts of the retag should be SharedReadWrite. This is the type in question:
rust/library/std/src/sys/locks/rwlock/queue.rs
Lines 179 to 187 in b58f647
Everything except the
bool
here is anUnsafeCell
. So how can there be 8 bytes of SharedReadOnly (i.e., outsideUnsafeCell
)?miri-test-libstd uses
-Zrandomize-layout
. Maybe that has something to do with it. Maybe that's why we couldn't reproduce it yet.RalfJung commentedon Feb 26, 2024
So with random layout, I guess the
bool
field comes first and then an 8-aligned field follows, meaning we have 8 bytes of "not interior mutable" memory at the start. So far, so good.Then apparently later it got invalidated at offsets 8 to 16 here
rust/library/std/src/sys/locks/rwlock/queue.rs
Line 347 in b58f647
So that pointer must be the second field.
Then later we are using the original shared reference again and 💥 .
I think this is morally equivalent to the following:
This is rejected by Miri with Stacked Borrows but accepted with Tree Borrows.
The "blessed" way to do this is to never mix direct accesses to a local with accesses through derived pointers. Fixing that in queue.rs shouldn't be too hard.
Auto merge of #3327 - RalfJung:tree-interior_mut_reborrow, r=RalfJung
RalfJung commentedon Feb 26, 2024
From what @joboet said, the actual bug is that there's anyone even still using the old reference when we are doing the write that does the invalidation.
According to the Miri log, we have these events in this order:
A reference gets created and turned into a raw pointer
rust/library/std/src/sys/locks/rwlock/queue.rs
Lines 348 to 350 in b58f647
The memory that reference points to is written to via direct access to the local variable
rust/library/std/src/sys/locks/rwlock/queue.rs
Line 347 in b58f647
Someone uses the pointer derived from the reference
rust/library/std/src/sys/locks/rwlock/queue.rs
Line 265 in b58f647
We have a full backtrace only for the last event.
Backtrace
apiraino commentedon Feb 26, 2024
AFAIK T-libs doesn't use the I-prioritize label, so probably fine to remove it.
@rustbot label -I-prioritize
RalfJung commentedon Feb 26, 2024
I did managed to reproduce this on my machine once with that flag (with seed 2852 on Miri bbabee144ea12b210a7be9488f0eeed9efe3e24d), but when I run the testcase again with that seed it doesn't happen again.
This is the test:
Full command I used:
So... it uses actual randomness, and thus of course it does not reproduce even with the same seed. :/ (Disabling isolation means we give the program access to host randomness, which will be used to initialize the
thread_rng
.)RalfJung commentedon Feb 26, 2024
The backtrace was a bit different this time:
So the entry point of the bad use of the pointer is
read
this time, notwrite
. But fromunlock_queue
onwards it's the same.RalfJung commentedon Feb 26, 2024
I finally found some seeds. :) 2998, 5308, 7680. On Miri bbabee144ea12b210a7be9488f0eeed9efe3e24d:
RalfJung commentedon Mar 2, 2024
Turns out this diff fixes all the 3 seeds that I found above:
Alternatively, this diff also does it:
So I think something is wrong with the memory orderings used somewhere. I can't tell which of these
state
accesses should be strengthened though. (And there might more 1-line diffs that suffice, I didn't try all theRelaxed
.)I also found some other suspicious orderings, namely RMW operations where the success case has a weaker ordering than the failure case:
rust/library/std/src/sys/locks/rwlock/queue.rs
Line 405 in b58f647
rust/library/std/src/sys/locks/rwlock/queue.rs
Lines 492 to 497 in b58f647
rust/library/std/src/sys/locks/rwlock/queue.rs
Line 537 in b58f647
IMO these justify a comment explaining why the orderings as so surprising, but they don't seem involved in this bug, at least not in the cases we have found so far.
Auto merge of #3338 - RalfJung:more-tracking-and-threads, r=RalfJung
Auto merge of #3338 - RalfJung:more-tracking-and-threads, r=RalfJung
Auto merge of #3338 - RalfJung:more-tracking-and-threads, r=RalfJung
RalfJung commentedon Mar 3, 2024
Turn out I went down the wrong rabbit hole. Changing these orders means the program took a different path somewhere earlier in the execution so we didn't hit the bad interleaving any more. I found a seed that reproduces this issue even without weak memory emulation, so the orderings can't be the problem:
Miri 9272474b755dca7dfea4f96fd8344dd12e223ae8,
-Zmiri-seed=11408 -Zmiri-disable-weak-memory-emulation
What makes this particularly strange is that everything relevant seems to happen on the same thread... not sure what to make of that.
Auto merge of rust-lang#3327 - RalfJung:tree-interior_mut_reborrow, r…
Auto merge of rust-lang#3338 - RalfJung:more-tracking-and-threads, r=…
joboet commentedon Mar 3, 2024
I found the bug! It's an ABA problem. When a new node is added, its thread basically executes:
In the failing executions, the thread gets interrupted for a long time, long enough that the thread at the head of the queue gets woken up (invalidating all shared references to it) and gets readded to the queue (with a new shared reference). The CAS succeeds, because the pointer has the same bitpattern but once we try to perform an access through the pointer, we obviously incur UB.
This is very hard to fix, since the bug can occur anytime a
Node
allocation has the same address as a previous one. At the moment, the only solution I see is to use lossy provenance and make miri guess the right borrow permissions.RalfJung commentedon Mar 3, 2024
Nice catch!
Closing in favor of #121950.
seize
to 0.3 jonhoo/flurry#123