Fix race in conformance/interfaces/pthread_mutex_init/1-2.c #9

sbc100 · 2021-11-09T17:25:18Z

This test was running deadlk_issue several times and expecting the
returned global to be zero if pthread_mutex_lock has not returned,
however the resetting on returned to zero was happening only after the
sem_post. The main thread is calling sched_yeild which on most
platforms is enough to cover up this issue, but not with WebAssembly.

Some alternatives to landing this change:

We could just disable this test.
We could make sched_yield actualy do something rather than be
a no-op.

This test was running deadlk_issue several times and expecting the returned global to be zero if `pthread_mutex_lock` has not returned, however the resetting on `returned` to zero was happening only after the `sem_post`. The main thread is calling `sched_yeild` which on most platforms is enough to cover up this issue, but not with WebAssembly. Some alternatives to landing this change: 1. We could just disable this test. 2. We could make sched_yield actualy do something rather than be a no-op.

kripken · 2021-11-09T17:38:11Z

What would our options be in 2 - what could we make sched_yield do?

sbc100 · 2021-11-09T17:57:21Z

I was thinking one option might be emscripten_thread_sleep(0)... however, no matter what we do it doesn't eliminate the race here.. the race is real even if sched_yeild hides it most of the time.

kripken · 2021-11-09T19:42:57Z

I see, thanks.

I'm not too familiar with the pthreads API and race semantics etc., so I'd need to spend a significant amount of time to check this in depth. But unless you do want a detailed review from me, this lgtm to land, and is better than the other options.

sbc100 · 2021-11-09T19:46:13Z

Lets see if @kleisauke agrees with my assessment that this is bug in the test code.

I'm temporarily disabled this test as part of emscripten-core/emscripten#13006 because that change seem to make this test more flaky (although I don't know why exactly).

kleisauke

LGTM! Somehow I couldn't reproduce this race when adding -DVERBOSE=2 here (due to proxying to the main thread?) or when testing this with PR emscripten-core/emscripten#13007 (perhaps moving pthread_{cancel,join} from JS to musl resolved this?).

sbc100 · 2021-11-10T18:07:10Z

Yes, this race is very sensitive to things like printf and tracing or any kind of syscall made in the main thread.

sbc100 · 2021-11-10T18:08:19Z

Do you agree that there is an inadvertent race here? i.e. the setting of returned = 0; in this location was not done for some deliberate reason right?

kleisauke · 2021-11-10T19:36:51Z

I'm not sure if this is an inadvertent race, debugging these types of flaky races is difficult. I also tried to debug this on Alpine (which also uses musl), but with or without this change it always freezes on joining the first deadlk_issue thread.

See for details:

Details

$ cd tests/third_party/posixtestsuite
$ docker run --cap-add=SYS_PTRACE -v $(pwd):/app -w /app -it alpine:latest sh
/app # apk add build-base gdb
/app # make build-tests CFLAGS='-DVERBOSE=2' LDFLAGS='-pthread' POSIX_TARGET='conformance/interfaces/pthread_mutex_init'
/app # gdb ./conformance/interfaces/pthread_mutex_init/1-2.test
(gdb) run
Starting program: /app/conformance/interfaces/pthread_mutex_init/1-2.test 
warning: Error disabling address space randomization: Function not implemented
Test starting...
Data initialized...
Results for unlock issue #1:
 mutex 1 unlocking returned 0
 mutex 2 unlocking returned 0
Creating thread (unlock)...
[New LWP 174]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Creating thread (unlock)...
[LWP 174 exited]
[New LWP 175]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Results for unlock issue #2:
 mutex 1 returned 0
 mutex 2 returned 0
Creating thread (deadlk)...
[LWP 175 exited]
[New LWP 176]
Thread releases the semaphore...
Cancel thread...
Thread canceled...

Thread 4 "1-2.test" received signal SIG33, Real-time event 33.
[Switching to LWP 176]
0x00007ffb0db7f6a6 in ?? () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007ffb0db7f6a6 in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x0000000000000008 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) handle SIG33 nostop noprint pass
Signal        Stop	Print	Pass to program	Description
SIG33         No	No	Yes		Real-time event 33
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /app/conformance/interfaces/pthread_mutex_init/1-2.test 
warning: Error disabling address space randomization: Function not implemented
Test starting...
Data initialized...
Results for unlock issue #1:
 mutex 1 unlocking returned 0
 mutex 2 unlocking returned 0
Creating thread (unlock)...
[New LWP 178]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Creating thread (unlock)...
[LWP 178 exited]
[New LWP 179]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Results for unlock issue #2:
 mutex 1 returned 0
 mutex 2 returned 0
Creating thread (deadlk)...
[LWP 179 exited]
[New LWP 180]
Thread releases the semaphore...
Cancel thread...
Thread canceled...

# freeze
^Z
Thread 1 "1-2.test" received signal SIGTSTP, Stopped (user).
0x00007fa2688ff3ad in ?? () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007fa2688ff3ad in ?? () from /lib/ld-musl-x86_64.so.1
#1  0x00007fa2688fc6c7 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fa268940b84 in ?? () from /lib/ld-musl-x86_64.so.1
#3  0x0000000000000000 in ?? ()

sbc100 · 2021-11-10T19:43:41Z

I guess another way of putting it: Give what the test is measuring can you see any reason not to set returned to zero early on? To me it looked like it was intended that returned should be zero from the beginning and the fact that it was initialized to zero late is not intentional. returned being true is supposed to signify that the second pthread_mutex_lock has returned and it being set to 1 in a previous run was not supposed to effect the next run.

sbc100 · 2021-11-10T19:44:49Z

If you set VERBOSE=2 does that help remove the deadlock with alpine?

Great news that this change fixes the deadlock on alpine too.. I'm even more convinced now that this is the correct fix.

kleisauke · 2021-11-11T15:27:30Z

I guess another way of putting it: Give what the test is measuring can you see any reason not to set returned to zero early on? To me it looked like it was intended that returned should be zero from the beginning and the fact that it was initialized to zero late is not intentional. returned being true is supposed to signify that the second pthread_mutex_lock has returned and it being set to 1 in a previous run was not supposed to effect the next run.

Ah yes, you're right. It makes more sense to "reset" the returned var earlier to avoid any interference between the two threads.

If you set VERBOSE=2 does that help remove the deadlock with alpine?

I initially tested with -DVERBOSE=2, but it seems that it still freezes after the Thread canceled... print. Removing that definition doesn't work either.

Great news that this change fixes the deadlock on alpine too.. I'm even more convinced now that this is the correct fix.

Note that I was unable to resolve this deadlock on Alpine. I've tried this change-set, older versions, enabling/disabling verbose output to no avail. Perhaps it makes more sense to just disable these two tests?

We can also try to report this upstream in musl's mailing list, the segfault in posixtest.test_pthread_key_create_2_1 (see here) and posixtest.test_pthread_detach_1_2 (see reproducer instructions) should probably be reported as well.

sbc100 · 2021-11-11T15:35:45Z

For now these two tests are disabled as "flaky" in emscripten .. so maybe I'll just leave it at that. I'll leave this PR open for posterity.

sbc100 requested review from kleisauke and kripken November 9, 2021 17:25

kleisauke approved these changes Nov 10, 2021

View reviewed changes

Fix race in conformance/interfaces/pthread_mutex_init/1-2.c #9

Are you sure you want to change the base?

Fix race in conformance/interfaces/pthread_mutex_init/1-2.c #9

Uh oh!

Conversation

sbc100 commented Nov 9, 2021

Uh oh!

kripken commented Nov 9, 2021

Uh oh!

sbc100 commented Nov 9, 2021

Uh oh!

kripken commented Nov 9, 2021

Uh oh!

sbc100 commented Nov 9, 2021

Uh oh!

kleisauke left a comment

Choose a reason for hiding this comment

Uh oh!

sbc100 commented Nov 10, 2021

Uh oh!

sbc100 commented Nov 10, 2021

Uh oh!

kleisauke commented Nov 10, 2021

Uh oh!

sbc100 commented Nov 10, 2021

Uh oh!

sbc100 commented Nov 10, 2021

Uh oh!

kleisauke commented Nov 11, 2021

Uh oh!

sbc100 commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sbc100 commented Nov 11, 2021 •

edited

Loading