-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in conformance/interfaces/pthread_mutex_init/1-2.c #9
base: main
Are you sure you want to change the base?
Conversation
This test was running deadlk_issue several times and expecting the returned global to be zero if `pthread_mutex_lock` has not returned, however the resetting on `returned` to zero was happening only after the `sem_post`. The main thread is calling `sched_yeild` which on most platforms is enough to cover up this issue, but not with WebAssembly. Some alternatives to landing this change: 1. We could just disable this test. 2. We could make sched_yield actualy do something rather than be a no-op.
What would our options be in 2 - what could we make |
I was thinking one option might be |
I see, thanks. I'm not too familiar with the pthreads API and race semantics etc., so I'd need to spend a significant amount of time to check this in depth. But unless you do want a detailed review from me, this lgtm to land, and is better than the other options. |
Lets see if @kleisauke agrees with my assessment that this is bug in the test code. I'm temporarily disabled this test as part of emscripten-core/emscripten#13006 because that change seem to make this test more flaky (although I don't know why exactly). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Somehow I couldn't reproduce this race when adding -DVERBOSE=2
here (due to proxying to the main thread?) or when testing this with PR emscripten-core/emscripten#13007 (perhaps moving pthread_{cancel,join}
from JS to musl resolved this?).
Yes, this race is very sensitive to things like printf and tracing or any kind of syscall made in the main thread. |
Do you agree that there is an inadvertent race here? i.e. the setting of |
I'm not sure if this is an inadvertent race, debugging these types of flaky races is difficult. I also tried to debug this on Alpine (which also uses musl), but with or without this change it always freezes on joining the first See for details: Details$ cd tests/third_party/posixtestsuite
$ docker run --cap-add=SYS_PTRACE -v $(pwd):/app -w /app -it alpine:latest sh
/app # apk add build-base gdb
/app # make build-tests CFLAGS='-DVERBOSE=2' LDFLAGS='-pthread' POSIX_TARGET='conformance/interfaces/pthread_mutex_init'
/app # gdb ./conformance/interfaces/pthread_mutex_init/1-2.test
(gdb) run
Starting program: /app/conformance/interfaces/pthread_mutex_init/1-2.test
warning: Error disabling address space randomization: Function not implemented
Test starting...
Data initialized...
Results for unlock issue #1:
mutex 1 unlocking returned 0
mutex 2 unlocking returned 0
Creating thread (unlock)...
[New LWP 174]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Creating thread (unlock)...
[LWP 174 exited]
[New LWP 175]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Results for unlock issue #2:
mutex 1 returned 0
mutex 2 returned 0
Creating thread (deadlk)...
[LWP 175 exited]
[New LWP 176]
Thread releases the semaphore...
Cancel thread...
Thread canceled...
Thread 4 "1-2.test" received signal SIG33, Real-time event 33.
[Switching to LWP 176]
0x00007ffb0db7f6a6 in ?? () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0 0x00007ffb0db7f6a6 in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x0000000000000008 in ?? ()
#2 0x0000000000000000 in ?? ()
(gdb) handle SIG33 nostop noprint pass
Signal Stop Print Pass to program Description
SIG33 No No Yes Real-time event 33
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /app/conformance/interfaces/pthread_mutex_init/1-2.test
warning: Error disabling address space randomization: Function not implemented
Test starting...
Data initialized...
Results for unlock issue #1:
mutex 1 unlocking returned 0
mutex 2 unlocking returned 0
Creating thread (unlock)...
[New LWP 178]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Creating thread (unlock)...
[LWP 178 exited]
[New LWP 179]
Locking in child...
Unlocking in parent...
Thread joined successfully...
Results for unlock issue #2:
mutex 1 returned 0
mutex 2 returned 0
Creating thread (deadlk)...
[LWP 179 exited]
[New LWP 180]
Thread releases the semaphore...
Cancel thread...
Thread canceled...
# freeze
^Z
Thread 1 "1-2.test" received signal SIGTSTP, Stopped (user).
0x00007fa2688ff3ad in ?? () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0 0x00007fa2688ff3ad in ?? () from /lib/ld-musl-x86_64.so.1
#1 0x00007fa2688fc6c7 in ?? () from /lib/ld-musl-x86_64.so.1
#2 0x00007fa268940b84 in ?? () from /lib/ld-musl-x86_64.so.1
#3 0x0000000000000000 in ?? () |
I guess another way of putting it: Give what the test is measuring can you see any reason not to set returned to zero early on? To me it looked like it was intended that returned should be zero from the beginning and the fact that it was initialized to zero late is not intentional. |
If you set VERBOSE=2 does that help remove the deadlock with alpine? Great news that this change fixes the deadlock on alpine too.. I'm even more convinced now that this is the correct fix. |
Ah yes, you're right. It makes more sense to "reset" the
I initially tested with
Note that I was unable to resolve this deadlock on Alpine. I've tried this change-set, older versions, enabling/disabling verbose output to no avail. Perhaps it makes more sense to just disable these two tests? We can also try to report this upstream in musl's mailing list, the segfault in |
For now these two tests are disabled as "flaky" in emscripten .. so maybe I'll just leave it at that. I'll leave this PR open for posterity. |
This test was running deadlk_issue several times and expecting the
returned global to be zero if
pthread_mutex_lock
has not returned,however the resetting on
returned
to zero was happening only after thesem_post
. The main thread is callingsched_yeild
which on mostplatforms is enough to cover up this issue, but not with WebAssembly.
Some alternatives to landing this change:
a no-op.