Skip to content

Conversation

@xuezhulian
Copy link

Problem:
The previous implementation used pthread_key_create(&terminationKey, onThreadExitCallback) to listen for thread exit and execute a cleanup callback.
The core issue was that this onThreadExitCallback function accessed a statically declared thread_local variable: THREAD_LOCAL_VARIABLE RuntimeState* runtimeState = kInvalidRuntime;.
Due to the order of operations during thread termination, the cleanup for our custom terminationKey was executed after the thread_local variable destructor. This meant that by the time our callback was called, the TLV for runtimeState had already been destroyed.
Consequently, accessing the destroyed runtimeState inside the callback triggered a new allocation for it. Because the thread's TSD cleanup loop was already complete, this newly allocated memory was never freed, leading to a memory leak.

Solution:
The fix is to change the callback registration mechanism to align with the C++ runtime's intended process for thread_local cleanup.
Instead of creating a custom key, the new implementation uses _tlv_atexit(&onThreadExitCallback, destructorRecord) to register the thread exit callback.
This works because _tlv_atexit indirectly registers our callback with the main cleanup list managed by dyld. During process initialization, dyld creates a system-level key, _terminatorsKey, and associates it with a master cleanup function, finalizeListTLV. The _tlv_atexit function essentially adds our callback to the list that finalizeListTLV will process.
Crucially, the execution of finalizeListTLV is guaranteed to happen before the individual thread_local variables like runtimeState are destroyed.
As a result, when onThreadExitCallback now accesses runtimeState, the variable is still valid, which prevents the TLV reallocation and resolves the memory leak.

@xuezhulian xuezhulian requested a review from a team as a code owner November 24, 2025 12:02
@xuezhulian xuezhulian force-pushed the master branch 11 times, most recently from 10e75ee to 3dc5c7c Compare November 24, 2025 14:19
@xuezhulian
Copy link
Author

The _tlv_get_addr function, which is called when a thread_local variable is accessed, operates on the DATA.__thread_data section. Accessing this section maybe trigger a page fault, resulting in blocking disk I/O. This becomes critical during thread termination.
We have observed the following deadlock scenario in production:

  1. A Main GC event is triggered. As part of its process, the GC thread needs all Kotlin threads to reach a safepoint before it can proceed.
  2. At the same time, the main thread is already suspended, waiting for the GC to complete.
  3. Concurrently, another Kotlin thread is in the process of exiting and is executing __pthread_tsd_cleanup.
  4. Within this cleanup routine, our previous logic inadvertently accesses a thread_local variable. This access triggers a page fault, causing the exiting thread to be suspended by the kernel while it waits for disk I/O.
  5. This creates a deadlock:
    ○ The Main GC thread is blocked, waiting for the exiting Kotlin thread to reach a safepoint.
    ○ The exiting Kotlin thread is blocked by the kernel, waiting for a page fault to be resolved.
    ○ The main thread remains suspended, indirectly blocked by the exiting thread's stall.
    This entire sequence prevents the main thread from responding, ultimately leading to a watchdog timeout and termination of the application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants