Skip to content

GH-49272: [C++][CI] Fix intermittent segfault in arrow-json-test on M…#49462

Open
vanshaj2023 wants to merge 1 commit intoapache:mainfrom
vanshaj2023:fix-mingw-json-test-segfault
Open

GH-49272: [C++][CI] Fix intermittent segfault in arrow-json-test on M…#49462
vanshaj2023 wants to merge 1 commit intoapache:mainfrom
vanshaj2023:fix-mingw-json-test-segfault

Conversation

@vanshaj2023
Copy link

@vanshaj2023 vanshaj2023 commented Mar 5, 2026

Rationale for this change

The arrow-json-test intermittently segfaults on AMD64 Windows MinGW CI (both CLANG64 and MINGW64 environments), causing false CI failures. The crash occurs at 0.03 seconds into test execution during ReaderTest.MultipleChunksParallel. See #49272.

Example failing runs:

The root cause is MinGW's __emutls implementation for C++ thread_local, which has known race conditions during thread creation. When ThreadPool::LaunchWorkersUnlocked creates a new worker thread, the thread immediately writes to the thread_local current_thread_pool_ variable. If __emutls hasn't finished initializing TLS for the new thread, this dereferences an invalid pointer, causing a segfault.

What changes are included in this PR?

Replace thread_local with native Win32 TLS API on MinGW (thread_pool.cc): Uses TlsAlloc/TlsGetValue/TlsSetValue via Arrow's windows_compatibility.h instead of thread_local on MinGW to bypass the buggy __emutls emulation. Non-MinGW platforms are unchanged. Includes error handling for TlsAlloc failure.

Add MultipleChunksParallelStress test (reader_test.cc): Runs parallel JSON reading 20 times to help expose intermittent threading races.

Are these changes tested?

Yes. A new stress test (ReaderTest.MultipleChunksParallelStress) is added that repeatedly exercises the parallel JSON reading path. The existing ReaderTest.MultipleChunksParallel test also covers the affected code path.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". The changes fix a bug that causes a crash (segfault) in arrow-json-test on MinGW Windows CI due to a race condition in MinGW's __emutls thread-local storage emulation during thread pool worker creation.

Comment on lines +29 to +31
#if defined(__MINGW32__) || defined(__MINGW64__)
# include "arrow/util/windows_compatibility.h"
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always include windows_compatibility.h because it does nothing on non Windows.

Suggested change
#if defined(__MINGW32__) || defined(__MINGW64__)
# include "arrow/util/windows_compatibility.h"
#endif
#include "arrow/util/windows_compatibility.h"

static DWORD index = [] {
DWORD i = TlsAlloc();
if (i == TLS_OUT_OF_INDEXES) {
ARROW_LOG(FATAL) << "TlsAlloc failed for thread pool TLS";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we show error details?

Suggested change
ARROW_LOG(FATAL) << "TlsAlloc failed for thread pool TLS";
ARROW_LOG(FATAL) << "TlsAlloc failed for thread pool TLS: " << internal::WinErrorMessage(GetLastError());

Comment on lines +293 to +317
// Stress test for intermittent threading crashes on MinGW.
// See https://github.com/apache/arrow/issues/49272
TEST(ReaderTest, MultipleChunksParallelStress) {
constexpr int kTrials = 20;
for (int trial = 0; trial < kTrials; ++trial) {
int64_t count = 1 << 10;
ParseOptions parse_options;
parse_options.unexpected_field_behavior = UnexpectedFieldBehavior::InferType;
ReadOptions read_options;
read_options.block_size = static_cast<int>(count / 2);
read_options.use_threads = true;

std::string json;
for (int i = 0; i < count; ++i) {
json += "{\"a\":" + std::to_string(i) + "}\n";
}

std::shared_ptr<io::InputStream> input;
ASSERT_OK(MakeStream(json, &input));
ASSERT_OK_AND_ASSIGN(auto reader, TableReader::Make(default_memory_pool(), input,
read_options, parse_options));
ASSERT_OK_AND_ASSIGN(auto table, reader->Read());
ASSERT_EQ(table->num_rows(), count);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a heavy test?

If so, we don't want to this test in our CI. Can we disable it by default something like the following?

TEST(...) {
  if (!std::getenv("ARROW_SLOW_TEST_ENABLE")) {
    GTEST_SKIP() << ...;
  }
  ...
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not too heavy, each iteration is identical to the existing MultipleChunksParallel test (1024 small JSON lines), just repeated 20 times to increase the chance of hitting the race condition. It finishes in a few seconds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we reduce the iteration count if you think that's better for CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we remove for (int trial = 0; trial < kTrials; ++trial) {? We can reproduce this by running this test multiple times manually, right? (e.g. while build/debug/debug/arrow-json-test ...; do :; done)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update the code

@kou kou added the CI: Extra: C++ Run extra C++ CI label Mar 6, 2026
@vanshaj2023
Copy link
Author

Thanks for the review @kou
Will push the updated changes shortly.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed CI: Extra: C++ Run extra C++ CI awaiting review Awaiting review labels Mar 6, 2026
…t on MinGW

MinGW's __emutls implementation for C++ thread_local has known race
conditions during thread creation. When ThreadPool::LaunchWorkersUnlocked
creates a new worker thread, the thread writes to the thread_local
current_thread_pool_ variable. If __emutls hasn't finished initializing
TLS for the new thread, this dereferences an invalid pointer, causing
a segfault.

Replace thread_local with native Win32 TLS API (TlsAlloc/TlsGetValue/
TlsSetValue) on MinGW to bypass the buggy __emutls emulation.
Non-MinGW platforms continue using standard thread_local.

Also add a stress test (MultipleChunksParallelStress) that runs parallel
JSON reading 20 times to help expose intermittent threading races.
@vanshaj2023 vanshaj2023 force-pushed the fix-mingw-json-test-segfault branch from 62da992 to 1e0fba3 Compare March 6, 2026 09:07
@github-actions github-actions bot added awaiting review Awaiting review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Mar 6, 2026
@vanshaj2023 vanshaj2023 requested a review from kou March 6, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants