fix: Restore direct Arrow thread pool control inside Parquet format library#284
Merged
Merged
Conversation
…conf (alibaba#68)" This reverts commit d3bb3a9. The original commit moved arrow::SetCpuThreadPoolCapacity from libpaimon_parquet_file_format.so (direct call) to libpaimon.so (via paimon::SetArrowCpuThreadPoolCapacity wrapper). Since Arrow is statically linked into both .so files, each has its own CpuThreadPool singleton. Setting capacity through libpaimon.so never affects the singleton inside libpaimon_parquet_file_format.so, making thread control ineffective for Parquet reads.
Collaborator
Author
Contributor
|
+1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #68
Motivation
When Arrow is statically linked into multiple shared libraries (
libpaimon.soandlibpaimon_parquet_file_format.so), each.sogets its own copy of Arrow'sCpuThreadPoolsingleton.Commit d3bb3a9 introduced
paimon::SetArrowCpuThreadPoolCapacity()inlibpaimon.soas a wrapper aroundarrow::SetCpuThreadPoolCapacity(), and removed the directarrow::SetCpuThreadPoolCapacity()call fromparquet_file_batch_reader.cpp(which lives inlibpaimon_parquet_file_format.so). This means setting the thread pool capacity through the wrapper only affects the singleton insidelibpaimon.so, never the one insidelibpaimon_parquet_file_format.so— making Parquet read thread control completely ineffective.On a 96-core machine, this caused Arrow to spawn 96 CpuThreadPool workers + 8 IOThreadPool workers inside
libpaimon_parquet_file_format.so, regardless of any capacity setting by the user.Changes
This PR reverts commit d3bb3a9 to restore the original behavior where
parquet_file_batch_reader.cppdirectly callsarrow::SetCpuThreadPoolCapacity(), ensuring the call targets the correct singleton withinlibpaimon_parquet_file_format.so.Tests
API and Format
Documentation
Generative AI tooling