feat: Add global config API and optimize Parquet read thread conf#68
Conversation
…ation - Add global_config.h/cpp for managing global configurations - Change Parquet read thread configuration from thread count to boolean switch - Update related test cases to adapt to the new configuration
There was a problem hiding this comment.
Pull request overview
This pull request refactors the Parquet read thread configuration and introduces a global configuration API. The changes address the issue where thread configuration was modifying global Arrow settings at the reader level, which could cause problems in concurrent scenarios.
Changes:
- Replaced thread count configuration with a boolean toggle for using threads in Parquet reading
- Added new global_config.h/cpp module with API for managing Arrow CPU thread pool capacity
- Updated test cases to use the new configuration approach
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/parquet/parquet_format_defs.h | Changed Parquet read configuration from thread count to boolean use-threads flag |
| src/paimon/format/parquet/parquet_file_batch_reader.cpp | Simplified thread configuration logic to use boolean flag instead of setting thread pool capacity |
| src/paimon/format/parquet/parquet_file_batch_reader_test.cpp | Updated tests to use new configuration API and test boolean thread flag |
| src/paimon/common/global_config.cpp | Implemented wrapper functions for Arrow CPU thread pool configuration |
| include/paimon/global_config.h | Defined public API for managing global Arrow thread pool capacity |
| src/paimon/CMakeLists.txt | Added global_config.cpp to build system |
| include/paimon/api.h | Exported global_config.h as part of public API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include "paimon/global_config.h" | ||
|
|
||
| #include "arrow/util/thread_pool.h" | ||
| #include "utils/arrow/status_utils.h" |
There was a problem hiding this comment.
The include path is incorrect. The correct path should be "paimon/common/utils/arrow/status_utils.h" (with "paimon/common" prefix) rather than "utils/arrow/status_utils.h". This inconsistency with other files in the codebase could lead to compilation errors.
| #include "utils/arrow/status_utils.h" | |
| #include "paimon/common/utils/arrow/status_utils.h" |
| /// Set the capacity of the arrow's global thread pool | ||
| /// This is a simple wrapper of arrow::SetCpuThreadPoolCapacity() | ||
| /// | ||
| /// Set the number of worker threads int the thread pool to which |
There was a problem hiding this comment.
Typographical error: "int" should be "in". The comment should read "Set the number of worker threads in the thread pool" rather than "int the thread pool".
| /// Set the number of worker threads int the thread pool to which | |
| /// Set the number of worker threads in the thread pool to which |
| /// Set the number of worker threads int the thread pool to which | ||
| /// Arrow dispatches various CPU-bound tasks. | ||
| /// | ||
| /// The current number is returned by GetCpuThreadPoolCapacity(). |
There was a problem hiding this comment.
The documentation at line 40 references "GetCpuThreadPoolCapacity()" but should reference "GetArrowCpuThreadPoolCapacity()" to match the actual function name in this API. This inconsistency could confuse users of the API.
| /// The current number is returned by GetCpuThreadPoolCapacity(). | |
| /// The current number is returned by GetArrowCpuThreadPoolCapacity(). |
| ASSERT_OK(SetArrowCpuThreadPoolCapacity(6)); | ||
| std::map<std::string, std::string> options = {{PARQUET_READ_USE_THREADS, "true"}}; | ||
| int32_t batch_size = 1024; | ||
| ASSERT_OK_AND_ASSIGN( | ||
| auto arrow_reader_properties, | ||
| ParquetFileBatchReader::CreateArrowReaderProperties(pool_, options, batch_size)); | ||
| ASSERT_EQ(arrow_reader_properties.use_threads(), true); | ||
| ASSERT_EQ(arrow::GetCpuThreadPoolCapacity(), 6); | ||
| ASSERT_EQ(GetArrowCpuThreadPoolCapacity(), 6); | ||
| } |
There was a problem hiding this comment.
The test modifies global state by calling SetArrowCpuThreadPoolCapacity(6) without restoring the original value afterward. This could affect other tests running in the same process. Consider saving the original capacity before modification and restoring it after the test, or in the TearDown method to avoid test interference.
…conf (alibaba#68)" This reverts commit d3bb3a9. The original commit moved arrow::SetCpuThreadPoolCapacity from libpaimon_parquet_file_format.so (direct call) to libpaimon.so (via paimon::SetArrowCpuThreadPoolCapacity wrapper). Since Arrow is statically linked into both .so files, each has its own CpuThreadPool singleton. Setting capacity through libpaimon.so never affects the singleton inside libpaimon_parquet_file_format.so, making thread control ineffective for Parquet reads.
Purpose
The current configuration approach modifies global settings at the reader level, which is not reasonable. It affects all code using the Arrow thread pool in the process and is not always effective in concurrent scenarios. Additionally, its default value is too small, impacting Parquet read performance under default settings.
Tests
Update related test cases to adapt to the new configuration
API and Format
Added a new global configuration module
Modified Parquet read configuration
Documentation
None