Skip to content

Conversation

@Ashutosh0x
Copy link

This PR fixes a critical Use-After-Free race condition in DirectBufferedInput and TableScan where AsyncThreadCtx tracking was initiated too late.

Problem

The AsyncThreadCtx counter was being incremented (in()) inside the background task after offloading to the executor. This allowed the main thread to complete its work and return from TableScan::close() (seeing a count of 0), leading to the destruction of the context while tasks were still in the queue.

Solution

  • Refactored AsyncLoadHolder in DirectBufferedInput.h to use RAII for AsyncThreadCtx tracking.
  • Ensured in() is called on the producer thread before task offloading.
  • Added RAII-based tracking (in()/out()) to the TableScan split preloader.
  • Removed redundant manual tracking from DirectBufferedInput.cpp.

cc @yangzhengguo @zachary-blanco @fzhedu

Refactor AsyncLoadHolder to use RAII for AsyncThreadCtx tracking. This ensures tracking increments happen on the producer thread before task offloading, preventing destruction of context while tasks are still in queue.
@CLAassistant
Copy link

CLAassistant commented Jan 24, 2026

CLA assistant check
All committers have signed the CLA.

@Ashutosh0x
Copy link
Author

Hello @yangzhengguo @zachary-blanco @fzhedu, I have submitted this PR to fix the Use-After-Free issue #147. Looking forward to your review!

@Ashutosh0x Ashutosh0x changed the title Fix Use-After-Free in Async IO tracking (#147) Fix UAF in scan preloading and add OOM protection Jan 24, 2026
@frankobe frankobe requested a review from fzhedu January 26, 2026 02:21
@frankobe
Copy link
Collaborator

@Ashutosh0x Thx for the 1st contribution to Bolt! I just trigger the CI workflow and the format check is complaining.

@fzhedu pls review & validate the fix when you get a time

@Ashutosh0x
Copy link
Author

Ashutosh0x commented Jan 26, 2026

Hello @frankobe, I have fixed the formatting issues and corrected a typo in the AsyncLoadHolder constructor. I also refactored the AsyncThreadCtx tracking to use a more robust RAII-based Guard class to ensure consistent resource tracking even if task offloading fails. The format-check is now passing, and other tests are in progress. Looking forward to your review!

int32_t prefetchMemoryPercent_{30};
connector::AsyncThreadCtx* asyncThreadCtx;
uint64_t preloadBytesLimit_{0};
connector::AsyncThreadCtx::Guard inGuard;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
connector::AsyncThreadCtx::Guard inGuard;
connector::AsyncThreadCtx::Guard inGuard_;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing underscore is common for member variables. It's to add an underscore for better consistency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed the naming convention (added trailing underscore) and also fixed a potential memory tracking leak I noticed in the same class.

@yangzhg yangzhg added the good first issue Good for newcomers label Jan 26, 2026
@Ashutosh0x
Copy link
Author

Thanks for the review @yangzhg! I have replaced std::lock_guard with std::scoped_lock in bolt/connectors/Connector.h as suggested. I also took the opportunity to replace a few unnecessary std::unique_lock usages with std::scoped_lock as well.

@yangzhg
Copy link
Collaborator

yangzhg commented Jan 26, 2026

The UT was failed

[ RUN      ] TableScanTest.preloadingSplitClose
E20260126 07:52:15.627036 123368832440640 Exceptions.h:82] Line: /__w/bolt/bolt/bolt/exec/tests/utils/QueryAssertions.cpp:1435, Function:readCursor, Expression:  Failed to wait for task to complete after 5.00s, task: {Task test_cursor_175 (test_cursor_175)
Plan:
-- TableScan[table: hive_table] -> c0:BIGINT, c1:INTEGER, c2:SMALLINT, c3:REAL, c4:DOUBLE, c5:VARCHAR, c6:TINYINT

drivers:
{Driver.0.0: running {Operators: TableScan[0] 0, CallbackSink[N/A] 1}}
}, Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: BoltRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to wait for task to complete after 5.00s, task: {Task test_cursor_175 (test_cursor_175)
Plan:
-- TableScan[table: hive_table] -> c0:BIGINT, c1:INTEGER, c2:SMALLINT, c3:REAL, c4:DOUBLE, c5:VARCHAR, c6:TINYINT

drivers:
{Driver.0.0: running {Operators: TableScan[0] 0, CallbackSink[N/A] 1}}
}
Retriable: False
Function: readCursor
File: /__w/bolt/bolt/bolt/exec/tests/utils/QueryAssertions.cpp
Line: 1435
Stack trace:
# 0  _ZN9bytedance4bolt7process10StackTraceC1Ei
# 1  _ZN9bytedance4bolt13BoltExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN9bytedance4bolt6detail13boltCheckFailINS0_16BoltRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_17BoltCheckFailArgsET0_
# 3  _ZN9bytedance4bolt4exec4test10readCursorERKNS2_16CursorParametersESt8functionIFvPNS1_4TaskEEEm
# 4  _ZN9bytedance4bolt4exec4test18AssertQueryBuilder10readCursorEv
# 5  _ZN9bytedance4bolt4exec4test18AssertQueryBuilder13assertResultsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt8optionalISt6vectorIjSaIjEEE
# 6  _ZN13TableScanTest11assertQueryERKSt10shared_ptrIKN9bytedance4bolt4core8PlanNodeEERKSt6vectorIS0_INS2_4exec4test12TempFilePathEESaISD_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi
# 7  _ZN39TableScanTest_preloadingSplitClose_Test8TestBodyEv
# 8  _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
# 9  _ZN7testing4Test3RunEv
# 10 _ZN7testing8TestInfo3RunEv
# 11 _ZN7testing9TestSuite3RunEv
# 12 _ZN7testing8internal12UnitTestImpl11RunAllTestsEv
# 13 _ZN7testing8UnitTest3RunEv
# 14 main
# 15 0x0000000000000000
# 16 __libc_start_main
# 17 _start
" thrown in the test body.

Thanks for the review @yangzhg! I have replaced std::lock_guard with std::scoped_lock in bolt/connectors/Connector.h as suggested. I also took the opportunity to replace a few unnecessary std::unique_lock usages with std::scoped_lock as well.

- Rename inGuard to inGuard_ for consistency
- Fix deadlock in tableScanTest.preloadingSplitClose by unblocking executor
- Fix memory tracking leak in AsyncLoadHolder
@Ashutosh0x
Copy link
Author

I've investigated the failure in TableScanTest.preloadingSplitClose. It was a deadlock caused by TableScan::close() waiting for background preloads while the executor threads were blocked by the test itself. I've updated the test to safely unblock the executor in a background thread, ensuring it can finish successfully even with the new safety checks in place. The naming conventions for member variables have also been updated, and I fixed a minor memory tracking leak in AsyncLoadHolder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants