Skip to content

Conversation

@ruochenj123
Copy link

@ruochenj123 ruochenj123 commented Jan 14, 2026

What problem does this PR solve?

Issue Number: #11

Type of Change

  • πŸ› Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • πŸš€ Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • πŸ”¨ Refactoring (no logic changes)
  • πŸ”§ Build/CI or Infrastructure changes
  • πŸ“ Documentation only

Description

Currently HashJoin and Sort operations store data in row-based RowContainer, which incurs non-trivial layout conversion overhead. This PR introduces a hybrid storage design that keeps payload columns in their original columnar format while only storing keys in RowContainer, reducing this layout conversion overhead.

Main Changes

  1. HybridContainer

    • Separates key storage (in RowContainer) from payload storage (kept as RowVectorPtr).
    • Introduces encoded HybridRowId to reference payload rows.
    • HybridRowId encodes {containerId, rowId} to support multi-driver parallel execution in HashJoin.
  2. Multi-driver support in HashJoin

    • Each driver builds its own HybridContainer.
    • After table merge, the allContainers_ map enables cross-container payload extraction during the probe phase.
  3. Extraction optimizations

    • coalesceBatches() flattens multiple payload batches into a single contiguous batch to reduce TLB misses during extraction.
    • sortByContainerId() reorders rows by containerId before extraction to improve cache locality in multi-container scenarios.
    • Prefetching during extraction to hide memory latency and reduce data loading time.
    • isSingleContainer() provides a fast path that skips sorting overhead in the single-driver scenario.
  4. Configuration options

    • hybrid_join_enabled / hybrid_sort_enabled to opt in to hybrid execution.
    • hybrid_join_reorder_enabled to control row reordering (disabled in tests to preserve deterministic output).

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Tested on bolt_tpch_benchmark with sf=10 iteration=5. The performances are almost the same.

+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| Type     | q1   | q2   | q3   | q4   | q5   | q6   | q7   | q8   | q9   | q10  | q11  | q12  | q13  | q14  | q15  | q16  | q17  | q18  | q19  | q20  | q21  | q22  |
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| baseline | 856  | 287.4| 1096 | 2492 | 1374 | 453.6| 1090 | 1420 | 1990 | 1662 | 151  | 558.8| 1544 | 874  | 717.4| 353.6| 3816 | 2250 | 952  | 639.2| 2234 | 583.2|
| hybrid   | 882.8| 315  | 1100 | 2390 | 1370 | 450.4| 1078 | 1266 | 2072 | 1738 | 149.2| 631  | 1532 | 852.6| 698.6| 343.4| 4052 | 2062 | 918.4| 664.4| 2248 | 545.4|
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed a crash in `substr` when input is null.
- optimized `group by` performance by 20%.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@CLAassistant
Copy link

CLAassistant commented Jan 14, 2026

CLA assistant check
All committers have signed the CLA.

@ruochenj123 ruochenj123 changed the title Hybrid layout design for HashJoin/Sort [WIP] Hybrid layout design for HashJoin/Sort Jan 14, 2026
@markjin1990 markjin1990 added the performance performance improvement needed label Jan 14, 2026
@markjin1990 markjin1990 requested a review from fzhedu January 14, 2026 21:35
@markjin1990 markjin1990 changed the title [WIP] Hybrid layout design for HashJoin/Sort Hybrid layout design for HashJoin/Sort Jan 20, 2026
<< ", spill enabled: " << spillEnabled()
<< ", maxHashTableSize = " << maxHashTableBucketCount_;
<< ", maxHashTableSize = " << maxHashTableBucketCount_
<< ", hybrid mode " << (hybridJoin_ ? "enabled" : "disbaled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo disabled

if (hybridJoin_) {
BOLT_CHECK_LE(
driverId_,
255,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcode limit to 255?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we store a BIGINT (64 bits) of rowId where the top 8 bits represents the driverId and the remaining 56 bits represents the rowId for each driver. So the max # of driver it supports is 255. Maybe we can make it as a config.

const T* rawValues = flatChild->rawValues();
const uint64_t* rawNulls = flatChild->rawNulls();

constexpr vector_size_t kPrefetchDist = 16;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the magic number 16 fit for all the arch? e.g. x86 arm?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I tuned it on a x86_64 machine, k = 8-32 perform similarly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance performance improvement needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants