feat(lumina): support null values in vector column during index building by lxy-9602 · Pull Request #310 · alibaba/paimon-cpp

lxy-9602 · 2026-05-28T01:47:54Z

Purpose

Support writing null vector values in LuminaGlobalIndex. Previously, any null in the vector column would cause AddBatch to fail. Now, null rows are automatically skipped during index building — their row IDs are not indexed, and they will never be recalled by vector search.

AddBatch: splits contiguous non-null rows into segments, skips null rows
Finish: returns empty metas when all rows are null
LuminaDataset: uses per-segment start IDs to generate correct non-contiguous vector IDs

Linked issue #5

Tests

LuminaGlobalIndexTest. TestWriteWithNullRows
LuminaGlobalIndexTest.TestWriteWithMultipleNullSegments
LuminaGlobalIndexTest.TestWriteWithAllNullRows
LuminaGlobalIndexTest.TestWriteWithNullAndFilter
GlobalIndexTest .TestWriteCommitScanReadIndexWithScore

API and Format

Documentation

Generative AI tooling

Generated-by: Claude-4.6-Opus

Copilot

Pull request overview

This PR adds support for null values in vector columns during Lumina global index building. Previously, any null entry in the list-typed vector column would cause AddBatch to fail. Now, the writer splits the input into contiguous non-null segments, skips nulls (so their row IDs are never indexed and thus never recalled), and propagates per-segment start IDs to LuminaDataset so that the resulting vector IDs match the original row positions even when there are gaps.

Changes:

LuminaIndexWriter::AddBatch now walks the list array and creates one sliced FloatArray per contiguous non-null run, recording the originating start ID for each segment; Finish short-circuits with empty metas when no rows were indexed.
LuminaDataset takes per-segment start_ids and seeds std::iota from start_ids_[cursor_] instead of a monotonically incremented internal id_, producing correct non-contiguous vector IDs.
Adds unit tests for the new behaviors (middle null, multiple null segments, all null, null + filter) and updates the integration test to include null rows and validate that null row IDs are not recalled.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
src/paimon/global_index/lumina/lumina_global_index.h	Adds `indexed_count_` and `array_start_ids_` members on `LuminaIndexWriter`.
src/paimon/global_index/lumina/lumina_global_index.cpp	Implements null-skipping segmentation in `AddBatch`, empty-meta short-circuit in `Finish`, and per-segment start IDs in `LuminaDataset`.
src/paimon/global_index/lumina/lumina_global_index_test.cpp	Exposes test fixture members as `protected`; adds four tests covering null handling.
test/inte/global_index_test.cpp	Extends the end-to-end test with null rows, updates ranges/scores, and asserts null IDs are absent from the search bitmap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jerry-024 · 2026-05-28T03:45:55Z

Nice work! The null-skipping approach is clean and compatible with the Java implementation.

One suggestion for test coverage:

Consider adding a test that exercises multiple AddBatch calls where both batches contain null rows. This would verify that count_ accumulation across batches produces correct global vector IDs. For example:

Batch 1: [vec0, null, vec2] → expected IDs: {0, 2}
Batch 2: [null, vec4, vec5] → expected IDs: {4, 5} (since count_=3 after batch 1)

The current tests cover single-batch null patterns well, but cross-batch ID correctness is the most likely place a subtle regression could appear in future refactoring.

jerry-024

+1

lxy-9602 and others added 2 commits May 28, 2026 09:41

feat(lumina): support null values in vector column during index building

c01f0df

Merge branch 'main' into lumina-skip-null-data

6d5a693

zjw1111 requested a review from Copilot May 28, 2026 01:50

Copilot started reviewing on behalf of zjw1111 May 28, 2026 01:50 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

add multi-batch test

d2a2e7d

jerry-024 approved these changes May 28, 2026

View reviewed changes

lszskye approved these changes May 28, 2026

View reviewed changes

lxy-9602 added 2 commits May 28, 2026 17:59

Merge branch 'main' into lumina-skip-null-data

0264e59

Merge branch 'main' into lumina-skip-null-data

d071359

lxy-9602 merged commit 32e919c into alibaba:main May 29, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lumina): support null values in vector column during index building#310

feat(lumina): support null values in vector column during index building#310
lxy-9602 merged 5 commits into
alibaba:mainfrom
lxy-9602:lumina-skip-null-data

lxy-9602 commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jerry-024 commented May 28, 2026

Uh oh!

jerry-024 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lxy-9602 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jerry-024 commented May 28, 2026

Uh oh!

jerry-024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lxy-9602 commented May 28, 2026 •

edited

Loading