Skip to content

feat(lumina): support null values in vector column during index building#310

Merged
lxy-9602 merged 5 commits into
alibaba:mainfrom
lxy-9602:lumina-skip-null-data
May 29, 2026
Merged

feat(lumina): support null values in vector column during index building#310
lxy-9602 merged 5 commits into
alibaba:mainfrom
lxy-9602:lumina-skip-null-data

Conversation

@lxy-9602
Copy link
Copy Markdown
Collaborator

@lxy-9602 lxy-9602 commented May 28, 2026

Purpose

Support writing null vector values in LuminaGlobalIndex. Previously, any null in the vector column would cause AddBatch to fail. Now, null rows are automatically skipped during index building — their row IDs are not indexed, and they will never be recalled by vector search.

  • AddBatch: splits contiguous non-null rows into segments, skips null rows
  • Finish: returns empty metas when all rows are null
  • LuminaDataset: uses per-segment start IDs to generate correct non-contiguous vector IDs

Linked issue #5

Tests

LuminaGlobalIndexTest. TestWriteWithNullRows
LuminaGlobalIndexTest.TestWriteWithMultipleNullSegments
LuminaGlobalIndexTest.TestWriteWithAllNullRows
LuminaGlobalIndexTest.TestWriteWithNullAndFilter
GlobalIndexTest .TestWriteCommitScanReadIndexWithScore

API and Format

Documentation

Generative AI tooling

Generated-by: Claude-4.6-Opus

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for null values in vector columns during Lumina global index building. Previously, any null entry in the list-typed vector column would cause AddBatch to fail. Now, the writer splits the input into contiguous non-null segments, skips nulls (so their row IDs are never indexed and thus never recalled), and propagates per-segment start IDs to LuminaDataset so that the resulting vector IDs match the original row positions even when there are gaps.

Changes:

  • LuminaIndexWriter::AddBatch now walks the list array and creates one sliced FloatArray per contiguous non-null run, recording the originating start ID for each segment; Finish short-circuits with empty metas when no rows were indexed.
  • LuminaDataset takes per-segment start_ids and seeds std::iota from start_ids_[cursor_] instead of a monotonically incremented internal id_, producing correct non-contiguous vector IDs.
  • Adds unit tests for the new behaviors (middle null, multiple null segments, all null, null + filter) and updates the integration test to include null rows and validate that null row IDs are not recalled.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/paimon/global_index/lumina/lumina_global_index.h Adds indexed_count_ and array_start_ids_ members on LuminaIndexWriter.
src/paimon/global_index/lumina/lumina_global_index.cpp Implements null-skipping segmentation in AddBatch, empty-meta short-circuit in Finish, and per-segment start IDs in LuminaDataset.
src/paimon/global_index/lumina/lumina_global_index_test.cpp Exposes test fixture members as protected; adds four tests covering null handling.
test/inte/global_index_test.cpp Extends the end-to-end test with null rows, updates ranges/scores, and asserts null IDs are absent from the search bitmap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jerry-024
Copy link
Copy Markdown

Nice work! The null-skipping approach is clean and compatible with the Java implementation.

One suggestion for test coverage:

Consider adding a test that exercises multiple AddBatch calls where both batches contain null rows. This would verify that count_ accumulation across batches produces correct global vector IDs. For example:

  • Batch 1: [vec0, null, vec2] → expected IDs: {0, 2}
  • Batch 2: [null, vec4, vec5] → expected IDs: {4, 5} (since count_=3 after batch 1)

The current tests cover single-batch null patterns well, but cross-batch ID correctness is the most likely place a subtle regression could appear in future refactoring.

Copy link
Copy Markdown

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lxy-9602 lxy-9602 merged commit 32e919c into alibaba:main May 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants