feat(lumina): support null values in vector column during index building#310
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for null values in vector columns during Lumina global index building. Previously, any null entry in the list-typed vector column would cause AddBatch to fail. Now, the writer splits the input into contiguous non-null segments, skips nulls (so their row IDs are never indexed and thus never recalled), and propagates per-segment start IDs to LuminaDataset so that the resulting vector IDs match the original row positions even when there are gaps.
Changes:
LuminaIndexWriter::AddBatchnow walks the list array and creates one slicedFloatArrayper contiguous non-null run, recording the originating start ID for each segment;Finishshort-circuits with empty metas when no rows were indexed.LuminaDatasettakes per-segmentstart_idsand seedsstd::iotafromstart_ids_[cursor_]instead of a monotonically incremented internalid_, producing correct non-contiguous vector IDs.- Adds unit tests for the new behaviors (middle null, multiple null segments, all null, null + filter) and updates the integration test to include null rows and validate that null row IDs are not recalled.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/paimon/global_index/lumina/lumina_global_index.h | Adds indexed_count_ and array_start_ids_ members on LuminaIndexWriter. |
| src/paimon/global_index/lumina/lumina_global_index.cpp | Implements null-skipping segmentation in AddBatch, empty-meta short-circuit in Finish, and per-segment start IDs in LuminaDataset. |
| src/paimon/global_index/lumina/lumina_global_index_test.cpp | Exposes test fixture members as protected; adds four tests covering null handling. |
| test/inte/global_index_test.cpp | Extends the end-to-end test with null rows, updates ranges/scores, and asserts null IDs are absent from the search bitmap. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Nice work! The null-skipping approach is clean and compatible with the Java implementation. One suggestion for test coverage: Consider adding a test that exercises multiple
The current tests cover single-batch null patterns well, but cross-batch ID correctness is the most likely place a subtle regression could appear in future refactoring. |
Purpose
Support writing null vector values in LuminaGlobalIndex. Previously, any null in the vector column would cause AddBatch to fail. Now, null rows are automatically skipped during index building — their row IDs are not indexed, and they will never be recalled by vector search.
AddBatch: splits contiguous non-null rows into segments, skips null rowsFinish: returns empty metas when all rows are nullLuminaDataset: uses per-segment start IDs to generate correct non-contiguous vector IDsLinked issue #5
Tests
LuminaGlobalIndexTest. TestWriteWithNullRows
LuminaGlobalIndexTest.TestWriteWithMultipleNullSegments
LuminaGlobalIndexTest.TestWriteWithAllNullRows
LuminaGlobalIndexTest.TestWriteWithNullAndFilter
GlobalIndexTest .TestWriteCommitScanReadIndexWithScore
API and Format
Documentation
Generative AI tooling
Generated-by: Claude-4.6-Opus