Skip to content

Conversation

@fenfeng9
Copy link
Contributor

closes #3687

Changes:

  • Make array_contains(list_col, value) use the existing LABEL_LIST scalar index.
  • DataFusion treats array_contains as an alias of array_has (often wrapped in an alias expr), so we unwrap that and map it to the LabelList index query.
  • Add a Python test and update the LabelList docs to mention array_has / array_contains.

@github-actions github-actions bot added enhancement New feature or request python labels Jan 10, 2026
@fenfeng9
Copy link
Contributor Author

PTAL @westonpace.

@fenfeng9
Copy link
Contributor Author

fenfeng9 commented Jan 10, 2026

Discovered a corner data correctness bug in the LABEL_LIST index and filed #5682 for follow-up.

@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 12.50000% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/expression.rs 12.50% 14 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for picking this up!

Will merge in a day or so in case you want to address any comments.

# Include lists with NULL items to ensure NULL needle behavior matches
# non-index execution.
tbl = pa.table(
{"labels": [["foo", "bar"], ["bar"], ["baz"], ["qux", None], [None], []]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also include an entry where the entire list is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 499 to 502
// Do not push down NULL needles.
if scalar.is_null() {
return None;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because of #5682 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is independent of #5682. #5682 is about the index missing rows where a valid element and a NULL co-exists (e.g., searching for 'foo' misses ['foo', None]). I will provide more details in #5682 later.
I've added a comment to explain the check. And array_has_any and array_has_all don't have this semantic mismatch as they natively follow membership logic.

}
match expr {
Expr::Between(between) => Ok(visit_between(between, index_info)),
Expr::Alias(alias) => visit_node(alias.expr.as_ref(), index_info, depth),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

if args.len() != 2 {
return None;
}
if func.name() == "array_has" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work for array_contains? Is Datafusion mapping that to array_has already?

If so, can we add a comment here mentioning that this branch is also going to be hit for array_contains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment

@westonpace
Copy link
Member

The build-no-lock failure should have been fixed by #5690

@fenfeng9 fenfeng9 force-pushed the feat/array-contains-index-support branch from 04d8744 to 33558d5 Compare January 12, 2026 15:44
@fenfeng9 fenfeng9 requested a review from westonpace January 13, 2026 14:23
@fenfeng9
Copy link
Contributor Author

PTAL @westonpace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support array_contains in LABEL_LIST index

2 participants