Skip to content

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jan 6, 2026

BREAKING CHANGE: create_index now returns IndexMetadata of the new index in Rust and Java. Previously it returned nothing. (Python is unchanged, as it returns the Dataset itself, and changing that would be too disruptive.)

Defines the behavior of default index names, particularly for ones with mixed case, non-alphanumeric, and nested fields. I tried to align it with how it was being done before as much as possible. For example, fields with dashes like my-column would get my-column_idx. This helps libraries that were relying on a predicable algorithm.

However, I did notice we don't handle name collisions. So I added behavior for that. If there's already an existing index with that name, the default name gets a _2 added.

Finally, because we are choosing the name in the function, I made it so we return the IndexMetadata when you create the index in case you want to get back the name that was chosen.

@github-actions github-actions bot added enhancement New feature or request python labels Jan 6, 2026
@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 94.25287% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 50.00% 0 Missing and 3 partials ⚠️
rust/lance/src/index/create.rs 97.53% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 changed the title feat: define default index name behavior for nested field and special characters feat: define default index name and return IndexMetadata after building index Jan 7, 2026
@github-actions github-actions bot added the java label Jan 7, 2026
@wjones127 wjones127 changed the title feat: define default index name and return IndexMetadata after building index feat!: define default index name and return IndexMetadata after building index Jan 7, 2026
@wjones127 wjones127 force-pushed the feat/index-name-normalization branch from 208984f to 4d93fc1 Compare January 7, 2026 23:22
@wjones127 wjones127 force-pushed the feat/index-name-normalization branch from 4d93fc1 to 8a46b24 Compare January 7, 2026 23:42
@wjones127 wjones127 marked this pull request as ready for review January 8, 2026 00:51
///
/// Joins field names with `.` to create the base index name.
/// For example: `["meta-data", "user-id"]` -> `"meta-data.user-id"`
fn default_index_name(fields: &[&str]) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function is exactly what we don't want, particularly for kebab-case. For backwards compatibility, we want user-id to format to:

user-id_idx

But with that function you are pointing at, user-id formats to

`user-id`_idx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My thought process was that, if the library relies on the name, it is basically trying to parse the fields from the index name. But in that case, if we don't backquote, then it would be an issue for columns with dots. But I guess we don't have column with dot case in the past anyway, so what about this:

  1. if field names contain dot, use format_field_path
  2. otherwise, join with dot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be an issue for columns with dots

Why is it an issue for columns with dots?

I think it's fine if nested.column_idx can either be a column literally called nested.column or a field column inside of a struct nested. It's easier to type and remember than

`nested.column`_idx

In the case of collisions, we have the logic in place to add the _2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine if nested.column_idx can either be a column literally called nested.column or a field column inside of a struct nested. It's easier to type and remember

My thought process is that, we are basically doing these to ensure backwards compatibility for places that rely on the index name. And if some process relies on the name and cannot break, it is basically trying to parse the fields from the index name. And for this case, basically it can no longer derive the right column for the case above because the name is ambiguous.

.execute()
.await
.unwrap();
assert_eq!(idx2.name, "b_idx_2");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's possible arguments to make this "1" instead of "2" - it might be worth checking that whatever we're doing is stylistically in line with other parts of our interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any other interfaces that do this right now. _2 felt natural given the original is implicitly 1 (assuming 1-based indexing). It feels natural enough to say.

Although I don't care too much, because I think it's better if users just name their indexes something sensible instead.

let column_path = default_index_name(&names);
let base_name = format!("{column_path}_idx");
let mut candidate = base_name.clone();
let mut counter = 2; // Start with no suffix, then use _2, _3, ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start to allow multiple indexes on the same column, I think we should also have the index type in the index name for information purpose, like col_idx_btree (trying to keep the original string and only add things in the end)

Another thing is that this will start to create a situation that concurrent index creation on the same column and same name should fail. We currently treat that as a delta index, but now I think we need to update the transaction model to differentiate these 2 cases with a flag like is_delta_idex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start to allow multiple indexes on the same column, I think we should also have the index type in the index name for information purpose

Yeah, I think we should do that. Should also balance with need for backwards compat. Unfortunately there are places in LanceDB where we hard code the expectation that the index name is {column}_idx. Hopefully we can fix that soon and it won't be an issue.

Another thing is that this will start to create a situation that concurrent index creation on the same column and same name should fail.

Yeah that would be good to handle better. We shouldn't mix index types for the same index name.

@wjones127 wjones127 requested review from jackye1995 and wkalt January 13, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants