-
Notifications
You must be signed in to change notification settings - Fork 519
feat!: define default index name and return IndexMetadata after building index #5645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat!: define default index name and return IndexMetadata after building index #5645
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
208984f to
4d93fc1
Compare
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
4d93fc1 to
8a46b24
Compare
| /// | ||
| /// Joins field names with `.` to create the base index name. | ||
| /// For example: `["meta-data", "user-id"]` -> `"meta-data.user-id"` | ||
| fn default_index_name(fields: &[&str]) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function is exactly what we don't want, particularly for kebab-case. For backwards compatibility, we want user-id to format to:
user-id_idx
But with that function you are pointing at, user-id formats to
`user-id`_idx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. My thought process was that, if the library relies on the name, it is basically trying to parse the fields from the index name. But in that case, if we don't backquote, then it would be an issue for columns with dots. But I guess we don't have column with dot case in the past anyway, so what about this:
- if field names contain dot, use format_field_path
- otherwise, join with dot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be an issue for columns with dots
Why is it an issue for columns with dots?
I think it's fine if nested.column_idx can either be a column literally called nested.column or a field column inside of a struct nested. It's easier to type and remember than
`nested.column`_idx
In the case of collisions, we have the logic in place to add the _2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine if nested.column_idx can either be a column literally called nested.column or a field column inside of a struct nested. It's easier to type and remember
My thought process is that, we are basically doing these to ensure backwards compatibility for places that rely on the index name. And if some process relies on the name and cannot break, it is basically trying to parse the fields from the index name. And for this case, basically it can no longer derive the right column for the case above because the name is ambiguous.
| .execute() | ||
| .await | ||
| .unwrap(); | ||
| assert_eq!(idx2.name, "b_idx_2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's possible arguments to make this "1" instead of "2" - it might be worth checking that whatever we're doing is stylistically in line with other parts of our interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any other interfaces that do this right now. _2 felt natural given the original is implicitly 1 (assuming 1-based indexing). It feels natural enough to say.
Although I don't care too much, because I think it's better if users just name their indexes something sensible instead.
| let column_path = default_index_name(&names); | ||
| let base_name = format!("{column_path}_idx"); | ||
| let mut candidate = base_name.clone(); | ||
| let mut counter = 2; // Start with no suffix, then use _2, _3, ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start to allow multiple indexes on the same column, I think we should also have the index type in the index name for information purpose, like col_idx_btree (trying to keep the original string and only add things in the end)
Another thing is that this will start to create a situation that concurrent index creation on the same column and same name should fail. We currently treat that as a delta index, but now I think we need to update the transaction model to differentiate these 2 cases with a flag like is_delta_idex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start to allow multiple indexes on the same column, I think we should also have the index type in the index name for information purpose
Yeah, I think we should do that. Should also balance with need for backwards compat. Unfortunately there are places in LanceDB where we hard code the expectation that the index name is {column}_idx. Hopefully we can fix that soon and it won't be an issue.
Another thing is that this will start to create a situation that concurrent index creation on the same column and same name should fail.
Yeah that would be good to handle better. We shouldn't mix index types for the same index name.
BREAKING CHANGE:
create_indexnow returnsIndexMetadataof the new index in Rust and Java. Previously it returned nothing. (Python is unchanged, as it returns the Dataset itself, and changing that would be too disruptive.)Defines the behavior of default index names, particularly for ones with mixed case, non-alphanumeric, and nested fields. I tried to align it with how it was being done before as much as possible. For example, fields with dashes like
my-columnwould getmy-column_idx. This helps libraries that were relying on a predicable algorithm.However, I did notice we don't handle name collisions. So I added behavior for that. If there's already an existing index with that name, the default name gets a
_2added.Finally, because we are choosing the name in the function, I made it so we return the
IndexMetadatawhen you create the index in case you want to get back the name that was chosen.