-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Fix multiindex factorize extension dtypes #62964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Fix multiindex factorize extension dtypes #62964
Conversation
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR but I think this is working around the core issue where algorithms.factorize is being called on self._values which is just a numpy array for a MultiIndex.
I think MultiIndex would need to override factorize and use a custom implementation if any level has an ExtentionDtype.
| klass=_shared_doc_kwargs["klass"], | ||
| optional_reindex=_shared_doc_kwargs["optional_reindex"], | ||
| ) | ||
| # error: Cannot determine type of 'reindex' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment represent?
| sort=sort, use_na_sentinel=use_na_sentinel | ||
| ) | ||
|
|
||
| def _factorize_with_extension_dtypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only required for a MultiIndex? A base Index doesn't have the same requirement?
doc/source/whatsnew/v3.0.0.rstfileMultiIndex.factorize()was silently converting extension dtypes (Int64, boolean, string) to base dtypes, causing data corruption. This fix preserves extension dtypes by restoring them level-by-level after factorization.Before:
After:
Performance Increase:
Some MultiIndex operations ~10% faster due to better type consistency.
Benchmarks: