-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068
Comments
Totally. About the how, well: We never supported “object” columns, there just hasn’t been a string column type when AnnData switched to DataFrames. Initially AnnData used structured arrays which, while clumsy, had the advantage that you couldn’t put data in that wasn’t writable later. Ideally, AnnData would convert things on assignment so errors surface early. We didn’t do that so far, since subclassing DataFrames is iffy, but it might be the best solution. There could be a way to deactivate this for people who don’t care about writeability. |
I think this is doable, but would like to rely on a consistent set of rules defined upstream of us. At the moment, pandas has:
I don't think subclassing a pandas dataframes is something we can support. |
Probably not, but it would be optimal for users. There would be much fewer issues when people run into errors when trying to add a column. By contrast, running into an issue when writing is much less user friendly, as it means people have to solve all possible issues in bulk instead of running into an early exception. |
We can probably do a more aggressive inference on object types as a stop gap, but the future of string extension arrays is a little unclear. I've been told they want to just use arrow directly, but then I heard a lot of pushback on pandas having a direct dependency on pyarrow. So which of the string types should we be defaulting to? The one that adds a big dependency, or the one that is just a thin wrapper around |
Do you have any links for that? Would be nice to see if there’s been an update on the plans |
My understanding comes from discussion at euroscipy, not sure if there is a consolidated place for discussion. |
Would be great to get some at least semi-official writeup before making decision based on something that might never materialize. Maybe we can get hold of someone involved? |
|
Please describe your wishes and possible alternatives to achieve the desired result.
Since #504, AnnData supports nullable int and bool columns in
obs
. Support for strings is planned in #679.However, this only works if the nullable columns are represented as the appropriate pandas
Array
extension type.For instance this
fails with
TypeError: Can't implicitly convert non-string objects to strings
.After converting the columns to pandas arrays, the object can be saved:
Unfortunately, the pandas extension arrays are little known and
None
s might end up inadata.obs
for various reasons (for instance scverse/scirpy#434).I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save?
Or maybe there should be an equivalent to
AnnData.strings_to_categoricals
that can be called to sanitize such columns?The text was updated successfully, but these errors were encountered: