(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

grst · 2023-07-23T18:12:23Z

Please describe your wishes and possible alternatives to achieve the desired result.

Since #504, AnnData supports nullable int and bool columns in obs. Support for strings is planned in #679.

However, this only works if the nullable columns are represented as the appropriate pandas Array extension type.

For instance this

import anndata
import numpy as np
import pandas as pd

adata = anndata.AnnData(
    X=None,
    obs=pd.DataFrame().assign(
        test_int=np.array([1, 2, None, 3]),
        test_bool=[True, False, None, False],
    ),
)
adata.write_h5ad("test.h5ad")

fails with TypeError: Can't implicitly convert non-string objects to strings.

After converting the columns to pandas arrays, the object can be saved:

for c in adata.obs.columns:
    adata.obs[c] = pd.array(adata.obs[c].values)
adata.write_h5ad("test.h5ad")

Unfortunately, the pandas extension arrays are little known and Nones might end up in adata.obs for various reasons (for instance scverse/scirpy#434).

I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save?
Or maybe there should be an equivalent to AnnData.strings_to_categoricals that can be called to sanitize such columns?

The text was updated successfully, but these errors were encountered:

flying-sheep · 2023-07-28T09:25:19Z

Totally.

About the how, well: We never supported “object” columns, there just hasn’t been a string column type when AnnData switched to DataFrames. Initially AnnData used structured arrays which, while clumsy, had the advantage that you couldn’t put data in that wasn’t writable later.

Ideally, AnnData would convert things on assignment so errors surface early. We didn’t do that so far, since subclassing DataFrames is iffy, but it might be the best solution.

There could be a way to deactivate this for people who don’t care about writeability.

ivirshup · 2023-07-31T11:43:22Z

I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save?

I think this is doable, but would like to rely on a consistent set of rules defined upstream of us. At the moment, pandas has:

Series.infer_dtype which would convert your example to floats
DataFrame.convert_dtypes which converts everything to pandas dtypes, regardless of whether it's needed. This includes conversion to dtypes that we don't support like nullable floating point.

We didn’t do that so far, since subclassing DataFrames is iffy, but it might be the best solution.

I don't think subclassing a pandas dataframes is something we can support.

flying-sheep · 2023-08-28T09:23:31Z

Probably not, but it would be optimal for users. There would be much fewer issues when people run into errors when trying to add a column. By contrast, running into an issue when writing is much less user friendly, as it means people have to solve all possible issues in bulk instead of running into an early exception.

ivirshup · 2023-10-04T16:58:18Z

We can probably do a more aggressive inference on object types as a stop gap, but the future of string extension arrays is a little unclear.

I've been told they want to just use arrow directly, but then I heard a lot of pushback on pandas having a direct dependency on pyarrow. So which of the string types should we be defaulting to? The one that adds a big dependency, or the one that is just a thin wrapper around object that sounds like it'll be deprecated?

flying-sheep · 2023-10-06T09:18:17Z

Do you have any links for that? Would be nice to see if there’s been an update on the plans

ivirshup · 2023-10-06T14:28:57Z

My understanding comes from discussion at euroscipy, not sure if there is a consolidated place for discussion.

flying-sheep · 2023-10-09T11:49:20Z

Would be great to get some at least semi-official writeup before making decision based on something that might never materialize. Maybe we can get hold of someone involved?

ivirshup · 2023-10-09T22:06:53Z

Current pinned issue on pandas: FEEDBACK: PyArrow as a required dependency and PyArrow backed strings pandas-dev/pandas#54466
PDEP-10 PyArrow as a required dependency for default string inference implementation

flying-sheep · 2023-12-15T08:51:24Z

News from the pandas issue:

dtype=string will be arrow backed starting from 3.0 or when you activate the infer_string option

grst added the enhancement label Jul 23, 2023

This was referenced Oct 4, 2023

Error while writing annData #1143

Closed

can't write missing values in obsm and varm to h5ad #1146

Closed

ivirshup added this to the 0.11.0 milestone Oct 4, 2023

This was referenced Dec 4, 2023

Can't implicitly convert non-string objects to strings. Bug about tl.filter_rank_genes_groups and tl.rank_genes_groups #1141

Closed

Option to ignore "nan" with sc.pl.rank_genes_groups() and error while writing data to .h5ad scverse/scanpy#1651

Closed

flying-sheep mentioned this issue Dec 15, 2023

Can't read or write h5ad files that contain booleans columns with nulls (None) #1258

Closed

3 tasks

flying-sheep mentioned this issue Jan 2, 2024

No support for mixed column type #726

Closed

ivirshup modified the milestones: 0.11.0, 0.12.0 Aug 8, 2024

grst mentioned this issue Nov 4, 2024

Can't save h5mu from Scirpy processed gex+bcr+tcr data if I copy airr into obs scverse/scirpy#434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

grst commented Jul 23, 2023

flying-sheep commented Jul 28, 2023

ivirshup commented Jul 31, 2023

flying-sheep commented Aug 28, 2023

ivirshup commented Oct 4, 2023

flying-sheep commented Oct 6, 2023

ivirshup commented Oct 6, 2023

flying-sheep commented Oct 9, 2023 •

edited

Loading

ivirshup commented Oct 9, 2023

flying-sheep commented Dec 15, 2023

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

Comments

grst commented Jul 23, 2023

Please describe your wishes and possible alternatives to achieve the desired result.

flying-sheep commented Jul 28, 2023

ivirshup commented Jul 31, 2023

flying-sheep commented Aug 28, 2023

ivirshup commented Oct 4, 2023

flying-sheep commented Oct 6, 2023

ivirshup commented Oct 6, 2023

flying-sheep commented Oct 9, 2023 • edited Loading

ivirshup commented Oct 9, 2023

flying-sheep commented Dec 15, 2023

flying-sheep commented Oct 9, 2023 •

edited

Loading