String dtype: enable in SQL IO + resolve all xfails #60255

WillAyd · 2024-11-08T20:54:39Z

WillAyd · 2024-11-08T20:55:42Z

pandas/core/internals/construction.py

-                    arr = maybe_infer_to_datetimelike(arr)
-                    if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):
+                    convert_to_nullable_dtype = dtype_backend != "numpy"
+                    arr = maybe_infer_to_datetimelike(arr, convert_to_nullable_dtype)


I was surprised by the fact that maybe_infer_to_datetimelike would change an object dtype to a string datatype; might be a cause of unexpected behavior elsewhere where this is used

Yeah that's a very confusing name for what it does (by accident?) in practice nowadays ..

(not necessarily for now, but long term I think we should directly use maybe_convert_objects here instead of maybe_infer_to_datimelike as a small wrapper around it)

Sounds good - I opened #60258 to keep track of this

WillAyd · 2024-11-08T20:57:00Z

pandas/io/sql.py

@@ -1422,6 +1425,10 @@ def _get_dtype(self, sqltype):
            return date
        elif isinstance(sqltype, Boolean):
            return bool
+        elif isinstance(sqltype, String):
+            if using_string_dtype():
+                return StringDtype(na_value=np.nan)


I hard-coded the nan value here for backwards compatability, but this might become a code smell in the future. Have we been trying to avoid this pattern @jorisvandenbossche or is it expected for now?

Yes, this is the way I also did it elsewhere in code paths that needs to construct the default string dtype (harcode the nan value, and the storage gets determined automatically based on whether pyarrow is installed / based on the pd.options)

But, is this change required? It doesn't seem that this is actually used in the one place where _get_dtype gets used?

It is required - I somehow did not commit the part that uses it in the caller. Stay tuned...

jorisvandenbossche

Thanks for looking into this!

jorisvandenbossche · 2024-11-08T21:52:05Z

pandas/core/internals/construction.py

-                    arr = maybe_infer_to_datetimelike(arr)
-                    if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):
+                    convert_to_nullable_dtype = dtype_backend != "numpy"
+                    arr = maybe_infer_to_datetimelike(arr, convert_to_nullable_dtype)


Yeah that's a very confusing name for what it does (by accident?) in practice nowadays ..

jorisvandenbossche · 2024-11-08T21:53:55Z

pandas/io/sql.py

@@ -1422,6 +1425,10 @@ def _get_dtype(self, sqltype):
            return date
        elif isinstance(sqltype, Boolean):
            return bool
+        elif isinstance(sqltype, String):
+            if using_string_dtype():
+                return StringDtype(na_value=np.nan)


Yes, this is the way I also did it elsewhere in code paths that needs to construct the default string dtype (harcode the nan value, and the storage gets determined automatically based on whether pyarrow is installed / based on the pd.options)

jorisvandenbossche · 2024-11-08T21:56:53Z

pandas/io/sql.py

@@ -1422,6 +1425,10 @@ def _get_dtype(self, sqltype):
            return date
        elif isinstance(sqltype, Boolean):
            return bool
+        elif isinstance(sqltype, String):
+            if using_string_dtype():
+                return StringDtype(na_value=np.nan)


But, is this change required? It doesn't seem that this is actually used in the one place where _get_dtype gets used?

jorisvandenbossche · 2024-11-08T21:57:15Z

pandas/io/sql.py

@@ -2205,7 +2210,7 @@ def read_table(
        elif using_string_dtype():
            from pandas.io._util import arrow_string_types_mapper

-            arrow_string_types_mapper()
+            mapping = arrow_string_types_mapper()


jorisvandenbossche · 2024-11-08T22:03:22Z

pandas/core/internals/construction.py

-                    arr = maybe_infer_to_datetimelike(arr)
-                    if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):
+                    convert_to_nullable_dtype = dtype_backend != "numpy"
+                    arr = maybe_infer_to_datetimelike(arr, convert_to_nullable_dtype)


(not necessarily for now, but long term I think we should directly use maybe_convert_objects here instead of maybe_infer_to_datimelike as a small wrapper around it)

WillAyd · 2024-11-09T17:01:38Z

I think the issue with the non-pyarrow future string build has to do with the tests running in parallel. @mroeschke maybe I am misunderstanding, but isn't each module supposed to run all of its tests on one worker? Wonder why this build in particular is the only one that seems to fail

mroeschke · 2024-11-09T18:52:07Z

but isn't each module supposed to run all of its tests on one worker?

I changed the execution method to worksteal a few months ago (test ran much faster that way).

I suspect that possibly not all the fixtures correctly create and clean up the tables they create, and since the ADBC and sqlalchemy connections reference the same table names on the same dbs, the importorskip("pyarrow")s added may have disrupted the table states on the db.

(In an ideal world, teach test should work with a unique table name which is cleaned up appropriately upon failure/success)

WillAyd · 2024-11-09T19:03:12Z

Ah OK interesting...I wonder how that hasn't affected the main CI jobs to date...

I suspect that possibly not all the fixtures correctly create and clean up the tables they create

They do as long as they only operate within a single thread, but with multiple threads there is nothing that prevents the tests from getting a race condition, since a lot of the test table names are hard-coded (although we probably could change them to a uuid)

I'm guessing the fix for now would be to add a pytestmarker for single_cpu to the test_sql.py module (?)

jorisvandenbossche

Looks good!

jorisvandenbossche · 2024-11-11T21:42:45Z

pandas/io/sql.py

+                elif (
+                    using_string_dtype()
+                    and is_string_dtype(col_type)
+                    and is_object_dtype(self.frame[col_name])


Is the check for object dtype of the column needed? (or the string dtype might already have been inferred when constructing the DataFrame?)

And for example a column of all missing values, that might get inferred as float by the DataFrame constructor? In that case maybe rather not is_string_dtype instead of is_object_dtype? (Or just pass a copy=False to astype)

It is needed, although admittedly the patterns used here are not very clear.

Without the is_object_dtype check, it is possible that a series that is constructed as a StringDtype using pd.NA as the na_value sentinel gets changed to using the np.nan sentinel, which fails a good deal of tests

As an alternative, we might be able to infer the na_value to use from the executing environment rather than hard-coding np.nan on L1436?

Maybe it would be clearer then to use if dtype_backend == "numpy" in the if clause? (instead of is_object_dtype) I.e. only do this cast to StringDtype in default backend mode.
That's also what the case just above and below actually use .. (so maybe those could be combined within a single if dtype_backend == "numpy" block)

(this also makes me wonder if we should call _harmonize_columns at all in case of a pyarrow dtype_backend, because at that point the self.frame already has all ArrowDtype columns. But I suppose this method handles some additional things like the parse_dates keyword?)

I think we need to revert this to using is_object_dtype instead of just checking the dtype_backend.

The problem is showing up in the latest CI failure because of the NUMERIC type in postgres, which is akin to the decimal data type in Python. The current SQL code treats that as an object dtype, so this branch ends up getting activated and converts that data into an object type array, rather than keeping as float

How does this branch get activated in that case of decimals if there is a is_string_dtype(col_type) in there?

Because col_type is object

>>> from pandas.core.dtypes.common import is_string_dtype >>> is_string_dtype(object) True

is that not expected?

Ah, yes, I constantly forget that is_string_dtype is a bit of a misnomer ..

Ideally we have a check that just passes for actual string dtypes, but let's leave that for another time.

jorisvandenbossche · 2024-11-12T20:51:09Z

Hmm one more failing test now with test_insertion_method_on_conflict_do_nothing (not directly sure how that is caused by code change, because the wrong dtype is for the float column)

jorisvandenbossche · 2024-11-14T13:42:23Z

Thanks @WillAyd!

lumberbot-app · 2024-11-14T13:42:46Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 ba4d1cfdda14bf521ff91d6ad432b21095c417fd

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60255: String dtype: enable in SQL IO + resolve all xfails'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60255-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60255 on branch 2.3.x (String dtype: enable in SQL IO + resolve all xfails)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

(cherry picked from commit ba4d1cf)

jorisvandenbossche · 2024-11-14T13:46:22Z

Manual backport -> #60315

…60255) (#60315) (cherry picked from commit ba4d1cf) Co-authored-by: Will Ayd <[email protected]>

WillAyd commented Nov 8, 2024

View reviewed changes

WillAyd force-pushed the sql-string-xfail branch from 26e42d2 to a884333 Compare November 8, 2024 20:57

jorisvandenbossche reviewed Nov 8, 2024

View reviewed changes

WillAyd force-pushed the sql-string-xfail branch 6 times, most recently from 990bdc7 to 1abf86c Compare November 9, 2024 15:18

WillAyd requested a review from mroeschke as a code owner November 9, 2024 16:30

WillAyd force-pushed the sql-string-xfail branch from 1f38f1d to 41af90a Compare November 9, 2024 19:05

jorisvandenbossche approved these changes Nov 11, 2024

View reviewed changes

jorisvandenbossche approved these changes Nov 12, 2024

View reviewed changes

WillAyd added 2 commits November 13, 2024 16:01

TST (string dtype): fix sql xfails with using_infer_string

379feea

Add single_cpu mark

9eb9232

WillAyd force-pushed the sql-string-xfail branch 2 times, most recently from 1bead7f to 9eb9232 Compare November 13, 2024 22:15

jorisvandenbossche added IO SQL to_sql, read_sql, read_sql_query Strings String extension data type and string data labels Nov 14, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 14, 2024

jorisvandenbossche changed the title ~~TST (string dtype): fix sql xfails with using_infer_string~~ String dtype: enable in SQL IO + resolve all xfails Nov 14, 2024

jorisvandenbossche merged commit ba4d1cf into pandas-dev:main Nov 14, 2024
57 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Nov 14, 2024

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Nov 14, 2024

String dtype: enable in SQL IO + resolve all xfails (pandas-dev#60255)

2304e71

(cherry picked from commit ba4d1cf)

jorisvandenbossche mentioned this pull request Nov 14, 2024

[backport 2.3.x] String dtype: enable in SQL IO + resolve all xfails (#60255) #60315

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Nov 14, 2024

WillAyd deleted the sql-string-xfail branch November 14, 2024 16:15

jorisvandenbossche added a commit that referenced this pull request Nov 14, 2024

[backport 2.3.x] String dtype: enable in SQL IO + resolve all xfails (#…

aa8adfa

…60255) (#60315) (cherry picked from commit ba4d1cf) Co-authored-by: Will Ayd <[email protected]>

Uh oh!

String dtype: enable in SQL IO + resolve all xfails #60255

String dtype: enable in SQL IO + resolve all xfails #60255

Uh oh!

Conversation

WillAyd commented Nov 8, 2024 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Nov 9, 2024

Uh oh!

mroeschke commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Nov 9, 2024

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Nov 12, 2024

Uh oh!

Uh oh!

jorisvandenbossche commented Nov 14, 2024

Uh oh!

lumberbot-app bot commented Nov 14, 2024

Uh oh!

jorisvandenbossche commented Nov 14, 2024

Uh oh!

Uh oh!

WillAyd commented Nov 8, 2024 •

edited by jorisvandenbossche

Loading

mroeschke commented Nov 9, 2024 •

edited

Loading