Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe #1876

Open
wants to merge 1 commit into
base: tswast-refactor-cell-data
Choose a base branch
from

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Mar 27, 2024

TODO:

  • tests
  • maybe we don't want string columns for JSON?

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #1580 (TODO: need test case for empty result set)
🦕

@product-auto-label product-auto-label bot added size: xs Pull request size is extra small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Mar 27, 2024
@tswast
Copy link
Contributor Author

tswast commented Mar 27, 2024

I might actually want to do something in db-dtypes so that even though it's a string the unboxed version would give a parsed object like the behavior is when the REST API is used.

@tswast
Copy link
Contributor Author

tswast commented Mar 27, 2024

Right now the behavior is inconsistent across REST and BQ Storage API.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 28, 2024
@tswast
Copy link
Contributor Author

tswast commented Mar 28, 2024

Marking as do not merge for now, as this makes JSON dtype consistent now but always return string dtype like the BQ Storage Read API code path does, which isn't ideal.

@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: xs Pull request size is extra small. labels Mar 10, 2025
@tswast tswast marked this pull request as ready for review March 10, 2025 15:28
@tswast tswast requested review from a team as code owners March 10, 2025 15:28
@tswast tswast requested a review from shollyman March 10, 2025 15:28
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025
@tswast
Copy link
Contributor Author

tswast commented Mar 10, 2025

Actually, I think this needs a few more tests. I'm testing manually with pytest 'tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-json]' from googleapis/python-bigquery-pandas#893, but it's currently failing because we parse the JSON string in _row_iterator_page_columns, but we actually want to keep those as strings to use the json_ pyarrow type.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025
@product-auto-label product-auto-label bot added size: m Pull request size is medium. size: xl Pull request size is extra large. and removed size: s Pull request size is small. size: m Pull request size is medium. labels Mar 10, 2025
Comment on lines +66 to +74
# Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available.
# Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0,
# but since they might have an older db-dtypes, have string as a fallback for that.
# TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to
# pyarrow.json_(pyarrow.string()) if available and supported by pandas.
if hasattr(db_dtypes, "JSONArrowType"):
json_arrow_type = db_dtypes.JSONArrowType()
else:
json_arrow_type = pyarrow.string()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string()) because of pandas-dev/pandas#60958.

@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025
@tswast tswast requested review from chalmerlowe and Linchin and removed request for shollyman March 11, 2025 21:17
@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025
@tswast
Copy link
Contributor Author

tswast commented Mar 11, 2025

Marking as do not merge again. I'll split out the refactor into a separate PR first.

Edit: Mailed #2144

@tswast tswast changed the base branch from main to tswast-refactor-cell-data March 11, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. do not merge Indicates a pull request not ready for merge, due to either quality or timing. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError encountered when to_dataframe returns empty resultset with JSON field
2 participants