GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided#49177
Closed
shashbha14 wants to merge 6 commits intoapache:mainfrom
Closed
GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided#49177shashbha14 wants to merge 6 commits intoapache:mainfrom
shashbha14 wants to merge 6 commits intoapache:mainfrom
Conversation
added 2 commits
January 19, 2026 17:32
…able function - Add errors parameter to cast() function with 'raise' (default) and 'coerce' options - errors='coerce' converts invalid values to null instead of raising errors - Add errors parameter to Array.cast(), Scalar.cast(), and ChunkedArray.cast() instance methods - Verify is_castable() function is properly exposed and working - Add comprehensive tests including the exact example from issue apache#48972 - Update documentation with examples showing errors='coerce' usage This addresses issue apache#48972 by providing pandas.to_numeric(errors='coerce') equivalent functionality in PyArrow.
…ma is provided When reading JSON with explicit schema, the parser now attempts to convert values to match the schema type before erroring. This allows JSON files with inconsistent types (e.g., number and string for the same field) to be read successfully when an explicit schema is provided. Changes: - Store explicit_schema in HandlerBase for access during parsing - Modified AppendScalar to check for conversion before erroring - Added TryConvertAndAppend helper function to handle conversions - Updated Bool handler to also support conversion - Added tests for number->string and string->number conversions Supported conversions: - Number <-> String (when numeric) - Boolean <-> String - Boolean <-> Number - Number -> Boolean (0=false, non-zero=true) Fixes apache#49158
0e3b8af to
d776ba5
Compare
Member
|
Thanks for jumping on this @shashbha14. However please note we might not want to change the json parser right now depending of the outcome of discussion on the issue. |
rok
reviewed
Feb 9, 2026
Comment on lines
+1939
to
+1941
| personal_data : bool, default None | ||
| Whether the table/batch contains personal data. If True, adds | ||
| b'ARROW:personal_data': b'true' to the metadata. |
Member
There was a problem hiding this comment.
Out of curiosity, why is personal_data needed?
Member
|
Since @shashbha14 seems inactive we best close this and leave the issue unassigned for now. |
Member
|
I agree with you. I've removed the assignee from the issue. Happy to close this PR. |
Member
|
@shashbha14 closing this now for inactivity. Feel free to reopen this PR and engage us in review. |
Contributor
Author
|
Member
|
Thanks for replying @shashbha14! Feel free to propose a PR when we have consensus in the discussion. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #49158
The issue: when you provide an explicit schema to the JSON parser, it errors if JSON types don't exactly match schema types, even when conversion is straightforward.
For example, if you have:n
{"_id": "152934"}
{"_id": 152934}And your schema says
_idshould be string, it fails on row 1 with "Column changed from string to number" instead of converting 152934 to "152934".I fixed this by making the parser attempt type conversion when an explicit schema is provided. Before erroring on a type mismatch, it checks if we have an explicit schema and tries to convert the value to match the expected type.
Changes:
Conversions that work now:
This only happens when explicit schema is provided, so it's backward compatible. All existing tests still pass.
Fixes #49158