GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided by shashbha14 · Pull Request #49177 · apache/arrow

shashbha14 · 2026-02-07T14:50:16Z

The issue: when you provide an explicit schema to the JSON parser, it errors if JSON types don't exactly match schema types, even when conversion is straightforward.

For example, if you have:n
{"_id": "152934"}
{"_id": 152934}And your schema says _id should be string, it fails on row 1 with "Column changed from string to number" instead of converting 152934 to "152934".

I fixed this by making the parser attempt type conversion when an explicit schema is provided. Before erroring on a type mismatch, it checks if we have an explicit schema and tries to convert the value to match the expected type.

Changes:

Store explicit_schema in HandlerBase so we can access it during parsing
Modified AppendScalar() to try conversion before erroring when explicit schema exists
Added TryConvertAndAppend() helper that handles the conversion logic
Updated Bool() handler to also support conversion
Added tests for number->string and string->number cases

Conversions that work now:

Number -> String (152934 -> "152934")
String -> Number (when the string is numeric)
Boolean conversions to/from string and number
Number -> Boolean (0 is false, non-zero is true)

This only happens when explicit schema is provided, so it's backward compatible. All existing tests still pass.

Fixes #49158

GitHub Issue: [Python] PyArrow cannot read from a newline-delimited JSON file with inconsistent column types, even if parse_options specifies a schema #49158

…rc_binaries.py

…able function - Add errors parameter to cast() function with 'raise' (default) and 'coerce' options - errors='coerce' converts invalid values to null instead of raising errors - Add errors parameter to Array.cast(), Scalar.cast(), and ChunkedArray.cast() instance methods - Verify is_castable() function is properly exposed and working - Add comprehensive tests including the exact example from issue apache#48972 - Update documentation with examples showing errors='coerce' usage This addresses issue apache#48972 by providing pandas.to_numeric(errors='coerce') equivalent functionality in PyArrow.

…ma is provided When reading JSON with explicit schema, the parser now attempts to convert values to match the schema type before erroring. This allows JSON files with inconsistent types (e.g., number and string for the same field) to be read successfully when an explicit schema is provided. Changes: - Store explicit_schema in HandlerBase for access during parsing - Modified AppendScalar to check for conversion before erroring - Added TryConvertAndAppend helper function to handle conversions - Updated Bool handler to also support conversion - Added tests for number->string and string->number conversions Supported conversions: - Number <-> String (when numeric) - Boolean <-> String - Boolean <-> Number - Number -> Boolean (0=false, non-zero=true) Fixes apache#49158

rok · 2026-02-07T19:16:16Z

Thanks for jumping on this @shashbha14. However please note we might not want to change the json parser right now depending of the outcome of discussion on the issue.

rok · 2026-02-09T22:42:16Z

python/pyarrow/table.pxi

+        personal_data : bool, default None
+            Whether the table/batch contains personal data. If True, adds
+            b'ARROW:personal_data': b'true' to the metadata.


Out of curiosity, why is personal_data needed?

raulcd

I've seen this ARROW:personal_data field on other PRs too like:

#48994

This PR seems con contain 6 commits, several of them of unrelated PRs to this one. Please could you remove the unnecessary commits.

rok · 2026-02-10T15:22:44Z

Since @shashbha14 seems inactive we best close this and leave the issue unassigned for now.

raulcd · 2026-02-10T16:04:39Z

I agree with you. I've removed the assignee from the issue. Happy to close this PR.

rok · 2026-02-10T16:09:58Z

@shashbha14 closing this now for inactivity. Feel free to reopen this PR and engage us in review.

shashbha14 · 2026-02-11T10:26:28Z

 I had an initial PR (#49177) for this but it mixed in unrelated commits and the
 JSON parser behavior is still being discussed. I’m happy to revisit this once
 there’s consensus on how coercion with explicit schema should behave.

rok · 2026-02-11T10:38:38Z

Thanks for replying @shashbha14! Feel free to propose a PR when we have consensus in the discussion.

Shashwati added 2 commits January 19, 2026 17:32

apacheGH-48853: [Release] Fix bytes to string comparison in download_…

8ffedb4

…rc_binaries.py

shashbha14 requested review from AlenkaF, assignUser, jonkeane, kou, raulcd and rok as code owners February 7, 2026 14:50

github-actions bot added Component: C++ Component: Python awaiting review Awaiting review labels Feb 7, 2026

shashbha14 force-pushed the GH-49158-json-type-coercion branch from 0e3b8af to d776ba5 Compare February 7, 2026 16:22

Shashwati added 3 commits February 7, 2026 21:58

Fix: Add missing string header for std::string

3db1991

Improve tests to verify converted values

7fde1ae

Simplify tests to avoid value comparison issues

8197e83

kou changed the title ~~GH-49158: Allow type conversion in JSON parser when explicit schema is provided~~ GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided Feb 8, 2026

rok reviewed Feb 9, 2026

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 9, 2026

raulcd requested changes Feb 10, 2026

View reviewed changes

rok closed this Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided#49177

GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided#49177
shashbha14 wants to merge 6 commits intoapache:mainfrom
shashbha14:GH-49158-json-type-coercion

shashbha14 commented Feb 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

rok commented Feb 7, 2026

Uh oh!

rok Feb 9, 2026

Uh oh!

raulcd left a comment

Uh oh!

rok commented Feb 10, 2026

Uh oh!

raulcd commented Feb 10, 2026

Uh oh!

rok commented Feb 10, 2026

Uh oh!

shashbha14 commented Feb 11, 2026

Uh oh!

rok commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shashbha14 commented Feb 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rok commented Feb 7, 2026

Uh oh!

rok Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

rok commented Feb 10, 2026

Uh oh!

raulcd commented Feb 10, 2026

Uh oh!

rok commented Feb 10, 2026

Uh oh!

shashbha14 commented Feb 11, 2026

Uh oh!

rok commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shashbha14 commented Feb 7, 2026 •

edited by github-actions bot

Loading