Skip to content

GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided#49177

Closed
shashbha14 wants to merge 6 commits intoapache:mainfrom
shashbha14:GH-49158-json-type-coercion
Closed

GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided#49177
shashbha14 wants to merge 6 commits intoapache:mainfrom
shashbha14:GH-49158-json-type-coercion

Conversation

@shashbha14
Copy link
Contributor

@shashbha14 shashbha14 commented Feb 7, 2026

Fixes #49158

The issue: when you provide an explicit schema to the JSON parser, it errors if JSON types don't exactly match schema types, even when conversion is straightforward.

For example, if you have:n
{"_id": "152934"}
{"_id": 152934}And your schema says _id should be string, it fails on row 1 with "Column changed from string to number" instead of converting 152934 to "152934".

I fixed this by making the parser attempt type conversion when an explicit schema is provided. Before erroring on a type mismatch, it checks if we have an explicit schema and tries to convert the value to match the expected type.

Changes:

  • Store explicit_schema in HandlerBase so we can access it during parsing
  • Modified AppendScalar() to try conversion before erroring when explicit schema exists
  • Added TryConvertAndAppend() helper that handles the conversion logic
  • Updated Bool() handler to also support conversion
  • Added tests for number->string and string->number cases

Conversions that work now:

  • Number -> String (152934 -> "152934")
  • String -> Number (when the string is numeric)
  • Boolean conversions to/from string and number
  • Number -> Boolean (0 is false, non-zero is true)

This only happens when explicit schema is provided, so it's backward compatible. All existing tests still pass.

Fixes #49158

Shashwati added 2 commits January 19, 2026 17:32
…able function

- Add errors parameter to cast() function with 'raise' (default) and 'coerce' options
- errors='coerce' converts invalid values to null instead of raising errors
- Add errors parameter to Array.cast(), Scalar.cast(), and ChunkedArray.cast() instance methods
- Verify is_castable() function is properly exposed and working
- Add comprehensive tests including the exact example from issue apache#48972
- Update documentation with examples showing errors='coerce' usage

This addresses issue apache#48972 by providing pandas.to_numeric(errors='coerce')
equivalent functionality in PyArrow.
…ma is provided

When reading JSON with explicit schema, the parser now attempts to convert
values to match the schema type before erroring. This allows JSON files
with inconsistent types (e.g., number and string for the same field) to
be read successfully when an explicit schema is provided.

Changes:
- Store explicit_schema in HandlerBase for access during parsing
- Modified AppendScalar to check for conversion before erroring
- Added TryConvertAndAppend helper function to handle conversions
- Updated Bool handler to also support conversion
- Added tests for number->string and string->number conversions

Supported conversions:
- Number <-> String (when numeric)
- Boolean <-> String
- Boolean <-> Number
- Number -> Boolean (0=false, non-zero=true)

Fixes apache#49158
@shashbha14 shashbha14 force-pushed the GH-49158-json-type-coercion branch from 0e3b8af to d776ba5 Compare February 7, 2026 16:22
@rok
Copy link
Member

rok commented Feb 7, 2026

Thanks for jumping on this @shashbha14. However please note we might not want to change the json parser right now depending of the outcome of discussion on the issue.

@kou kou changed the title GH-49158: Allow type conversion in JSON parser when explicit schema is provided GH-49158: [Python] Allow type conversion in JSON parser when explicit schema is provided Feb 8, 2026
Comment on lines +1939 to +1941
personal_data : bool, default None
Whether the table/batch contains personal data. If True, adds
b'ARROW:personal_data': b'true' to the metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is personal_data needed?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 9, 2026
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this ARROW:personal_data field on other PRs too like:

This PR seems con contain 6 commits, several of them of unrelated PRs to this one. Please could you remove the unnecessary commits.

@rok
Copy link
Member

rok commented Feb 10, 2026

Since @shashbha14 seems inactive we best close this and leave the issue unassigned for now.

@raulcd
Copy link
Member

raulcd commented Feb 10, 2026

I agree with you. I've removed the assignee from the issue. Happy to close this PR.

@rok
Copy link
Member

rok commented Feb 10, 2026

@shashbha14 closing this now for inactivity. Feel free to reopen this PR and engage us in review.

@rok rok closed this Feb 10, 2026
@shashbha14
Copy link
Contributor Author

 I had an initial PR (#49177) for this but it mixed in unrelated commits and the
 JSON parser behavior is still being discussed. I’m happy to revisit this once
 there’s consensus on how coercion with explicit schema should behave.

@rok
Copy link
Member

rok commented Feb 11, 2026

Thanks for replying @shashbha14! Feel free to propose a PR when we have consensus in the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] PyArrow cannot read from a newline-delimited JSON file with inconsistent column types, even if parse_options specifies a schema

3 participants