Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multi-source batched JSON reader: error due to reordered columns in partial tables constructed from each batch #17689

Open
shrshi opened this issue Jan 7, 2025 · 1 comment · May be fixed by #17708
Labels
bug Something isn't working

Comments

@shrshi
Copy link
Contributor

shrshi commented Jan 7, 2025

Describe the bug
Before we concatenate the partial tables generated from each batch, we error out if the schemas of the tables don't match. But the column ordering of the partial tables can change depending on nulls in the columns. We should not error out in this case.

Steps/Code to reproduce bug
Draft PR #17688
./build/latest/benchmarks/JSON_READER_NVBENCH --benchmark json_read_compressed_io --axis compression_type GZIP --axis data_size[pow2]=28 --axis num_sources=4 --device 0

Expected behavior
Enforce column ordering based on the partial table in the first batch in all later batches.

@shrshi shrshi added the bug Something isn't working label Jan 7, 2025
@shrshi
Copy link
Contributor Author

shrshi commented Jan 7, 2025

Enforce column ordering based on the partial table in the first batch in all later batches.

From offline discussions with @karthikeyann, pitfalls with the proposed solution:

  1. Data type mismatch for the same column between partial tables. For example, a partial table may have int8, but next chunk might be inferred at int16 or float.
  2. Column present in the second batch but not in the first batch. In this case, we will prune that column out and the final table will be missing that column. Note that the converse case - if a column present in the first batch is missing from some following batch - is handled by the JSON tree algorithms. The missing column is included and filled with nulls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant