Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JSON Matrix tests on Databricks 14.3 #11533

Open
razajafri opened this issue Sep 27, 2024 · 2 comments · May be fixed by #11719
Open

Fix JSON Matrix tests on Databricks 14.3 #11533

razajafri opened this issue Sep 27, 2024 · 2 comments · May be fixed by #11719
Assignees

Comments

@razajafri
Copy link
Collaborator

Build the plugin against the Databricks 14.3 cluster using #11467. Once built successfully run the JSON matrix tests by TESTS=json_matrix_test.py jenkins/databricks/test.sh

The following tests fail

[gw3] [ 38%] FAILED ../../src/main/python/json_matrix_test.py::test_from_json_long_structs
[gw3] [ 38%] FAILED ../../src/main/python/json_matrix_test.py::test_scan_json_long_structs
@mythrocks
Copy link
Collaborator

I have filed #11711 for the change in behaviour of from_json on Databricks 14.3. This will need version-specific handling.

@revans2
Copy link
Collaborator

revans2 commented Nov 12, 2024

Generally what I have been doing for JSON matrix, and I think is the proper course is to split up data for tests that are failing so we can continue to have coverage for the parts that work and mark the ones that fail with a clear error message so we know what is happening there.

In the case of #11711, I don't know what the priority is going to end up being so splitting up the tests is probably the best way to get around the issue. I also don't think that #11711 is really all that much of a blocker. There are very few cases where a top level null is going to be treated differently from a struct with two nulls in them.

mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Nov 12, 2024
Fixes NVIDIA#11533.

This commit addresses the test failures reported in NVIDIA#11533, for the
following tests:
  - `json_matrix_test.py::test_from_json_long_structs()`
  - `json_matrix_test.py::test_scan_json_long_structs()`

These failures are a result of NVIDIA#11711.  When the JSON parser attempts to
read integral struct members from a JSON file, if the parsing leads to
an overflow, then the `STRUCT` column value is deemed null on Databricks
14.3 (i.e. *without* `spark-rapids` active).  This behaviour differs
from that exhibited by Apache Spark versions exceeding 3.4.1.

This commit breaks out the problematic JSON test rows into a separate
file, whose read is tested in an `xfail` for Databricks 14.3.  The
remaining rows are tested on all versions.

The true fix for NVIDIA#11711 will be addressed later.

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks self-assigned this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants