-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835
Comments
Rerun the repro with in two more modifications disable async pool memory allocator
the diff still consistently reproduces added compute sanitizer with the default memcheckthe default check does not catch errors and seems to change concurrency in a way that the issue stops reproducing. |
Pursued a conjecture that the issue only reproduces due to forward compatibility because we have no cubin sections for compute capability 89 However, a targeted compilation for 89 equally reproduced the issue |
Running the executors under
There are following issue classes: Uninitialized global memoryOne instance looks intentional given its name
But the other is not
Unused memory warnings
|
Thanks to @ttnghia for the C++ test repro rapidsai/cudf#17757
|
Looking at the source code, it looks like we're reading uninitialized data here as thrust::exclusive_scan() is reading one past where the above thrust::reduce_by_key() call writes. Preliminarily though it looks this uninitialized read may not affect the parquet read, still looking into it. |
I have a fix for this one locally, but initcheck is showing some others, continuing to try to hunt them down. |
This PR fixes a couple of uninitialized reads in the parquet reader. However it doesn't look like they would cause the original issue, as their data should be unused. |
@gerashegalov can you try and rerun the repro case with the fixed PR to see if we are still seeing the issues? If so we might need to do another round of debugging/fixes to see if we can find the actual root cause. |
@revans2 I reran my repro pyspark with @pmattione-nvidia's branch. It has not reproduced the original query result diff. However, neither did a couple of runs against the jars that I had documented as reproducing. Checking what might have changed on the node. The driver version is still the same. But I see the node was rebooted after my last "successful" repro runs. |
I ran this new code in a jar for 20 tests, and it failed for 20% of them. So this PR did not fix the issue. |
Describe the bug
NDS SF3K CI pipeline exhibits intermittent query result validation failures for various queries.
It is difficult to reproduce but I was able to reduce the scenario to running q36 and q86 one after another, which fails 90+% out of the runs. I dropped LIMIT 100 from q86 to reduce chances of nondeterminism.
The diff is large enough but there is a single row in the result and diff with lochierarchy=2 so it is to focus on for tests
This issue seems to be introduced between build 33
and build 42
Given that the runs are not 100% reproducible there is a chance the range is longer. I ran build 33 four times without reproducing the issue. Build 42 reproduces the failure quickly
Steps/Code to reproduce bug
Open the notebook on a single node with L40S https://github.com/gerashegalov/rapids-shell/blob/25ca477172f8ac45b71d0eed3452369299748284/src/jupyter/nds2-parquet-3k-snappy.ipynb
Expected behavior
Results must continue to match. These tests are consistently passing on the same node when configured to use an H100 GPU instead
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: