Correctness Checking #608

MrBurmark · 2025-12-19T18:11:41Z

Summary

Add more support for correctness checking.
Each kernel now can set its own tolerance and we print if the checksum met that tolerance in the show-progress screen output and the checksum output file.
This mechanism is now also used by the test executable.

How concerned should we be that checksums grow over multiple passes?

This PR is a feature
It does the following (modify list as needed):
- Adds correctness checking at the request of Correctness/Robustness Checking #604

Example Output

Here is an example of -sp screen output.

Run kernel -- Basic_MULTI_REDUCE
        Running Base_Seq variant
                Running      default tuning -- 2.71738e-05 sec. x 50 rep. PASSED checksum
        Running Lambda_Seq variant
                Running      default tuning -- 2.71464e-05 sec. x 50 rep. PASSED checksum
        Running RAJA_Seq variant
                Running      default tuning -- 2.71234e-05 sec. x 50 rep. PASSED checksum
        Running Base_OpenMP variant
                Running      default tuning -- 0.000128336 sec. x 50 rep. PASSED checksum
        Running Lambda_OpenMP variant
                Running      default tuning -- 0.000129231 sec. x 50 rep. PASSED checksum
        Running RAJA_OpenMP variant
                Running      default tuning -- 1.70671e-05 sec. x 50 rep. FAILED checksum

Here is an example of checksum.txt file output.

Kernel                     
........................................................
Variants                   Result  Tolerance                   Average Checksum            Max Checksum Diff           Checksum Diff StdDev        
                                                                                           (vs. first variant listed)                              
----------------------------------------------------------------------------------------
Basic_MULTI_REDUCE         
........................................................
Base_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_Seq-default         PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Base_OpenMP-default        PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_OpenMP-default      PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_OpenMP-default        FAILED  9.9999999999999995475e-08   2918.7998532503377369       2946.4701746910394893       2.2204460492503130808e-16 

-------------------------------------------------------

…rk1/correctness

rhornung67

Either here or in another PR, it may be a good idea to note in the Dev Guide some things to think about when setting the tolerance for a kernel in the code. Maybe a brief explanation of how you determined the tolerance to set for a couple of representative kernels with different tolerances.

rhornung67 · 2025-12-19T21:11:19Z

We've tried to deal with growing checksums by adding a multiplier to keep their magnitude reasonable. We also a Kahan sum approach for summing checksum values. We haven't looked at this in a while, at least I haven't. Maybe re-investigate since problem sizes are getting larger.

artv3 · 2025-12-23T18:58:01Z

Do you mean multiple passes over the suite? or within the number of times the kernel is invoked?

MrBurmark · 2025-12-23T20:30:02Z

The checksums are added up over passes as well and the checksum scaling factor does not account for that effect.

SetChecksumTolerance in all kernels for consistency.

MrBurmark · 2025-12-26T18:48:33Z

I updated the documentation for checksum consistency and added some documentation for the checksum tolerance and scaling factor. @rhornung67

artv3 · 2025-12-26T19:17:58Z

docs/sphinx/dev_guide/kernel_class_impl.rst

+        ``ConsistentPerVariantTuning``. On the other hand, some kernels have
+        variant tunings that get different checksums on each run of that variant
+        tuning, for example due to the ordering of floating-point atomic add
+        operations, so the checksums are ``Inconsistent``.


Are Inconsistent kernels expected to agree within some tolerance? I suspect the Inconsistency may grow per number of reps?

It depends on the kernel, most kernels overwrite their outputs on each rep instead of continuing to accumulate.

artv3 · 2025-12-26T19:19:10Z

src/algorithm/ATOMIC.cpp

  setFLOPsPerRep(getActualProblemSize());

  setChecksumConsistency(ChecksumConsistency::Inconsistent); // atomics
+  setChecksumTolerance(ChecksumTolerance::normal);


Are the ChecksumTolerance::normal values defined in the docs?

Nope, its only in the code. It is 1e-7. It might not be a bad idea to output the tolerance in the checksum output file.

Sounds like a great idea to me! I would also encourage documentation on what we can expect from normal vs tight tolerance. I haven't dug into it myself, but I'm curious if the tolerances are relative and if so how they are calculated? It could be that each kernel could define its own relative tolerance definition -- what do folks think about that?

For reference tight is 1e-12, but that number is arbitrary and I don't really know if its a good number. All of the kernels using that tolerance have checksums that are identical on all of the platforms I've tested them on.

Do not print the whole checksum.

MrBurmark added 3 commits December 19, 2025 10:06

Add checksum tolerance to KernelBase

610c16c

Use checksum tolerance in outputs

e1fad9a

Merge branch 'develop' of github.com:LLNL/RAJAPerf into feature/burma…

cbdaf94

…rk1/correctness

MrBurmark requested review from a team and rhornung67 December 19, 2025 18:11

rhornung67 approved these changes Dec 19, 2025

View reviewed changes

MrBurmark added 6 commits December 26, 2025 10:14

Use a setter for checksum_tolerance

eaa7635

SetChecksumTolerance in all kernels for consistency.

Fix use of local checksum_scale_factor in EDGE3D

e45f8a9

Use setChecksumScaleFactor

9e13e4b

Hide checksum_scale_factor in KernelBase

5acfc23

Update checksum documentation

bfe597c

Unremove POLYBENCH_FLOYD_WARSHALL checksum scale factor

f837902

artv3 reviewed Dec 26, 2025

View reviewed changes

MrBurmark mentioned this pull request Dec 26, 2025

Add kahan sum reduce helper class. llnl/RAJA#1969

Open

MrBurmark added 2 commits December 26, 2025 12:50

Print checksum tolerance to checksum output file

e38359e

Only print pass/fail to -sp output

d018d0d

Do not print the whole checksum.

MrBurmark requested a review from rhornung67 December 26, 2025 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctness Checking #608

Correctness Checking #608

Uh oh!

MrBurmark commented Dec 19, 2025 •

edited

Loading

Uh oh!

rhornung67 left a comment

Uh oh!

rhornung67 commented Dec 19, 2025

Uh oh!

artv3 commented Dec 23, 2025

Uh oh!

MrBurmark commented Dec 23, 2025 •

edited

Loading

Uh oh!

MrBurmark commented Dec 26, 2025 •

edited

Loading

Uh oh!

artv3 Dec 26, 2025

Uh oh!

MrBurmark Dec 26, 2025

Uh oh!

artv3 Dec 26, 2025

Uh oh!

MrBurmark Dec 26, 2025 •

edited

Loading

Uh oh!

artv3 Dec 26, 2025

Uh oh!

MrBurmark Dec 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Correctness Checking #608

Are you sure you want to change the base?

Correctness Checking #608

Uh oh!

Conversation

MrBurmark commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example Output

Uh oh!

rhornung67 left a comment

Choose a reason for hiding this comment

Uh oh!

rhornung67 commented Dec 19, 2025

Uh oh!

artv3 commented Dec 23, 2025

Uh oh!

MrBurmark commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrBurmark commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artv3 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

MrBurmark Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

artv3 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

MrBurmark Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

artv3 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

MrBurmark Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MrBurmark commented Dec 19, 2025 •

edited

Loading

MrBurmark commented Dec 23, 2025 •

edited

Loading

MrBurmark commented Dec 26, 2025 •

edited

Loading

MrBurmark Dec 26, 2025 •

edited

Loading

MrBurmark Dec 26, 2025 •

edited

Loading