Skip to content

Conversation

@MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Dec 19, 2025

Summary

Add more support for correctness checking.
Each kernel now can set its own tolerance and we print if the checksum met that tolerance in the show-progress screen output and the checksum output file.
This mechanism is now also used by the test executable.

How concerned should we be that checksums grow over multiple passes?

Example Output

Here is an example of -sp screen output.

Run kernel -- Basic_MULTI_REDUCE
        Running Base_Seq variant
                Running      default tuning -- 2.71738e-05 sec. x 50 rep. PASSED checksum
        Running Lambda_Seq variant
                Running      default tuning -- 2.71464e-05 sec. x 50 rep. PASSED checksum
        Running RAJA_Seq variant
                Running      default tuning -- 2.71234e-05 sec. x 50 rep. PASSED checksum
        Running Base_OpenMP variant
                Running      default tuning -- 0.000128336 sec. x 50 rep. PASSED checksum
        Running Lambda_OpenMP variant
                Running      default tuning -- 0.000129231 sec. x 50 rep. PASSED checksum
        Running RAJA_OpenMP variant
                Running      default tuning -- 1.70671e-05 sec. x 50 rep. FAILED checksum

Here is an example of checksum.txt file output.

Kernel                     
........................................................
Variants                   Result  Tolerance                   Average Checksum            Max Checksum Diff           Checksum Diff StdDev        
                                                                                           (vs. first variant listed)                              
----------------------------------------------------------------------------------------
Basic_MULTI_REDUCE         
........................................................
Base_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_Seq-default         PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Base_OpenMP-default        PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_OpenMP-default      PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_OpenMP-default        FAILED  9.9999999999999995475e-08   2918.7998532503377369       2946.4701746910394893       2.2204460492503130808e-16 

-------------------------------------------------------

@MrBurmark MrBurmark requested review from a team and rhornung67 December 19, 2025 18:11
Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in another PR, it may be a good idea to note in the Dev Guide some things to think about when setting the tolerance for a kernel in the code. Maybe a brief explanation of how you determined the tolerance to set for a couple of representative kernels with different tolerances.

@rhornung67
Copy link
Member

We've tried to deal with growing checksums by adding a multiplier to keep their magnitude reasonable. We also a Kahan sum approach for summing checksum values. We haven't looked at this in a while, at least I haven't. Maybe re-investigate since problem sizes are getting larger.

@artv3
Copy link
Member

artv3 commented Dec 23, 2025

Do you mean multiple passes over the suite? or within the number of times the kernel is invoked?

@MrBurmark
Copy link
Member Author

MrBurmark commented Dec 23, 2025

The checksums are added up over passes as well and the checksum scaling factor does not account for that effect.

@MrBurmark
Copy link
Member Author

MrBurmark commented Dec 26, 2025

I updated the documentation for checksum consistency and added some documentation for the checksum tolerance and scaling factor. @rhornung67

``ConsistentPerVariantTuning``. On the other hand, some kernels have
variant tunings that get different checksums on each run of that variant
tuning, for example due to the ordering of floating-point atomic add
operations, so the checksums are ``Inconsistent``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Inconsistent kernels expected to agree within some tolerance? I suspect the Inconsistency may grow per number of reps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the kernel, most kernels overwrite their outputs on each rep instead of continuing to accumulate.

setFLOPsPerRep(getActualProblemSize());

setChecksumConsistency(ChecksumConsistency::Inconsistent); // atomics
setChecksumTolerance(ChecksumTolerance::normal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the ChecksumTolerance::normal values defined in the docs?

Copy link
Member Author

@MrBurmark MrBurmark Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, its only in the code. It is 1e-7. It might not be a bad idea to output the tolerance in the checksum output file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea to me! I would also encourage documentation on what we can expect from normal vs tight tolerance. I haven't dug into it myself, but I'm curious if the tolerances are relative and if so how they are calculated? It could be that each kernel could define its own relative tolerance definition -- what do folks think about that?

Copy link
Member Author

@MrBurmark MrBurmark Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference tight is 1e-12, but that number is arbitrary and I don't really know if its a good number. All of the kernels using that tolerance have checksums that are identical on all of the platforms I've tested them on.

@MrBurmark MrBurmark requested a review from rhornung67 December 26, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants