-
Notifications
You must be signed in to change notification settings - Fork 50
Correctness Checking #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Correctness Checking #608
Conversation
rhornung67
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either here or in another PR, it may be a good idea to note in the Dev Guide some things to think about when setting the tolerance for a kernel in the code. Maybe a brief explanation of how you determined the tolerance to set for a couple of representative kernels with different tolerances.
|
We've tried to deal with growing checksums by adding a multiplier to keep their magnitude reasonable. We also a Kahan sum approach for summing checksum values. We haven't looked at this in a while, at least I haven't. Maybe re-investigate since problem sizes are getting larger. |
|
Do you mean multiple passes over the suite? or within the number of times the kernel is invoked? |
|
The checksums are added up over passes as well and the checksum scaling factor does not account for that effect. |
SetChecksumTolerance in all kernels for consistency.
|
I updated the documentation for checksum consistency and added some documentation for the checksum tolerance and scaling factor. @rhornung67 |
| ``ConsistentPerVariantTuning``. On the other hand, some kernels have | ||
| variant tunings that get different checksums on each run of that variant | ||
| tuning, for example due to the ordering of floating-point atomic add | ||
| operations, so the checksums are ``Inconsistent``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are Inconsistent kernels expected to agree within some tolerance? I suspect the Inconsistency may grow per number of reps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the kernel, most kernels overwrite their outputs on each rep instead of continuing to accumulate.
| setFLOPsPerRep(getActualProblemSize()); | ||
|
|
||
| setChecksumConsistency(ChecksumConsistency::Inconsistent); // atomics | ||
| setChecksumTolerance(ChecksumTolerance::normal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the ChecksumTolerance::normal values defined in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, its only in the code. It is 1e-7. It might not be a bad idea to output the tolerance in the checksum output file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea to me! I would also encourage documentation on what we can expect from normal vs tight tolerance. I haven't dug into it myself, but I'm curious if the tolerances are relative and if so how they are calculated? It could be that each kernel could define its own relative tolerance definition -- what do folks think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference tight is 1e-12, but that number is arbitrary and I don't really know if its a good number. All of the kernels using that tolerance have checksums that are identical on all of the platforms I've tested them on.
Do not print the whole checksum.
Summary
Add more support for correctness checking.
Each kernel now can set its own tolerance and we print if the checksum met that tolerance in the show-progress screen output and the checksum output file.
This mechanism is now also used by the test executable.
How concerned should we be that checksums grow over multiple passes?
Example Output
Here is an example of -sp screen output.
Here is an example of checksum.txt file output.