Skip to content

Conversation

@MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Dec 17, 2025

Summary

Add a count for the amount of memory touched. This is useful as a estimate of the memory used by a kernel and a best case for the amount of cache used by a kernel.

This is implemented by counting the bytes modify written separately from bytes read and bytes written.

  • This PR is a feature
  • It does the following:
    • Adds more accurate memory usage at the request of people trying to run problems at 4x the last level cache size

Now the memory accesses are split into read, write, modify write,
and atomic modify write categories. Each memory address touched
should only appear in exactly one of these memory attributes
per loop-next/kernel launch.

Continue to output Bytes which is the total memory traffic. This is
calculated via (read + write + 2*(modify write + atomic modify write)).

Also output BytesTouched which is the amount of memory used. This is
calculated as the sum of the 4 categories of memory accesses.

These numbers are idealized and real memory traffic may be higher if
perfect caching is not achieved. They are also sometimes estimates
as some kernels have conditionals that rely on random numbers or
complex implementations.
Add the number of bytes modify written per kernel.
Label which variable(s) are accessed for each line of the
bytes read, written, modify written, or atomic modify written.
@MrBurmark MrBurmark requested review from a team and rhornung67 December 17, 2025 19:15
Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some text needs to be made clearer, maybe some details added, I think.

Undo separate of read modify write count from the read count and
write count. So the BytesRead/rep count includes both those
"read only" and "read modified and written".
@MrBurmark MrBurmark requested a review from rhornung67 December 18, 2025 20:16
FLOP count. So these numbers are rough estimates. For actual FLOP counts,
a performance analysis tool should be used.
* **BytesTouched/rep** -- Total number of bytes accessed for each repetition
of kernel. This is a best case scenario for the amount of cache needed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte in DRAM or HBM right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, each byte is counted once even if its accessed multiple times or in multiple ways. This doesn't count things we expect to be on the stack however like what is captured in the lambda.

the value of zero is reported.

..note:: The Bytes*/rep and FLOPs/rep counts are estimates for kernels
involving randomness or difficult to count algorithms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which kernels exhibit randomness? It may be nice to list the kernels which do not have the best estimates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artv3 I think the plan is to have that sort of information appear in the output files. Putting kernel-specific information in the user documentation creates extra maintenance burden.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not currently an attribute for that, but it can be added fairly easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants