Skip to content

Adding Benchmark instrument #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

tharittk
Copy link
Collaborator

@tharittk tharittk commented Jul 11, 2025

This PR introduces a benchmark decorator that enables users to place mark(label) calls anywhere inside a function to record execution time at specific code points. The timing marks are collected and (as of now) printed in a human-readable format after the function completes.

The design is inspired by C++-style instrumentation, where developers can inject performance markers directly into the code.

Changes made:

  • benchmark.py A benchmark decorator for lightweight instrumentation

  • [Removed] kirchhoff_bench.py A sample script to demonstrate how to use the decorator -- this belongs to the other repo and thus removed

Ongoing work:

  • New/Updated documentation to illustrate how to use benchmark

  • formatted print and logging

  • Take a look more closely on CuPy / NCCL Synchronization issue as pointed in the comment

@mrava87
Copy link
Contributor

mrava87 commented Jul 14, 2025

@tharittk I’ll take a look at this soon. Regarding the 2 points of ongoing work, I think these kind of extensive benchmark tests do not belong to the main repo but they are better suited for the one we created to put test codes… here we should add some minimal tests to verify the correctness of the benchmark decorator and mark functionality

Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk good start! I think we will probably need a few iterations here to get something that is functional and looks nice, and I am playing myself with it to get a better feeling 😄

Here are some initial comments as they come:

  • definitely we should not have benchmark scripts in the main repo, so Kirchhoff_bench has to be ultimately removed from here and put in separate repo
  • we should try to add a new page to the documentation called Benchmarking where we can provide an explanation of this decorator, how to use it, and show a small example (a bit like we have in the GPU Support and Adding New Operators pages)
  • I tried to add a benchmark decorator to nccl_allreduce but what I see is that the messages of benchmark come all before those of mark.. is this expected/intended?
  • In general I am never in favour of re-inventing the wheel.. so i have been wondering if we really need to write code to print/write on file and after a few tests I think going for the route of using logging is probably better for the long run (we can get lots of things for free including writing to file. A little example:

In kirchhoff_bench:

import logging
logging.basicConfig(level=logging.INFO, force=True) or logging.basicConfig(filename='kirch_log.log', level=logging.INFO, force=True)
numba_logger = logging.getLogger('numba') 
numba_logger.setLevel(logging.WARNING) # silence numba (can do any other library we get logs one doesn't want to see)

and in benchmark

import logging
logger = logging.getLogger(__name__)

... 
print("".join(output)) -> logging.info("".join(output))

from mpi4py import MPI


# TODO (tharitt): later move to env file or something
Copy link
Contributor

@mrava87 mrava87 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done for sure and documented in the new doc page I suggested 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

# start-to-end time
elapsed = end_time - start_time

# TODO (tharitt): Both MPI + NCCL collective calls have implicit synchronization inside the stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure of this... I took your example and bumped par3 a lot: par3 = {"nx": 300, "nz": 500, "ns": 900, "nr": 1000, "use_cupy": True, "use_nccl": True} and decorated nccl_allreduce

I started to see 3 allreduce calls (turns out two of them are on the shape of DistributedArray when we have a nccl_comm in

if self._allreduce(local_shape[self.axis]) != self.global_shape[self.axis] or \
(not sure this is really intended?), and the last one which should be way larger was not really taking longer, eg:

---NCCL ALLREDUCE 0.021379 seconds (rank = 0)

---NCCL ALLREDUCE 0.000181 seconds (rank = 0)

---NCCL ALLREDUCE 0.000204 seconds (rank = 0)

Then I started to think that maybe the nccl calls are asyncronous and the syncronize automatically the next time you use the array (but with our benchmark decorator we do not catch this)... so i went ahead and added print('send_buf.size', send_buf.size) in nccl_reduce and did a quick-and-dirty test in the benchmark around result = func(*args, **kwargs):

import cupy as cp
cp.cuda.runtime.deviceSynchronize()
result = func(*args, **kwargs)
cp.cuda.runtime.deviceSynchronize()
end_time = time.perf_counter()

and then what I get is much more realistic:

send_buf.size 1
send_buf.size 1
send_buf.size 150000
send_buf.size 1
---NCCL ALLREDUCE 0.022986 seconds (rank = 0)

send_buf.size 1
---NCCL ALLREDUCE 0.004465 seconds (rank = 0)

send_buf.size 150000
---NCCL ALLREDUCE 0.264804 seconds (rank = 0)

run_bench 22.169827 seconds (rank = 0)
[begin par["use_cupy"]=True, par["use_nccl"]=True ---> after prepare]: 11.534548s 
[after prepare ---> after adjoint]: 2.146873s 

send_buf.size 1
send_buf.size 1
send_buf.size 150000

Basically one out of 3 allreduce operates on a large vector (the others on scalars as they are to do with the check shapes method as I mentioned above) and one nccl_allreduce now takes way longer 🚀

Long story short, something here needs changed, but we can't just do what I did so we may need to create some methods in _nccl to call the syncronizations only for the cupy case.. I suspect for MPI things should be fine already as there the allreduce call is for sure not async

Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk well done, I like the new formatting with messages from nested benchmark appearing in the right order 😄

I left a few additional comments/thoughts for you... once you have addressed the remaining points that you have in the PR main comment, I'll do a final review

else:
handler = logging.StreamHandler(sys.stdout)

handler.setLevel(logging.INFO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh I am not convinced about this... is it really needed?

Why not just letting the user of the script that runs a benchmark creating the logger and choosing where to save the output, instead of doing inside the benchmark method. What are the pros and cons of this approach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I like the way of taking the logging setup out of the benchmark, and letting the user pass it instead. If the user does not pass it, we can fall back to stdout. I think it is a better separation of concerns.

So the user's code can look like

save_file = True
file_path = "benchmark.log"


def make_logger():
    logger = logging.getLogger(__name__)
    logging.basicConfig(level=logging.INFO, force=True)
    logger.propagate = False
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    if save_file:
        handler = logging.FileHandler(file_path, mode='w')
    else:
        handler = logging.StreamHandler(sys.stdout)

    handler.setLevel(logging.INFO)
    handler.setFormatter(logging.Formatter("%(message)s"))
    logger.addHandler(handler)

    return logger


logger = make_logger()


@benchmark(logger=logger)
def run_bench(par):
        <snip>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only benefit of the current implementation is that everything goes through the benchmark() argument and no manual setup. But I like the approach of taking the logging logic out more. The benchmark code looks cleaner that way.

Copy link
Contributor

@mrava87 mrava87 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good 😄 yes, I like this way as we don't lock a single logging configuration into our own codes, instead give the user freedom to do what they like and we provide them with code traces to log timings.

When you write the documentation, I think you could probably have a first example with

def make_logger():
    logger = logging.getLogger(__name__)
    logging.basicConfig(filename=file_path if save_file else None, filemode='w', level=logging.INFO, force=True)
    logger.propagate = False

    return logger

and then say that we can also completely isolate the logging of our benchmark from that of the rest of the code using your make_logger with logger.propagate = False where, if I understand correctly, if the run_bench is just a part of a larger script that is logged, the logging messages from run_bench will be isolated - and I guess, especially when we log to a file, the file will only contain those messages?

end_time = time.perf_counter()

elapsed = end_time - start_time
_markers[header_index] = (f"[header]{description or func.__name__}", elapsed, level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does [header] mean? Maybe it would make more sense to put something like [decorator]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lack of a better word, I use the "header" to signify that this is the function decorated by @benchmark. Other lines that measure time come from wrap().

I agree we can change it to [decorator] or [decorated]. It is more precise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 😄

Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @tharittk ! I just left a few minor comments, but I think we have converged to a final solution 😄

Try to handle the two remaining tasks and then we should be ready to merge this!

def _parse_output_tree(markers: list[str]):
def _parse_output_tree(markers: List[str]):
"""This function parses the list of strings gathered during the benchmark call and output them
as one properly formatted string. The format of output string follows the hierachy of function calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hierachy -> hierarchy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants