Adding Benchmark instrument #157

tharittk · 2025-07-11T14:45:55Z

This PR introduces a benchmark decorator that enables users to place mark(label) calls anywhere inside a function to record execution time at specific code points. The timing marks are collected and (as of now) printed in a human-readable format after the function completes.

The design is inspired by C++-style instrumentation, where developers can inject performance markers directly into the code.

Changes made:

benchmark.py A benchmark decorator for lightweight instrumentation
[Removed] kirchhoff_bench.py A sample script to demonstrate how to use the decorator -- this belongs to the other repo and thus removed

Ongoing work:

New/Updated documentation to illustrate how to use benchmark
formatted print and logging
Take a look more closely on CuPy / NCCL Synchronization issue as pointed in the comment

…calls properly

mrava87 · 2025-07-14T17:25:55Z

@tharittk I’ll take a look at this soon. Regarding the 2 points of ongoing work, I think these kind of extensive benchmark tests do not belong to the main repo but they are better suited for the one we created to put test codes… here we should add some minimal tests to verify the correctness of the benchmark decorator and mark functionality

mrava87

@tharittk good start! I think we will probably need a few iterations here to get something that is functional and looks nice, and I am playing myself with it to get a better feeling 😄

Here are some initial comments as they come:

definitely we should not have benchmark scripts in the main repo, so Kirchhoff_bench has to be ultimately removed from here and put in separate repo
we should try to add a new page to the documentation called Benchmarking where we can provide an explanation of this decorator, how to use it, and show a small example (a bit like we have in the GPU Support and Adding New Operators pages)
I tried to add a benchmark decorator to nccl_allreduce but what I see is that the messages of benchmark come all before those of mark.. is this expected/intended?
In general I am never in favour of re-inventing the wheel.. so i have been wondering if we really need to write code to print/write on file and after a few tests I think going for the route of using logging is probably better for the long run (we can get lots of things for free including writing to file. A little example:

In kirchhoff_bench:

import logging
logging.basicConfig(level=logging.INFO, force=True) or logging.basicConfig(filename='kirch_log.log', level=logging.INFO, force=True)
numba_logger = logging.getLogger('numba') 
numba_logger.setLevel(logging.WARNING) # silence numba (can do any other library we get logs one doesn't want to see)

and in benchmark

import logging
logger = logging.getLogger(__name__)

... 
print("".join(output)) -> logging.info("".join(output))

mrava87 · 2025-07-15T19:32:36Z

pylops_mpi/utils/benchmark.py

+from mpi4py import MPI
+
+
+# TODO (tharitt): later move to env file or something


This should be done for sure and documented in the new doc page I suggested 😄

pylops_mpi/utils/benchmark.py

mrava87 · 2025-07-15T19:53:49Z

pylops_mpi/utils/benchmark.py

+            # start-to-end time
+            elapsed = end_time - start_time
+
+            # TODO (tharitt): Both MPI + NCCL collective calls have implicit synchronization inside the stream


Are you sure of this... I took your example and bumped par3 a lot: par3 = {"nx": 300, "nz": 500, "ns": 900, "nr": 1000, "use_cupy": True, "use_nccl": True} and decorated nccl_allreduce

I started to see 3 allreduce calls (turns out two of them are on the shape of DistributedArray when we have a nccl_comm in

pylops-mpi/pylops_mpi/DistributedArray.py

Line 455 in 62927cd

if self._allreduce(local_shape[self.axis]) != self.global_shape[self.axis] or \

(not sure this is really intended?), and the last one which should be way larger was not really taking longer, eg:

---NCCL ALLREDUCE 0.021379 seconds (rank = 0) ---NCCL ALLREDUCE 0.000181 seconds (rank = 0) ---NCCL ALLREDUCE 0.000204 seconds (rank = 0)

Then I started to think that maybe the nccl calls are asyncronous and the syncronize automatically the next time you use the array (but with our benchmark decorator we do not catch this)... so i went ahead and added print('send_buf.size', send_buf.size) in nccl_reduce and did a quick-and-dirty test in the benchmark around result = func(*args, **kwargs):

import cupy as cp cp.cuda.runtime.deviceSynchronize() result = func(*args, **kwargs) cp.cuda.runtime.deviceSynchronize() end_time = time.perf_counter()

and then what I get is much more realistic:

send_buf.size 1 send_buf.size 1 send_buf.size 150000 send_buf.size 1 ---NCCL ALLREDUCE 0.022986 seconds (rank = 0) send_buf.size 1 ---NCCL ALLREDUCE 0.004465 seconds (rank = 0) send_buf.size 150000 ---NCCL ALLREDUCE 0.264804 seconds (rank = 0) run_bench 22.169827 seconds (rank = 0) [begin par["use_cupy"]=True, par["use_nccl"]=True ---> after prepare]: 11.534548s [after prepare ---> after adjoint]: 2.146873s send_buf.size 1 send_buf.size 1 send_buf.size 150000

Basically one out of 3 allreduce operates on a large vector (the others on scalars as they are to do with the check shapes method as I mentioned above) and one nccl_allreduce now takes way longer 🚀

Long story short, something here needs changed, but we can't just do what I did so we may need to create some methods in _nccl to call the syncronizations only for the cupy case.. I suspect for MPI things should be fine already as there the allreduce call is for sure not async

pylops_mpi/utils/benchmark.py

…cal_shapes

mrava87

@tharittk well done, I like the new formatting with messages from nested benchmark appearing in the right order 😄

I left a few additional comments/thoughts for you... once you have addressed the remaining points that you have in the PR main comment, I'll do a final review

pylops_mpi/utils/benchmark.py

mrava87 · 2025-07-19T20:54:05Z

pylops_mpi/utils/benchmark.py

+                    else:
+                        handler = logging.StreamHandler(sys.stdout)
+
+                    handler.setLevel(logging.INFO)


mmh I am not convinced about this... is it really needed?

Why not just letting the user of the script that runs a benchmark creating the logger and choosing where to save the output, instead of doing inside the benchmark method. What are the pros and cons of this approach?

I see. I like the way of taking the logging setup out of the benchmark, and letting the user pass it instead. If the user does not pass it, we can fall back to stdout. I think it is a better separation of concerns.

So the user's code can look like

save_file = True file_path = "benchmark.log" def make_logger(): logger = logging.getLogger(__name__) logging.basicConfig(level=logging.INFO, force=True) logger.propagate = False for handler in logger.handlers[:]: logger.removeHandler(handler) if save_file: handler = logging.FileHandler(file_path, mode='w') else: handler = logging.StreamHandler(sys.stdout) handler.setLevel(logging.INFO) handler.setFormatter(logging.Formatter("%(message)s")) logger.addHandler(handler) return logger logger = make_logger() @benchmark(logger=logger) def run_bench(par): <snip>

I guess the only benefit of the current implementation is that everything goes through the benchmark() argument and no manual setup. But I like the approach of taking the logging logic out more. The benchmark code looks cleaner that way.

Very good 😄 yes, I like this way as we don't lock a single logging configuration into our own codes, instead give the user freedom to do what they like and we provide them with code traces to log timings.

When you write the documentation, I think you could probably have a first example with

def make_logger(): logger = logging.getLogger(__name__) logging.basicConfig(filename=file_path if save_file else None, filemode='w', level=logging.INFO, force=True) logger.propagate = False return logger

and then say that we can also completely isolate the logging of our benchmark from that of the rest of the code using your make_logger with logger.propagate = False where, if I understand correctly, if the run_bench is just a part of a larger script that is logged, the logging messages from run_bench will be isolated - and I guess, especially when we log to a file, the file will only contain those messages?

pylops_mpi/utils/benchmark.py

mrava87 · 2025-07-19T21:06:38Z

pylops_mpi/utils/benchmark.py

+            end_time = time.perf_counter()
+
+            elapsed = end_time - start_time
+            _markers[header_index] = (f"[header]{description or func.__name__}", elapsed, level)


What does [header] mean? Maybe it would make more sense to put something like [decorator]?

For lack of a better word, I use the "header" to signify that this is the function decorated by @benchmark. Other lines that measure time come from wrap().

I agree we can change it to [decorator] or [decorated]. It is more precise.

Sounds good 😄

mrava87

Great work @tharittk ! I just left a few minor comments, but I think we have converged to a final solution 😄

Try to handle the two remaining tasks and then we should be ready to merge this!

mrava87 · 2025-07-23T14:12:54Z

pylops_mpi/utils/benchmark.py

-def _parse_output_tree(markers: list[str]):
+def _parse_output_tree(markers: List[str]):
+    """This function parses the list of strings gathered during the benchmark call and output them
+    as one properly formatted string. The format of output string follows the hierachy of function calls


hierachy -> hierarchy

tharittk and others added 4 commits July 10, 2025 21:07

first version benhmark wrapper

3dbecfe

Merge branch 'PyLops:main' into benchmark_instrument

626006f

kirchhoff example for benchmark decorator

7c31951

Fixed benchmark.py, allows to decorate without params, handle nested …

6e66898

…calls properly