-
Notifications
You must be signed in to change notification settings - Fork 5
Adding Benchmark instrument #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@tharittk I’ll take a look at this soon. Regarding the 2 points of ongoing work, I think these kind of extensive benchmark tests do not belong to the main repo but they are better suited for the one we created to put test codes… here we should add some minimal tests to verify the correctness of the benchmark decorator and mark functionality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk good start! I think we will probably need a few iterations here to get something that is functional and looks nice, and I am playing myself with it to get a better feeling 😄
Here are some initial comments as they come:
- definitely we should not have benchmark scripts in the main repo, so
Kirchhoff_bench
has to be ultimately removed from here and put in separate repo - we should try to add a new page to the documentation called
Benchmarking
where we can provide an explanation of this decorator, how to use it, and show a small example (a bit like we have in theGPU Support
andAdding New Operators
pages) - I tried to add a benchmark decorator to
nccl_allreduce
but what I see is that the messages of benchmark come all before those of mark.. is this expected/intended? - In general I am never in favour of re-inventing the wheel.. so i have been wondering if we really need to write code to print/write on file and after a few tests I think going for the route of using
logging
is probably better for the long run (we can get lots of things for free including writing to file. A little example:
In kirchhoff_bench
:
import logging
logging.basicConfig(level=logging.INFO, force=True) or logging.basicConfig(filename='kirch_log.log', level=logging.INFO, force=True)
numba_logger = logging.getLogger('numba')
numba_logger.setLevel(logging.WARNING) # silence numba (can do any other library we get logs one doesn't want to see)
and in benchmark
import logging
logger = logging.getLogger(__name__)
...
print("".join(output)) -> logging.info("".join(output))
from mpi4py import MPI | ||
|
||
|
||
# TODO (tharitt): later move to env file or something |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done for sure and documented in the new doc page I suggested 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
pylops_mpi/utils/benchmark.py
Outdated
# start-to-end time | ||
elapsed = end_time - start_time | ||
|
||
# TODO (tharitt): Both MPI + NCCL collective calls have implicit synchronization inside the stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure of this... I took your example and bumped par3
a lot: par3 = {"nx": 300, "nz": 500, "ns": 900, "nr": 1000, "use_cupy": True, "use_nccl": True}
and decorated nccl_allreduce
I started to see 3 allreduce calls (turns out two of them are on the shape of DistributedArray when we have a nccl_comm in
pylops-mpi/pylops_mpi/DistributedArray.py
Line 455 in 62927cd
if self._allreduce(local_shape[self.axis]) != self.global_shape[self.axis] or \ |
---NCCL ALLREDUCE 0.021379 seconds (rank = 0)
---NCCL ALLREDUCE 0.000181 seconds (rank = 0)
---NCCL ALLREDUCE 0.000204 seconds (rank = 0)
Then I started to think that maybe the nccl calls are asyncronous and the syncronize automatically the next time you use the array (but with our benchmark decorator we do not catch this)... so i went ahead and added print('send_buf.size', send_buf.size)
in nccl_reduce
and did a quick-and-dirty test in the benchmark around result = func(*args, **kwargs)
:
import cupy as cp
cp.cuda.runtime.deviceSynchronize()
result = func(*args, **kwargs)
cp.cuda.runtime.deviceSynchronize()
end_time = time.perf_counter()
and then what I get is much more realistic:
send_buf.size 1
send_buf.size 1
send_buf.size 150000
send_buf.size 1
---NCCL ALLREDUCE 0.022986 seconds (rank = 0)
send_buf.size 1
---NCCL ALLREDUCE 0.004465 seconds (rank = 0)
send_buf.size 150000
---NCCL ALLREDUCE 0.264804 seconds (rank = 0)
run_bench 22.169827 seconds (rank = 0)
[begin par["use_cupy"]=True, par["use_nccl"]=True ---> after prepare]: 11.534548s
[after prepare ---> after adjoint]: 2.146873s
send_buf.size 1
send_buf.size 1
send_buf.size 150000
Basically one out of 3 allreduce operates on a large vector (the others on scalars as they are to do with the check shapes method as I mentioned above) and one nccl_allreduce now takes way longer 🚀
Long story short, something here needs changed, but we can't just do what I did so we may need to create some methods in _nccl
to call the syncronizations only for the cupy case.. I suspect for MPI things should be fine already as there the allreduce call is for sure not async
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tharittk well done, I like the new formatting with messages from nested benchmark appearing in the right order 😄
I left a few additional comments/thoughts for you... once you have addressed the remaining points that you have in the PR main comment, I'll do a final review
pylops_mpi/utils/benchmark.py
Outdated
else: | ||
handler = logging.StreamHandler(sys.stdout) | ||
|
||
handler.setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh I am not convinced about this... is it really needed?
Why not just letting the user of the script that runs a benchmark creating the logger and choosing where to save the output, instead of doing inside the benchmark method. What are the pros and cons of this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I like the way of taking the logging setup out of the benchmark, and letting the user pass it instead. If the user does not pass it, we can fall back to stdout. I think it is a better separation of concerns.
So the user's code can look like
save_file = True
file_path = "benchmark.log"
def make_logger():
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, force=True)
logger.propagate = False
for handler in logger.handlers[:]:
logger.removeHandler(handler)
if save_file:
handler = logging.FileHandler(file_path, mode='w')
else:
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)
handler.setFormatter(logging.Formatter("%(message)s"))
logger.addHandler(handler)
return logger
logger = make_logger()
@benchmark(logger=logger)
def run_bench(par):
<snip>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only benefit of the current implementation is that everything goes through the benchmark()
argument and no manual setup. But I like the approach of taking the logging logic out more. The benchmark code looks cleaner that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good 😄 yes, I like this way as we don't lock a single logging configuration into our own codes, instead give the user freedom to do what they like and we provide them with code traces to log timings.
When you write the documentation, I think you could probably have a first example with
def make_logger():
logger = logging.getLogger(__name__)
logging.basicConfig(filename=file_path if save_file else None, filemode='w', level=logging.INFO, force=True)
logger.propagate = False
return logger
and then say that we can also completely isolate the logging of our benchmark from that of the rest of the code using your make_logger
with logger.propagate = False
where, if I understand correctly, if the run_bench
is just a part of a larger script that is logged, the logging messages from run_bench
will be isolated - and I guess, especially when we log to a file, the file will only contain those messages?
pylops_mpi/utils/benchmark.py
Outdated
end_time = time.perf_counter() | ||
|
||
elapsed = end_time - start_time | ||
_markers[header_index] = (f"[header]{description or func.__name__}", elapsed, level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does [header]
mean? Maybe it would make more sense to put something like [decorator]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lack of a better word, I use the "header" to signify that this is the function decorated by @benchmark. Other lines that measure time come from wrap()
.
I agree we can change it to [decorator] or [decorated]. It is more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @tharittk ! I just left a few minor comments, but I think we have converged to a final solution 😄
Try to handle the two remaining tasks and then we should be ready to merge this!
def _parse_output_tree(markers: list[str]): | ||
def _parse_output_tree(markers: List[str]): | ||
"""This function parses the list of strings gathered during the benchmark call and output them | ||
as one properly formatted string. The format of output string follows the hierachy of function calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hierachy -> hierarchy
This PR introduces a
benchmark
decorator that enables users to placemark(label)
calls anywhere inside a function to record execution time at specific code points. The timing marks are collected and (as of now) printed in a human-readable format after the function completes.The design is inspired by C++-style instrumentation, where developers can inject performance markers directly into the code.
Changes made:
benchmark.py
A benchmark decorator for lightweight instrumentation[Removed]
kirchhoff_bench.py
A sample script to demonstrate how to use the decorator -- this belongs to the other repo and thus removedOngoing work:
New/Updated documentation to illustrate how to use
benchmark
formatted print and logging
Take a look more closely on CuPy / NCCL Synchronization issue as pointed in the comment