-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help needed to monitor / understand socket interconnect / UPI sync traffic with Intel pcm #919
Comments
Hi, could you please compute how many times per second is the shared variable incremented in your tests? This would give a ball-park number of the expected difference in the PCM output. |
Hi @rdementi ! Thanks for the question! :-)
The number can be calculated from this table already posted above:
Each test runs for 10 seconds, but each loop also atomic increments the shared variable 10 times per loop. So the number of loops shown above is actually the average number of atomic increments per second too, e.g.: 56 pids: 33,667,768 loops (in 10 seconds) * 10 increments per loop / 10 seconds = 33,667,768 average atomic increments per second. And those 33,667,768 average atomic increments per second will be distributed across the 56 forked children. Note: The tests with Intel pcm running in the background were all run for only 5 seconds in an attempt to make the output shorter. And with 56 forked children in the hope this might show up more in Intel pcm, instead of using e.g. 2 forked children. But in theory the average atomic increments per second is still going to be similar even if the duration of the test is half the time. So in the above table the 10 second test for 56 forked children was run 3 times and without remote memory access the results were 3,8347,394, 37,475,161, and 38,459,180 average atomic increments per second. and with remote memory access the results were 33,667,768 , 29,745,196, and 30,219,790 average atomic increments per second. Is it fair to say that for each atomic increment the interconnect would have to be used to sync the shared variable to the peer socket memory? |
no, for an atomic increment we do not need to go over the interconnect every time. For an atomic increment we need the cache line in the exclusive ownership and if we already have it we don't need to go over the interconnect. This is likely the reason why there is no strong correlation in your data. I think the traffic you are creating is noise-level compared to other stuff going on between the sockets. |
you might want to read this white paper: https://halobates.de/xeon-lock-scaling-analysis-paper.pdf |
Thanks for the response, @rdementi .
I get that this would be the case if we only have one thread atomically incrementing, i.e. no contention on the cache line being used. And the results above are massively faster for this "exclusive ownership" / single thread contention case, as expected. However, presumably if there are 2 or more threads spread across 2 sockets, and each thread is continuously atomically incrementing, presumably there cannot be "exclusive ownership", or? And presumably the interconnect will get used every time? Or why not, or how to view this differently?
Please note that my goal is not specifically to do with atomic instructions or locking! My goal is to find a way to detect the severity of interconnect usage / remote memory usage for a very large variety of non-NUMA aware multi-threaded software running on dual socket systems. I was hoping that Intel pcm might be a way to do that, and created the example C program, using the atomic instructions to hopefully guarantee interconnect usage when running with threads on different sockets. However, even with the example C program using 100% host CPU and running up to 30% slower cross socket, it does not seem possible to detect the interconnect usage with Intel pcm, and so I create this issue. How do you recommend to modify the example C program to general enough interconnect traffic so that Intel pcm shows 100% interconnect / UPI traffic? And if 100% is not possible, which percentage is possible and how? Thanks for your help so far! :-) |
Update: In an effort to try and provoke the Intel pcm UPI percentages upwards then I tried running Intel pcm while running Intel mlc AKA Memory Latency Checker, which on its on takes about 53 seconds to run:
And here with Intel pcm running in the background:
The output is mostly
The highest UPI percentage is only 3% and only occurred during 1 of the 53 seconds that Intel mlc ran for :-( And how does that relate to the original numbers I posted using the atomic increment C program? Ignoring the last second of Intel pcm output which could be mysteriously higher numbers (maybe due to processes finishing up?), then the highest UPI numbers for the idling or single socket runs were all under 100M. Whereas the highest UPI numbers for the dual socket runs got up to 270M+ to 303M:
So 270M to 303M is clearly higher than the previous Intel pcm numbers of < 100M. So presumably this difference shows a clear extra usage of the interconnect even though the percent is only But why does Intel pcm only show the higher usage figures for every other second reported? The in-between seconds still only have < 100M values. But we know the atomic increment C program is continuously counting... Any ideas? Answer: There are actually 4 lines per second (not 2), 2 lines for UPI incoming traffic, and 2 lines for UPI outgoing traffic. |
With the new info above, and assuming UPI is always going to be Run Intel pcm WITHOUT the C program:
Run Intel pcm WITH the C program WITOUT socket interconnect / UPI traffic:
Run Intel pcm WITH the C program WITH socket interconnect / UPI traffic:
Observation: We can now clearly see that both the incoming and outgoing UPI totals are inflated when the atomic increment C program is running WITH socket interconnect. Question: Why is the UPI outgoing total inflated when the C program WITHOUT socket interconnect is running? Question: What is the difference between UPI incoming and outgoing and why are they not balanced? |
Hello! I have an Intel dual socket Sapphire Rapids Xeon host running Debian with the latest Intel pcm compiled and installed. I'm interested in monitoring the socket interconnect / UPI sync traffic on the system in general. It looks like Intel pcm will allow me to do this.
To this end, I have created a simple C program which does the following:
Hopefully the program provokes socket interconnect / UPI sync traffic which presumably can be then detected by Intel pcm. However, my issue is that it's not obvious to me that Intel pcm is detecting the socket interconnect / UPI sync traffic, hence this ticket and these questions:
Steps to reproduce:
The simple C program:
Example runs:
With the last 2 examples then the program forks 56 times each run. The 1st run pins the processes to cores 0, 2, 4, 6, ... etc. All on socket 0. And the 2nd run pins the processes to cores 0, 1, 2, 3, 4, etc. On both socket 0 and socket 1. The memory being sync'd is always allocated from socket 0 memory, and in the 2nd run in theory the atomic instructions occurring the memory happens from threads on both sockets, and therefore the memory needs to be sync'd between the sockets? Presumably, this explains why the 2nd run is slower than the 1st run, because of the additional overhead for sync'ing the memory between the sockets?
I tried re-running the C program for many different forks and repeats to see how deterministic the results are and this what happened:
The results without using the interconnect sync ("loops S0") were always pretty consistent and around +/- 38M. However, when the interconnect sync is involved ("loops S0&S1") then the results appear to vary depending upon the number of forks / amount of contention on the atomic instruction, and even just re-running the test.
Also, sometimes even with the interconnect sync, the test ran well under 10% slower. How can that happen? Any ideas?
Now to running the tests while monitoring either Intel pcm or Intel pcm-numa in the background:
Try to detect socket interconnect / UPI traffic via Intel pcm UPI report:
Run Intel pcm WITHOUT the C program:
Run Intel pcm WITH the C program WITOUT socket interconnect / UPI traffic:
Run Intel pcm WITH the C program WITH socket interconnect / UPI traffic:
Try to detect socket interconnect / UPI traffic via Intel pcm DRAM Accesses report:
Run Intel pcm-numa WITHOUT the C program:
Run Intel pcm-numa WITH the C program WITOUT socket interconnect / UPI traffic:
Run Intel pcm-numa WITH the C program WITH socket interconnect / UPI traffic:
Try to detect socket interconnect / UPI traffic via Intel pcm MEM LOCAL report:
Run Intel pcm WITHOUT the C program:
Run Intel pcm WITH the C program WITOUT socket interconnect / UPI traffic:
Run Intel pcm WITH the C program WITH socket interconnect / UPI traffic:
I do feel like I was successful in creating a simple C program with runs faster if the socket interconnect sync is not used, and slower if it is used. But how to make it run more deterministically in terms of performance? And why is it not running more deterministically at the moment?
And how to use Intel pcm or one of its binaries to monitor / detect when the socket interconnect is causing any program or the simple C program to run slower? Or how to interpret the above results while the simple C program is running?
Thanks in advance for any help and or comments! :-)
Update: For anybody interested in following along then I also posted a request for help here [1] too :-)
[1] https://community.intel.com/t5/Processors/Need-help-with-Xeon-CPU-Interconnect-UPI-monitoring/m-p/1675113#M82517
The text was updated successfully, but these errors were encountered: