Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed to monitor / understand socket interconnect / UPI sync traffic with Intel pcm #919

Open
simonhf opened this issue Mar 10, 2025 · 7 comments
Labels

Comments

@simonhf
Copy link

simonhf commented Mar 10, 2025

Hello! I have an Intel dual socket Sapphire Rapids Xeon host running Debian with the latest Intel pcm compiled and installed. I'm interested in monitoring the socket interconnect / UPI sync traffic on the system in general. It looks like Intel pcm will allow me to do this.

To this end, I have created a simple C program which does the following:

  • Parent mmap()s anonymous shared memory page that multiple forked processes can access.
  • fork() x children and pin processes to specific HTs.
  • Start all forked processes at the same time and run them for y seconds.
  • All forked processes atomically increment the same shared memory location.
  • If some forked processes are running on the other socket, presumably the memory is forced to sync across the socket interconnect?
  • At the end the grand total number of increments are displayed.
  • Presumably a higher total means less / no socket interconnect syncing?

Hopefully the program provokes socket interconnect / UPI sync traffic which presumably can be then detected by Intel pcm. However, my issue is that it's not obvious to me that Intel pcm is detecting the socket interconnect / UPI sync traffic, hence this ticket and these questions:

  1. If C program is partly or fully to blame, then why? And what would be a better C program for testing?
  2. If Intel pcm is working, then how to interpret the results below?
  3. If Intel pcm is not being used in the correct way, then how to use it correctly, or why can it not be used?
  4. Else, does Intel pcm have a bug?

Steps to reproduce:

The simple C program:

$ cat ayryd.c
#define _GNU_SOURCE
#include <assert.h>
#include <stdatomic.h>
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <string.h>
#include <errno.h>
// gcc -O1 -o ayryd.exe ayryd.c && ./ayryd.exe 4 0 2
int cpu;
void get_cpu() {
    cpu = sched_getcpu();
    if (cpu == -1) { printf("- %u=pid ERROR: sched_getcpu() = %d // %d=errno = %s\n", getpid(), cpu, errno, strerror(errno)); exit(EXIT_FAILURE); }
}
int cpu_id;
void set_cpu() {
    pthread_t current_thread = pthread_self(); // Get the current thread ID
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);        // Clear the set
    CPU_SET(cpu_id, &cpuset); // Add the CPU to the set
    int result = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) { printf("- %u=pid %3u=cpu ERROR: pthread_setaffinity_np(%d) = %d // %d=errno = %s\n", getpid(), cpu, cpu_id, result, errno, strerror(errno)); exit(EXIT_FAILURE); }
}
double get_time_in_seconds(void){
    struct timespec ts;
    assert(clock_gettime(CLOCK_MONOTONIC_RAW , &ts) == 0);
    return ts.tv_sec + ts.tv_nsec / 1000000000.0;
}
int main(int argc, char *argv[]) {
    if (argc != 5) { printf("- usage: %s <forks> <cpu_id> <cpu_inc> <seconds>\n", argv[0]); exit(EXIT_FAILURE); }
    int forks = atoi(argv[1]);
    int cpu_id_orig = atoi(argv[2]);
    int cpu_inc = atoi(argv[3]);
    int seconds = atoi(argv[4]);

    cpu_id = cpu_id_orig;
    set_cpu();
    get_cpu();

    uint64_t bytes = 4096;
    void *addr = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    if (addr == MAP_FAILED) { printf("- %u=pid ERROR: mmap(%lu) = %p // %d=errno = %s\n", getpid(), bytes, addr, errno, strerror(errno)); exit(EXIT_FAILURE); }
    printf("- %u=pid %3u=cpu parent allocated %lu bytes of memory at address: %p\n", getpid(), cpu, bytes, addr);
    memset(addr, 0, bytes); // Zero out the memory
    uint64_t * array = addr;

    for (int i = 0; i < forks; i++) {
        pid_t pid = fork();
        if (pid < 0) {
            printf("- %u=pid %3u=cpu ERROR: fork() = %d // %d=errno = %s\n", getpid(), cpu, pid, errno, strerror(errno)); exit(EXIT_FAILURE);
        } else if (pid == 0) {
            //printf("- %u=pid %3u=cpu child %3d started\n", getpid(), cpu, i + 1);
            set_cpu();
            get_cpu();
            uint64_t old_value = atomic_fetch_add(&array[0], 1);
            printf("- %u=pid %3u=cpu child %3d started and pinned to CPU %d; %lu=old_value\n", getpid(), cpu, i + 1, cpu_id, old_value);
            while(atomic_fetch_add(&array[0], 0) < forks) {
            }
            //sleep(1);
            double t1 = get_time_in_seconds();
            uint64_t loops = 0;
            LOOP:;
            loops ++;
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            atomic_fetch_add(&array[0], 1);
            double t2 = get_time_in_seconds();
            if((t2 - t1) < seconds) goto LOOP;
            printf("- %u=pid %3u=cpu child %3d started at %f, ended at %f, with %lu=loops\n", getpid(), cpu, i + 1, t1, t2, loops);
            array[i + 1] = loops;
            exit(0);  // Child process exits after printing
        } else {
            // Parent process: do nothing except loop and fork another child
        }
        cpu_id += cpu_inc;
    }
    // Parent process waits for all child processes to finish
    while (waitpid(-1, NULL, 0) > 0) {
        // Wait for all child processes
    }
    uint64_t loops = 0;
    for (int i = 0; i < forks; i++) {
        loops += array[i + 1];
    }
    printf("- %u=pid %3u=cpu parent exiting with %lu=loops_total for %s %d %d %d %d\n", getpid(), cpu, loops, argv[0], forks, cpu_id_orig, cpu_inc, seconds);
    return 0;
}

Example runs:

$ gcc -O1 -o ayryd.exe ayryd.c && ./ayryd.exe
- usage: ./ayryd.exe <forks> <cpu_id> <cpu_inc> <seconds>

$ gcc -O1 -o ayryd.exe ayryd.c && ./ayryd.exe 4 0 2 10
- 29406=pid   0=cpu parent allocated 4096 bytes of memory at address: 0x72b17e8b5000
- 29407=pid   0=cpu child   1 started and pinned to CPU 0; 0=old_value
- 29408=pid   2=cpu child   2 started and pinned to CPU 2; 1=old_value
- 29409=pid   4=cpu child   3 started and pinned to CPU 4; 2=old_value
- 29410=pid   6=cpu child   4 started and pinned to CPU 6; 3=old_value
- 29407=pid   0=cpu child   1 started at 2407.097803, ended at 2417.097803, with 10105625=loops
- 29408=pid   2=cpu child   2 started at 2407.097852, ended at 2417.097853, with 9640248=loops
- 29409=pid   4=cpu child   3 started at 2407.097897, ended at 2417.097897, with 10020840=loops
- 29410=pid   6=cpu child   4 started at 2407.097929, ended at 2417.097929, with 8907444=loops
- 29406=pid   0=cpu parent exiting with 38674157=loops_total for ./ayryd.exe 4 0 2 10

$ gcc -O1 -o ayryd.exe ayryd.c && ./ayryd.exe 4 0 1 10
- 30215=pid   0=cpu parent allocated 4096 bytes of memory at address: 0x7fef414e7000
- 30216=pid   0=cpu child   1 started and pinned to CPU 0; 0=old_value
- 30217=pid   1=cpu child   2 started and pinned to CPU 1; 1=old_value
- 30218=pid   2=cpu child   3 started and pinned to CPU 2; 2=old_value
- 30219=pid   3=cpu child   4 started and pinned to CPU 3; 3=old_value
- 30216=pid   0=cpu child   1 started at 2637.671938, ended at 2647.671938, with 5621779=loops
- 30217=pid   1=cpu child   2 started at 2637.671949, ended at 2647.671950, with 8936470=loops
- 30218=pid   2=cpu child   3 started at 2637.672008, ended at 2647.672008, with 4778614=loops
- 30219=pid   3=cpu child   4 started at 2637.672090, ended at 2647.672090, with 8242958=loops
- 30215=pid   0=cpu parent exiting with 27579821=loops_total for ./ayryd.exe 4 0 1 10

$ ./ayryd.exe 56 0 2 10 | egrep loops_total
- 31719=pid   0=cpu parent exiting with 39109796=loops_total for ./ayryd.exe 56 0 2 10
$ ./ayryd.exe 56 0 1 10 | egrep loops_total
- 31832=pid   0=cpu parent exiting with 30452668=loops_total for ./ayryd.exe 56 0 1 10

With the last 2 examples then the program forks 56 times each run. The 1st run pins the processes to cores 0, 2, 4, 6, ... etc. All on socket 0. And the 2nd run pins the processes to cores 0, 1, 2, 3, 4, etc. On both socket 0 and socket 1. The memory being sync'd is always allocated from socket 0 memory, and in the 2nd run in theory the atomic instructions occurring the memory happens from threads on both sockets, and therefore the memory needs to be sync'd between the sockets? Presumably, this explains why the 2nd run is slower than the 1st run, because of the additional overhead for sync'ing the memory between the sockets?

I tried re-running the C program for many different forks and repeats to see how deterministic the results are and this what happened:

$ perl -e '$pids = 1; while($pids <= 56){ foreach $r(1..3){ $r1 = `./ayryd.exe $pids 0 2 10 | egrep loops_total`; ($r1_lt) = $r1 =~ m~(\d+)=loops_total~; $r2 = `./ayryd.exe $pids 0 1 10 | egrep loops_total`; ($r2_lt) = $r2 =~ m~(\d+)=loops_total~; $percent = ($r1_lt - $r2_lt) / $r1_lt * 100; printf qq[- %2u pids: run %u: %9u loops S0 and %9u loops S0&S1 or %5.1f%% slower due to UPI %s\n], $pids, $r, $r1_lt, $r2_lt, $percent, q[-] x $percent; } if($pids < 8){ $pids ++; }else{ $pids += 8; } }'
-  1 pids: run 1: 114555751 loops S0 and 114603186 loops S0&S1 or  -0.0% slower due to UPI 
-  1 pids: run 2: 114591964 loops S0 and 114581289 loops S0&S1 or   0.0% slower due to UPI 
-  1 pids: run 3: 114582762 loops S0 and 114564176 loops S0&S1 or   0.0% slower due to UPI 
-  2 pids: run 1:  38452734 loops S0 and  32657498 loops S0&S1 or  15.1% slower due to UPI ---------------
-  2 pids: run 2:  37524113 loops S0 and  29501175 loops S0&S1 or  21.4% slower due to UPI ---------------------
-  2 pids: run 3:  43328319 loops S0 and  29339907 loops S0&S1 or  32.3% slower due to UPI --------------------------------
-  3 pids: run 1:  37370328 loops S0 and  30324123 loops S0&S1 or  18.9% slower due to UPI ------------------
-  3 pids: run 2:  36740727 loops S0 and  34635138 loops S0&S1 or   5.7% slower due to UPI -----
-  3 pids: run 3:  35481221 loops S0 and  33299024 loops S0&S1 or   6.2% slower due to UPI ------
-  4 pids: run 1:  38003914 loops S0 and  27978609 loops S0&S1 or  26.4% slower due to UPI --------------------------
-  4 pids: run 2:  39749760 loops S0 and  31812926 loops S0&S1 or  20.0% slower due to UPI -------------------
-  4 pids: run 3:  40686833 loops S0 and  29336901 loops S0&S1 or  27.9% slower due to UPI ---------------------------
-  5 pids: run 1:  38712731 loops S0 and  31820049 loops S0&S1 or  17.8% slower due to UPI -----------------
-  5 pids: run 2:  33638607 loops S0 and  29429758 loops S0&S1 or  12.5% slower due to UPI ------------
-  5 pids: run 3:  38350871 loops S0 and  28995574 loops S0&S1 or  24.4% slower due to UPI ------------------------
-  6 pids: run 1:  37041417 loops S0 and  29780327 loops S0&S1 or  19.6% slower due to UPI -------------------
-  6 pids: run 2:  37906656 loops S0 and  30495198 loops S0&S1 or  19.6% slower due to UPI -------------------
-  6 pids: run 3:  38055289 loops S0 and  30588076 loops S0&S1 or  19.6% slower due to UPI -------------------
-  7 pids: run 1:  36925357 loops S0 and  31385791 loops S0&S1 or  15.0% slower due to UPI ---------------
-  7 pids: run 2:  37373131 loops S0 and  30556297 loops S0&S1 or  18.2% slower due to UPI ------------------
-  7 pids: run 3:  37820241 loops S0 and  30075177 loops S0&S1 or  20.5% slower due to UPI --------------------
-  8 pids: run 1:  38169553 loops S0 and  31569056 loops S0&S1 or  17.3% slower due to UPI -----------------
-  8 pids: run 2:  38187755 loops S0 and  32539208 loops S0&S1 or  14.8% slower due to UPI --------------
-  8 pids: run 3:  38416403 loops S0 and  32359311 loops S0&S1 or  15.8% slower due to UPI ---------------
- 16 pids: run 1:  38144694 loops S0 and  33291006 loops S0&S1 or  12.7% slower due to UPI ------------
- 16 pids: run 2:  37747784 loops S0 and  33580184 loops S0&S1 or  11.0% slower due to UPI -----------
- 16 pids: run 3:  38199854 loops S0 and  31096685 loops S0&S1 or  18.6% slower due to UPI ------------------
- 24 pids: run 1:  38009594 loops S0 and  32031778 loops S0&S1 or  15.7% slower due to UPI ---------------
- 24 pids: run 2:  36986857 loops S0 and  30892232 loops S0&S1 or  16.5% slower due to UPI ----------------
- 24 pids: run 3:  38267385 loops S0 and  29711401 loops S0&S1 or  22.4% slower due to UPI ----------------------
- 32 pids: run 1:  38170344 loops S0 and  31764172 loops S0&S1 or  16.8% slower due to UPI ----------------
- 32 pids: run 2:  37383347 loops S0 and  34084189 loops S0&S1 or   8.8% slower due to UPI --------
- 32 pids: run 3:  38208849 loops S0 and  30218760 loops S0&S1 or  20.9% slower due to UPI --------------------
- 40 pids: run 1:  37109639 loops S0 and  32760827 loops S0&S1 or  11.7% slower due to UPI -----------
- 40 pids: run 2:  37211324 loops S0 and  33810445 loops S0&S1 or   9.1% slower due to UPI ---------
- 40 pids: run 3:  37846637 loops S0 and  32660840 loops S0&S1 or  13.7% slower due to UPI -------------
- 48 pids: run 1:  36503778 loops S0 and  30974917 loops S0&S1 or  15.1% slower due to UPI ---------------
- 48 pids: run 2:  38735659 loops S0 and  34118655 loops S0&S1 or  11.9% slower due to UPI -----------
- 48 pids: run 3:  38728020 loops S0 and  30993557 loops S0&S1 or  20.0% slower due to UPI -------------------
- 56 pids: run 1:  38347394 loops S0 and  33667768 loops S0&S1 or  12.2% slower due to UPI ------------
- 56 pids: run 2:  37475161 loops S0 and  29745196 loops S0&S1 or  20.6% slower due to UPI --------------------
- 56 pids: run 3:  38459180 loops S0 and  30219790 loops S0&S1 or  21.4% slower due to UPI ---------------------

The results without using the interconnect sync ("loops S0") were always pretty consistent and around +/- 38M. However, when the interconnect sync is involved ("loops S0&S1") then the results appear to vary depending upon the number of forks / amount of contention on the atomic instruction, and even just re-running the test.

Also, sometimes even with the interconnect sync, the test ran well under 10% slower. How can that happen? Any ideas?

Now to running the tests while monitoring either Intel pcm or Intel pcm-numa in the background:

Try to detect socket interconnect / UPI traffic via Intel pcm UPI report:

Run Intel pcm WITHOUT the C program:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep UPI0 | head -1; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS  2>&1 | egrep --after-context=3 --line-buffered UPI0 | egrep -v --line-buffered "(\-\-|UPI0)")
               UPI0     UPI1     UPI2     UPI3    |  UPI0   UPI1   UPI2   UPI3  
 SKT    0       10 M     10 M     10 M      0     |    0%     0%     0%     0%   
 SKT    1       40 M     41 M     41 M      0     |    0%     0%     0%     0%   
 SKT    0       96 M     96 M     96 M      0     |    0%     0%     0%     0%   
 SKT    1       75 M     76 M     76 M      0     |    0%     0%     0%     0%   
 SKT    0     4736 K   4690 K   4917 K      0     |    0%     0%     0%     0%   
 SKT    1       19 M     19 M     19 M      0     |    0%     0%     0%     0%   
 SKT    0       49 M     49 M     50 M      0     |    0%     0%     0%     0%   
 SKT    1       39 M     39 M     39 M      0     |    0%     0%     0%     0%   
 SKT    0     3533 K   3503 K   3747 K      0     |    0%     0%     0%     0%   
 SKT    1       13 M     13 M     13 M      0     |    0%     0%     0%     0%   
 SKT    0       35 M     35 M     35 M      0     |    0%     0%     0%     0%   
 SKT    1       28 M     28 M     29 M      0     |    0%     0%     0%     0%   
 SKT    0     2732 K   2704 K   2954 K      0     |    0%     0%     0%     0%   
 SKT    1       14 M     14 M     14 M      0     |    0%     0%     0%     0%   
 SKT    0       34 M     34 M     34 M      0     |    0%     0%     0%     0%   
 SKT    1       26 M     26 M     27 M      0     |    0%     0%     0%     0%   
 SKT    0       75 M     81 M     76 M      0     |    0%     0%     0%     0%   
 SKT    1       93 M     85 M     86 M      0     |    0%     0%     0%     0%   
 SKT    0      260 M    281 M    264 M      0     |    0%     0%     0%     0%   
 SKT    1      273 M    254 M    258 M      0     |    0%     0%     0%     0%  

Run Intel pcm WITH the C program WITOUT socket interconnect / UPI traffic:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep UPI0 | head -1; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS  2>&1 | egrep --after-context=3 --line-buffered UPI0 | egrep -v --line-buffered "(\-\-|UPI0)") & sleep 2; ./ayryd.exe 56 0 2 $RUN_FOR_SECONDS | egrep loops_total
               UPI0     UPI1     UPI2     UPI3    |  UPI0   UPI1   UPI2   UPI3  
 SKT    0     4347 K   4388 K   4518 K      0     |    0%     0%     0%     0%   
 SKT    1       18 M     18 M     18 M      0     |    0%     0%     0%     0%   
 SKT    0       44 M     44 M     44 M      0     |    0%     0%     0%     0%   
 SKT    1       35 M     35 M     35 M      0     |    0%     0%     0%     0%   
 SKT    0     8608 K   8562 K   8984 K      0     |    0%     0%     0%     0%   
 SKT    1       19 M     19 M     20 M      0     |    0%     0%     0%     0%   
 SKT    0       54 M     53 M     55 M      0     |    0%     0%     0%     0%   
 SKT    1       47 M     47 M     48 M      0     |    0%     0%     0%     0%   
 SKT    0     8088 K   8272 K   8646 K      0     |    0%     0%     0%     0%   
 SKT    1       25 M     25 M     26 M      0     |    0%     0%     0%     0%   
 SKT    0       65 M     65 M     67 M      0     |    0%     0%     0%     0%   
 SKT    1       54 M     54 M     55 M      0     |    0%     0%     0%     0%   
 SKT    0     7044 K   6995 K   7392 K      0     |    0%     0%     0%     0%   
 SKT    1       19 M     19 M     20 M      0     |    0%     0%     0%     0%   
 SKT    0       52 M     51 M     53 M      0     |    0%     0%     0%     0%   
 SKT    1       43 M     43 M     45 M      0     |    0%     0%     0%     0%   
 SKT    0       10 M     10 M     11 M      0     |    0%     0%     0%     0%   
 SKT    1       20 M     20 M     21 M      0     |    0%     0%     0%     0%   
 SKT    0       61 M     61 M     63 M      0     |    0%     0%     0%     0%   
 SKT    1       54 M     55 M     56 M      0     |    0%     0%     0%     0%   
- 204995=pid   0=cpu parent exiting with 18382022=loops_total for ./ayryd.exe 56 0 2 5

Run Intel pcm WITH the C program WITH socket interconnect / UPI traffic:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep UPI0 | head -1; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS  2>&1 | egrep --after-context=3 --line-buffered UPI0 | egrep -v --line-buffered "(\-\-|UPI0)") & sleep 2; ./ayryd.exe 56 0 1 $RUN_FOR_SECONDS | egrep loops_total
               UPI0     UPI1     UPI2     UPI3    |  UPI0   UPI1   UPI2   UPI3  
 SKT    0     4411 K   4366 K   4521 K      0     |    0%     0%     0%     0%   
 SKT    1       17 M     17 M     17 M      0     |    0%     0%     0%     0%   
 SKT    0       44 M     44 M     44 M      0     |    0%     0%     0%     0%   
 SKT    1       35 M     35 M     36 M      0     |    0%     0%     0%     0%   
 SKT    0     8227 K     80 M   8731 K      0     |    0%     0%     0%     0%   
 SKT    1       92 M     20 M     20 M      0     |    0%     0%     0%     0%   
 SKT    0       56 M    274 M     57 M      0     |    0%     0%     0%     0%   
 SKT    1      282 M     48 M     49 M      0     |    0%     0%     0%     0%   
 SKT    0     5954 K     89 M   6461 K      0     |    0%     0%     0%     0%   
 SKT    1       97 M     13 M     14 M      0     |    0%     0%     0%     0%   
 SKT    0       37 M    289 M     38 M      0     |    0%     0%     0%     0%   
 SKT    1      302 M     32 M     33 M      0     |    0%     0%     0%     0%   
 SKT    0     6157 K     88 M   6659 K      0     |    0%     0%     0%     0%   
 SKT    1       97 M     15 M     15 M      0     |    0%     0%     0%     0%   
 SKT    0       40 M    289 M     42 M      0     |    0%     0%     0%     0%   
 SKT    1      300 M     34 M     35 M      0     |    0%     0%     0%     0%   
 SKT    0     6743 K     88 M   7262 K      0     |    0%     0%     0%     0%   
 SKT    1       98 M     16 M     16 M      0     |    0%     0%     0%     0%   
 SKT    0       43 M    291 M     45 M      0     |    0%     0%     0%     0%   
 SKT    1      303 M     37 M     38 M      0     |    0%     0%     0%     0%   
- 207393=pid   0=cpu parent exiting with 16538520=loops_total for ./ayryd.exe 56 0 1 5

Try to detect socket interconnect / UPI traffic via Intel pcm DRAM Accesses report:

Run Intel pcm-numa WITHOUT the C program:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm-numa -i=1 2>&1 | egrep "Remote DRAM Accesses"; (sudo ~/pcm/build/bin/pcm-numa -i=$RUN_FOR_SECONDS 2>&1 | egrep --line-buffered "^(   1 |   2 )")
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   1   1.35       6500 K     4798 K      2641                1350                
   2   1.52        820 K      540 K       332                  18                
   1   1.58         12 M     7954 K      4421                3394                
   2   1.50       1177 K      786 K      1641                 377                
   1   1.50       5683 K     3782 K      1709                1061                
   2   1.89       1372 K      724 K       634                 128                
   1   1.59       5147 K     3242 K      1631                1502                
   2   1.51       4042 K     2681 K      1216                 790                
   1   1.15        103 M       90 M       365 K               147 K              
   2   2.31       4141 K     1793 K       661                 159     

Run Intel pcm-numa WITH the C program WITOUT socket interconnect / UPI traffic:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm-numa -i=1 2>&1 | egrep "Remote DRAM Accesses"; (sudo ~/pcm/build/bin/pcm-numa -i=$RUN_FOR_SECONDS 2>&1 | egrep --line-buffered "^(   1 |   2 )") & sleep 2; ./ayryd.exe 56 0 2 $RUN_FOR_SECONDS | egrep loops_total
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   1   1.23       2979 K     2423 K      2542                1889                
   2   1.50       1412 K      944 K      1214                 357                
   1   1.73        337 M      195 M       292 K               140 K              
   2   0.01         22 M     2630 M      4689                 744                
   1   1.65        105 M       63 M        51 K                36 K              
   2   0.01         17 M     2996 M      3856                 191                
   1   1.81        108 M       59 M        42 K                26 K              
   2   0.01         22 M     2992 M      2170                 197                
   1   1.30        167 M      128 M      1174 K                49 K              
   2   0.01         22 M     2996 M      2186                 113                
- 195256=pid   0=cpu parent exiting with 18860454=loops_total for ./ayryd.exe 56 0 2 5

Run Intel pcm-numa WITH the C program WITH socket interconnect / UPI traffic:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm-numa -i=1 2>&1 | egrep "Remote DRAM Accesses"; (sudo ~/pcm/build/bin/pcm-numa -i=$RUN_FOR_SECONDS 2>&1 | egrep --line-buffered "^(   1 |   2 )") & sleep 2; ./ayryd.exe 56 0 1 $RUN_FOR_SECONDS | egrep loops_total
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   1   1.62         10 M     6208 K      7407                6999                
   2   1.45        335 K      231 K       252                  29                
   1   0.01         19 M     2610 M      4052                1630                
   2   0.01         18 M     2606 M      3263                 465                
   1   0.01         19 M     2997 M      4573                2088                
   2   0.01         17 M     2999 M      1199                  18                
   1   0.01         19 M     3002 M      3431                1282                
   2   0.01         17 M     3002 M      1645                  17                
   1   0.01         20 M     2990 M      4568                2846                
   2   0.01         17 M     2990 M      1166                  13                
- 197645=pid   0=cpu parent exiting with 16305856=loops_total for ./ayryd.exe 56 0 1 5

Try to detect socket interconnect / UPI traffic via Intel pcm MEM LOCAL report:

Run Intel pcm WITHOUT the C program:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep "\| LOCAL \|"; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS 2>&1 | egrep --after-context=3 --line-buffered "\| LOCAL \|" | egrep -v --line-buffered "(\-\-|LOCAL)")
MEM (GB)->|  READ |  WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns)| UncFREQ (Ghz)|
 SKT   0     0.11     0.08   23 %      0.00      0.00     161.38      46.30         188.99             2.49
 SKT   1     0.13     0.10   76 %      0.00      0.00     164.65      46.29         163.75             2.49
 SKT   0     0.13     0.08   13 %      0.00      0.00     160.66      46.12         186.85             2.49
 SKT   1     0.14     0.09   91 %      0.00      0.00     165.12      46.31         164.28             2.49
 SKT   0     0.10     0.08   28 %      0.00      0.00     160.82      46.25         192.19             2.49
 SKT   1     0.14     0.10   81 %      0.00      0.00     164.45      46.27         164.48             2.49
 SKT   0     0.22     0.19   38 %      0.00      0.00     162.18      46.41         169.44             2.49
 SKT   1     0.17     0.12   72 %      0.00      0.00     164.75      46.27         168.50             2.49
 SKT   0     0.11     0.08   10 %      0.00      0.00     160.49      46.16         196.65             2.49
 SKT   1     0.12     0.09   92 %      0.00      0.00     164.36      46.25         162.19             2.49

Run Intel pcm WITH the C program WITOUT socket interconnect / UPI traffic:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep "\| LOCAL \|"; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS 2>&1 | egrep --after-context=3 --line-buffered "\| LOCAL \|" | egrep -v --line-buffered "(\-\-|LOCAL)") & sleep 2; ./ayryd.exe 56 0 2 $RUN_FOR_SECONDS | egrep loops_total
MEM (GB)->|  READ |  WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns)| UncFREQ (Ghz)|
 SKT   0     0.12     0.08   17 %      0.00      0.00     161.23      46.24         193.39             2.49
 SKT   1     0.12     0.09   93 %      0.00      0.00     164.78      46.32         163.08             2.49
 SKT   0     0.14     0.12   55 %      0.00      0.00     242.56      46.83         146.76             2.50
 SKT   1     0.12     0.09   69 %      0.00      0.00     166.50      46.84         164.75             2.50
 SKT   0     0.16     0.12   30 %      0.00      0.00     251.31      46.03         173.87             2.50
 SKT   1     0.13     0.09   89 %      0.00      0.00     164.27      46.10         159.70             2.49
 SKT   0     0.15     0.12   34 %      0.00      0.00     252.67      46.17         174.24             2.49
 SKT   1     0.13     0.09   86 %      0.00      0.00     164.76      46.31         160.51             2.49
 SKT   0     0.15     0.12   36 %      0.00      0.00     253.66      46.29         171.48             2.50
 SKT   1     0.12     0.10   83 %      0.00      0.00     165.11      46.41         160.71             2.50
- 179569=pid   0=cpu parent exiting with 18819483=loops_total for ./ayryd.exe 56 0 2 5

Run Intel pcm WITH the C program WITH socket interconnect / UPI traffic:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep "\| LOCAL \|"; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS 2>&1 | egrep --after-context=3 --line-buffered "\| LOCAL \|" | egrep -v --line-buffered "(\-\-|LOCAL)") & sleep 2; ./ayryd.exe 56 0 1 $RUN_FOR_SECONDS | egrep loops_total
MEM (GB)->|  READ |  WRITE | LOCAL | PMM RD | PMM WR | CPU energy | DIMM energy | LLCRDMISSLAT (ns)| UncFREQ (Ghz)|
 SKT   0     0.18     0.14   18 %      0.00      0.00     162.95      46.39         192.66             2.49
 SKT   1     0.24     0.15   84 %      0.00      0.00     166.19      46.42         163.99             2.49
 SKT   0     0.24     0.20   43 %      0.00      0.00     203.61      46.31         180.20             2.49
 SKT   1     0.14     0.11   88 %      0.00      0.00     203.77      46.19         162.52             2.49
 SKT   0     0.24     0.20   44 %      0.00      0.00     210.50      46.40         178.78             2.50
 SKT   1     0.12     0.09   83 %      0.00      0.00     210.37      46.30         165.00             2.49
 SKT   0     0.23     0.20   44 %      0.00      0.00     210.77      46.44         180.00             2.49
 SKT   1     0.12     0.09   84 %      0.00      0.00     210.72      46.37         162.63             2.49
 SKT   0     0.22     0.20   46 %      0.00      0.00     210.12      46.36         179.22             2.49
 SKT   1     0.11     0.09   81 %      0.00      0.00     210.19      46.21         163.51             2.49
- 181958=pid   0=cpu parent exiting with 15094041=loops_total for ./ayryd.exe 56 0 1 5

I do feel like I was successful in creating a simple C program with runs faster if the socket interconnect sync is not used, and slower if it is used. But how to make it run more deterministically in terms of performance? And why is it not running more deterministically at the moment?

And how to use Intel pcm or one of its binaries to monitor / detect when the socket interconnect is causing any program or the simple C program to run slower? Or how to interpret the above results while the simple C program is running?

Thanks in advance for any help and or comments! :-)

Update: For anybody interested in following along then I also posted a request for help here [1] too :-)

[1] https://community.intel.com/t5/Processors/Need-help-with-Xeon-CPU-Interconnect-UPI-monitoring/m-p/1675113#M82517

@rdementi
Copy link
Contributor

Hi, could you please compute how many times per second is the shared variable incremented in your tests? This would give a ball-park number of the expected difference in the PCM output.

@simonhf
Copy link
Author

simonhf commented Mar 18, 2025

Hi @rdementi ! Thanks for the question! :-)

how many times per second is the shared variable incremented in your tests

The number can be calculated from this table already posted above:

$ perl -e '$pids = 1; while($pids <= 56){ foreach $r(1..3){ $r1 = `./ayryd.exe $pids 0 2 10 | egrep loops_total`; ($r1_lt) = $r1 =~ m~(\d+)=loops_total~; $r2 = `./ayryd.exe $pids 0 1 10 | egrep loops_total`; ($r2_lt) = $r2 =~ m~(\d+)=loops_total~; $percent = ($r1_lt - $r2_lt) / $r1_lt * 100; printf qq[- %2u pids: run %u: %9u loops S0 and %9u loops S0&S1 or %5.1f%% slower due to UPI %s\n], $pids, $r, $r1_lt, $r2_lt, $percent, q[-] x $percent; } if($pids < 8){ $pids ++; }else{ $pids += 8; } }'
-  1 pids: run 1: 114555751 loops S0 and 114603186 loops S0&S1 or  -0.0% slower due to UPI 
-  1 pids: run 2: 114591964 loops S0 and 114581289 loops S0&S1 or   0.0% slower due to UPI 
-  1 pids: run 3: 114582762 loops S0 and 114564176 loops S0&S1 or   0.0% slower due to UPI 
-  2 pids: run 1:  38452734 loops S0 and  32657498 loops S0&S1 or  15.1% slower due to UPI ---------------
-  2 pids: run 2:  37524113 loops S0 and  29501175 loops S0&S1 or  21.4% slower due to UPI ---------------------
-  2 pids: run 3:  43328319 loops S0 and  29339907 loops S0&S1 or  32.3% slower due to UPI --------------------------------
-  3 pids: run 1:  37370328 loops S0 and  30324123 loops S0&S1 or  18.9% slower due to UPI ------------------
-  3 pids: run 2:  36740727 loops S0 and  34635138 loops S0&S1 or   5.7% slower due to UPI -----
-  3 pids: run 3:  35481221 loops S0 and  33299024 loops S0&S1 or   6.2% slower due to UPI ------
-  4 pids: run 1:  38003914 loops S0 and  27978609 loops S0&S1 or  26.4% slower due to UPI --------------------------
-  4 pids: run 2:  39749760 loops S0 and  31812926 loops S0&S1 or  20.0% slower due to UPI -------------------
-  4 pids: run 3:  40686833 loops S0 and  29336901 loops S0&S1 or  27.9% slower due to UPI ---------------------------
-  5 pids: run 1:  38712731 loops S0 and  31820049 loops S0&S1 or  17.8% slower due to UPI -----------------
-  5 pids: run 2:  33638607 loops S0 and  29429758 loops S0&S1 or  12.5% slower due to UPI ------------
-  5 pids: run 3:  38350871 loops S0 and  28995574 loops S0&S1 or  24.4% slower due to UPI ------------------------
-  6 pids: run 1:  37041417 loops S0 and  29780327 loops S0&S1 or  19.6% slower due to UPI -------------------
-  6 pids: run 2:  37906656 loops S0 and  30495198 loops S0&S1 or  19.6% slower due to UPI -------------------
-  6 pids: run 3:  38055289 loops S0 and  30588076 loops S0&S1 or  19.6% slower due to UPI -------------------
-  7 pids: run 1:  36925357 loops S0 and  31385791 loops S0&S1 or  15.0% slower due to UPI ---------------
-  7 pids: run 2:  37373131 loops S0 and  30556297 loops S0&S1 or  18.2% slower due to UPI ------------------
-  7 pids: run 3:  37820241 loops S0 and  30075177 loops S0&S1 or  20.5% slower due to UPI --------------------
-  8 pids: run 1:  38169553 loops S0 and  31569056 loops S0&S1 or  17.3% slower due to UPI -----------------
-  8 pids: run 2:  38187755 loops S0 and  32539208 loops S0&S1 or  14.8% slower due to UPI --------------
-  8 pids: run 3:  38416403 loops S0 and  32359311 loops S0&S1 or  15.8% slower due to UPI ---------------
- 16 pids: run 1:  38144694 loops S0 and  33291006 loops S0&S1 or  12.7% slower due to UPI ------------
- 16 pids: run 2:  37747784 loops S0 and  33580184 loops S0&S1 or  11.0% slower due to UPI -----------
- 16 pids: run 3:  38199854 loops S0 and  31096685 loops S0&S1 or  18.6% slower due to UPI ------------------
- 24 pids: run 1:  38009594 loops S0 and  32031778 loops S0&S1 or  15.7% slower due to UPI ---------------
- 24 pids: run 2:  36986857 loops S0 and  30892232 loops S0&S1 or  16.5% slower due to UPI ----------------
- 24 pids: run 3:  38267385 loops S0 and  29711401 loops S0&S1 or  22.4% slower due to UPI ----------------------
- 32 pids: run 1:  38170344 loops S0 and  31764172 loops S0&S1 or  16.8% slower due to UPI ----------------
- 32 pids: run 2:  37383347 loops S0 and  34084189 loops S0&S1 or   8.8% slower due to UPI --------
- 32 pids: run 3:  38208849 loops S0 and  30218760 loops S0&S1 or  20.9% slower due to UPI --------------------
- 40 pids: run 1:  37109639 loops S0 and  32760827 loops S0&S1 or  11.7% slower due to UPI -----------
- 40 pids: run 2:  37211324 loops S0 and  33810445 loops S0&S1 or   9.1% slower due to UPI ---------
- 40 pids: run 3:  37846637 loops S0 and  32660840 loops S0&S1 or  13.7% slower due to UPI -------------
- 48 pids: run 1:  36503778 loops S0 and  30974917 loops S0&S1 or  15.1% slower due to UPI ---------------
- 48 pids: run 2:  38735659 loops S0 and  34118655 loops S0&S1 or  11.9% slower due to UPI -----------
- 48 pids: run 3:  38728020 loops S0 and  30993557 loops S0&S1 or  20.0% slower due to UPI -------------------
- 56 pids: run 1:  38347394 loops S0 and  33667768 loops S0&S1 or  12.2% slower due to UPI ------------
- 56 pids: run 2:  37475161 loops S0 and  29745196 loops S0&S1 or  20.6% slower due to UPI --------------------
- 56 pids: run 3:  38459180 loops S0 and  30219790 loops S0&S1 or  21.4% slower due to UPI ---------------------

Each test runs for 10 seconds, but each loop also atomic increments the shared variable 10 times per loop. So the number of loops shown above is actually the average number of atomic increments per second too, e.g.:

56 pids: 33,667,768 loops (in 10 seconds) * 10 increments per loop / 10 seconds = 33,667,768 average atomic increments per second.

And those 33,667,768 average atomic increments per second will be distributed across the 56 forked children.

Note: The tests with Intel pcm running in the background were all run for only 5 seconds in an attempt to make the output shorter. And with 56 forked children in the hope this might show up more in Intel pcm, instead of using e.g. 2 forked children. But in theory the average atomic increments per second is still going to be similar even if the duration of the test is half the time.

So in the above table the 10 second test for 56 forked children was run 3 times and without remote memory access the results were 3,8347,394, 37,475,161, and 38,459,180 average atomic increments per second. and with remote memory access the results were 33,667,768 , 29,745,196, and 30,219,790 average atomic increments per second.

Is it fair to say that for each atomic increment the interconnect would have to be used to sync the shared variable to the peer socket memory?

@rdementi
Copy link
Contributor

Is it fair to say that for each atomic increment the interconnect would have to be used to sync the shared variable to the peer socket memory?

no, for an atomic increment we do not need to go over the interconnect every time. For an atomic increment we need the cache line in the exclusive ownership and if we already have it we don't need to go over the interconnect. This is likely the reason why there is no strong correlation in your data. I think the traffic you are creating is noise-level compared to other stuff going on between the sockets.

@rdementi
Copy link
Contributor

you might want to read this white paper: https://halobates.de/xeon-lock-scaling-analysis-paper.pdf

@simonhf
Copy link
Author

simonhf commented Mar 19, 2025

Thanks for the response, @rdementi .

no, for an atomic increment we do not need to go over the interconnect every time. For an atomic increment we need the cache line in the exclusive ownership and if we already have it we don't need to go over the interconnect.

I get that this would be the case if we only have one thread atomically incrementing, i.e. no contention on the cache line being used. And the results above are massively faster for this "exclusive ownership" / single thread contention case, as expected.

However, presumably if there are 2 or more threads spread across 2 sockets, and each thread is continuously atomically incrementing, presumably there cannot be "exclusive ownership", or? And presumably the interconnect will get used every time? Or why not, or how to view this differently?

you might want to read this white paper: https://halobates.de/xeon-lock-scaling-analysis-paper.pdf

Please note that my goal is not specifically to do with atomic instructions or locking!

My goal is to find a way to detect the severity of interconnect usage / remote memory usage for a very large variety of non-NUMA aware multi-threaded software running on dual socket systems. I was hoping that Intel pcm might be a way to do that, and created the example C program, using the atomic instructions to hopefully guarantee interconnect usage when running with threads on different sockets. However, even with the example C program using 100% host CPU and running up to 30% slower cross socket, it does not seem possible to detect the interconnect usage with Intel pcm, and so I create this issue.

How do you recommend to modify the example C program to general enough interconnect traffic so that Intel pcm shows 100% interconnect / UPI traffic? And if 100% is not possible, which percentage is possible and how?

Thanks for your help so far! :-)

@simonhf
Copy link
Author

simonhf commented Mar 20, 2025

Update: In an effort to try and provoke the Intel pcm UPI percentages upwards then I tried running Intel pcm while running Intel mlc AKA Memory Latency Checker, which on its on takes about 53 seconds to run:

$ time sudo ~/mlc/Linux/mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --latency_matrix 

Using buffer size of 2000.000MiB
Measuring idle latencies for sequential access (in ns)...
		Numa node
Numa node	    0	    1	
       0	116.8	194.8	
       1	205.4	116.5	

real	0m53.437s
user	0m0.004s
sys	0m0.003s

And here with Intel pcm running in the background:

$ cat pcm-while-mlc.txt | egrep UPI0 | head -1; cat pcm-while-mlc.txt | egrep --after-context=3 UPI0 | egrep SKT
               UPI0     UPI1     UPI2     UPI3    |  UPI0   UPI1   UPI2   UPI3  
 SKT    0      115 M    115 M    117 M      0     |    0%     0%     0%     0%   
 SKT    1       97 M     97 M     98 M      0     |    0%     0%     0%     0%   
 SKT    0      346 M    345 M    351 M      0     |    0%     0%     0%     0%   
 SKT    1      358 M    359 M    365 M      0     |    1%     1%     1%     0%   
 SKT    0      157 M    157 M    164 M      0     |    0%     0%     0%     0%   
 SKT    1      184 M    183 M    189 M      0     |    0%     0%     0%     0%   
 SKT    0      564 M    564 M    583 M      0     |    1%     1%     1%     0%   
 SKT    1      548 M    548 M    568 M      0     |    1%     1%     1%     0%   
 SKT    0       21 M     22 M     22 M      0     |    0%     0%     0%     0%   
 SKT    1       43 M     42 M     44 M      0     |    0%     0%     0%     0%   
 SKT    0      116 M    118 M    121 M      0     |    0%     0%     0%     0%   
 SKT    1      104 M    101 M    106 M      0     |    0%     0%     0%     0%   
 SKT    0     2915 K   3406 K   3017 K      0     |    0%     0%     0%     0%   
 SKT    1     6484 K   6006 K   6011 K      0     |    0%     0%     0%     0%   
...
[sniping many ~ 100 lines of mostly 0%]
...

The output is mostly 0% but if we look at the lines which have at least one non-0%:

$ cat pcm-while-mlc.txt | egrep UPI0 | head -1; cat pcm-while-mlc.txt | egrep --after-context=3 UPI0 | egrep SKT | egrep -v ' 0%     0%     0%     0%'
               UPI0     UPI1     UPI2     UPI3    |  UPI0   UPI1   UPI2   UPI3  
 SKT    1      358 M    359 M    365 M      0     |    1%     1%     1%     0%   
 SKT    0      564 M    564 M    583 M      0     |    1%     1%     1%     0%   
 SKT    1      548 M    548 M    568 M      0     |    1%     1%     1%     0%   
 SKT    0      754 M    753 M    753 M      0     |    2%     2%     2%     0%   
 SKT    1      752 M    752 M    752 M      0     |    2%     2%     2%     0%   
 SKT    0      471 M    472 M    472 M      0     |    1%     1%     1%     0%   
 SKT    1      476 M    475 M    476 M      0     |    1%     1%     1%     0%   
 SKT    0     1339 M   1338 M   1340 M      0     |    3%     3%     3%     0%   
 SKT    1     1333 M   1332 M   1334 M      0     |    3%     3%     3%     0%   
 SKT    0      496 M    495 M    527 M      0     |    1%     1%     1%     0%   
 SKT    1      502 M    502 M    536 M      0     |    1%     1%     1%     0%   
 SKT    0      549 M    546 M    568 M      0     |    1%     1%     1%     0%   
 SKT    1      526 M    528 M    549 M      0     |    1%     1%     1%     0%   
 SKT    0      477 M    476 M    511 M      0     |    1%     1%     1%     0%   
 SKT    1      467 M    467 M    504 M      0     |    1%     1%     1%     0%   
 SKT    0      503 M    495 M    529 M      0     |    1%     1%     1%     0%   
 SKT    1      506 M    513 M    542 M      0     |    1%     1%     1%     0%   
 SKT    0      494 M    488 M    513 M      0     |    1%     1%     1%     0%   
 SKT    1      472 M    478 M    498 M      0     |    1%     1%     1%     0%   

The highest UPI percentage is only 3% and only occurred during 1 of the 53 seconds that Intel mlc ran for :-(
And that's when the UPI non-percentage counter gets up to 1.3 billion.
It seems like every 500M might be 1%. So 100% would be 50B?

And how does that relate to the original numbers I posted using the atomic increment C program?

Ignoring the last second of Intel pcm output which could be mysteriously higher numbers (maybe due to processes finishing up?), then the highest UPI numbers for the idling or single socket runs were all under 100M.

Whereas the highest UPI numbers for the dual socket runs got up to 270M+ to 303M:

$ export RUN_FOR_SECONDS=5; sudo ~/pcm/build/bin/pcm --no-color -i=1 2>&1 | egrep UPI0 | head -1; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS  2>&1 | egrep --after-context=3 --line-buffered UPI0 | egrep -v --line-buffered "(\-\-|UPI0)") & sleep 2; ./ayryd.exe 56 0 1 $RUN_FOR_SECONDS | egrep loops_total
               UPI0     UPI1     UPI2     UPI3    |  UPI0   UPI1   UPI2   UPI3  
 SKT    0     4411 K   4366 K   4521 K      0     |    0%     0%     0%     0%   
 SKT    1       17 M     17 M     17 M      0     |    0%     0%     0%     0%   
 SKT    0       44 M     44 M     44 M      0     |    0%     0%     0%     0%   
 SKT    1       35 M     35 M     36 M      0     |    0%     0%     0%     0%   
 SKT    0     8227 K     80 M   8731 K      0     |    0%     0%     0%     0%   
 SKT    1       92 M     20 M     20 M      0     |    0%     0%     0%     0%   
 SKT    0       56 M    274 M     57 M      0     |    0%     0%     0%     0%   <-- 274M
 SKT    1      282 M     48 M     49 M      0     |    0%     0%     0%     0%   <-- 282M
 SKT    0     5954 K     89 M   6461 K      0     |    0%     0%     0%     0%   
 SKT    1       97 M     13 M     14 M      0     |    0%     0%     0%     0%   
 SKT    0       37 M    289 M     38 M      0     |    0%     0%     0%     0%   <-- 289M
 SKT    1      302 M     32 M     33 M      0     |    0%     0%     0%     0%   <-- 302M
 SKT    0     6157 K     88 M   6659 K      0     |    0%     0%     0%     0%   
 SKT    1       97 M     15 M     15 M      0     |    0%     0%     0%     0%   
 SKT    0       40 M    289 M     42 M      0     |    0%     0%     0%     0%   <-- 289M
 SKT    1      300 M     34 M     35 M      0     |    0%     0%     0%     0%   <-- 300M
 SKT    0     6743 K     88 M   7262 K      0     |    0%     0%     0%     0%   
 SKT    1       98 M     16 M     16 M      0     |    0%     0%     0%     0%   
 SKT    0       43 M    291 M     45 M      0     |    0%     0%     0%     0%   <-- 291M
 SKT    1      303 M     37 M     38 M      0     |    0%     0%     0%     0%   <-- 303M
- 207393=pid   0=cpu parent exiting with 16538520=loops_total for ./ayryd.exe 56 0 1 5

So 270M to 303M is clearly higher than the previous Intel pcm numbers of < 100M. So presumably this difference shows a clear extra usage of the interconnect even though the percent is only 0%?

But why does Intel pcm only show the higher usage figures for every other second reported? The in-between seconds still only have < 100M values. But we know the atomic increment C program is continuously counting... Any ideas? Answer: There are actually 4 lines per second (not 2), 2 lines for UPI incoming traffic, and 2 lines for UPI outgoing traffic.

@simonhf
Copy link
Author

simonhf commented Mar 20, 2025

With the new info above, and assuming UPI is always going to be 0%, try to detect socket interconnect / UPI traffic via Intel pcm UPI report again, this time for 10 seconds (instead of 5 seconds), and using the UPI total lines from Intel pcm:

Run Intel pcm WITHOUT the C program:

$ export RUN_FOR_SECONDS=10; (sudo ~/pcm/build/bin/pcm --no-color -i=$RUN_FOR_SECONDS  2>&1 | egrep --line-buffered "Total UPI" | perl -lane 'chomp; push @a, $_; $c ++; if($c == 2){ printf qq[%s\n], join(" ", @a); undef @a; undef $c; }')
Total UPI incoming data traffic:   20 M     UPI data traffic/Memory controller traffic: 0.07 Total UPI outgoing data and non-data traffic:   71 M
Total UPI incoming data traffic:  151 M     UPI data traffic/Memory controller traffic: 0.18 Total UPI outgoing data and non-data traffic:  602 M
Total UPI incoming data traffic:   54 M     UPI data traffic/Memory controller traffic: 0.12 Total UPI outgoing data and non-data traffic:  209 M
Total UPI incoming data traffic:   38 M     UPI data traffic/Memory controller traffic: 0.11 Total UPI outgoing data and non-data traffic:  140 M
Total UPI incoming data traffic:  126 M     UPI data traffic/Memory controller traffic: 0.23 Total UPI outgoing data and non-data traffic:  429 M
Total UPI incoming data traffic:   51 M     UPI data traffic/Memory controller traffic: 0.13 Total UPI outgoing data and non-data traffic:  196 M
Total UPI incoming data traffic:   33 M     UPI data traffic/Memory controller traffic: 0.10 Total UPI outgoing data and non-data traffic:  123 M
Total UPI incoming data traffic:   40 M     UPI data traffic/Memory controller traffic: 0.11 Total UPI outgoing data and non-data traffic:  152 M
Total UPI incoming data traffic:   57 M     UPI data traffic/Memory controller traffic: 0.14 Total UPI outgoing data and non-data traffic:  220 M
Total UPI incoming data traffic:   30 M     UPI data traffic/Memory controller traffic: 0.10 Total UPI outgoing data and non-data traffic:  108 M

Run Intel pcm WITH the C program WITOUT socket interconnect / UPI traffic:

Total UPI incoming data traffic:   98 M     UPI data traffic/Memory controller traffic: 0.16 Total UPI outgoing data and non-data traffic:  394 M
Total UPI incoming data traffic:   61 M     UPI data traffic/Memory controller traffic: 0.15 Total UPI outgoing data and non-data traffic:  219 M
Total UPI incoming data traffic:   57 M     UPI data traffic/Memory controller traffic: 0.15 Total UPI outgoing data and non-data traffic:  199 M
Total UPI incoming data traffic:  142 M     UPI data traffic/Memory controller traffic: 0.23 Total UPI outgoing data and non-data traffic:  487 M
Total UPI incoming data traffic:  131 M     UPI data traffic/Memory controller traffic: 0.22 Total UPI outgoing data and non-data traffic:  448 M
Total UPI incoming data traffic:   96 M     UPI data traffic/Memory controller traffic: 0.19 Total UPI outgoing data and non-data traffic:  334 M
Total UPI incoming data traffic:   65 M     UPI data traffic/Memory controller traffic: 0.16 Total UPI outgoing data and non-data traffic:  225 M
Total UPI incoming data traffic:   89 M     UPI data traffic/Memory controller traffic: 0.19 Total UPI outgoing data and non-data traffic:  304 M
Total UPI incoming data traffic:   83 M     UPI data traffic/Memory controller traffic: 0.19 Total UPI outgoing data and non-data traffic:  281 M
Total UPI incoming data traffic:   91 M     UPI data traffic/Memory controller traffic: 0.20 Total UPI outgoing data and non-data traffic:  309 M
- 2450026=pid   0=cpu parent exiting with 37535921=loops_total for ./ayryd.exe 56 0 2 10

Run Intel pcm WITH the C program WITH socket interconnect / UPI traffic:

Total UPI incoming data traffic:   67 M     UPI data traffic/Memory controller traffic: 0.15 Total UPI outgoing data and non-data traffic:  237 M
Total UPI incoming data traffic:  504 M     UPI data traffic/Memory controller traffic: 0.46 Total UPI outgoing data and non-data traffic: 1655 M
Total UPI incoming data traffic:  310 M     UPI data traffic/Memory controller traffic: 0.47 Total UPI outgoing data and non-data traffic:  992 M
Total UPI incoming data traffic:  307 M     UPI data traffic/Memory controller traffic: 0.48 Total UPI outgoing data and non-data traffic:  984 M
Total UPI incoming data traffic:  404 M     UPI data traffic/Memory controller traffic: 0.45 Total UPI outgoing data and non-data traffic: 1304 M
Total UPI incoming data traffic:  314 M     UPI data traffic/Memory controller traffic: 0.45 Total UPI outgoing data and non-data traffic: 1017 M
Total UPI incoming data traffic:  338 M     UPI data traffic/Memory controller traffic: 0.46 Total UPI outgoing data and non-data traffic: 1093 M
Total UPI incoming data traffic:  335 M     UPI data traffic/Memory controller traffic: 0.47 Total UPI outgoing data and non-data traffic: 1081 M
Total UPI incoming data traffic:  300 M     UPI data traffic/Memory controller traffic: 0.47 Total UPI outgoing data and non-data traffic:  963 M
Total UPI incoming data traffic:  296 M     UPI data traffic/Memory controller traffic: 0.47 Total UPI outgoing data and non-data traffic:  951 M
- 2451303=pid   0=cpu parent exiting with 31194903=loops_total for ./ayryd.exe 56 0 1 10

Observation: We can now clearly see that both the incoming and outgoing UPI totals are inflated when the atomic increment C program is running WITH socket interconnect.

Question: Why is the UPI outgoing total inflated when the C program WITHOUT socket interconnect is running?

Question: What is the difference between UPI incoming and outgoing and why are they not balanced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants