Skip to content

Performance of L0 backend: Unable to see concurrent execution #5344

@TApplencourt

Description

@TApplencourt
Contributor

Describe the bug

Some kernels cannot use the full GPU (it's the case of the kernel used in the reproducer bellow), the worse case is kernels that use only one EU. To get performance with such codes, the solution is to submit many kernels that would run concurrently.
This can be implemented in the SYCL in two manners:

  • The straightforward is just to submit many kernels in an out-of-order queue and then synchronize. It's what I call the "async" mode in the reproducer.
  • Create one queue per kernel, submit all the kernels, and then wait for each individual queue. I call this mode "multiple_queue".

As you deduced by the title of the issues, none of those approaches work on my hardware using L0 runtime. I don't see any concurrent execution:

$dpcpp -fsycl reproducer.cpp
tapplencourt@foo12:~/tmp/sycl> ./a.out async
Mode async
Time 1 chunk: 172878
Time 2 chunks: 345948
No concurents execution...
tapplencourt@foo2:~/tmp/sycl> ./a.out multiple_queue
Mode multiple_queue
Time 1 chunk: 244641
Time 2 chunks: 410005
No concurents execution...

The code is trivial. Just comparing the runtime of one kernel, and the runtime of two kernels ( Each kernel is only 512 Work Item large.). The time should be the same, as they should be able to run in parallel on the GPU. As you can see, it's not the case.

Of course, nothing said in the specification that those kernels should run concurrently, but by experience, it's a required optimization to get good performance for a lot of codes.

To Reproduce

Be careful, for a strange reason GitHub removed the \ in the macro.

#define MAD_4(x, y)                                                                                                    \
  x = y * x + y;                                                                                                       \
  y = x * y + x;                                                                                                       \
  x = y * x + y;                                                                                                       \
  y = x * y + x;
#define MAD_16(x, y)                                                                                                   \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);
#define MAD_64(x, y)                                                                                                   \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);

#include <chrono>
#include <sycl/sycl.hpp>

template <class T> T busy_wait(int N, T i) {
  T x = 1.3f;
  T y = (T)i;
  for (int j = 0; j < 1024 * 512; j++) {
    MAD_64(x, y);
  }
  return y;
}

template <class T> double async(const int num_chunks) {
  // one queue, `num_chunks` kernel submission, one wait at the end
  const int globalWIs{512};
  sycl::queue Q{sycl::gpu_selector()};
  T *ptr = sycl::malloc_device<T>(globalWIs, Q);
  const auto s = std::chrono::high_resolution_clock::now();
  for (int chunk = 0; chunk < num_chunks; chunk++) {
    Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
    });
  }
  Q.wait();
  const auto e = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}

template <class T> double multiple_queue(const int num_chunks) {
  // `num_chunks` queue. One queue per kernel submission
  const int globalWIs{512};
  const sycl::device D{sycl::gpu_selector()};
  const sycl::context C(D);
  T *ptr = sycl::malloc_device<T>(globalWIs, D, C);
  std::vector<sycl::queue> Qs;
  for (int chunk = 0; chunk < num_chunks; chunk++)
    Qs.push_back(sycl::queue(C, D));

  const auto s = std::chrono::high_resolution_clock::now();
  for (auto &Q : Qs) {
    Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
    });
  }
  for (auto &Q : Qs)
    Q.wait();
  const auto e = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}

int main(int argc, char *argv[]) {
  // By default try async, and N == 2
  const std::string mode = (argc < 2) ? "async" : std::string{argv[1]};
  const int N = (argc < 3) ? 2 : std::stoi(argv[2]);

  std::cout << "Mode " << mode << std::endl;

  double (*foo)(int);
  foo = mode == "async" ? &async<float> : &multiple_queue<float>;
  // Just to avoid JIT, and to warm up the GPU
  foo(1);
  // Each kernel run one ~1 EU, so N kernels should take the same time as 1 kernel
  // for N <= EU_MAX.
  const double t1 = foo(1);
  std::cout << "Time 1 chunk: " << t1 << std::endl;
  const double tN = foo(N);
  std::cout << "Time " << N << " chunks: " << tN << std::endl;

  // Kernels should have run in parralel
  if (not(std::abs(tN - t1) <= (0.20 * tN))) {
    std::cerr << "No concurents execution..." << std::endl;
    return 1;
  } else {
    std::cerr << "Concurents execution!" << std::endl;
    return 0;
  }
}

Environment (please complete the following information):

  • OS: Linux
  • Target device and vendor: Intel GPU
  • DPC++ version: Intel(R) oneAPI DPC++/C++ Compiler 2022.1.0 (2022.x.0.20211025),
  • Dependencies version: L0 agama-prerelease-191

Activity

AlexeySachkov

AlexeySachkov commented on Jan 25, 2022

@AlexeySachkov
Contributor

Hi @TApplencourt, thanks for the report,

I was able to reproduce the problem using open-source version of the compiler 20200120 and NEO 21.46.21636 or 1.2.21636 for L0 version.

Using ze_tracer I've collected some info: async-8.trace.json on pastebin and multiple-queues-8.trace.json on pastebin. To view them, just load .json files into chrome://tracing page in Chrome. As you can guess, I used 8 chunks for those tests.

So, multiple-queues-8.trace.json shows that even though you only call Q.wait() after all kernels are submitted, zeCommandQueueSynchronize is still called for each kernel right after it was submitted. Also, we have called zeModuleCreate three times instead of just once. Tagging @smaslov-intel here for awareness.

image

async-8.trace.json looks better, i.e. there is only one call to zeCommandQueueSynchronize, but all kernels are still serialized under the hood:

image

It seems to me that there is some limitation (perhaps?) of a low-level device runtime or driver, which serializes kernels. Tagging @jchodor here for further comments

TApplencourt

TApplencourt commented on Jan 25, 2022

@TApplencourt
ContributorAuthor

Hi Alexey,

I did the same study but didn't want to pollute the first threads with some technical details. :) I saw exactly what you described. By the way, we have the same issue with the OpenMP backend of icpx.

  • The first one that you describe is a limitation of the L0 backend of SYCL. One should potentially prefer to use the command list immediate, to avoid the need to reset (and hence synchronize) the command list. My understanding is that reusing a command list is done to avoid the command list creation overhead. But losing concurrency is far more problematic. A lot of HPC code is relying on GPU kernel concurrency to get good performance

Sadly even if SYCL did this, it will not work. I did some tests writing directly in L0 using command list immediate and I hit another bug but (this time in the L0 drivers) preventing concurrent execution.

  • The second one, seems to be indeed also a limitation to the L0 drivers. Or at least a missed optimization.

Hope this help,
I can provide more data if needed.

MrSidims

MrSidims commented on Jan 26, 2022

@MrSidims
Contributor

UPD: please ignore these 'findings'
I agree with Alexey's serialization conclusion, though I believe 'kernel' shouldn't be plural, since in the code above, if my understanding is correct, DPCPP compiler creates a single kernel (by name) and enqueues it two times (I'm underlining this to differentiate kernels and kernel instances in runtime). I didn't dig in GPU runtime/pi traces, but I've managed to get:

Mode async
Time 1 chunk: 861940
Time 2 chunks: 862400
Concurents execution!

by modifying the original code like this:

 35 //  for (int chunk = 0; chunk < num_chunks; chunk++) {
 36     Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
 37       // Race condition on `ptr[i]`. We don't care
 38       ptr[i] = busy_wait(1024 * 512, (T)i);
 39     });
 40     Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
 41       // Race condition on `ptr[i]`. We don't care
 42       ptr[i] = busy_wait(1024 * 512, (T)i);
 43     });
 44 //  }

(note, that after modification there are two different by names kernels) so for me it does look like, that L0 runtime does serialization of the created kernel, enforcing in-order execution. I see a reason why it should be done for FPGA, where you have extra cross-kernel connections, but at the same time there is no reason for serialization on GPU.

TApplencourt

TApplencourt commented on Jan 26, 2022

@TApplencourt
ContributorAuthor

Oh, interesting find!
Sadly I was not able to reproduce your behavior. Maybe I did something bad with my updated reproducer? I but it bellow for more people to try to reproducer for needed.

tapplencourt:~/tmp/sycl> cat test.cpp
#define MAD_4(x, y)                                                                                                    \
  x = y * x + y;                                                                                                       \
  y = x * y + x;                                                                                                       \
  x = y * x + y;                                                                                                       \
  y = x * y + x;
#define MAD_16(x, y)                                                                                                   \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);
#define MAD_64(x, y)                                                                                                   \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);

#include <chrono>
#include <sycl/sycl.hpp>

template <class T> T busy_wait(int N, T i) {
  T x = 1.3f;
  T y = (T)i;
  for (int j = 0; j < 1024 * 512; j++) {
    MAD_64(x, y);
  }
  return y;
}

template <class T> double async1(const int num_chunks) {
  // one queue, `num_chunks` kernel submission, one wait at the end
  const int globalWIs{512};
  sycl::queue Q{sycl::gpu_selector()};
  T *ptr = sycl::malloc_device<T>(globalWIs, Q);
  const auto s = std::chrono::high_resolution_clock::now();
  Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
   });
  Q.wait();
  const auto e = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}

template <class T> double async2(const int num_chunks) {
  // one queue, `num_chunks` kernel submission, one wait at the end
  const int globalWIs{512};
  sycl::queue Q{sycl::gpu_selector()};
  T *ptr = sycl::malloc_device<T>(globalWIs, Q);
  const auto s = std::chrono::high_resolution_clock::now();
   Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
   });
   Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
   });
  Q.wait();
  const auto e = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}

int main(int argc, char *argv[]) {
  std::cout << "Mode async" << std::endl;
  // Just to avoid JIT, and to warm up the GPU
  async1<float>(1);
  const double t1 = async1<float>(1);
  std::cout << "Time 1 chunk: " << t1 << std::endl;
  const double t2 = async2<float>(2);
  std::cout << "Time " << 2 << " chunks: " << t2 << std::endl;

  // Kernels should have run in parralel
  if (not(std::abs(t2 - t1) <= (0.20 * t2))) {
    std::cerr << "No concurents execution..." << std::endl;
    return 1;
  } else {
    std::cerr << "Concurents execution!" << std::endl;
    return 0;
  }
}
tapplencourt:~/tmp/sycl> dpcpp test.cpp
tapplencourt:~/tmp/sycl> ./a.out
Mode async
Time 1 chunk: 273272
Time 2 chunks: 545479
No concurents execution...
MrSidims

MrSidims commented on Jan 26, 2022

@MrSidims
Contributor

Maybe I did something bad with my updated reproducer?

No. It's just me being stupid and forgetting to test single execution to compare results - so the results that I was comparing are both for double execution. In other words the proposed 'fix' isn't helping.

smaslov-intel

smaslov-intel commented on Jan 27, 2022

@smaslov-intel
Contributor

I did some tests writing directly in L0 using command list immediate and I hit another bug but (this time in the L0 drivers) preventing concurrent execution.

@TApplencourt : could you share your tests?

TApplencourt

TApplencourt commented on Jan 28, 2022

@TApplencourt
ContributorAuthor

Ofc. Please find them here.
Please note that I lie when I said that I wrote directly in L0. I used our Ruby Binding. Ruby binding are at a side effect on THAPI (our take on zetracer). Thanks to spack, THAPI should be "easy" to install, don't hesitate if you have any questions.

Please find below some sniped of code, and results just to ease people reading.

Code we use:

def bench_immediate(n)
   command_list = CONTEXT.command_list_create_immediate(DEVICE, mode: :ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS)
   event_pool = CONTEXT.event_pool_create(1, flags: [:ZE_EVENT_POOL_FLAG_HOST_VISIBLE])
   event = event_pool.event_create(0)

   t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
   n.times {
        command_list.append_launch_kernel(KERNEL, GROUP_COUNT)
   }
   command_list.append_barrier(signal_event: event)
   event.host_synchronize(timeout: ZE::UINT64_MAX)
   t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
   return t2-t1
end

Then we run this code for n=1 and n=5, this is the output

["bench_immediate", "n", 5, "t1", 0.049294119999103714, "tN", 0.24510650199954398, false]

As you see tN is 5x bigger than t1, showing no concurrent execution.
Hope this help

TApplencourt

TApplencourt commented on Feb 7, 2022

@TApplencourt
ContributorAuthor

Just for the sake of people monitoring this thread, I was able to see concurrent execution of kernel using multiple command queues. One just needs to use multiple index when creating the command queues.

For the people who can read ruby, I put the reproducer bellow

def bench_multiple_queue_ordinal(n,same_index=false)
    #CUDA style: Multiple queue
    if (!same_index)
        queues = DEVICE.command_queue_group_properties.filter_map.with_index{ |d,ordinal| [ordinal,d[:numQueues]] if d[:flags].include?(:ZE_COMMAND_QUEUE_GROUP_PROPERTY_FLAG_COMPUTE)}
    else 
        # Only use one index / queue. 
        queues = [ [0,1] ]
    end
    # Maximun concurrency if `n=0`. Will dispatch one kernel per ordinal * num_queue / index
    n = queues.sum{ |_,index| index} if n.zero?

    commmands = queues.cycle.first(n).map.with_index { |(ordinal,index_max),i|
            [CONTEXT.command_queue_create(DEVICE, ordinal: ordinal, index: i % index_max),
             CONTEXT.command_list_create(DEVICE, command_queue_group_ordinal: ordinal)] }

    t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    commmands.each { |command_queue, command_list|
        command_list.append_launch_kernel(KERNEL,GROUP_COUNT)
        command_list.close
        command_queue.execute_command_lists(command_list)
    }
    commmands.each { |command_queue, command_list|
        command_queue.synchronize
        command_list.reset
    }
    t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    return t2-t1
end
jandres742

jandres742 commented on Feb 14, 2022

@jandres742
Contributor

@TApplencourt could you provide latest status on this issue: there are some things to keep in mind here:

I was able to see concurrent execution of kernel using multiple command queues. One just needs to use multiple index when creating the command queues.

that is correct. You may increase concurrency by using different indexes. If using the same index, you would have serialization.

I used our Ruby Binding. Ruby binding are at a side effect on THAPI (our take on zetracer).

Depending on what THAPI is using, you would see serialization on some platforms. For instance, for ze_tracer, events are added, and if those have host scope, then we would have to do some extra cache flushes, and depending on the gen, you would get some serialization. One way of confirming that would be to have a native L0 sample w/o any other dep, like THAPI.

TApplencourt

TApplencourt commented on Feb 14, 2022

@TApplencourt
ContributorAuthor

They are just FFI binding to the L0 commands. We don't modify any L0 command. We don't insert events to profiling or this kind of this.
But I will write a direct L0 reproducer for this particular test and post it here.

jandres742

jandres742 commented on Feb 14, 2022

@jandres742
Contributor

@TApplencourt could you elaborate on this then?

Ruby binding are at a side effect on THAPI (our take on zetracer).

what do you mean that THAPI is your take on ze_tracer? is THAPI then a tracing/profiling tool?

TApplencourt

TApplencourt commented on Feb 14, 2022

@TApplencourt
ContributorAuthor

what do you mean that THAPI is your take on ze_tracer? is THAPI then a tracing/profiling tool?

Yes. THAPI is a tracer. You use it the same way you use ze_tracer. It will intercept all the L0 call and generate the trace (dumping the arguments of each L0 call and the timespampt). They may add new events if you ask for GPU profiling. It's a tracing profiling tool

The ruby binding is just that binding. It's a by-product of THAPI. Some post-analysis tools shipped with THAPI use ruby binding. Ruby bindings are similar to some L0 Python binding that I saw onetime.
They are just used to call the L0 function directly via ruby using FFI. They don't do anything magic under the hood, and should not include any serialization of any kind. They just provide a little bit of syntactic sugar for the creation of struct, device creation, and so on. But that is it. The binding is similar to https://github.com/Nanosim-LIG/opencl-ruby or https://documen.tician.de/pyopencl/

Both the tracer and the binding, live in the same project but they are distinct.

jandres742

jandres742 commented on Feb 14, 2022

@jandres742
Contributor

thanks. @TApplencourt . So this is what I was referring to:

They may add new events if you ask for GPU profiling. It's a tracing profiling tool

If the tool adds some extra events, then depending on how those events are created and the gen on which you are running, you may see some serialization. That's why I think having a standalone reproducer on L0 would help here more.

TApplencourt

TApplencourt commented on Feb 14, 2022

@TApplencourt
ContributorAuthor

But the ruby binding doesn't add an event. Never. And I use only the ruby binding. I don't use events in the code. Just CPU timing. But I will write the L0 reproducer for the multiple index case.
Will be simpler for everybody to share.

TApplencourt

TApplencourt commented on Feb 15, 2022

@TApplencourt
ContributorAuthor

Hi @jandres742,

I did write it directly in L0. Please find it here for your review: https://gist.github.com/TApplencourt/35d124cf1cf74d240d2c499cb070fbd8

As expected, the behavior is exactly the same as with the ruby binding. I was expecting 4x, I got only 2. This code uses multiple command queues, each targeting a different index.

$ ZE_AFFINITY_MASK=0.0 ./a.out
1 kernel 49105
4 kernels 109224
Slowdown 2.22429
Not enough parallelism

Hope this helps,

PS: I followed your coding style, so if you trace the code you will see that stype fields are un-initialized but I guess that ok.

jandres742

jandres742 commented on Feb 15, 2022

@jandres742
Contributor

Thanks @TApplencourt . Will dig deeper into the reproducer and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingconfirmedperformancePerformance related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @AlexeySachkov@TApplencourt@jandres742@MrSidims@smaslov-intel

      Issue actions

        Performance of L0 backend: Unable to see concurrent execution · Issue #5344 · intel/llvm