Performance of L0 backend: Unable to see concurrent execution

**Describe the bug**

Some kernels cannot use the full GPU (it's the case of the kernel used in the reproducer bellow), the worse case is kernels that use only one EU.  To get performance with such codes, the solution is to submit many kernels that would run concurrently. 
This can be implemented in the SYCL in two manners:

 - The straightforward is just to submit many kernels in an out-of-order queue and then synchronize. It's what I call the "async" mode in the reproducer.
 - Create one queue per kernel, submit all the kernels, and then wait for each individual queue.  I call this mode "multiple_queue".
 
 As you deduced by the title of the issues, none of those approaches work on my hardware using L0 runtime. I don't see any concurrent execution:
 
 ```
 $dpcpp -fsycl reproducer.cpp
 tapplencourt@foo12:~/tmp/sycl> ./a.out async
Mode async
Time 1 chunk: 172878
Time 2 chunks: 345948
No concurents execution...
tapplencourt@foo2:~/tmp/sycl> ./a.out multiple_queue
Mode multiple_queue
Time 1 chunk: 244641
Time 2 chunks: 410005
No concurents execution...
 ```
The code is trivial. Just comparing the runtime of one kernel, and the runtime of two kernels ( Each kernel is only 512 Work Item large.). The time should be the same, as they should be able to run in parallel on the GPU. As you can see, it's not the case. 

 Of course, nothing said in the specification that those kernels should run concurrently, but by experience, it's a required optimization to get good performance for a lot of codes. 
 
**To Reproduce**

Be careful, for a strange reason GitHub removed the `\` in the macro. 

```c++
#define MAD_4(x, y)                                                                                                    \
  x = y * x + y;                                                                                                       \
  y = x * y + x;                                                                                                       \
  x = y * x + y;                                                                                                       \
  y = x * y + x;
#define MAD_16(x, y)                                                                                                   \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);                                                                                                         \
  MAD_4(x, y);
#define MAD_64(x, y)                                                                                                   \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);                                                                                                        \
  MAD_16(x, y);

#include <chrono>
#include <sycl/sycl.hpp>

template <class T> T busy_wait(int N, T i) {
  T x = 1.3f;
  T y = (T)i;
  for (int j = 0; j < 1024 * 512; j++) {
    MAD_64(x, y);
  }
  return y;
}

template <class T> double async(const int num_chunks) {
  // one queue, `num_chunks` kernel submission, one wait at the end
  const int globalWIs{512};
  sycl::queue Q{sycl::gpu_selector()};
  T *ptr = sycl::malloc_device<T>(globalWIs, Q);
  const auto s = std::chrono::high_resolution_clock::now();
  for (int chunk = 0; chunk < num_chunks; chunk++) {
    Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
    });
  }
  Q.wait();
  const auto e = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}

template <class T> double multiple_queue(const int num_chunks) {
  // `num_chunks` queue. One queue per kernel submission
  const int globalWIs{512};
  const sycl::device D{sycl::gpu_selector()};
  const sycl::context C(D);
  T *ptr = sycl::malloc_device<T>(globalWIs, D, C);
  std::vector<sycl::queue> Qs;
  for (int chunk = 0; chunk < num_chunks; chunk++)
    Qs.push_back(sycl::queue(C, D));

  const auto s = std::chrono::high_resolution_clock::now();
  for (auto &Q : Qs) {
    Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
      // Race condition on `ptr[i]`. We don't care
      ptr[i] = busy_wait(1024 * 512, (T)i);
    });
  }
  for (auto &Q : Qs)
    Q.wait();
  const auto e = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}

int main(int argc, char *argv[]) {
  // By default try async, and N == 2
  const std::string mode = (argc < 2) ? "async" : std::string{argv[1]};
  const int N = (argc < 3) ? 2 : std::stoi(argv[2]);

  std::cout << "Mode " << mode << std::endl;

  double (*foo)(int);
  foo = mode == "async" ? &async<float> : &multiple_queue<float>;
  // Just to avoid JIT, and to warm up the GPU
  foo(1);
  // Each kernel run one ~1 EU, so N kernels should take the same time as 1 kernel
  // for N <= EU_MAX.
  const double t1 = foo(1);
  std::cout << "Time 1 chunk: " << t1 << std::endl;
  const double tN = foo(N);
  std::cout << "Time " << N << " chunks: " << tN << std::endl;

  // Kernels should have run in parralel
  if (not(std::abs(tN - t1) <= (0.20 * tN))) {
    std::cerr << "No concurents execution..." << std::endl;
    return 1;
  } else {
    std::cerr << "Concurents execution!" << std::endl;
    return 0;
  }
}
```

**Environment (please complete the following information):**

- OS: Linux
- Target device and vendor: Intel GPU
- DPC++ version: `Intel(R) oneAPI DPC++/C++ Compiler 2022.1.0 (2022.x.0.20211025)`, 
- Dependencies version: L0 `agama-prerelease-191` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance of L0 backend: Unable to see concurrent execution #5344

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance of L0 backend: Unable to see concurrent execution #5344

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions