-
Notifications
You must be signed in to change notification settings - Fork 795
Description
Describe the bug
Some kernels cannot use the full GPU (it's the case of the kernel used in the reproducer bellow), the worse case is kernels that use only one EU. To get performance with such codes, the solution is to submit many kernels that would run concurrently.
This can be implemented in the SYCL in two manners:
- The straightforward is just to submit many kernels in an out-of-order queue and then synchronize. It's what I call the "async" mode in the reproducer.
- Create one queue per kernel, submit all the kernels, and then wait for each individual queue. I call this mode "multiple_queue".
As you deduced by the title of the issues, none of those approaches work on my hardware using L0 runtime. I don't see any concurrent execution:
$dpcpp -fsycl reproducer.cpp
tapplencourt@foo12:~/tmp/sycl> ./a.out async
Mode async
Time 1 chunk: 172878
Time 2 chunks: 345948
No concurents execution...
tapplencourt@foo2:~/tmp/sycl> ./a.out multiple_queue
Mode multiple_queue
Time 1 chunk: 244641
Time 2 chunks: 410005
No concurents execution...
The code is trivial. Just comparing the runtime of one kernel, and the runtime of two kernels ( Each kernel is only 512 Work Item large.). The time should be the same, as they should be able to run in parallel on the GPU. As you can see, it's not the case.
Of course, nothing said in the specification that those kernels should run concurrently, but by experience, it's a required optimization to get good performance for a lot of codes.
To Reproduce
Be careful, for a strange reason GitHub removed the \
in the macro.
#define MAD_4(x, y) \
x = y * x + y; \
y = x * y + x; \
x = y * x + y; \
y = x * y + x;
#define MAD_16(x, y) \
MAD_4(x, y); \
MAD_4(x, y); \
MAD_4(x, y); \
MAD_4(x, y);
#define MAD_64(x, y) \
MAD_16(x, y); \
MAD_16(x, y); \
MAD_16(x, y); \
MAD_16(x, y);
#include <chrono>
#include <sycl/sycl.hpp>
template <class T> T busy_wait(int N, T i) {
T x = 1.3f;
T y = (T)i;
for (int j = 0; j < 1024 * 512; j++) {
MAD_64(x, y);
}
return y;
}
template <class T> double async(const int num_chunks) {
// one queue, `num_chunks` kernel submission, one wait at the end
const int globalWIs{512};
sycl::queue Q{sycl::gpu_selector()};
T *ptr = sycl::malloc_device<T>(globalWIs, Q);
const auto s = std::chrono::high_resolution_clock::now();
for (int chunk = 0; chunk < num_chunks; chunk++) {
Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
// Race condition on `ptr[i]`. We don't care
ptr[i] = busy_wait(1024 * 512, (T)i);
});
}
Q.wait();
const auto e = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}
template <class T> double multiple_queue(const int num_chunks) {
// `num_chunks` queue. One queue per kernel submission
const int globalWIs{512};
const sycl::device D{sycl::gpu_selector()};
const sycl::context C(D);
T *ptr = sycl::malloc_device<T>(globalWIs, D, C);
std::vector<sycl::queue> Qs;
for (int chunk = 0; chunk < num_chunks; chunk++)
Qs.push_back(sycl::queue(C, D));
const auto s = std::chrono::high_resolution_clock::now();
for (auto &Q : Qs) {
Q.parallel_for(globalWIs, [=](sycl::item<1> i) {
// Race condition on `ptr[i]`. We don't care
ptr[i] = busy_wait(1024 * 512, (T)i);
});
}
for (auto &Q : Qs)
Q.wait();
const auto e = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(e - s).count();
}
int main(int argc, char *argv[]) {
// By default try async, and N == 2
const std::string mode = (argc < 2) ? "async" : std::string{argv[1]};
const int N = (argc < 3) ? 2 : std::stoi(argv[2]);
std::cout << "Mode " << mode << std::endl;
double (*foo)(int);
foo = mode == "async" ? &async<float> : &multiple_queue<float>;
// Just to avoid JIT, and to warm up the GPU
foo(1);
// Each kernel run one ~1 EU, so N kernels should take the same time as 1 kernel
// for N <= EU_MAX.
const double t1 = foo(1);
std::cout << "Time 1 chunk: " << t1 << std::endl;
const double tN = foo(N);
std::cout << "Time " << N << " chunks: " << tN << std::endl;
// Kernels should have run in parralel
if (not(std::abs(tN - t1) <= (0.20 * tN))) {
std::cerr << "No concurents execution..." << std::endl;
return 1;
} else {
std::cerr << "Concurents execution!" << std::endl;
return 0;
}
}
Environment (please complete the following information):
- OS: Linux
- Target device and vendor: Intel GPU
- DPC++ version:
Intel(R) oneAPI DPC++/C++ Compiler 2022.1.0 (2022.x.0.20211025)
, - Dependencies version: L0
agama-prerelease-191
Activity
AlexeySachkov commentedon Jan 25, 2022
Hi @TApplencourt, thanks for the report,
I was able to reproduce the problem using open-source version of the compiler 20200120 and NEO 21.46.21636 or 1.2.21636 for L0 version.
Using
ze_tracer
I've collected some info:async-8.trace.json
on pastebin andmultiple-queues-8.trace.json
on pastebin. To view them, just load.json
files intochrome://tracing
page in Chrome. As you can guess, I used 8 chunks for those tests.So,
multiple-queues-8.trace.json
shows that even though you only callQ.wait()
after all kernels are submitted,zeCommandQueueSynchronize
is still called for each kernel right after it was submitted. Also, we have calledzeModuleCreate
three times instead of just once. Tagging @smaslov-intel here for awareness.async-8.trace.json
looks better, i.e. there is only one call tozeCommandQueueSynchronize
, but all kernels are still serialized under the hood:It seems to me that there is some limitation (perhaps?) of a low-level device runtime or driver, which serializes kernels. Tagging @jchodor here for further comments
TApplencourt commentedon Jan 25, 2022
Hi Alexey,
I did the same study but didn't want to pollute the first threads with some technical details. :) I saw exactly what you described. By the way, we have the same issue with the OpenMP backend of
icpx
.Sadly even if SYCL did this, it will not work. I did some tests writing directly in L0 using command list immediate and I hit another bug but (this time in the L0 drivers) preventing concurrent execution.
Hope this help,
I can provide more data if needed.
MrSidims commentedon Jan 26, 2022
UPD: please ignore these 'findings'
I agree with Alexey's serialization conclusion, though I believe 'kernel' shouldn't be plural, since in the code above, if my understanding is correct, DPCPP compiler creates a single kernel (by name) and enqueues it two times (I'm underlining this to differentiate kernels and kernel instances in runtime). I didn't dig in GPU runtime/pi traces, but I've managed to get:
by modifying the original code like this:
(note, that after modification there are two different by names kernels) so for me it does look like, that L0 runtime does serialization of the created kernel, enforcing in-order execution. I see a reason why it should be done for FPGA, where you have extra cross-kernel connections, but at the same time there is no reason for serialization on GPU.
TApplencourt commentedon Jan 26, 2022
Oh, interesting find!
Sadly I was not able to reproduce your behavior. Maybe I did something bad with my updated reproducer? I but it bellow for more people to try to reproducer for needed.
MrSidims commentedon Jan 26, 2022
No. It's just me being stupid and forgetting to test single execution to compare results - so the results that I was comparing are both for double execution. In other words the proposed 'fix' isn't helping.
smaslov-intel commentedon Jan 27, 2022
@TApplencourt : could you share your tests?
TApplencourt commentedon Jan 28, 2022
Ofc. Please find them here.
Please note that I lie when I said that I wrote directly in L0. I used our Ruby Binding. Ruby binding are at a side effect on THAPI (our take on zetracer). Thanks to spack, THAPI should be "easy" to install, don't hesitate if you have any questions.
Please find below some sniped of code, and results just to ease people reading.
Code we use:
Then we run this code for
n=1
andn=5
, this is the outputAs you see tN is 5x bigger than t1, showing no concurrent execution.
Hope this help
TApplencourt commentedon Feb 7, 2022
Just for the sake of people monitoring this thread, I was able to see concurrent execution of kernel using multiple command queues. One just needs to use multiple index when creating the command queues.
For the people who can read ruby, I put the reproducer bellow
jandres742 commentedon Feb 14, 2022
@TApplencourt could you provide latest status on this issue: there are some things to keep in mind here:
that is correct. You may increase concurrency by using different indexes. If using the same index, you would have serialization.
Depending on what THAPI is using, you would see serialization on some platforms. For instance, for ze_tracer, events are added, and if those have host scope, then we would have to do some extra cache flushes, and depending on the gen, you would get some serialization. One way of confirming that would be to have a native L0 sample w/o any other dep, like THAPI.
TApplencourt commentedon Feb 14, 2022
They are just FFI binding to the L0 commands. We don't modify any L0 command. We don't insert events to profiling or this kind of this.
But I will write a direct L0 reproducer for this particular test and post it here.
jandres742 commentedon Feb 14, 2022
@TApplencourt could you elaborate on this then?
what do you mean that THAPI is your take on ze_tracer? is THAPI then a tracing/profiling tool?
TApplencourt commentedon Feb 14, 2022
Yes. THAPI is a tracer. You use it the same way you use ze_tracer. It will intercept all the L0 call and generate the trace (dumping the arguments of each L0 call and the timespampt). They may add new events if you ask for GPU profiling. It's a tracing profiling tool
The ruby binding is just that binding. It's a by-product of THAPI. Some post-analysis tools shipped with THAPI use ruby binding. Ruby bindings are similar to some L0 Python binding that I saw onetime.
They are just used to call the L0 function directly via ruby using FFI. They don't do anything magic under the hood, and should not include any serialization of any kind. They just provide a little bit of syntactic sugar for the creation of struct, device creation, and so on. But that is it. The binding is similar to https://github.com/Nanosim-LIG/opencl-ruby or https://documen.tician.de/pyopencl/
Both the tracer and the binding, live in the same project but they are distinct.
jandres742 commentedon Feb 14, 2022
thanks. @TApplencourt . So this is what I was referring to:
If the tool adds some extra events, then depending on how those events are created and the gen on which you are running, you may see some serialization. That's why I think having a standalone reproducer on L0 would help here more.
TApplencourt commentedon Feb 14, 2022
But the ruby binding doesn't add an event. Never. And I use only the ruby binding. I don't use events in the code. Just CPU timing. But I will write the L0 reproducer for the multiple index case.
Will be simpler for everybody to share.
TApplencourt commentedon Feb 15, 2022
Hi @jandres742,
I did write it directly in L0. Please find it here for your review: https://gist.github.com/TApplencourt/35d124cf1cf74d240d2c499cb070fbd8
As expected, the behavior is exactly the same as with the ruby binding. I was expecting 4x, I got only 2. This code uses multiple command queues, each targeting a different index.
Hope this helps,
PS: I followed your coding style, so if you trace the code you will see that
stype
fields are un-initialized but I guess that ok.jandres742 commentedon Feb 15, 2022
Thanks @TApplencourt . Will dig deeper into the reproducer and report back.