Open
Description
Describe the bug
While submitting consecutive host tasks to an in-order queue without explicit wait()
, the execution time of each host task explodes as the number of submission increases.
To reproduce
Reproducing code
// test.cpp
#include <sycl/sycl.hpp>
#include <iostream>
#include <thread>
#include <chrono>
int main(int argc, char *argv[]) {
sycl::queue queue(sycl::property::queue::in_order{});
std::cout << "Using device: " << queue.get_device().get_info<sycl::info::device::name>() << "\n";
int repeat = 10000;
if (argc > 1) {
repeat = std::stoi(std::string(argv[1]));
}
int data = 0;
std::cout << "Submitting " << repeat << " host tasks...\n";
auto start_time = std::chrono::high_resolution_clock::now();
for (int i = 0; i < repeat; i++) {
std::this_thread::sleep_for(std::chrono::microseconds(500));
auto e = queue.submit([&](sycl::handler &cgh) {
cgh.host_task([&]() {
// Simulate some work on the host
std::this_thread::sleep_for(std::chrono::milliseconds(1));
data++;
});
});
#ifdef WAIT
e.wait();
#endif
}
queue.wait();
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << "Total execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count() << " ms\n";
if (data != repeat) {
std::cerr << "Error: data mismatch! Expected " << repeat << ", got " << data << "\n";
return 1;
}
return 0;
}
Compile
Compile the code w/ and w/o explicit wait
for each submission.
clang++ -fsycl test.cpp -o nowait.out
clang++ -fsycl test.cpp -DWAIT -o wait.out
Run
Pass the number of consecutive submission (repeat
) via first argument.
./nowait.out 3000
./wait.out 3000
Results for different repeat
Total time in ms
repeat | 10 | 100 | 1000 | 3000 | 10000 |
---|---|---|---|---|---|
wait.out | 16 | 162 | 1617 | 4853 | 16184 |
nowait.out | 11 | 106 | 1396 | 12996 | 519977 |
Avg time in ms
repeat | 10 | 100 | 1000 | 3000 | 10000 |
---|---|---|---|---|---|
wait.out | 1.6 | 1.62 | 1.617 | 1.618 | 1.6184 |
nowait.out | 1.1 | 1.06 | 1.396 | 4.332 | 51.9977 |
Expected behavior
Even w/o explicit wait()
for each submission (onto an in-order queue), the average execution time of each host task should be around 1ms. The 50x slowdown when repeat==10000
is not expected.
Environment
- OS: Linux
- Target device and vendor: host
- DPC++ version: 7987a43
- Dependencies version: Not relevant
Additional context
No response