Skip to content

[SYCL][Host Task] Bad performance of consecutively submitted host tasks onto an in-order queue #18500

Open
@Nuullll

Description

@Nuullll

Describe the bug

While submitting consecutive host tasks to an in-order queue without explicit wait(), the execution time of each host task explodes as the number of submission increases.

To reproduce

Reproducing code

// test.cpp
#include <sycl/sycl.hpp>
#include <iostream>
#include <thread>
#include <chrono>

int main(int argc, char *argv[]) {
    sycl::queue queue(sycl::property::queue::in_order{});

    std::cout << "Using device: " << queue.get_device().get_info<sycl::info::device::name>() << "\n";

    int repeat = 10000;
    if (argc > 1) {
        repeat = std::stoi(std::string(argv[1]));
    }
    int data = 0;

    std::cout << "Submitting " << repeat << " host tasks...\n";
    auto start_time = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < repeat; i++) {
        std::this_thread::sleep_for(std::chrono::microseconds(500));
        auto e = queue.submit([&](sycl::handler &cgh) {
            cgh.host_task([&]() {
                // Simulate some work on the host
                std::this_thread::sleep_for(std::chrono::milliseconds(1));
                data++;
            });
        });
#ifdef WAIT
        e.wait();
#endif
    }

    queue.wait();
    auto end_time = std::chrono::high_resolution_clock::now();
    std::cout << "Total execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count() << " ms\n";
    if (data != repeat) {
        std::cerr << "Error: data mismatch! Expected " << repeat << ", got " << data << "\n";
        return 1;
    }

    return 0;
}

Compile

Compile the code w/ and w/o explicit wait for each submission.

clang++ -fsycl test.cpp -o nowait.out
clang++ -fsycl test.cpp -DWAIT -o wait.out

Run

Pass the number of consecutive submission (repeat) via first argument.

./nowait.out 3000
./wait.out 3000

Results for different repeat

Total time in ms

repeat 10 100 1000 3000 10000
wait.out 16 162 1617 4853 16184
nowait.out 11 106 1396 12996 519977

Avg time in ms

repeat 10 100 1000 3000 10000
wait.out 1.6 1.62 1.617 1.618 1.6184
nowait.out 1.1 1.06 1.396 4.332 51.9977

Expected behavior

Even w/o explicit wait() for each submission (onto an in-order queue), the average execution time of each host task should be around 1ms. The 50x slowdown when repeat==10000 is not expected.

Environment

  • OS: Linux
  • Target device and vendor: host
  • DPC++ version: 7987a43
  • Dependencies version: Not relevant

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions