Summary
Nightly benchmark still has a separate GLA decode hang/timeout. This is independent of:
Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027
The benchmark job uploaded tileops_benchmarks.log, which ends with a timeout while profiling benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench.
Failure Evidence
The timeout stack is:
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/ops/bench_gla_recurrence.py", line 120, in test_gla_decode_bench
result = bm.profile(op, *inputs)
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/benchmark_base.py", line 315, in profile
latency = bench_kernel(functor, args=inputs)
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/benchmark_base.py", line 199, in bench_kernel
torch.cuda.synchronize()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1108, in synchronize
return torch._C._cuda_synchronize()
Captured stdout immediately before the timeout:
GLADecodeFP32Kernel initialized with config: {'num_stages': 2, 'threads': 128, 'k_tile': 16}
So the benchmark appears to launch work and then hang or run indefinitely until the job-level timeout catches it at torch.cuda.synchronize().
Affected Benchmark
File: benchmarks/ops/bench_gla_recurrence.py
Benchmark under test:
@GLADecodeBenchFixture
def test_gla_decode_bench(...):
op = GLADecodeOp(batch, heads, dim_k, dim_v, scale=scale, dtype=dtype)
result = bm.profile(op, *inputs)
The fixture covers single-token GLA decode (T=1) over:
- batch:
1, 8, 16, 32, 64
- heads:
32
- dim:
64 or 128
- dtype:
float32, float16, bfloat16
The log only captured the active kernel class/config before timeout, not the exact parametrized case, so first debugging step is to rerun with verbose pytest ids or add per-case logging.
Proposed Investigation
- Reproduce
benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench on H200 with -vv -s to identify the exact hanging parameter case.
- Start with fp32 cases because the captured kernel is
GLADecodeFP32Kernel.
- Check whether the hang is in the TileOPs kernel launch itself or in benchmark warmup/repetition logic.
- Compare with op correctness tests for
GLADecodeOp to see whether correctness uses smaller shapes or avoids this fp32 path.
- Add a defensive per-case timeout or xfail only after we know whether the kernel is incorrect/hanging vs just too slow for the benchmark budget.
Acceptance Criteria
benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench completes without hanging in nightly benchmark.
- If a specific shape/dtype is unsupported or too slow, it is explicitly skipped/xfail-marked with a reason instead of timing out the whole benchmark job.
- Benchmark logs include enough parameter context to identify future hangs quickly.
Summary
Nightly benchmark still has a separate GLA decode hang/timeout. This is independent of:
Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027
The benchmark job uploaded
tileops_benchmarks.log, which ends with a timeout while profilingbenchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench.Failure Evidence
The timeout stack is:
Captured stdout immediately before the timeout:
So the benchmark appears to launch work and then hang or run indefinitely until the job-level timeout catches it at
torch.cuda.synchronize().Affected Benchmark
File:
benchmarks/ops/bench_gla_recurrence.pyBenchmark under test:
The fixture covers single-token GLA decode (
T=1) over:1,8,16,32,643264or128float32,float16,bfloat16The log only captured the active kernel class/config before timeout, not the exact parametrized case, so first debugging step is to rerun with verbose pytest ids or add per-case logging.
Proposed Investigation
benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_benchon H200 with-vv -sto identify the exact hanging parameter case.GLADecodeFP32Kernel.GLADecodeOpto see whether correctness uses smaller shapes or avoids this fp32 path.Acceptance Criteria
benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_benchcompletes without hanging in nightly benchmark.