Skip to content

[BUG][BENCHMARK][Linear-Attn] GLA decode benchmark times out during CUDA synchronize #1617

Description

@superAngGao

Summary

Nightly benchmark still has a separate GLA decode hang/timeout. This is independent of:

Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027

The benchmark job uploaded tileops_benchmarks.log, which ends with a timeout while profiling benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench.

Failure Evidence

The timeout stack is:

File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/ops/bench_gla_recurrence.py", line 120, in test_gla_decode_bench
  result = bm.profile(op, *inputs)
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/benchmark_base.py", line 315, in profile
  latency = bench_kernel(functor, args=inputs)
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/benchmark_base.py", line 199, in bench_kernel
  torch.cuda.synchronize()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1108, in synchronize
  return torch._C._cuda_synchronize()

Captured stdout immediately before the timeout:

GLADecodeFP32Kernel initialized with config: {'num_stages': 2, 'threads': 128, 'k_tile': 16}

So the benchmark appears to launch work and then hang or run indefinitely until the job-level timeout catches it at torch.cuda.synchronize().

Affected Benchmark

File: benchmarks/ops/bench_gla_recurrence.py

Benchmark under test:

@GLADecodeBenchFixture
def test_gla_decode_bench(...):
    op = GLADecodeOp(batch, heads, dim_k, dim_v, scale=scale, dtype=dtype)
    result = bm.profile(op, *inputs)

The fixture covers single-token GLA decode (T=1) over:

  • batch: 1, 8, 16, 32, 64
  • heads: 32
  • dim: 64 or 128
  • dtype: float32, float16, bfloat16

The log only captured the active kernel class/config before timeout, not the exact parametrized case, so first debugging step is to rerun with verbose pytest ids or add per-case logging.

Proposed Investigation

  • Reproduce benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench on H200 with -vv -s to identify the exact hanging parameter case.
  • Start with fp32 cases because the captured kernel is GLADecodeFP32Kernel.
  • Check whether the hang is in the TileOPs kernel launch itself or in benchmark warmup/repetition logic.
  • Compare with op correctness tests for GLADecodeOp to see whether correctness uses smaller shapes or avoids this fp32 path.
  • Add a defensive per-case timeout or xfail only after we know whether the kernel is incorrect/hanging vs just too slow for the benchmark budget.

Acceptance Criteria

  • benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench completes without hanging in nightly benchmark.
  • If a specific shape/dtype is unsupported or too slow, it is explicitly skipped/xfail-marked with a reason instead of timing out the whole benchmark job.
  • Benchmark logs include enough parameter context to identify future hangs quickly.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions