[BUG][BENCHMARK][Linear-Attn] GLA decode benchmark times out during CUDA synchronize

## Summary

Nightly benchmark still has a separate GLA decode hang/timeout. This is independent of:

- the TileLang 0.1.11 autotune missing-argument issue (#1613),
- the Triton cache permission issue (#1615), and
- the GQA sliding-window correctness issue (#1616).

Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027

The benchmark job uploaded `tileops_benchmarks.log`, which ends with a timeout while profiling `benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench`.

## Failure Evidence

The timeout stack is:

```text
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/ops/bench_gla_recurrence.py", line 120, in test_gla_decode_bench
  result = bm.profile(op, *inputs)
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/benchmark_base.py", line 315, in profile
  latency = bench_kernel(functor, args=inputs)
File "/home/ci-runner/runner/_work/TileOPs/TileOPs/benchmarks/benchmark_base.py", line 199, in bench_kernel
  torch.cuda.synchronize()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 1108, in synchronize
  return torch._C._cuda_synchronize()
```

Captured stdout immediately before the timeout:

```text
GLADecodeFP32Kernel initialized with config: {'num_stages': 2, 'threads': 128, 'k_tile': 16}
```

So the benchmark appears to launch work and then hang or run indefinitely until the job-level timeout catches it at `torch.cuda.synchronize()`.

## Affected Benchmark

File: `benchmarks/ops/bench_gla_recurrence.py`

Benchmark under test:

```python
@GLADecodeBenchFixture
def test_gla_decode_bench(...):
    op = GLADecodeOp(batch, heads, dim_k, dim_v, scale=scale, dtype=dtype)
    result = bm.profile(op, *inputs)
```

The fixture covers single-token GLA decode (`T=1`) over:

- batch: `1`, `8`, `16`, `32`, `64`
- heads: `32`
- dim: `64` or `128`
- dtype: `float32`, `float16`, `bfloat16`

The log only captured the active kernel class/config before timeout, not the exact parametrized case, so first debugging step is to rerun with verbose pytest ids or add per-case logging.

## Proposed Investigation

- Reproduce `benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench` on H200 with `-vv -s` to identify the exact hanging parameter case.
- Start with fp32 cases because the captured kernel is `GLADecodeFP32Kernel`.
- Check whether the hang is in the TileOPs kernel launch itself or in benchmark warmup/repetition logic.
- Compare with op correctness tests for `GLADecodeOp` to see whether correctness uses smaller shapes or avoids this fp32 path.
- Add a defensive per-case timeout or xfail only after we know whether the kernel is incorrect/hanging vs just too slow for the benchmark budget.

## Acceptance Criteria

- `benchmarks/ops/bench_gla_recurrence.py::test_gla_decode_bench` completes without hanging in nightly benchmark.
- If a specific shape/dtype is unsupported or too slow, it is explicitly skipped/xfail-marked with a reason instead of timing out the whole benchmark job.
- Benchmark logs include enough parameter context to identify future hangs quickly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG][BENCHMARK][Linear-Attn] GLA decode benchmark times out during CUDA synchronize #1617

Summary

Failure Evidence

Affected Benchmark

Proposed Investigation

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG][BENCHMARK][Linear-Attn] GLA decode benchmark times out during CUDA synchronize #1617

Description

Summary

Failure Evidence

Affected Benchmark

Proposed Investigation

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions