Skip to content

[CI][CACHE] fix Triton cache permission errors in nightly op tests #1615

Description

@superAngGao

Summary

Nightly op tests hit a group of PermissionError failures while Triton/PyTorch Inductor tried to write temporary files under the shared CI Triton cache:

PermissionError: [Errno 13] Permission denied: '/ci-cache/triton/.../tmp.pid_...'

This looks like a CI cache ownership / writable-path issue rather than an operator correctness issue. The affected tests are failing while third-party Triton kernels or PyTorch Inductor compile/cache artifacts, before the actual TileOPs correctness comparison can complete.

Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027

Observed Failures

The op test artifact contains 21 failures in this category:

  • tests.ops.test_gla_chunkwise_bwd: 3 failures
  • tests.ops.test_gla_chunkwise_fwd: 3 failures
  • tests.ops.test_gla_recurrence: 3 failures
  • tests.ops.test_mamba: 6 failures, surfaced through torch._inductor.exc.InductorError wrapping the same PermissionError
  • tests.ops.test_moe_fused_moe: 6 failures

Representative examples:

FAILED tests/ops/test_gla_chunkwise_bwd.py::test_gla_bwd[2-64-2-64-64-64-dtype0-False]
PermissionError: [Errno 13] Permission denied: '/ci-cache/triton/.../tmp.pid_...'
FAILED tests/ops/test_mamba.py::test_mamba2_fwd_e2e[1-256-4-64-32-1-256-dtype0]
torch._inductor.exc.InductorError: PermissionError: [Errno 13] Permission denied: '/ci-cache/triton/.../tmp.pid_...'

Likely Cause

The nightly workflow mounts persistent cache directories into the Docker container. The mounted Triton cache appears to contain directories/files created by a different UID/GID or with restrictive permissions, so the current container user cannot create tmp.pid_* files in some hashed Triton cache subdirectories.

Because the error path is inside the cache (/ci-cache/triton/...) and not the checked-out workspace, the existing workspace ownership fix is not sufficient.

Proposed Fix

This should be fixable at the workflow level.

Options:

  1. Ensure the host cache directories are writable before running tests, for example by adding a cache ownership/permission normalization step for the mounted cache roots.
  2. Run the container with a consistent UID/GID that owns the persistent cache directories.
  3. Use a per-run writable Triton/Inductor cache directory for tests, then optionally sync/reuse only known-safe cache contents.
  4. Explicitly set all relevant cache env vars to the intended writable mount inside the container, including TRITON_CACHE_DIR and any PyTorch Inductor cache env vars if needed.

A minimal workflow-side fix could be to normalize the mounted cache permissions before Run full op tests, e.g. for the cache roots used by the nightly container:

mkdir -p /data/ci-cache/triton /data/ci-cache/tilelang /data/ci-cache/pip /data/ci-cache/wheels
chown -R <runner-uid>:<runner-gid> /data/ci-cache/triton /data/ci-cache/tilelang /data/ci-cache/pip /data/ci-cache/wheels
chmod -R u+rwX /data/ci-cache/triton /data/ci-cache/tilelang /data/ci-cache/pip /data/ci-cache/wheels

or perform the equivalent inside a privileged/cache-maintenance container, depending on the runner setup.

Acceptance Criteria

  • Nightly op tests no longer fail with PermissionError under /ci-cache/triton or the mounted Triton cache path.
  • GLA, Mamba, and MoE tests progress past Triton/Inductor compilation/cache setup.
  • The fix does not require changing operator code.

Metadata

Metadata

Assignees

Labels

benchBenchmark updatesciCI/CD pipeline changes

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions