Summary
Nightly op tests hit a group of PermissionError failures while Triton/PyTorch Inductor tried to write temporary files under the shared CI Triton cache:
PermissionError: [Errno 13] Permission denied: '/ci-cache/triton/.../tmp.pid_...'
This looks like a CI cache ownership / writable-path issue rather than an operator correctness issue. The affected tests are failing while third-party Triton kernels or PyTorch Inductor compile/cache artifacts, before the actual TileOPs correctness comparison can complete.
Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027
Observed Failures
The op test artifact contains 21 failures in this category:
tests.ops.test_gla_chunkwise_bwd: 3 failures
tests.ops.test_gla_chunkwise_fwd: 3 failures
tests.ops.test_gla_recurrence: 3 failures
tests.ops.test_mamba: 6 failures, surfaced through torch._inductor.exc.InductorError wrapping the same PermissionError
tests.ops.test_moe_fused_moe: 6 failures
Representative examples:
FAILED tests/ops/test_gla_chunkwise_bwd.py::test_gla_bwd[2-64-2-64-64-64-dtype0-False]
PermissionError: [Errno 13] Permission denied: '/ci-cache/triton/.../tmp.pid_...'
FAILED tests/ops/test_mamba.py::test_mamba2_fwd_e2e[1-256-4-64-32-1-256-dtype0]
torch._inductor.exc.InductorError: PermissionError: [Errno 13] Permission denied: '/ci-cache/triton/.../tmp.pid_...'
Likely Cause
The nightly workflow mounts persistent cache directories into the Docker container. The mounted Triton cache appears to contain directories/files created by a different UID/GID or with restrictive permissions, so the current container user cannot create tmp.pid_* files in some hashed Triton cache subdirectories.
Because the error path is inside the cache (/ci-cache/triton/...) and not the checked-out workspace, the existing workspace ownership fix is not sufficient.
Proposed Fix
This should be fixable at the workflow level.
Options:
- Ensure the host cache directories are writable before running tests, for example by adding a cache ownership/permission normalization step for the mounted cache roots.
- Run the container with a consistent UID/GID that owns the persistent cache directories.
- Use a per-run writable Triton/Inductor cache directory for tests, then optionally sync/reuse only known-safe cache contents.
- Explicitly set all relevant cache env vars to the intended writable mount inside the container, including
TRITON_CACHE_DIR and any PyTorch Inductor cache env vars if needed.
A minimal workflow-side fix could be to normalize the mounted cache permissions before Run full op tests, e.g. for the cache roots used by the nightly container:
mkdir -p /data/ci-cache/triton /data/ci-cache/tilelang /data/ci-cache/pip /data/ci-cache/wheels
chown -R <runner-uid>:<runner-gid> /data/ci-cache/triton /data/ci-cache/tilelang /data/ci-cache/pip /data/ci-cache/wheels
chmod -R u+rwX /data/ci-cache/triton /data/ci-cache/tilelang /data/ci-cache/pip /data/ci-cache/wheels
or perform the equivalent inside a privileged/cache-maintenance container, depending on the runner setup.
Acceptance Criteria
- Nightly op tests no longer fail with
PermissionError under /ci-cache/triton or the mounted Triton cache path.
- GLA, Mamba, and MoE tests progress past Triton/Inductor compilation/cache setup.
- The fix does not require changing operator code.
Summary
Nightly op tests hit a group of
PermissionErrorfailures while Triton/PyTorch Inductor tried to write temporary files under the shared CI Triton cache:This looks like a CI cache ownership / writable-path issue rather than an operator correctness issue. The affected tests are failing while third-party Triton kernels or PyTorch Inductor compile/cache artifacts, before the actual TileOPs correctness comparison can complete.
Nightly run: https://github.com/tile-ai/TileOPs/actions/runs/28125205861/job/83305954027
Observed Failures
The op test artifact contains 21 failures in this category:
tests.ops.test_gla_chunkwise_bwd: 3 failurestests.ops.test_gla_chunkwise_fwd: 3 failurestests.ops.test_gla_recurrence: 3 failurestests.ops.test_mamba: 6 failures, surfaced throughtorch._inductor.exc.InductorErrorwrapping the samePermissionErrortests.ops.test_moe_fused_moe: 6 failuresRepresentative examples:
Likely Cause
The nightly workflow mounts persistent cache directories into the Docker container. The mounted Triton cache appears to contain directories/files created by a different UID/GID or with restrictive permissions, so the current container user cannot create
tmp.pid_*files in some hashed Triton cache subdirectories.Because the error path is inside the cache (
/ci-cache/triton/...) and not the checked-out workspace, the existing workspace ownership fix is not sufficient.Proposed Fix
This should be fixable at the workflow level.
Options:
TRITON_CACHE_DIRand any PyTorch Inductor cache env vars if needed.A minimal workflow-side fix could be to normalize the mounted cache permissions before
Run full op tests, e.g. for the cache roots used by the nightly container:or perform the equivalent inside a privileged/cache-maintenance container, depending on the runner setup.
Acceptance Criteria
PermissionErrorunder/ci-cache/tritonor the mounted Triton cache path.