Migration of sharktank external tests to torch_models

# EPIC: Complete migration of sharktank external tests to torch_models

The sharktank external test suite should be fully ported to the newer
`torch_models` test suite format so we can retire
`.github/workflows/pkgci_test_sharktank.yml` and stop maintaining duplicate
model test infrastructure.

Current local inspection shows `tests/external/iree-test-suites/sharktank_models`
still has 33 JSON manifests, while `torch_models` has partial coverage. Some
ports exist for SDXL clip/vae/pUNet and Llama, but several sharktank tests are
still missing or are not equivalent.

## Goals

- Port all remaining sharktank quality and benchmark coverage to
  `tests/external/iree-test-suites/torch_models`.
- Preserve intended coverage, thresholds, benchmark flags, golden times,
  markers, and target/device behavior.
- Resolve or explicitly document any intentional behavior changes between
  sharktank and torch configs.
- Remove sharktank CI only after torch coverage is equivalent.

## Breakdown

- [ ] Port SD3 quality tests.

  Missing work: add a `torch_models/sd3` area with module configs for CLIP,
  MMDiT, and VAE, then add CPU and ROCm quality configs preserving the
  original MLIR URLs, weights, input/output files, thresholds, compiler flags,
  and run functions.

  References:
  [`clip_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sd3/clip_cpu.json),
  [`clip_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sd3/clip_rocm.json),
  [`mmdit_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sd3/mmdit_cpu.json),
  [`mmdit_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sd3/mmdit_rocm.json),
  [`vae_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sd3/vae_cpu.json),
  [`vae_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sd3/vae_rocm.json).

- [ ] Port SDXL scheduler compile-only tests.

  Missing work: add torch-model equivalents for the scheduler compile-only
  coverage. These do not run model quality checks; they validate that the
  scheduler MLIR compiles for CPU and ROCm with the same target/device and
  preprocessing flags.

  References:
  [`scheduler_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/scheduler_cpu.json),
  [`scheduler_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/scheduler_rocm.json).

- [ ] Port SDXL UNet tests.

  Missing work: add torch-model module and test configs for the scheduled UNet
  quality cases, the 960x1024 UNet quality cases, and the ROCm benchmark. The
  scheduled UNet configs also reference a pipeline module, so the port needs to
  preserve that multi-module behavior instead of only compiling the standalone
  model.

  References:
  [`unet_fp16_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/unet_fp16_cpu.json),
  [`unet_fp16_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/unet_fp16_rocm.json),
  [`unet_fp16_960_1024_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/unet_fp16_960_1024_cpu.json),
  [`unet_fp16_960_1024_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/unet_fp16_960_1024_rocm.json),
  [`benchmarks/sdxl/unet_fp16_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/benchmarks/sdxl/unet_fp16_rocm.json).

- [ ] Port SDXL end-to-end benchmark.

  Missing work: add a torch-model benchmark for the full SDXL pipeline that
  compiles and runs the multi-module pipeline (`sdxl_clip`, `sdxl_unet_fp16`,
  `sdxl_vae`) via `tokens_to_image`. Preserve the pipeline MLIR, compile flags,
  benchmark flags, and per-SKU golden timing expectations.

  Reference:
  [`benchmarks/sdxl/e2e_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/benchmarks/sdxl/e2e_rocm.json).

- [ ] Resolve SDXL pUNet fp8 coverage.

  Missing work: port the fp8 pUNet quality and benchmark coverage, or document
  why it is intentionally replaced. The existing torch `punet_gfx942_v2` config
  does not appear equivalent: it uses different MLIR, different input arity,
  `run_forward` instead of `main`, and different compiler/preprocessing flags.

  References:
  [`punet_int8_fp8_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/punet_int8_fp8_rocm.json),
  [`benchmarks/sdxl/punet_int8_fp8_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/benchmarks/sdxl/punet_int8_fp8_rocm.json),
  existing torch candidate
  [`punet_gfx942_v2.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/torch_models/sdxl/modules/punet_gfx942_v2.json).

- [ ] Resolve Llama f16 data-tiling coverage.

  Missing work: add f16 data-tiling module/test configs for both quality and
  benchmark coverage. The existing torch data-tiling configs are under
  `llama_8b_fp8`, so they do not replace the sharktank f16 data-tiling cases.

  References:
  [`8b_f16_decode_data_tiling_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/llama/8b_f16_decode_data_tiling_rocm.json),
  [`8b_f16_prefill_data_tiling_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/llama/8b_f16_prefill_data_tiling_rocm.json),
  [`benchmarks/llama/8b_f16_decode_data_tiling_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/benchmarks/llama/8b_f16_decode_data_tiling_rocm.json),
  [`benchmarks/llama/8b_f16_prefill_data_tiling_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/benchmarks/llama/8b_f16_prefill_data_tiling_rocm.json).

- [ ] Resolve Llama per-function quality coverage.

  Missing work: decide whether the torch `test_greedy_decoder` quality test is
  intended to replace the sharktank per-function `decode_bs4` and `prefill_bs4`
  quality tests. If not, add separate torch quality configs using the original
  inputs and functions.

  References:
  [`8b_f16_decode_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/llama/8b_f16_decode_rocm.json),
  [`8b_f16_prefill_rocm.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/llama/8b_f16_prefill_rocm.json),
  existing torch quality config
  [`quality_gfx942.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/torch_models/llama_8b_fp16/quality_gfx942.json).

- [ ] Audit apparent ports for mismatches.

  Missing work: for each apparent sharktank-to-torch port, either align the
  torch config with the sharktank behavior or document why the change is
  intentional.

  Known mismatches to resolve:
  - [ ] SDXL VAE CPU threshold changed from `--expected_f16_threshold=0.02f`
        to `0.4f`.
        References:
        [`sharktank vae_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/quality_tests/sdxl/vae_cpu.json),
        [`torch vae_quality_cpu.json`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/torch_models/sdxl/vae_quality_cpu.json).
  - [ ] Llama decode input differs: sharktank uses `4x5xi64`, torch seq128
        uses `4x4xi64`.
        References:
        [`sharktank decode benchmark`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/sharktank_models/benchmarks/llama/8b_f16_decode_rocm.json),
        [`torch decode seq128 benchmark`](https://github.com/iree-org/iree/blob/main/tests/external/iree-test-suites/torch_models/llama_8b_fp16/decode_benchmark_seq128_mi325.json).
  - [ ] Benchmark flags differ in several ports, especially
        `--device_allocator=caching`, `--hip_use_streams=true`, and
        `--hip_allow_inline_execution=true`.
  - [ ] Golden times differ between sharktank and torch ports; confirm whether
        each change is expected, update values if needed, and preserve
        tolerance semantics where the old sharktank config had per-SKU
        tolerances.

- [ ] CI cleanup.

  Missing work: after coverage parity is demonstrated, remove the sharktank CI
  workflow and route all migrated model coverage through `pkgci_test_torch.yml`.
  Also update path triggers and workflow-summary dependencies so sharktank is no
  longer scheduled independently.

  References:
  [`pkgci_test_sharktank.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci_test_sharktank.yml),
  [`pkgci_test_torch.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci_test_torch.yml),
  [`pkgci.yml`](https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml),
  [`configure_ci.py`](https://github.com/iree-org/iree/blob/main/build_tools/github_actions/configure_ci.py).

## Notes

There do not appear to be explicit `*mismatch*` config files under
`tests/external/iree-test-suites`, but there are behavioral mismatches in
existing apparent ports. These should be treated as migration blockers unless
they are intentionally accepted and documented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration of sharktank external tests to torch_models #24482

EPIC: Complete migration of sharktank external tests to torch_models

Goals

Breakdown

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Migration of sharktank external tests to torch_models #24482

Description

EPIC: Complete migration of sharktank external tests to torch_models

Goals

Breakdown

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions