Skip to content

Migration of sharktank external tests to torch_models #24482

@hanhanW

Description

@hanhanW

EPIC: Complete migration of sharktank external tests to torch_models

The sharktank external test suite should be fully ported to the newer
torch_models test suite format so we can retire
.github/workflows/pkgci_test_sharktank.yml and stop maintaining duplicate
model test infrastructure.

Current local inspection shows tests/external/iree-test-suites/sharktank_models
still has 33 JSON manifests, while torch_models has partial coverage. Some
ports exist for SDXL clip/vae/pUNet and Llama, but several sharktank tests are
still missing or are not equivalent.

Goals

  • Port all remaining sharktank quality and benchmark coverage to
    tests/external/iree-test-suites/torch_models.
  • Preserve intended coverage, thresholds, benchmark flags, golden times,
    markers, and target/device behavior.
  • Resolve or explicitly document any intentional behavior changes between
    sharktank and torch configs.
  • Remove sharktank CI only after torch coverage is equivalent.

Breakdown

  • Port SD3 quality tests.

    Missing work: add a torch_models/sd3 area with module configs for CLIP,
    MMDiT, and VAE, then add CPU and ROCm quality configs preserving the
    original MLIR URLs, weights, input/output files, thresholds, compiler flags,
    and run functions.

    References:
    clip_cpu.json,
    clip_rocm.json,
    mmdit_cpu.json,
    mmdit_rocm.json,
    vae_cpu.json,
    vae_rocm.json.

  • Port SDXL scheduler compile-only tests.

    Missing work: add torch-model equivalents for the scheduler compile-only
    coverage. These do not run model quality checks; they validate that the
    scheduler MLIR compiles for CPU and ROCm with the same target/device and
    preprocessing flags.

    References:
    scheduler_cpu.json,
    scheduler_rocm.json.

  • Port SDXL UNet tests.

    Missing work: add torch-model module and test configs for the scheduled UNet
    quality cases, the 960x1024 UNet quality cases, and the ROCm benchmark. The
    scheduled UNet configs also reference a pipeline module, so the port needs to
    preserve that multi-module behavior instead of only compiling the standalone
    model.

    References:
    unet_fp16_cpu.json,
    unet_fp16_rocm.json,
    unet_fp16_960_1024_cpu.json,
    unet_fp16_960_1024_rocm.json,
    benchmarks/sdxl/unet_fp16_rocm.json.

  • Port SDXL end-to-end benchmark.

    Missing work: add a torch-model benchmark for the full SDXL pipeline that
    compiles and runs the multi-module pipeline (sdxl_clip, sdxl_unet_fp16,
    sdxl_vae) via tokens_to_image. Preserve the pipeline MLIR, compile flags,
    benchmark flags, and per-SKU golden timing expectations.

    Reference:
    benchmarks/sdxl/e2e_rocm.json.

  • Resolve SDXL pUNet fp8 coverage.

    Missing work: port the fp8 pUNet quality and benchmark coverage, or document
    why it is intentionally replaced. The existing torch punet_gfx942_v2 config
    does not appear equivalent: it uses different MLIR, different input arity,
    run_forward instead of main, and different compiler/preprocessing flags.

    References:
    punet_int8_fp8_rocm.json,
    benchmarks/sdxl/punet_int8_fp8_rocm.json,
    existing torch candidate
    punet_gfx942_v2.json.

  • Resolve Llama f16 data-tiling coverage.

    Missing work: add f16 data-tiling module/test configs for both quality and
    benchmark coverage. The existing torch data-tiling configs are under
    llama_8b_fp8, so they do not replace the sharktank f16 data-tiling cases.

    References:
    8b_f16_decode_data_tiling_rocm.json,
    8b_f16_prefill_data_tiling_rocm.json,
    benchmarks/llama/8b_f16_decode_data_tiling_rocm.json,
    benchmarks/llama/8b_f16_prefill_data_tiling_rocm.json.

  • Resolve Llama per-function quality coverage.

    Missing work: decide whether the torch test_greedy_decoder quality test is
    intended to replace the sharktank per-function decode_bs4 and prefill_bs4
    quality tests. If not, add separate torch quality configs using the original
    inputs and functions.

    References:
    8b_f16_decode_rocm.json,
    8b_f16_prefill_rocm.json,
    existing torch quality config
    quality_gfx942.json.

  • Audit apparent ports for mismatches.

    Missing work: for each apparent sharktank-to-torch port, either align the
    torch config with the sharktank behavior or document why the change is
    intentional.

    Known mismatches to resolve:

    • SDXL VAE CPU threshold changed from --expected_f16_threshold=0.02f
      to 0.4f.
      References:
      sharktank vae_cpu.json,
      torch vae_quality_cpu.json.
    • Llama decode input differs: sharktank uses 4x5xi64, torch seq128
      uses 4x4xi64.
      References:
      sharktank decode benchmark,
      torch decode seq128 benchmark.
    • Benchmark flags differ in several ports, especially
      --device_allocator=caching, --hip_use_streams=true, and
      --hip_allow_inline_execution=true.
    • Golden times differ between sharktank and torch ports; confirm whether
      each change is expected, update values if needed, and preserve
      tolerance semantics where the old sharktank config had per-SKU
      tolerances.
  • CI cleanup.

    Missing work: after coverage parity is demonstrated, remove the sharktank CI
    workflow and route all migrated model coverage through pkgci_test_torch.yml.
    Also update path triggers and workflow-summary dependencies so sharktank is no
    longer scheduled independently.

    References:
    pkgci_test_sharktank.yml,
    pkgci_test_torch.yml,
    pkgci.yml,
    configure_ci.py.

Notes

There do not appear to be explicit *mismatch* config files under
tests/external/iree-test-suites, but there are behavioral mismatches in
existing apparent ports. These should be treated as migration blockers unless
they are intentionally accepted and documented.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions