Skip to content

Commit 69c8bab

Browse files
authored
perf(inspect): skip redundant Auto* calls for text-only models (#746)
## Summary Two extra Strategy-2 gating fixes on top of #719. Profile on `cardiffnlp/twitter-roberta-base-sentiment-latest` (text-only) showed the inspect command was still taking ~16 s warm — most of it spent on Auto* calls that didn't need to run. Targeting **`release/v0.1.0`** since #717 / #718 / #719 are already on that release; this is the natural follow-up. ## Profile (warm cache, `cardiffnlp/twitter-roberta-base-sentiment-latest`) **Before this PR**: ``` [0] AutoConfig 0.74s (parent_hf_config — already deduped by #719) [1] AutoProcessor 4.27s returns RobertaTokenizerFast [7] AutoTokenizer 2.22s ← redundant, AutoProcessor already returned the tokenizer [11] AutoImageProcessor 1.39s FAIL (text model has no preprocessor_config.json) [12] AutoFeatureExtractor 0.64s FAIL (same) ───── ~16 s total ``` **After this PR**: ``` AutoConfig 5 calls (vs 8) AutoProcessor 1 call AutoTokenizer 1 call (vs 2 — the redundant standalone load is gone) AutoImageProcessor 0 calls (skipped — no preprocessor_config.json) AutoFeatureExtractor 0 calls (skipped — same) ───── ~12 s total (~25% faster) ``` ## Change ### 1. Detect when `AutoProcessor` returns a leaf class For single-modality models, `AutoProcessor.from_pretrained` returns the leaf class directly — e.g. RoBERTa → `RobertaTokenizerFast`. Such a return has no `.tokenizer` wrapper attribute, so the old code couldn't populate `tokenizer_class` and fell through to a redundant `AutoTokenizer.from_pretrained` (~2 s warm). Pattern-match the returned class name (`*Tokenizer` / `*TokenizerFast`, `*ImageProcessor` / `*ImageProcessorFast`, `*FeatureExtractor`) and populate the corresponding field. The `.tokenizer` / `.image_processor` / `.feature_extractor` attribute path still wins for genuine multimodal `ProcessorMixin` returns (CLIP, etc.) — see the `test_autoprocessor_with_wrapped_pieces_uses_attributes` regression test. ### 2. `preprocessor_config.json` absence is authoritative `_resolve_processor_from_hub_configs` already tries to download `preprocessor_config.json`. When the hub returns 404, the model has *no* image processor or feature extractor, period. Surface this as a `has_preprocessor_config` bool from the helper so the caller can skip the `AutoImageProcessor` / `AutoFeatureExtractor` round-trips (~2 s total wasted confirming 404s). ## Tests `tests/unit/inspect/test_resolve_processor_gating.py`: - `test_autoprocessor_returns_tokenizer_fills_tokenizer_class` — leaf-class detection populates `tokenizer_class` from class-name suffix and skips standalone `AutoTokenizer` - `test_autoprocessor_returns_image_processor_fills_image_class` — same for `*ImageProcessor` - `test_autoprocessor_returns_feature_extractor_fills_feature_class` — same for `*FeatureExtractor` - `test_autoprocessor_with_wrapped_pieces_uses_attributes` — multimodal `ProcessorMixin` with `.tokenizer` attribute wins over name suffix - `test_missing_preprocessor_config_skips_image_and_feature` — `has_preprocessor_config=False` skips `AutoImageProcessor` / `AutoFeatureExtractor` 55 targeted tests pass.
1 parent 90b9cec commit 69c8bab

2 files changed

Lines changed: 323 additions & 44 deletions

File tree

src/winml/modelkit/inspect/resolver.py

Lines changed: 98 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from __future__ import annotations
1111

1212
import logging
13-
from typing import TYPE_CHECKING
13+
from typing import TYPE_CHECKING, NamedTuple
1414

1515
from ..loader.task import (
1616
HF_TASK_DEFAULTS,
@@ -869,31 +869,37 @@ def resolve_processor(
869869
# This is fast and doesn't require downloading/instantiating processors
870870
# NOTE: These JSON keys (processor_class, image_processor_type, etc.) are
871871
# standard HuggingFace config conventions, not model-specific hardcoding.
872+
has_preprocessor_config = True
872873
try:
873-
hub_proc, hub_tok, hub_img, hub_fe = _resolve_processor_from_hub_configs(model_id)
874-
if hub_proc and processor_class is None:
875-
processor_class = hub_proc
874+
hub_result = _resolve_processor_from_hub_configs(model_id)
875+
if hub_result.processor_class and processor_class is None:
876+
processor_class = hub_result.processor_class
876877
processor_source = "hub_config"
877-
if hub_tok and tokenizer_class is None:
878-
tokenizer_class = hub_tok
878+
if hub_result.tokenizer_class and tokenizer_class is None:
879+
tokenizer_class = hub_result.tokenizer_class
879880
tokenizer_source = "hub_config"
880-
if hub_img and image_processor_class is None:
881-
image_processor_class = hub_img
881+
if hub_result.image_processor_class and image_processor_class is None:
882+
image_processor_class = hub_result.image_processor_class
882883
image_processor_source = "hub_config"
883-
if hub_fe and feature_extractor_class is None:
884-
feature_extractor_class = hub_fe
884+
if hub_result.feature_extractor_class and feature_extractor_class is None:
885+
feature_extractor_class = hub_result.feature_extractor_class
885886
feature_extractor_source = "hub_config"
887+
has_preprocessor_config = hub_result.has_preprocessor_config
886888
except Exception as e:
887889
logger.debug("Failed to resolve processors from hub configs: %s", e)
888890

889891
# Strategy 2: Use Auto classes to fill in any missing information.
890892
# Skip entirely when Strategies 0 + 1 already populated every field —
891893
# each Auto* instantiation does its own HF Hub I/O plus class init
892894
# (AutoProcessor and AutoFeatureExtractor are several seconds each).
895+
#
896+
# When ``preprocessor_config.json`` is missing on the hub, the model
897+
# has neither an image processor nor a feature extractor; skip those
898+
# two Auto* round-trips (they would each spend ~1s confirming a 404).
893899
need_processor = processor_class is None
894900
need_tokenizer = tokenizer_class is None
895-
need_image_processor = image_processor_class is None
896-
need_feature_extractor = feature_extractor_class is None
901+
need_image_processor = image_processor_class is None and has_preprocessor_config
902+
need_feature_extractor = feature_extractor_class is None and has_preprocessor_config
897903

898904
if need_processor or need_tokenizer or need_image_processor or need_feature_extractor:
899905
try:
@@ -938,9 +944,21 @@ def resolve_processor(
938944
)
939945

940946

941-
def _resolve_processor_from_hub_configs(
942-
model_id: str,
943-
) -> tuple[str | None, str | None, str | None, str | None]:
947+
class _HubConfigResult(NamedTuple):
948+
"""Result of ``_resolve_processor_from_hub_configs``.
949+
950+
A NamedTuple rather than a plain tuple so the trailing boolean cannot be
951+
silently swapped with the four ``str | None`` fields at the call site.
952+
"""
953+
954+
processor_class: str | None
955+
tokenizer_class: str | None
956+
image_processor_class: str | None
957+
feature_extractor_class: str | None
958+
has_preprocessor_config: bool
959+
960+
961+
def _resolve_processor_from_hub_configs(model_id: str) -> _HubConfigResult:
944962
"""Resolve processor classes by fetching config files from HuggingFace Hub.
945963
946964
This approach is fast because it only downloads small JSON config files,
@@ -950,7 +968,12 @@ def _resolve_processor_from_hub_configs(
950968
model_id: HuggingFace model identifier
951969
952970
Returns:
953-
Tuple of (processor_class, tokenizer_class, image_processor_class, feature_extractor_class)
971+
A ``_HubConfigResult`` whose ``has_preprocessor_config`` reports
972+
whether ``preprocessor_config.json`` actually exists on the hub —
973+
the authoritative signal that the model has no image processor or
974+
feature extractor, so the caller can skip the corresponding
975+
``AutoImageProcessor`` / ``AutoFeatureExtractor`` round-trips
976+
(which would each spend ~1s confirming a 404 on text-only models).
954977
"""
955978
import json
956979
from pathlib import Path
@@ -962,6 +985,7 @@ def _resolve_processor_from_hub_configs(
962985
tokenizer_class: str | None = None
963986
image_processor_class: str | None = None
964987
feature_extractor_class: str | None = None
988+
has_preprocessor_config = False
965989

966990
# Try to download and parse preprocessor_config.json
967991
# This file contains image_processor_type or processor_class
@@ -970,6 +994,11 @@ def _resolve_processor_from_hub_configs(
970994
repo_id=model_id,
971995
filename="preprocessor_config.json",
972996
)
997+
# Set the flag as soon as the file exists on the hub, *before* parsing.
998+
# A corrupt JSON is still proof that the model ships preprocessor
999+
# config — fall back to Auto* lookups rather than declaring the model
1000+
# text-only and silently dropping its image/feature processor.
1001+
has_preprocessor_config = True
9731002
with Path(preprocessor_config_path).open(encoding="utf-8") as f:
9741003
preprocessor_config = json.load(f)
9751004

@@ -1009,7 +1038,34 @@ def _resolve_processor_from_hub_configs(
10091038
except json.JSONDecodeError as e:
10101039
logger.debug("Failed to parse tokenizer_config.json for %s: %s", model_id, e)
10111040

1012-
return processor_class, tokenizer_class, image_processor_class, feature_extractor_class
1041+
return _HubConfigResult(
1042+
processor_class=processor_class,
1043+
tokenizer_class=tokenizer_class,
1044+
image_processor_class=image_processor_class,
1045+
feature_extractor_class=feature_extractor_class,
1046+
has_preprocessor_config=has_preprocessor_config,
1047+
)
1048+
1049+
1050+
def _is_tokenizer_class_name(name: str) -> bool:
1051+
"""Heuristic: does this transformers class name look like a tokenizer?
1052+
1053+
Tokenizer classes follow the ``*Tokenizer`` / ``*TokenizerFast`` naming
1054+
convention (e.g. ``RobertaTokenizer``, ``BertTokenizerFast``). Used to
1055+
detect when ``AutoProcessor.from_pretrained`` returned a leaf tokenizer
1056+
rather than a multimodal ``ProcessorMixin`` wrapper.
1057+
"""
1058+
return name.endswith(("Tokenizer", "TokenizerFast"))
1059+
1060+
1061+
def _is_image_processor_class_name(name: str) -> bool:
1062+
"""Heuristic: does this transformers class name look like an image processor?"""
1063+
return name.endswith(("ImageProcessor", "ImageProcessorFast"))
1064+
1065+
1066+
def _is_feature_extractor_class_name(name: str) -> bool:
1067+
"""Heuristic: does this transformers class name look like a feature extractor?"""
1068+
return name.endswith("FeatureExtractor")
10131069

10141070

10151071
def _resolve_processor_from_auto_classes(
@@ -1058,28 +1114,31 @@ def _resolve_processor_from_auto_classes(
10581114
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
10591115
processor_class = type(processor).__name__
10601116

1061-
# AutoProcessor may wrap tokenizer and image_processor
1062-
if (
1063-
try_tokenizer
1064-
and hasattr(processor, "tokenizer")
1065-
and processor.tokenizer is not None
1066-
):
1067-
tokenizer_class = type(processor.tokenizer).__name__
1068-
1069-
if (
1070-
try_image_processor
1071-
and hasattr(processor, "image_processor")
1072-
and processor.image_processor is not None
1073-
):
1074-
image_processor_class = type(processor.image_processor).__name__
1075-
1076-
# Some older models use feature_extractor instead of image_processor
1077-
if (
1078-
try_feature_extractor
1079-
and hasattr(processor, "feature_extractor")
1080-
and processor.feature_extractor is not None
1081-
):
1082-
feature_extractor_class = type(processor.feature_extractor).__name__
1117+
# AutoProcessor may wrap tokenizer / image_processor / feature_extractor
1118+
# as a multimodal `ProcessorMixin`. For single-modality models it
1119+
# often returns the leaf class directly (e.g. RoBERTa →
1120+
# `RobertaTokenizerFast`), which has none of those attributes.
1121+
# Pattern-match the returned class name so the standalone Auto*
1122+
# calls below can be skipped — otherwise we pay for a second,
1123+
# redundant load (~2s for AutoTokenizer on warm cache).
1124+
wrapped_tokenizer = getattr(processor, "tokenizer", None)
1125+
wrapped_image_processor = getattr(processor, "image_processor", None)
1126+
wrapped_feature_extractor = getattr(processor, "feature_extractor", None)
1127+
1128+
if try_tokenizer and wrapped_tokenizer is not None:
1129+
tokenizer_class = type(wrapped_tokenizer).__name__
1130+
elif try_tokenizer and _is_tokenizer_class_name(processor_class):
1131+
tokenizer_class = processor_class
1132+
1133+
if try_image_processor and wrapped_image_processor is not None:
1134+
image_processor_class = type(wrapped_image_processor).__name__
1135+
elif try_image_processor and _is_image_processor_class_name(processor_class):
1136+
image_processor_class = processor_class
1137+
1138+
if try_feature_extractor and wrapped_feature_extractor is not None:
1139+
feature_extractor_class = type(wrapped_feature_extractor).__name__
1140+
elif try_feature_extractor and _is_feature_extractor_class_name(processor_class):
1141+
feature_extractor_class = processor_class
10831142

10841143
except Exception as e:
10851144
logger.debug("AutoProcessor failed for %s: %s", model_id, e)

0 commit comments

Comments
 (0)