feat: auto detect custom dataset type based on file info #399

ajcasagrande · 2025-10-24T23:48:13Z

Attempts to automatically detect the --custom-dataset-type based on the file data provided when using --input-file. this matches closer to what GenAI-Perf did, while still supporting the ability to manually override.

This is done is a decoupled manner by having each CustomLoader provide a function defining whether they can load the given file or not.

Also, CustomLoaders can now describe their preferred --dataset-sampling-strategy, making the code even more decoupled!

Summary by CodeRabbit

New Features
- Added automatic detection and classification of custom dataset formats for improved ease of use.
- Extended dataset support to include video media alongside text, images, and audio.
Bug Fixes
- Improved dataset sampling strategy initialization and configuration handling.
Tests
- Added comprehensive test coverage for dataset loader detection and type inference.

codecov · 2025-10-24T23:50:59Z

Codecov Report

❌ Patch coverage is 92.72727% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/dataset/composer/custom.py	86.00%	6 Missing and 1 partial ⚠️
src/aiperf/dataset/dataset_manager.py	0.00%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai · 2025-10-24T23:53:20Z

Walkthrough

This PR refactors the dataset loader architecture by relocating CustomDatasetLoaderProtocol from aiperf.dataset.loader.protocol to aiperf.common.protocols with extended capabilities, removing automatic dataset_sampling_strategy defaulting from InputConfig validation, adding loader capability detection via can_load and get_preferred_sampling_strategy methods across all dataset loaders, implementing automatic dataset type inference and sampling strategy auto-configuration in CustomDatasetComposer, and introducing video media support across loaders. Comprehensive tests validate the new loader detection and type inference mechanisms.

Changes

Cohort / File(s)	Summary
Protocol Architecture `src/aiperf/common/protocols.py`, `src/aiperf/dataset/loader/protocol.py`	Moved and redefined `CustomDatasetLoaderProtocol` from loader.protocol to common.protocols with new methods (`can_load`, `get_preferred_sampling_strategy`); removed original protocol definition from loader module.
Package Exports `src/aiperf/dataset/__init__.py`, `src/aiperf/dataset/loader/__init__.py`, `src/aiperf/dataset/composer/__init__.py`	Removed `CustomDatasetLoaderProtocol` exports from dataset and loader packages; updated `__ignore__` list in composer package to include "logger".
Loader Implementations `src/aiperf/dataset/loader/mooncake_trace.py`, `src/aiperf/dataset/loader/multi_turn.py`, `src/aiperf/dataset/loader/random_pool.py`, `src/aiperf/dataset/loader/single_turn.py`	Added `can_load` and `get_preferred_sampling_strategy` classmethods to all loaders; extended Turn model with videos field support in conversion logic across multi_turn, random_pool, and single_turn loaders.
Composer and Dataset Logic `src/aiperf/dataset/composer/custom.py`, `src/aiperf/dataset/composer/synthetic.py`, `src/aiperf/dataset/dataset_manager.py`	Added dataset type inference (`_infer_dataset_type`, `_infer_type`) and sampling strategy auto-configuration (`_set_sampling_strategy`) in custom composer; added fallback sampling strategy assignment in synthetic composer; extended dataset manager to trigger CUSTOM composer when input file is provided.
Configuration `src/aiperf/common/config/input_config.py`	Removed `validate_dataset_sampling_strategy` model validator that previously defaulted sampling strategy based on custom dataset type.
Tests `tests/composers/test_custom_composer.py`, `tests/loaders/test_can_load.py`	Updated patch decorators in custom composer tests; added `TestSamplingStrategy` tests for sampling strategy initialization; introduced comprehensive test suite validating `can_load` detection logic, type inference, and priority rules across all loaders.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Key areas requiring attention:
- Protocol relocation and interface changes (common.protocols.py) — validate correct method signatures and TYPE_CHECKING imports
- Type inference logic in custom.py (_infer_dataset_type, _infer_type) — examine fallback chains and error handling for edge cases
- Video media support across loaders — verify consistent Turn construction and media handling in all loader implementations (multi_turn, random_pool, single_turn)
- Sampling strategy precedence — ensure explicit strategies are preserved and inferred defaults correctly apply via _set_sampling_strategy
- Removal of InputConfig validator — confirm no orphaned downstream logic depends on automatic defaulting behavior
- New comprehensive test suite — validate test coverage for loader detection priority and type inference pathways

Poem

🐰 Protocols dance in common ground,
Loaders leap with type-detect sound,
Videos spin, strategies inferred,
Dataset paths and tricks preferred,
Tests bloom bright, ambiguity deferred! 🎯

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: auto detect custom dataset type based on file info" directly addresses the primary objective of the pull request, which is to add automatic detection of the `--custom-dataset-type` based on file data provided via `--input-file`. The raw summary confirms this is the main focus, with multiple changes to implement type inference logic in the custom composer and the addition of `can_load` methods across dataset loaders to enable format detection. The title is concise, clear, and specific enough for a teammate scanning the commit history to understand that this PR implements auto-detection capability for custom dataset types. While the PR also addresses sampling strategy configuration, the title appropriately focuses on the primary feature without attempting to cover all details.
Docstring Coverage	✅ Passed	Docstring coverage is 82.14% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/aiperf/dataset/__init__.py (1)
60-97: Update import statement in aiperf/common/factories.py to use the new protocol location.

The removal of CustomDatasetLoaderProtocol from the dataset module exports is incomplete. The import at line 66 in src/aiperf/common/factories.py still references the old location and will fail:
# Current (broken):
from aiperf.dataset import (
    CustomDatasetLoaderProtocol,
)

# Should be:
from aiperf.common.protocols import (
    CustomDatasetLoaderProtocol,
)
The protocol exists only in aiperf.common.protocols.py but factories.py attempts to import it from aiperf.dataset, which no longer exports it.

🧹 Nitpick comments (12)

tests/composers/test_custom_composer.py (1)

136-178: Sampling strategy auto-configuration tests—LGTM; add test for type persistence

Great coverage for strategy selection and non-override. Please add a test to assert that auto-inferred dataset_type is written back to config.input.custom_dataset_type, as downstream logic (e.g., get_effective_request_count, fixed schedule) depends on it.

Example:
@patch("aiperf.dataset.composer.custom.check_file_exists")
@patch("builtins.open", mock_open(read_data='{"input_length": 10, "output_length": 1}\n')))
def test_inferred_type_persists_in_config(mock_check_file, trace_config, mock_tokenizer):
    trace_config.input.custom_dataset_type = None
    composer = CustomDatasetComposer(trace_config, mock_tokenizer)
    composer.create_dataset()
    assert trace_config.input.custom_dataset_type == CustomDatasetType.MOONCAKE_TRACE

src/aiperf/dataset/composer/custom.py (2)

57-89: Make JSONL inference more robust (optional)

Consider scanning the first N non-empty lines and specifying UTF-8 to tolerate BOMs and stray lines; fall back to path-only detection if none parse.

-            with open(file_path) as f:
-                for line in f:
+            with open(file_path, encoding="utf-8") as f:
+                checked = 0
+                for line in f:
                     if not (line := line.strip()):
                         continue
-                    data = load_json_str(line)
-                    return CustomDatasetComposer._infer_type(
-                        data=data, filename=file_path
-                    )
+                    try:
+                        data = load_json_str(line)
+                        inferred = CustomDatasetComposer._infer_type(data=data, filename=file_path)
+                        if inferred is not None:
+                            return inferred
+                    except Exception:
+                        # Ignore parse errors and keep probing a few lines
+                        pass
+                    checked += 1
+                    if checked >= 10:
+                        break
+            # Fall back to filename-only structural detection
+            return CustomDatasetComposer._infer_type(data=None, filename=file_path)

107-118: Avoid referencing data['type'] in except and narrow exception

Tiny logging hardening: only catch ValueError here and avoid possible KeyError in the log.

-        if data is not None and "type" in data:
-            try:
-                # Try to convert the type string to enum
-                explicit_type = CustomDatasetType(data["type"])
-                logger.info(f"Using explicit type field: {explicit_type}")
-                return explicit_type
-            except (ValueError, KeyError):
-                logger.info(
-                    f"Invalid type field value: {data['type']}, falling back to structural detection"
-                )
+        if data is not None and "type" in data:
+            try:
+                explicit_type = CustomDatasetType(data["type"])
+                logger.info(f"Using explicit type field: {explicit_type}")
+                return explicit_type
+            except ValueError:
+                logger.info(
+                    f"Invalid type field value: {repr(data.get('type'))}, falling back to structural detection"
+                )

src/aiperf/dataset/loader/random_pool.py (4)

76-93: Silence Ruff ARG003 for unused data while preserving the protocol signature.

data is intentionally unused (path-based detection only). Reference it to satisfy lint without breaking keyword callers.
 @classmethod
 def can_load(
     cls, data: dict[str, Any] | None = None, filename: str | Path | None = None
 ) -> bool:
     """Check if this loader can handle the given data format.
@@
-        if filename is not None:
+        # Intentionally unused: detection here is path-based only; keep param name for keyword callers.
+        del data
+        if filename is not None:
             path = Path(filename) if isinstance(filename, str) else filename
             if path.is_dir():
                 return True
203-217: Broaden _merge_turns input type to match zip tuple and improve type clarity.

zip yields a tuple, but the signature requires list[Turn]. Accepting Iterable[Turn] avoids mismatches.
-from typing import Any, TypeAlias
+from typing import Any, TypeAlias, Iterable
@@
-    def _merge_turns(self, turns: list[Turn]) -> Turn:
+    def _merge_turns(self, turns: Iterable[Turn]) -> Turn:
129-137: Specify file encoding for deterministic I/O.

Use UTF-8 to avoid platform-dependent defaults and decoding issues.
-        with open(file_path) as f:
+        with open(file_path, encoding="utf-8") as f:
             for line in f:
                 if (line := line.strip()) == "":
                     continue  # Skip empty lines
150-157: Handle empty directories explicitly.

If no files are discovered, convert_to_conversations returns empty-turn conversations. Consider failing fast with a clear error.

Example guard (in _load_dataset_from_dir or before conversion):
         for file_path in dir_path.iterdir():
             if file_path.is_file():
                 dataset_pool = self._load_dataset_from_file(file_path)
                 data[file_path.name].extend(dataset_pool)

-        return data
+        if not data:
+            raise ValueError(f"No dataset files found in directory: {dir_path}")
+        return data

src/aiperf/dataset/loader/multi_turn.py (3)

98-110: Silence Ruff ARG003 for unused filename while preserving the signature.

Reference filename to satisfy lint without behavioral change.

     def can_load(
         cls, data: dict[str, Any] | None = None, filename: str | Path | None = None
     ) -> bool:
         """Check if this loader can handle the given data format.
 
         MultiTurn format has a "turns" field containing a list of turns.
         """
-        if data is None:
+        # `filename` is unused here; keep param for protocol compatibility.
+        del filename
+        if data is None:
             return False
 
-        return "turns" in data and isinstance(data.get("turns"), list)
+        return "turns" in data and isinstance(data.get("turns"), list)

104-110: Optionally validate element shape for stricter structural detection.

Guard that turns is a list of dict-like items to reduce false positives early.

-        return "turns" in data and isinstance(data.get("turns"), list)
+        turns = data.get("turns")
+        return isinstance(turns, list) and all(isinstance(t, dict) for t in turns)

127-136: Specify file encoding for deterministic I/O.

Use UTF-8 for consistent behavior across platforms.

-        with open(self.filename) as f:
+        with open(self.filename, encoding="utf-8") as f:
             for line in f:
                 if (line := line.strip()) == "":
                     continue  # Skip empty lines

tests/loaders/test_can_load.py (2)

56-72: Add a negative case: turns list containing non-dict elements.

Verifies stricter shape if you adopt the optional check in MultiTurn.
         [
             param({"turns": [{"text": "Turn 1"}, {"text": "Turn 2"}]}, True, id="turns_list"),
+            param({"turns": ["not-a-dict"]}, False, id="turns_list_with_non_dict"),
186-188: Nit: docstring method name.

Docstring says infer_dataset_type() but the called API is _infer_dataset_type(). Consider aligning.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 282f61e and 84266da.

📒 Files selected for processing (15)

src/aiperf/common/config/input_config.py (0 hunks)
src/aiperf/common/protocols.py (3 hunks)
src/aiperf/dataset/__init__.py (1 hunks)
src/aiperf/dataset/composer/__init__.py (1 hunks)
src/aiperf/dataset/composer/custom.py (2 hunks)
src/aiperf/dataset/composer/synthetic.py (2 hunks)
src/aiperf/dataset/dataset_manager.py (1 hunks)
src/aiperf/dataset/loader/__init__.py (0 hunks)
src/aiperf/dataset/loader/mooncake_trace.py (2 hunks)
src/aiperf/dataset/loader/multi_turn.py (3 hunks)
src/aiperf/dataset/loader/protocol.py (0 hunks)
src/aiperf/dataset/loader/random_pool.py (4 hunks)
src/aiperf/dataset/loader/single_turn.py (3 hunks)
tests/composers/test_custom_composer.py (6 hunks)
tests/loaders/test_can_load.py (1 hunks)

💤 Files with no reviewable changes (3)

src/aiperf/common/config/input_config.py
src/aiperf/dataset/loader/init.py
src/aiperf/dataset/loader/protocol.py

🧰 Additional context used

🧬 Code graph analysis (9)

src/aiperf/dataset/loader/single_turn.py (5)

src/aiperf/common/enums/dataset_enums.py (2)

CustomDatasetType (17-21)

DatasetSamplingStrategy (50-59)

src/aiperf/common/protocols.py (2)

can_load (328-341)

get_preferred_sampling_strategy (344-350)

src/aiperf/dataset/loader/mooncake_trace.py (2)

can_load (57-68)

get_preferred_sampling_strategy (71-73)

src/aiperf/dataset/loader/multi_turn.py (2)

can_load (99-109)

get_preferred_sampling_strategy (112-114)

src/aiperf/dataset/loader/random_pool.py (2)

can_load (77-93)

get_preferred_sampling_strategy (96-98)

src/aiperf/dataset/composer/custom.py (8)

src/aiperf/common/enums/dataset_enums.py (1)

CustomDatasetType (17-21)

src/aiperf/common/factories.py (2)

get_all_classes_and_types (249-253)

get_class_from_type (216-232)

src/aiperf/common/protocols.py (6)

info (69-69)

load_dataset (352-352)

convert_to_conversations (354-356)

exception (74-74)

can_load (328-341)

get_preferred_sampling_strategy (344-350)

src/aiperf/common/utils.py (1)

load_json_str (79-98)

src/aiperf/dataset/loader/mooncake_trace.py (4)

load_dataset (75-108)

convert_to_conversations (115-149)

can_load (57-68)

get_preferred_sampling_strategy (71-73)

src/aiperf/dataset/loader/multi_turn.py (4)

load_dataset (116-136)

convert_to_conversations (138-169)

can_load (99-109)

get_preferred_sampling_strategy (112-114)

src/aiperf/dataset/loader/random_pool.py (4)

load_dataset (100-116)

convert_to_conversations (159-201)

can_load (77-93)

get_preferred_sampling_strategy (96-98)

src/aiperf/dataset/loader/single_turn.py (4)

load_dataset (103-123)

convert_to_conversations (125-153)

can_load (73-96)

get_preferred_sampling_strategy (99-101)

src/aiperf/dataset/loader/mooncake_trace.py (5)

src/aiperf/common/enums/dataset_enums.py (2)

CustomDatasetType (17-21)

DatasetSamplingStrategy (50-59)

src/aiperf/common/protocols.py (2)

can_load (328-341)

get_preferred_sampling_strategy (344-350)

src/aiperf/dataset/loader/multi_turn.py (2)

can_load (99-109)

get_preferred_sampling_strategy (112-114)

src/aiperf/dataset/loader/random_pool.py (2)

can_load (77-93)

get_preferred_sampling_strategy (96-98)

src/aiperf/dataset/loader/single_turn.py (2)

can_load (73-96)

get_preferred_sampling_strategy (99-101)

tests/loaders/test_can_load.py (7)

src/aiperf/common/enums/dataset_enums.py (1)

CustomDatasetType (17-21)

src/aiperf/dataset/composer/custom.py (3)

CustomDatasetComposer (24-166)

_infer_type (91-128)

_infer_dataset_type (58-88)

src/aiperf/dataset/loader/mooncake_trace.py (2)

MooncakeTraceDatasetLoader (19-149)

can_load (57-68)

src/aiperf/dataset/loader/multi_turn.py (2)

MultiTurnDatasetLoader (17-169)

can_load (99-109)

src/aiperf/dataset/loader/random_pool.py (2)

RandomPoolDatasetLoader (21-218)

can_load (77-93)

src/aiperf/dataset/loader/single_turn.py (2)

SingleTurnDatasetLoader (17-153)

can_load (73-96)

tests/loaders/conftest.py (1)

create_jsonl_file (11-27)

src/aiperf/dataset/loader/random_pool.py (3)

src/aiperf/common/enums/dataset_enums.py (1)

DatasetSamplingStrategy (50-59)

src/aiperf/common/protocols.py (2)

can_load (328-341)

get_preferred_sampling_strategy (344-350)

src/aiperf/dataset/loader/single_turn.py (2)

can_load (73-96)

get_preferred_sampling_strategy (99-101)

tests/composers/test_custom_composer.py (3)

src/aiperf/common/enums/dataset_enums.py (2)

CustomDatasetType (17-21)

DatasetSamplingStrategy (50-59)

tests/composers/conftest.py (2)

custom_config (163-173)

mock_tokenizer (26-30)

src/aiperf/dataset/composer/custom.py (2)

CustomDatasetComposer (24-166)

_set_sampling_strategy (130-145)

src/aiperf/dataset/loader/multi_turn.py (5)

src/aiperf/common/enums/dataset_enums.py (2)

CustomDatasetType (17-21)

DatasetSamplingStrategy (50-59)

src/aiperf/common/protocols.py (2)

can_load (328-341)

get_preferred_sampling_strategy (344-350)

src/aiperf/dataset/loader/mooncake_trace.py (2)

can_load (57-68)

get_preferred_sampling_strategy (71-73)

src/aiperf/dataset/loader/random_pool.py (2)

can_load (77-93)

get_preferred_sampling_strategy (96-98)

src/aiperf/dataset/loader/single_turn.py (2)

can_load (73-96)

get_preferred_sampling_strategy (99-101)

src/aiperf/dataset/composer/synthetic.py (2)

src/aiperf/common/config/config_defaults.py (1)

InputDefaults (44-58)

src/aiperf/common/protocols.py (1)

info (69-69)

src/aiperf/common/protocols.py (6)

src/aiperf/common/enums/dataset_enums.py (1)

DatasetSamplingStrategy (50-59)

src/aiperf/dataset/loader/mooncake_trace.py (4)

can_load (57-68)

get_preferred_sampling_strategy (71-73)

load_dataset (75-108)

convert_to_conversations (115-149)

src/aiperf/dataset/loader/multi_turn.py (4)

can_load (99-109)

get_preferred_sampling_strategy (112-114)

load_dataset (116-136)

convert_to_conversations (138-169)

src/aiperf/dataset/loader/random_pool.py (4)

can_load (77-93)

get_preferred_sampling_strategy (96-98)

load_dataset (100-116)

convert_to_conversations (159-201)

src/aiperf/dataset/loader/single_turn.py (4)

can_load (73-96)

get_preferred_sampling_strategy (99-101)

load_dataset (103-123)

convert_to_conversations (125-153)

src/aiperf/dataset/loader/base_public_dataset.py (2)

load_dataset (53-65)

convert_to_conversations (67-88)

🪛 Ruff (0.14.1)

src/aiperf/dataset/loader/single_turn.py

74-74: Unused class method argument: filename

(ARG003)

src/aiperf/dataset/composer/custom.py

42-45: Avoid specifying long messages outside the exception class

(TRY003)

113-113: Consider moving this statement to an else block

(TRY300)

src/aiperf/dataset/loader/mooncake_trace.py

58-58: Unused class method argument: filename

(ARG003)

src/aiperf/dataset/loader/random_pool.py

78-78: Unused class method argument: data

(ARG003)

src/aiperf/dataset/loader/multi_turn.py

100-100: Unused class method argument: filename

(ARG003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: integration-tests (ubuntu-latest, 3.12)
GitHub Check: integration-tests (ubuntu-latest, 3.11)
GitHub Check: integration-tests (ubuntu-latest, 3.10)
GitHub Check: build (ubuntu-latest, 3.11)
GitHub Check: build (ubuntu-latest, 3.12)
GitHub Check: build (ubuntu-latest, 3.10)
GitHub Check: build (macos-latest, 3.11)
GitHub Check: build (macos-latest, 3.12)

🔇 Additional comments (20)

src/aiperf/dataset/composer/__init__.py (1)

6-6: LGTM! Consistent mkinit configuration.

The addition of "logger" to the __ignore__ list aligns with the mkinit-generated init pattern used across the codebase.

src/aiperf/dataset/__init__.py (1)

3-10: LGTM! Standard mkinit auto-generation markers.

The added mkinit headers and __ignore__ configuration are consistent with the auto-generated file pattern and align with similar changes across the codebase.

src/aiperf/dataset/composer/synthetic.py (2)

7-7: LGTM!

The import is necessary for accessing the default sampling strategy value.

21-28: Config mutation pattern is intentional and safe. No changes needed.

The mutation of self.config.input.dataset_sampling_strategy in the composer's constructor follows an intentional design pattern that is both tested and necessary. The initialization sequence guarantees the default is set before the value is consumed: the composer is instantiated first (dataset_manager.py lines 205-209), which applies the default if needed, and only then is the value accessed by DatasetSamplingStrategyFactory.create_instance() (line 215). Since the config is passed by reference, the mutation is visible to the caller. Both SyntheticDatasetComposer and CustomDatasetComposer follow this same pattern consistently, and unit tests explicitly validate this behavior (test_custom_composer.py). There is no code path that accesses dataset_sampling_strategy before a composer is instantiated.

src/aiperf/common/protocols.py (3)

33-49: LGTM: type-only imports behind TYPE_CHECKING are correct.

Importing Path, DatasetSamplingStrategy, and CustomDatasetT for typing only avoids runtime deps. Good pattern here.

323-357: The protocol is correctly typed; Filename TypeAlias is fully compatible.

Filename is defined as a TypeAlias to str (not a NewType), so there is no type mismatch between the protocol's dict[str, ...] and the implementation's dict[Filename, ...]. Structural type checking treats them as equivalent.

Conversation is already imported at runtime from aiperf.common.models, so the suggested forward-reference is optional and not necessary. The code as written is correct.

12-12: Review suggests adding DatasetSamplingStrategy to TYPE_CHECKING, but it's already there (line 41).

The refactor concept is sound—Conversation is used only in type annotation (line 356), so moving it to TYPE_CHECKING reduces runtime import overhead. However:

DatasetSamplingStrategy is already imported in the TYPE_CHECKING block, so that part of the diff is unnecessary.

The return annotation must be updated to a string forward reference: -> list["Conversation"]: (not -> list[Conversation]:), since from __future__ import annotations is not present.

No actual circular dependency currently exists; this is a preventative measure.

Verify the final implementation matches: remove Conversation from the runtime import, add it to TYPE_CHECKING (without adding duplicate DatasetSamplingStrategy), and quote the return type annotation.
src/aiperf/dataset/loader/single_turn.py (3)
72-97: Silence Ruff ARG003 and clarify intent for path-based detection

The filename arg isn’t used (by design for SingleTurn). Rename to _filename to quiet ARG003 and document that SingleTurn does not support path-only detection.
[ suggest_recommended_refactor ]
-    def can_load(
-        cls, data: dict[str, Any] | None = None, filename: str | Path | None = None
-    ) -> bool:
+    def can_load(
+        cls, data: dict[str, Any] | None = None, _filename: str | Path | None = None
+    ) -> bool:
         """Check if this loader can handle the given data format.
 
         SingleTurn format has modality fields (text/texts, image/images, etc.)
         but does NOT have a "turns" field.
         """
98-101: Preferred sampling strategy exposure looks good

Returning SEQUENTIAL here aligns with composer auto-configuration.

142-147: Video modality wiring added—LGTM

Passing videos=media[MediaType.VIDEO] completes parity with other modalities. Ensure Turn supports the videos field (it appears consistent with other loaders in this PR).
tests/composers/test_custom_composer.py (5)

8-8: Enum import change is correct

Switching to aiperf.common.enums is consistent with source changes.

66-77: Trace dataset flow test—LGTM

Patch targets and assertions align with the new composer behavior.

78-95: Max tokens via config—LGTM

Good use of patch to make output deterministic.

96-112: Mooncake max_tokens from file—LGTM

Mocks and assertions look correct; ensures conversion respects output_length.

117-134: Empty loader result case—LGTM

Validates empty dataset handling without errors.

src/aiperf/dataset/loader/mooncake_trace.py (1)

70-74: Preferred strategy exposure—LGTM

SEQUENTIAL is appropriate for trace replay.

src/aiperf/dataset/composer/custom.py (1)

130-146: Sampling strategy auto-selection—LGTM

Respects explicit user config and defers to loader preference when None.

src/aiperf/dataset/dataset_manager.py (1)

191-198: Review comment is incorrect; both composers handle None before factory call

The original concern misses that both SyntheticDatasetComposer and CustomDatasetComposer handle None sampling strategy before returning from create_dataset():

SyntheticDatasetComposer (lines 18–25): Sets dataset_sampling_strategy to InputDefaults.DATASET_SAMPLING_STRATEGY in __init__ if None.

CustomDatasetComposer (line 49): Calls _set_sampling_strategy() during create_dataset(), which queries the loader's preferred strategy if None.

By line 215 in dataset_manager.py, the strategy is guaranteed to be set. The suggested diff is unnecessary.

Likely an incorrect or invalid review comment.

tests/loaders/test_can_load.py (1)

24-72: Great coverage on SingleTurn and MultiTurn structural detection.

LGTM. The matrix captures modalities, explicit type handling, and negative cases well.

src/aiperf/dataset/loader/random_pool.py (1)

197-199: The review comment is incorrect.

The project requires Python 3.10+ according to pyproject.toml. Since the strict parameter in zip() was introduced in Python 3.10, using strict=False is safe and valid for all supported versions. The review's concern about Python 3.9 and below compatibility does not apply to this codebase.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-10-24T23:53:23Z

src/aiperf/dataset/composer/custom.py

+        check_file_exists(self.config.input.file)
+
+        # Auto-infer dataset type if not provided
+        dataset_type = self.config.input.custom_dataset_type
+        if dataset_type is None:
+            dataset_type = self._infer_dataset_type(self.config.input.file)
+            if dataset_type is None:
+                raise ValueError(
+                    f"Could not infer dataset type from file: {self.config.input.file}. "
+                    "Please specify --custom-dataset-type explicitly."
+                )
+            self.info(f"Auto-detected dataset type: {dataset_type}")
+


⚠️ Potential issue | 🔴 Critical

Persist inferred type to config and guard missing file

Two fixes:

If file is None, raise a clear error before calling check_file_exists.

After auto-detect, write dataset_type back to config so downstream modules relying on config.input.custom_dataset_type behave correctly (e.g., exact trace replay).

@@ - # TODO: (future) for K8s, we need to transfer file data from SC (across node) - check_file_exists(self.config.input.file) + # TODO: (future) for K8s, we need to transfer file data from SC (across node) + if not self.config.input.file: + raise ValueError( + "Custom dataset requires --input-file. Provide --input-file or disable custom dataset." + ) + check_file_exists(self.config.input.file) @@ - if dataset_type is None: + if dataset_type is None: dataset_type = self._infer_dataset_type(self.config.input.file) if dataset_type is None: raise ValueError( f"Could not infer dataset type from file: {self.config.input.file}. " "Please specify --custom-dataset-type explicitly." ) self.info(f"Auto-detected dataset type: {dataset_type}") + # Persist inferred type for downstream logic (request count, fixed schedule, etc.) + self.config.input.custom_dataset_type = dataset_type

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.1)

42-45: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In src/aiperf/dataset/composer/custom.py around lines 35 to 47, the code calls check_file_exists without first ensuring self.config.input.file is present and also doesn't persist the auto-detected dataset_type back into the config; first, if self.config.input.file is None raise a clear ValueError before calling check_file_exists, and second, after auto-detecting dataset_type assign it back to self.config.input.custom_dataset_type (so downstream code sees the inferred value) and log the detection as before.

coderabbitai · 2025-10-24T23:53:23Z

src/aiperf/dataset/loader/mooncake_trace.py

+    @classmethod
+    def can_load(
+        cls, data: dict[str, Any] | None = None, filename: str | Path | None = None
+    ) -> bool:
+        """Check if this loader can handle the given data format.
+
+        MooncakeTrace format has "input_length" or "text_input" fields,
+        and optionally "hash_ids".
+        """
+        if data is None:
+            return False
+
+        return "input_length" in data or ("text_input" in data and "hash_ids" in data)
+


⚠️ Potential issue | 🟠 Major

can_load too strict; accept text_input without requiring hash_ids, and silence ARG003

Docstring says hash_ids is optional. Current predicate requires both and will fail valid traces, breaking auto-detect. Also rename filename to _filename to avoid Ruff ARG003.

- def can_load( - cls, data: dict[str, Any] | None = None, filename: str | Path | None = None - ) -> bool: + def can_load( + cls, data: dict[str, Any] | None = None, _filename: str | Path | None = None + ) -> bool: @@ - return "input_length" in data or ("text_input" in data and "hash_ids" in data) + return ("input_length" in data) or ("text_input" in data)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@classmethod

def can_load(

cls, data: dict[str, Any] | None = None, filename: str | Path | None = None

) -> bool:

"""Check if this loader can handle the given data format.

MooncakeTrace format has "input_length" or "text_input" fields,

and optionally "hash_ids".

"""

if data is None:

return False

return "input_length" in data or ("text_input" in data and "hash_ids" in data)

@classmethod

def can_load(

cls, data: dict[str, Any] | None = None, _filename: str | Path | None = None

) -> bool:

"""Check if this loader can handle the given data format.

MooncakeTrace format has "input_length" or "text_input" fields,

and optionally "hash_ids".

"""

if data is None:

return False

return ("input_length" in data) or ("text_input" in data)

🧰 Tools

🪛 Ruff (0.14.1)

58-58: Unused class method argument: filename

(ARG003)

🤖 Prompt for AI Agents

In src/aiperf/dataset/loader/mooncake_trace.py around lines 56 to 69, the can_load method is too strict and also triggers Ruff ARG003; update the predicate to return True if data contains "input_length" or "text_input" (without requiring "hash_ids") to match the docstring and accept valid traces, and rename the filename parameter to _filename (and any internal usages) to silence ARG003; keep the early None check and the docstring unchanged.

feat: auto detect custom dataset type based on file info

84266da

github-actions bot added the feat label Oct 24, 2025

ajcasagrande requested review from ilana-n and lkomali October 24, 2025 23:49

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: auto detect custom dataset type based on file info #399

feat: auto detect custom dataset type based on file info #399

Uh oh!

ajcasagrande commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

codecov bot commented Oct 24, 2025

Uh oh!

coderabbitai bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: auto detect custom dataset type based on file info #399

Are you sure you want to change the base?

feat: auto detect custom dataset type based on file info #399

Uh oh!

Conversation

ajcasagrande commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

codecov bot commented Oct 24, 2025

Codecov Report

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajcasagrande commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading