-
Notifications
You must be signed in to change notification settings - Fork 3
feat: auto detect custom dataset type based on file info #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
WalkthroughThis PR refactors the dataset loader architecture by relocating Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/aiperf/dataset/__init__.py (1)
60-97: Update import statement in aiperf/common/factories.py to use the new protocol location.The removal of
CustomDatasetLoaderProtocolfrom the dataset module exports is incomplete. The import at line 66 insrc/aiperf/common/factories.pystill references the old location and will fail:# Current (broken): from aiperf.dataset import ( CustomDatasetLoaderProtocol, ) # Should be: from aiperf.common.protocols import ( CustomDatasetLoaderProtocol, )The protocol exists only in
aiperf.common.protocols.pybutfactories.pyattempts to import it fromaiperf.dataset, which no longer exports it.
🧹 Nitpick comments (12)
tests/composers/test_custom_composer.py (1)
136-178: Sampling strategy auto-configuration tests—LGTM; add test for type persistenceGreat coverage for strategy selection and non-override. Please add a test to assert that auto-inferred dataset_type is written back to config.input.custom_dataset_type, as downstream logic (e.g., get_effective_request_count, fixed schedule) depends on it.
Example:
@patch("aiperf.dataset.composer.custom.check_file_exists") @patch("builtins.open", mock_open(read_data='{"input_length": 10, "output_length": 1}\n'))) def test_inferred_type_persists_in_config(mock_check_file, trace_config, mock_tokenizer): trace_config.input.custom_dataset_type = None composer = CustomDatasetComposer(trace_config, mock_tokenizer) composer.create_dataset() assert trace_config.input.custom_dataset_type == CustomDatasetType.MOONCAKE_TRACEsrc/aiperf/dataset/composer/custom.py (2)
57-89: Make JSONL inference more robust (optional)Consider scanning the first N non-empty lines and specifying UTF-8 to tolerate BOMs and stray lines; fall back to path-only detection if none parse.
- with open(file_path) as f: - for line in f: + with open(file_path, encoding="utf-8") as f: + checked = 0 + for line in f: if not (line := line.strip()): continue - data = load_json_str(line) - return CustomDatasetComposer._infer_type( - data=data, filename=file_path - ) + try: + data = load_json_str(line) + inferred = CustomDatasetComposer._infer_type(data=data, filename=file_path) + if inferred is not None: + return inferred + except Exception: + # Ignore parse errors and keep probing a few lines + pass + checked += 1 + if checked >= 10: + break + # Fall back to filename-only structural detection + return CustomDatasetComposer._infer_type(data=None, filename=file_path)
107-118: Avoid referencing data['type'] in except and narrow exceptionTiny logging hardening: only catch ValueError here and avoid possible KeyError in the log.
- if data is not None and "type" in data: - try: - # Try to convert the type string to enum - explicit_type = CustomDatasetType(data["type"]) - logger.info(f"Using explicit type field: {explicit_type}") - return explicit_type - except (ValueError, KeyError): - logger.info( - f"Invalid type field value: {data['type']}, falling back to structural detection" - ) + if data is not None and "type" in data: + try: + explicit_type = CustomDatasetType(data["type"]) + logger.info(f"Using explicit type field: {explicit_type}") + return explicit_type + except ValueError: + logger.info( + f"Invalid type field value: {repr(data.get('type'))}, falling back to structural detection" + )src/aiperf/dataset/loader/random_pool.py (4)
76-93: Silence Ruff ARG003 for unuseddatawhile preserving the protocol signature.
datais intentionally unused (path-based detection only). Reference it to satisfy lint without breaking keyword callers.@classmethod def can_load( cls, data: dict[str, Any] | None = None, filename: str | Path | None = None ) -> bool: """Check if this loader can handle the given data format. @@ - if filename is not None: + # Intentionally unused: detection here is path-based only; keep param name for keyword callers. + del data + if filename is not None: path = Path(filename) if isinstance(filename, str) else filename if path.is_dir(): return True
203-217: Broaden_merge_turnsinput type to matchziptuple and improve type clarity.
zipyields a tuple, but the signature requireslist[Turn]. AcceptingIterable[Turn]avoids mismatches.-from typing import Any, TypeAlias +from typing import Any, TypeAlias, Iterable @@ - def _merge_turns(self, turns: list[Turn]) -> Turn: + def _merge_turns(self, turns: Iterable[Turn]) -> Turn:
129-137: Specify file encoding for deterministic I/O.Use UTF-8 to avoid platform-dependent defaults and decoding issues.
- with open(file_path) as f: + with open(file_path, encoding="utf-8") as f: for line in f: if (line := line.strip()) == "": continue # Skip empty lines
150-157: Handle empty directories explicitly.If no files are discovered,
convert_to_conversationsreturns empty-turn conversations. Consider failing fast with a clear error.Example guard (in
_load_dataset_from_diror before conversion):for file_path in dir_path.iterdir(): if file_path.is_file(): dataset_pool = self._load_dataset_from_file(file_path) data[file_path.name].extend(dataset_pool) - return data + if not data: + raise ValueError(f"No dataset files found in directory: {dir_path}") + return datasrc/aiperf/dataset/loader/multi_turn.py (3)
98-110: Silence Ruff ARG003 for unusedfilenamewhile preserving the signature.Reference
filenameto satisfy lint without behavioral change.def can_load( cls, data: dict[str, Any] | None = None, filename: str | Path | None = None ) -> bool: """Check if this loader can handle the given data format. MultiTurn format has a "turns" field containing a list of turns. """ - if data is None: + # `filename` is unused here; keep param for protocol compatibility. + del filename + if data is None: return False - return "turns" in data and isinstance(data.get("turns"), list) + return "turns" in data and isinstance(data.get("turns"), list)
104-110: Optionally validate element shape for stricter structural detection.Guard that
turnsis a list of dict-like items to reduce false positives early.- return "turns" in data and isinstance(data.get("turns"), list) + turns = data.get("turns") + return isinstance(turns, list) and all(isinstance(t, dict) for t in turns)
127-136: Specify file encoding for deterministic I/O.Use UTF-8 for consistent behavior across platforms.
- with open(self.filename) as f: + with open(self.filename, encoding="utf-8") as f: for line in f: if (line := line.strip()) == "": continue # Skip empty linestests/loaders/test_can_load.py (2)
56-72: Add a negative case:turnslist containing non-dict elements.Verifies stricter shape if you adopt the optional check in MultiTurn.
[ param({"turns": [{"text": "Turn 1"}, {"text": "Turn 2"}]}, True, id="turns_list"), + param({"turns": ["not-a-dict"]}, False, id="turns_list_with_non_dict"),
186-188: Nit: docstring method name.Docstring says
infer_dataset_type()but the called API is_infer_dataset_type(). Consider aligning.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/aiperf/common/config/input_config.py(0 hunks)src/aiperf/common/protocols.py(3 hunks)src/aiperf/dataset/__init__.py(1 hunks)src/aiperf/dataset/composer/__init__.py(1 hunks)src/aiperf/dataset/composer/custom.py(2 hunks)src/aiperf/dataset/composer/synthetic.py(2 hunks)src/aiperf/dataset/dataset_manager.py(1 hunks)src/aiperf/dataset/loader/__init__.py(0 hunks)src/aiperf/dataset/loader/mooncake_trace.py(2 hunks)src/aiperf/dataset/loader/multi_turn.py(3 hunks)src/aiperf/dataset/loader/protocol.py(0 hunks)src/aiperf/dataset/loader/random_pool.py(4 hunks)src/aiperf/dataset/loader/single_turn.py(3 hunks)tests/composers/test_custom_composer.py(6 hunks)tests/loaders/test_can_load.py(1 hunks)
💤 Files with no reviewable changes (3)
- src/aiperf/common/config/input_config.py
- src/aiperf/dataset/loader/init.py
- src/aiperf/dataset/loader/protocol.py
🧰 Additional context used
🧬 Code graph analysis (9)
src/aiperf/dataset/loader/single_turn.py (5)
src/aiperf/common/enums/dataset_enums.py (2)
CustomDatasetType(17-21)DatasetSamplingStrategy(50-59)src/aiperf/common/protocols.py (2)
can_load(328-341)get_preferred_sampling_strategy(344-350)src/aiperf/dataset/loader/mooncake_trace.py (2)
can_load(57-68)get_preferred_sampling_strategy(71-73)src/aiperf/dataset/loader/multi_turn.py (2)
can_load(99-109)get_preferred_sampling_strategy(112-114)src/aiperf/dataset/loader/random_pool.py (2)
can_load(77-93)get_preferred_sampling_strategy(96-98)
src/aiperf/dataset/composer/custom.py (8)
src/aiperf/common/enums/dataset_enums.py (1)
CustomDatasetType(17-21)src/aiperf/common/factories.py (2)
get_all_classes_and_types(249-253)get_class_from_type(216-232)src/aiperf/common/protocols.py (6)
info(69-69)load_dataset(352-352)convert_to_conversations(354-356)exception(74-74)can_load(328-341)get_preferred_sampling_strategy(344-350)src/aiperf/common/utils.py (1)
load_json_str(79-98)src/aiperf/dataset/loader/mooncake_trace.py (4)
load_dataset(75-108)convert_to_conversations(115-149)can_load(57-68)get_preferred_sampling_strategy(71-73)src/aiperf/dataset/loader/multi_turn.py (4)
load_dataset(116-136)convert_to_conversations(138-169)can_load(99-109)get_preferred_sampling_strategy(112-114)src/aiperf/dataset/loader/random_pool.py (4)
load_dataset(100-116)convert_to_conversations(159-201)can_load(77-93)get_preferred_sampling_strategy(96-98)src/aiperf/dataset/loader/single_turn.py (4)
load_dataset(103-123)convert_to_conversations(125-153)can_load(73-96)get_preferred_sampling_strategy(99-101)
src/aiperf/dataset/loader/mooncake_trace.py (5)
src/aiperf/common/enums/dataset_enums.py (2)
CustomDatasetType(17-21)DatasetSamplingStrategy(50-59)src/aiperf/common/protocols.py (2)
can_load(328-341)get_preferred_sampling_strategy(344-350)src/aiperf/dataset/loader/multi_turn.py (2)
can_load(99-109)get_preferred_sampling_strategy(112-114)src/aiperf/dataset/loader/random_pool.py (2)
can_load(77-93)get_preferred_sampling_strategy(96-98)src/aiperf/dataset/loader/single_turn.py (2)
can_load(73-96)get_preferred_sampling_strategy(99-101)
tests/loaders/test_can_load.py (7)
src/aiperf/common/enums/dataset_enums.py (1)
CustomDatasetType(17-21)src/aiperf/dataset/composer/custom.py (3)
CustomDatasetComposer(24-166)_infer_type(91-128)_infer_dataset_type(58-88)src/aiperf/dataset/loader/mooncake_trace.py (2)
MooncakeTraceDatasetLoader(19-149)can_load(57-68)src/aiperf/dataset/loader/multi_turn.py (2)
MultiTurnDatasetLoader(17-169)can_load(99-109)src/aiperf/dataset/loader/random_pool.py (2)
RandomPoolDatasetLoader(21-218)can_load(77-93)src/aiperf/dataset/loader/single_turn.py (2)
SingleTurnDatasetLoader(17-153)can_load(73-96)tests/loaders/conftest.py (1)
create_jsonl_file(11-27)
src/aiperf/dataset/loader/random_pool.py (3)
src/aiperf/common/enums/dataset_enums.py (1)
DatasetSamplingStrategy(50-59)src/aiperf/common/protocols.py (2)
can_load(328-341)get_preferred_sampling_strategy(344-350)src/aiperf/dataset/loader/single_turn.py (2)
can_load(73-96)get_preferred_sampling_strategy(99-101)
tests/composers/test_custom_composer.py (3)
src/aiperf/common/enums/dataset_enums.py (2)
CustomDatasetType(17-21)DatasetSamplingStrategy(50-59)tests/composers/conftest.py (2)
custom_config(163-173)mock_tokenizer(26-30)src/aiperf/dataset/composer/custom.py (2)
CustomDatasetComposer(24-166)_set_sampling_strategy(130-145)
src/aiperf/dataset/loader/multi_turn.py (5)
src/aiperf/common/enums/dataset_enums.py (2)
CustomDatasetType(17-21)DatasetSamplingStrategy(50-59)src/aiperf/common/protocols.py (2)
can_load(328-341)get_preferred_sampling_strategy(344-350)src/aiperf/dataset/loader/mooncake_trace.py (2)
can_load(57-68)get_preferred_sampling_strategy(71-73)src/aiperf/dataset/loader/random_pool.py (2)
can_load(77-93)get_preferred_sampling_strategy(96-98)src/aiperf/dataset/loader/single_turn.py (2)
can_load(73-96)get_preferred_sampling_strategy(99-101)
src/aiperf/dataset/composer/synthetic.py (2)
src/aiperf/common/config/config_defaults.py (1)
InputDefaults(44-58)src/aiperf/common/protocols.py (1)
info(69-69)
src/aiperf/common/protocols.py (6)
src/aiperf/common/enums/dataset_enums.py (1)
DatasetSamplingStrategy(50-59)src/aiperf/dataset/loader/mooncake_trace.py (4)
can_load(57-68)get_preferred_sampling_strategy(71-73)load_dataset(75-108)convert_to_conversations(115-149)src/aiperf/dataset/loader/multi_turn.py (4)
can_load(99-109)get_preferred_sampling_strategy(112-114)load_dataset(116-136)convert_to_conversations(138-169)src/aiperf/dataset/loader/random_pool.py (4)
can_load(77-93)get_preferred_sampling_strategy(96-98)load_dataset(100-116)convert_to_conversations(159-201)src/aiperf/dataset/loader/single_turn.py (4)
can_load(73-96)get_preferred_sampling_strategy(99-101)load_dataset(103-123)convert_to_conversations(125-153)src/aiperf/dataset/loader/base_public_dataset.py (2)
load_dataset(53-65)convert_to_conversations(67-88)
🪛 Ruff (0.14.1)
src/aiperf/dataset/loader/single_turn.py
74-74: Unused class method argument: filename
(ARG003)
src/aiperf/dataset/composer/custom.py
42-45: Avoid specifying long messages outside the exception class
(TRY003)
113-113: Consider moving this statement to an else block
(TRY300)
src/aiperf/dataset/loader/mooncake_trace.py
58-58: Unused class method argument: filename
(ARG003)
src/aiperf/dataset/loader/random_pool.py
78-78: Unused class method argument: data
(ARG003)
src/aiperf/dataset/loader/multi_turn.py
100-100: Unused class method argument: filename
(ARG003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: integration-tests (ubuntu-latest, 3.12)
- GitHub Check: integration-tests (ubuntu-latest, 3.11)
- GitHub Check: integration-tests (ubuntu-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.11)
- GitHub Check: build (ubuntu-latest, 3.12)
- GitHub Check: build (ubuntu-latest, 3.10)
- GitHub Check: build (macos-latest, 3.11)
- GitHub Check: build (macos-latest, 3.12)
🔇 Additional comments (20)
src/aiperf/dataset/composer/__init__.py (1)
6-6: LGTM! Consistent mkinit configuration.The addition of "logger" to the
__ignore__list aligns with the mkinit-generated init pattern used across the codebase.src/aiperf/dataset/__init__.py (1)
3-10: LGTM! Standard mkinit auto-generation markers.The added mkinit headers and
__ignore__configuration are consistent with the auto-generated file pattern and align with similar changes across the codebase.src/aiperf/dataset/composer/synthetic.py (2)
7-7: LGTM!The import is necessary for accessing the default sampling strategy value.
21-28: Config mutation pattern is intentional and safe. No changes needed.The mutation of
self.config.input.dataset_sampling_strategyin the composer's constructor follows an intentional design pattern that is both tested and necessary. The initialization sequence guarantees the default is set before the value is consumed: the composer is instantiated first (dataset_manager.py lines 205-209), which applies the default if needed, and only then is the value accessed byDatasetSamplingStrategyFactory.create_instance()(line 215). Since the config is passed by reference, the mutation is visible to the caller. BothSyntheticDatasetComposerandCustomDatasetComposerfollow this same pattern consistently, and unit tests explicitly validate this behavior (test_custom_composer.py). There is no code path that accessesdataset_sampling_strategybefore a composer is instantiated.src/aiperf/common/protocols.py (3)
33-49: LGTM: type-only imports behind TYPE_CHECKING are correct.Importing Path, DatasetSamplingStrategy, and CustomDatasetT for typing only avoids runtime deps. Good pattern here.
323-357: The protocol is correctly typed; Filename TypeAlias is fully compatible.Filename is defined as a TypeAlias to str (not a NewType), so there is no type mismatch between the protocol's
dict[str, ...]and the implementation'sdict[Filename, ...]. Structural type checking treats them as equivalent.Conversation is already imported at runtime from
aiperf.common.models, so the suggested forward-reference is optional and not necessary. The code as written is correct.
12-12: Review suggests adding DatasetSamplingStrategy to TYPE_CHECKING, but it's already there (line 41).The refactor concept is sound—Conversation is used only in type annotation (line 356), so moving it to TYPE_CHECKING reduces runtime import overhead. However:
- DatasetSamplingStrategy is already imported in the TYPE_CHECKING block, so that part of the diff is unnecessary.
- The return annotation must be updated to a string forward reference:
-> list["Conversation"]:(not-> list[Conversation]:), sincefrom __future__ import annotationsis not present.- No actual circular dependency currently exists; this is a preventative measure.
Verify the final implementation matches: remove Conversation from the runtime import, add it to TYPE_CHECKING (without adding duplicate DatasetSamplingStrategy), and quote the return type annotation.
src/aiperf/dataset/loader/single_turn.py (3)
72-97: Silence Ruff ARG003 and clarify intent for path-based detectionThe filename arg isn’t used (by design for SingleTurn). Rename to _filename to quiet ARG003 and document that SingleTurn does not support path-only detection.
[ suggest_recommended_refactor ]- def can_load( - cls, data: dict[str, Any] | None = None, filename: str | Path | None = None - ) -> bool: + def can_load( + cls, data: dict[str, Any] | None = None, _filename: str | Path | None = None + ) -> bool: """Check if this loader can handle the given data format. SingleTurn format has modality fields (text/texts, image/images, etc.) but does NOT have a "turns" field. """
98-101: Preferred sampling strategy exposure looks goodReturning SEQUENTIAL here aligns with composer auto-configuration.
142-147: Video modality wiring added—LGTMPassing videos=media[MediaType.VIDEO] completes parity with other modalities. Ensure Turn supports the videos field (it appears consistent with other loaders in this PR).
tests/composers/test_custom_composer.py (5)
8-8: Enum import change is correctSwitching to aiperf.common.enums is consistent with source changes.
66-77: Trace dataset flow test—LGTMPatch targets and assertions align with the new composer behavior.
78-95: Max tokens via config—LGTMGood use of patch to make output deterministic.
96-112: Mooncake max_tokens from file—LGTMMocks and assertions look correct; ensures conversion respects output_length.
117-134: Empty loader result case—LGTMValidates empty dataset handling without errors.
src/aiperf/dataset/loader/mooncake_trace.py (1)
70-74: Preferred strategy exposure—LGTMSEQUENTIAL is appropriate for trace replay.
src/aiperf/dataset/composer/custom.py (1)
130-146: Sampling strategy auto-selection—LGTMRespects explicit user config and defers to loader preference when None.
src/aiperf/dataset/dataset_manager.py (1)
191-198: Review comment is incorrect; both composers handle None before factory callThe original concern misses that both
SyntheticDatasetComposerandCustomDatasetComposerhandleNonesampling strategy before returning fromcreate_dataset():
- SyntheticDatasetComposer (lines 18–25): Sets
dataset_sampling_strategytoInputDefaults.DATASET_SAMPLING_STRATEGYin__init__ifNone.- CustomDatasetComposer (line 49): Calls
_set_sampling_strategy()duringcreate_dataset(), which queries the loader's preferred strategy ifNone.By line 215 in
dataset_manager.py, the strategy is guaranteed to be set. The suggested diff is unnecessary.Likely an incorrect or invalid review comment.
tests/loaders/test_can_load.py (1)
24-72: Great coverage on SingleTurn and MultiTurn structural detection.LGTM. The matrix captures modalities, explicit type handling, and negative cases well.
src/aiperf/dataset/loader/random_pool.py (1)
197-199: The review comment is incorrect.The project requires Python 3.10+ according to
pyproject.toml. Since thestrictparameter inzip()was introduced in Python 3.10, usingstrict=Falseis safe and valid for all supported versions. The review's concern about Python 3.9 and below compatibility does not apply to this codebase.Likely an incorrect or invalid review comment.
| check_file_exists(self.config.input.file) | ||
|
|
||
| # Auto-infer dataset type if not provided | ||
| dataset_type = self.config.input.custom_dataset_type | ||
| if dataset_type is None: | ||
| dataset_type = self._infer_dataset_type(self.config.input.file) | ||
| if dataset_type is None: | ||
| raise ValueError( | ||
| f"Could not infer dataset type from file: {self.config.input.file}. " | ||
| "Please specify --custom-dataset-type explicitly." | ||
| ) | ||
| self.info(f"Auto-detected dataset type: {dataset_type}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Persist inferred type to config and guard missing file
Two fixes:
- If file is None, raise a clear error before calling check_file_exists.
- After auto-detect, write dataset_type back to config so downstream modules relying on config.input.custom_dataset_type behave correctly (e.g., exact trace replay).
@@
- # TODO: (future) for K8s, we need to transfer file data from SC (across node)
- check_file_exists(self.config.input.file)
+ # TODO: (future) for K8s, we need to transfer file data from SC (across node)
+ if not self.config.input.file:
+ raise ValueError(
+ "Custom dataset requires --input-file. Provide --input-file or disable custom dataset."
+ )
+ check_file_exists(self.config.input.file)
@@
- if dataset_type is None:
+ if dataset_type is None:
dataset_type = self._infer_dataset_type(self.config.input.file)
if dataset_type is None:
raise ValueError(
f"Could not infer dataset type from file: {self.config.input.file}. "
"Please specify --custom-dataset-type explicitly."
)
self.info(f"Auto-detected dataset type: {dataset_type}")
+ # Persist inferred type for downstream logic (request count, fixed schedule, etc.)
+ self.config.input.custom_dataset_type = dataset_typeCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.1)
42-45: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In src/aiperf/dataset/composer/custom.py around lines 35 to 47, the code calls
check_file_exists without first ensuring self.config.input.file is present and
also doesn't persist the auto-detected dataset_type back into the config; first,
if self.config.input.file is None raise a clear ValueError before calling
check_file_exists, and second, after auto-detecting dataset_type assign it back
to self.config.input.custom_dataset_type (so downstream code sees the inferred
value) and log the detection as before.
| @classmethod | ||
| def can_load( | ||
| cls, data: dict[str, Any] | None = None, filename: str | Path | None = None | ||
| ) -> bool: | ||
| """Check if this loader can handle the given data format. | ||
| MooncakeTrace format has "input_length" or "text_input" fields, | ||
| and optionally "hash_ids". | ||
| """ | ||
| if data is None: | ||
| return False | ||
|
|
||
| return "input_length" in data or ("text_input" in data and "hash_ids" in data) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can_load too strict; accept text_input without requiring hash_ids, and silence ARG003
Docstring says hash_ids is optional. Current predicate requires both and will fail valid traces, breaking auto-detect. Also rename filename to _filename to avoid Ruff ARG003.
- def can_load(
- cls, data: dict[str, Any] | None = None, filename: str | Path | None = None
- ) -> bool:
+ def can_load(
+ cls, data: dict[str, Any] | None = None, _filename: str | Path | None = None
+ ) -> bool:
@@
- return "input_length" in data or ("text_input" in data and "hash_ids" in data)
+ return ("input_length" in data) or ("text_input" in data)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @classmethod | |
| def can_load( | |
| cls, data: dict[str, Any] | None = None, filename: str | Path | None = None | |
| ) -> bool: | |
| """Check if this loader can handle the given data format. | |
| MooncakeTrace format has "input_length" or "text_input" fields, | |
| and optionally "hash_ids". | |
| """ | |
| if data is None: | |
| return False | |
| return "input_length" in data or ("text_input" in data and "hash_ids" in data) | |
| @classmethod | |
| def can_load( | |
| cls, data: dict[str, Any] | None = None, _filename: str | Path | None = None | |
| ) -> bool: | |
| """Check if this loader can handle the given data format. | |
| MooncakeTrace format has "input_length" or "text_input" fields, | |
| and optionally "hash_ids". | |
| """ | |
| if data is None: | |
| return False | |
| return ("input_length" in data) or ("text_input" in data) |
🧰 Tools
🪛 Ruff (0.14.1)
58-58: Unused class method argument: filename
(ARG003)
🤖 Prompt for AI Agents
In src/aiperf/dataset/loader/mooncake_trace.py around lines 56 to 69, the
can_load method is too strict and also triggers Ruff ARG003; update the
predicate to return True if data contains "input_length" or "text_input"
(without requiring "hash_ids") to match the docstring and accept valid traces,
and rename the filename parameter to _filename (and any internal usages) to
silence ARG003; keep the early None check and the docstring unchanged.
Attempts to automatically detect the
--custom-dataset-typebased on the file data provided when using--input-file. this matches closer to what GenAI-Perf did, while still supporting the ability to manually override.This is done is a decoupled manner by having each CustomLoader provide a function defining whether they can load the given file or not.
Also, CustomLoaders can now describe their preferred
--dataset-sampling-strategy, making the code even more decoupled!Summary by CodeRabbit
New Features
Bug Fixes
Tests