Expose common dataloader args #2097

divyanshk · 2025-12-02T00:47:31Z

This diff introduces common dataloader args which are supported by statefuldataloader (and torch.utils.data dataloader). Users should be able to use them in their config files.

I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc).

wwwjn

Thank you! I feel like I'm slightly lean towards using kwargs instead of adding these parameters one by one. This is because the StatefulDataLoader() has a lot of supported field and it's hard to say some of them are "common" in different use cases.

Can you explain more on "but that can easily complicate things"? We can just pass all the kwargs to StatefulDataLoader and let it to check correctness. wdyt @tianyu-l

tianyu-l · 2025-12-05T08:21:46Z

torchtitan/components/dataloader.py

+            num_workers=num_workers,
+            persistent_workers=persistent_workers,
+            prefetch_factor=prefetch_factor,
+            pin_memory=pin_memory,


I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc).

These are valid concerns. For now I'm leaning towards keeping things simple by passing **kwargs around.

Does it make sense if we only make these args explicit when sending to the actual init of StatefulDataLoader and not passing in all **kwargs from the input of ParallelAwareDataloader? The point is to not accidentally hit error inside StatefulDataLoader.

tianyu-l · 2025-12-05T08:22:59Z

torchtitan/components/dataloader.py

        self,
        dataset: IterableDataset,
        dp_rank: int,
        dp_world_size: int,


could you help change this: let's keep at most one positional arg (dataset) and others to be kwargs.

divyanshk · 2025-12-07T20:22:46Z

Thanks @tianyu-l @wwwjn Updated the PR with kwargs based approach. I initially didn't do this to avoid any confusion on the user's part. That is because we provide batch_size, collate_fn (in mm_datasets) internally. I resolved that by making explicit args defined internally take precedence. Added a warning for users in config.py - so that should help. The error from wrong kwargs (if any) will be thrown in torchtitan itself - won't go down to StatefulDataloader.

tianyu-l

Looks good in general.

The CPU unit test in CI didn't run. Could you double check?

Also, please add an GPU integration test, see inline comments.

tianyu-l · 2025-12-09T21:40:15Z

torchtitan/config/job_config.py

+        - batch_size: Determined by training.local_batch_size
+        - collate_fn: Set by the dataset-specific collator
+
+    Example (TOML config file):


could you add a dedicated test for dataloader with kwargs passed through?
https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/features.py

Added a GPU integration test. To be able to use the cli to pass in the kwargs I added a tyro rule. I am not super familiar with tyro so please have a look.

Also, shout out to the integration test setup. Love that we could do a quick mini-GPU run as part of feature testing.

torchtitan/components/dataloader.py

tianyu-l · 2025-12-11T10:08:53Z

tests/integration_tests/features.py

+        OverrideDefinitions(
+            [
+                [
+                    '--training.dataloader.kwargs \'{"num_workers": 2, "pin_memory": true, "prefetch_factor": 2}\'',


Instead of letting cli accept a dict, can we just do

--training.dataloader.kwargs.num_workers 2 --training.dataloader.kwargs.pin_memory true, ...

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 2, 2025

expose common dataloader args

990d654

divyanshk force-pushed the divyanshk/dataloader_args branch from 6763cc0 to 990d654 Compare December 2, 2025 01:04

divyanshk marked this pull request as ready for review December 3, 2025 17:09

divyanshk requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 3, 2025 17:09

wwwjn reviewed Dec 5, 2025

View reviewed changes

tianyu-l reviewed Dec 5, 2025

View reviewed changes

add kwarg based dataloader config

fef729a

tianyu-l reviewed Dec 9, 2025

View reviewed changes

GPU integration test

18a4446

tianyu-l reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose common dataloader args #2097

Expose common dataloader args #2097

Uh oh!

divyanshk commented Dec 2, 2025

Uh oh!

wwwjn left a comment

Uh oh!

tianyu-l Dec 5, 2025

Uh oh!

tianyu-l Dec 5, 2025

Uh oh!

divyanshk commented Dec 7, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Dec 9, 2025

Uh oh!

divyanshk Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

tianyu-l Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Expose common dataloader args #2097

Are you sure you want to change the base?

Expose common dataloader args #2097

Uh oh!

Conversation

divyanshk commented Dec 2, 2025

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

divyanshk commented Dec 7, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

divyanshk Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants