Skip to content

Question about how Dataset.weight controls data mixing ratio in SFT preprocessing #78

@HelloWorld686

Description

@HelloWorld686

Hello,

I'm trying to reproduce the SFT training described in the paper using the nano3/stage1_sft recipe. During data preparation, I need to mix multiple datasets and set different weight values for each dataset in the data blend, hoping to control their proportions in the final training data.

Could you please clarify:

Does the current data preprocessing pipeline (data_prep.py) actually use weight to perform weighted data mixing? Or is weight only passed as metadata to the downstream training framework?

If weighted mixing is supported, what is the correct configuration? Is there any example available?

If it's not supported in the current version, what is the recommended way to mix multiple datasets? (e.g., should it be handled dynamically by Megatron-Bridge during training?)

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions