[QEff Finetune] Adding dataset padding changes #478

quic-swatia · 2025-06-24T06:19:25Z

No description provided.

Signed-off-by: Swati Allabadi <[email protected]>

quic-meetkuma

Please generate the ppl numbers across different ddp devices, grad accum step to make this change concrete.

quic-meetkuma · 2025-06-24T06:38:11Z

QEfficient/finetune/utils/dataset_utils.py

@@ -64,19 +64,35 @@ def get_dataloader_kwargs(train_config, dataset, dataset_processer, split):

 def get_dataloader(tokenizer, dataset_config, train_config, split: str = "train"):
    dataset = get_preprocessed_dataset(tokenizer, dataset_config, split, context_length=train_config.context_length)
-    dl_kwargs = get_dataloader_kwargs(train_config, dataset, tokenizer, split)
+    dataset = dataset.select(range(0, 10))


Why slicing dataset to pick first 10 samples?

This was added for local experiments. Removed in the PR.

quic-meetkuma · 2025-06-24T06:38:23Z

QEfficient/finetune/utils/dataset_utils.py

-    dl_kwargs = get_dataloader_kwargs(train_config, dataset, tokenizer, split)
+    dataset = dataset.select(range(0, 10))
+    dataset = dataset.map(lambda x: {"input_length": len(x["input_ids"])})
+    dataset = dataset.sort("input_length")


Why sorting?

Done the sorting here for the non padded dataset in place of being done in sampler.py for the padded dataset to keep the dummy samples in the end.

quic-meetkuma · 2025-06-24T06:38:49Z

QEfficient/finetune/utils/dataset_utils.py

+    dummy_row["labels"] = [-100] * len(dummy_row["labels"])
+    padding_size = 0
+    num_replicas = dist.get_world_size()
+    if len(dataset) % num_replicas > 0:


bs>1 is not considered here.

Had skipped it since we are not supporting bs > 1 as of now. Made this change for sake of completion.

quic-meetkuma · 2025-06-24T06:39:14Z

QEfficient/finetune/utils/dataset_utils.py

+    if len(dataset) % num_replicas > 0:
+        padding_size = num_replicas - len(dataset) % num_replicas
+
+    dummy_data = [dummy_row.copy() for _ in range(padding_size)]


L78 to L80 can be refactored.

Found this way cleaner. Please suggest if you have anything better idea.

quic-meetkuma · 2025-06-24T06:45:27Z

QEfficient/finetune/utils/train_utils.py

@@ -192,6 +192,9 @@ def train(
                    ) as verifier:
                        model_outputs = model(**batch)
                        loss = model_outputs.loss  # Forward call
+                        if (batch["labels"] != -100).sum() == 0:
+                            loss = loss.nan_to_num(nan=0.0)


This loss is zeroed for dummy samples. But the total loss is averaged across all samples including dummy samples. Correct it.

quic-meetkuma · 2025-06-24T06:46:33Z

QEfficient/finetune/utils/train_utils.py

@@ -237,6 +242,9 @@ def train(
                    step_metric_val = float(torch.exp(loss.detach().float()))
                train_step_metric.append(step_metric_val)

+            # Accumalate gradients
+            loss = loss / train_config.gradient_accumulation_steps


This should change.
E.g. 100 samples, 30 global bs.

For first 30 samples, loss = loss / 30
For next 30 samples, loss = loss / 30
For next 30 samples, loss = loss / 30
For last 10 samples, loss = loss / 10 not loss = loss / 30

quic-meetkuma · 2025-06-24T06:46:41Z

QEfficient/finetune/utils/train_utils.py

@@ -439,6 +447,9 @@ def evaluation_helper(model, train_config, eval_dataloader, device):
                outputs = model(**batch)
            loss = outputs.loss

+            if (batch["labels"] != -100).sum() == 0:
+                loss = loss.nan_to_num(nan=0.0)


Same comment as above

quic-meetkuma · 2025-06-24T06:47:06Z

QEfficient/finetune/utils/dataset_utils.py

+
+    dummy_data = [dummy_row.copy() for _ in range(padding_size)]
+    dummy_dataset = datasets.Dataset.from_list(dummy_data)
+    combined_dataset = datasets.concatenate_datasets([dataset, dummy_dataset])


try to enclose this padding logic in separate function.

Signed-off-by: Swati Allabadi <[email protected]>

quic-swatia requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners June 24, 2025 06:19

quic-swatia requested review from quic-mamta, quic-meetkuma and vbaddi June 24, 2025 06:19

Adding dataset padding changes

adebe02

Signed-off-by: Swati Allabadi <[email protected]>

quic-swatia force-pushed the dataset_padding branch from b8104e1 to adebe02 Compare June 24, 2025 06:21

quic-meetkuma reviewed Jun 24, 2025

View reviewed changes

Addressed review comments

69caeec

Signed-off-by: Swati Allabadi <[email protected]>

[QEff Finetune] Adding dataset padding changes #478

Are you sure you want to change the base?

[QEff Finetune] Adding dataset padding changes #478

Conversation

quic-swatia commented Jun 24, 2025

Uh oh!

quic-meetkuma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-swatia Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-swatia Jun 24, 2025 •

edited

Loading