Skip to content

LabelSmoother ignores num_items_in_batch, inflating gradients by ~gradient_accumulation_steps with label smoothing #46731

@Incheonkirin

Description

@Incheonkirin

System Info

  • transformers 5.13.0.dev0 (main, 9fd7b67; also reproduced on 36193bf)
  • torch 2.11.0, Python 3.13.5, macOS 15.6 arm64 (forced CPU via use_cpu=True; not device-specific — also reproduces on the default MPS device)

Who can help?

@muellerzr @SunMarc

Reproduction

import torch, datasets
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
                          TrainingArguments, DataCollatorForLanguageModeling, TrainerCallback)

name = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
tok = AutoTokenizer.from_pretrained(name); tok.pad_token = tok.eos_token
ds = datasets.load_dataset("Salesforce/wikitext", "wikitext-2-raw-v1", split="train[:200]")
ds = ds.filter(lambda e: len(e["text"].strip()) > 0)
ds = ds.map(lambda e: tok(e["text"], max_length=16, padding="max_length", truncation=True), batched=True)
coll = DataCollatorForLanguageModeling(tokenizer=tok, mlm=False)

def run(bs, gas):
    out = []
    class CB(TrainerCallback):
        def on_log(self, a, s, c, logs=None, **k):
            if logs and "grad_norm" in logs: out.append(logs["grad_norm"])
    m = AutoModelForCausalLM.from_pretrained(name, dtype=torch.float32)
    args = TrainingArguments("/tmp/.ls_repro", per_device_train_batch_size=bs,
        gradient_accumulation_steps=gas, label_smoothing_factor=0.1, max_steps=3,
        logging_steps=1, learning_rate=1e-4, max_grad_norm=0.0, report_to=[], use_cpu=True)
    Trainer(m, args, train_dataset=ds, data_collator=coll, callbacks=[CB()]).train()
    return out

base = run(8, 1)   # effective batch 8, no accumulation
gas  = run(1, 8)   # same effective batch 8, via 8 accumulation steps
print("base (bs=8, gas=1):", [round(x, 4) for x in base])
print("gas  (bs=1, gas=8):", [round(x, 4) for x in gas])
print("ratio gas/base    :", [round(g / b, 2) for b, g in zip(base, gas) if b])

Output on current main:

base (bs=8, gas=1): [0.3281, 0.3209, 0.3021]
gas  (bs=1, gas=8): [2.9545, 3.0083, 2.6038]
ratio gas/base    : [9.01, 9.37, 8.62]

Both runs have the same effective batch (8), so the grad norms should match. With label_smoothing_factor=0.1 the accumulation run is ~gradient_accumulation_steps× larger. Set label_smoothing_factor=0.0 and the ratio is ~1.0.

Expected behavior

Grad norms and logged loss should be independent of how the effective batch is split into accumulation steps — as they already are without label smoothing.

Root cause

With label smoothing the loss comes from LabelSmoother, not the model:

  • _get_num_items_in_batch returns a non-None count because model_accepts_loss_kwargs is True, so training_step skips the 1 / gradient_accumulation_steps normalization (the if (not self.model_accepts_loss_kwargs or num_items_in_batch is None) and self.compute_loss_func is None: guard).
  • But compute_loss calls self.label_smoother(outputs, labels, ...) without num_items_in_batch, and LabelSmoother.__call__ mean-reduces over the current micro-batch's active tokens (trainer_pt_utils.py).
  • So neither path normalizes by the effective batch: the accumulated gradient becomes sum_i(S_i / N_i) instead of sum_i(S_i) / sum_i(N_i).

The grad-accum change that added num_items_in_batch (#34198) only touched trainer.py, not trainer_pt_utils.py, so LabelSmoother was never wired up. The same num_items_in_batch is already passed into the compute_loss_func branch right next to it.

Fix

Passing num_items_in_batch into LabelSmoother and using it as the denominator (falling back to the active-token count when None) restores the ratio:

gas (bs=1, gas=8): [0.3281, 0.3209, 0.3021]   # ratio 1.0

This shifts the loss/grad magnitude by ~gradient_accumulation_steps for existing label-smoothing + accumulation runs (the old values were over-scaled). Single-device and default DDP token averaging become exact; legacy DataParallel keeps the same n_gpu approximation the compute_loss_func path already has.

I have the fix and a red→green regression test ready — happy to open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions