Skip to content

LTX2TextEncoder silently falls back to wrong tokenizer → every prompt produces identical output #26

@smstosky

Description

@smstosky

Summary

When generating with prince-canuma/LTX-2.3-distilled (model-repo) plus Lightricks/LTX-2 (text-encoder-repo), LTX2TextEncoder.load() sile$

Repro is trivial. This affects both distilled and dev pipelines in T2V and I2V modes — effectively all generation is prompt-agnostic on th$

Version

  • mlx-video at 9ab4826d20e39286af13a26615c33b403d48be72 (current main, installed via `uv pip install git+https://github.com/Blaizzy/mlx-$
  • Python 3.11.15, MLX 0.31.1, transformers 5.5.0
  • Model repo: prince-canuma/LTX-2.3-distilled
  • Text encoder repo: Lightricks/LTX-2
  • Platform: macOS 15 (Apple Silicon, MLX/Metal)

Reproduction

Generate two videos with radically different prompts, same seed, T2V mode:

export HF_HOME=/path/to/hf/cache

python -m mlx_video.models.ltx_2.generate \
  --prompt "a cat sitting on a red couch in a cozy living room, warm lighting" \
  --model-repo prince-canuma/LTX-2.3-distilled \
  --text-encoder-repo Lightricks/LTX-2 \
  --pipeline distilled \
  --num-frames 25 --fps 24 --height 512 --width 512 --seed 1337 \
  --tiling none \
  --output-path /tmp/cat.mp4

python -m mlx_video.models.ltx_2.generate \
  --prompt "a rocket launching from a desert with smoke and fire, bright sunny sky" \
  --model-repo prince-canuma/LTX-2.3-distilled \
  --text-encoder-repo Lightricks/LTX-2 \
  --pipeline distilled \
  --num-frames 25 --fps 24 --height 512 --width 512 --seed 1337 \
  --tiling none \
  --output-path /tmp/rocket.mp4
    
md5 /tmp/cat.mp4 /tmp/rocket.mp4

Expected: two visibly different videos with different MD5s.

Actual: two bit-identical files.

MD5 (/tmp/cat.mp4)    = 566cc11c069b40656311c6846365484f
MD5 (/tmp/rocket.mp4) = 566cc11c069b40656311c6846365484f   

Same behavior in --pipeline dev (tested with --steps 15 --cfg-scale 3.0), same behavior in I2V with --image /path/to/image.png.

Root cause

In mlx_video/models/ltx_2/text_encoder.py around lines 877–897, the tokenizer load has a three-step fallback chain:

tokenizer_path = model_path / "tokenizer"
if tokenizer_path.exists():
    self.processor = AutoTokenizer.from_pretrained(
        str(tokenizer_path), trust_remote_code=True
    )
else:
    try:
        self.processor = AutoTokenizer.from_pretrained(
            text_encoder_path, trust_remote_code=True
        )
    except Exception:
        self.processor = AutoTokenizer.from_pretrained( 
            "google/gemma-3-12b-it", trust_remote_code=True
        )
self.processor.padding_side = "left"

With prince-canuma/LTX-2.3-distilled as model_path and Lightricks/LTX-2 as text_encoder_path:

  1. Step 1 fails: prince-canuma/LTX-2.3-distilled does not ship a top-level tokenizer/ subdirectory.
  2. Step 2 fails: text_encoder_path points at the root of the Lightricks/LTX-2 repo, but the tokenizer files in that repo live in `$
  3. Step 3 succeeds and silently loads google/gemma-3-12b-it — which is a valid Gemma tokenizer, but its vocab does not match the LTX-fin$

Downstream effect, verified by adding prints to LTX2TextEncoder.encode():

Prompt: 'a cat on a couch'
  input_ids[0]: [0, 0, 0, 0, 0, ..., 0, 3]
  attention_mask sum: 1
        
Prompt: 'a rocket launching'
  input_ids[0]: [0, 0, 0, 0, 0, ..., 0, 3]
  attention_mask sum: 1

Every prompt → 1023 pad tokens + one token id 3. num_valid = 1 at every call site that checks the attention mask. The V2 feature extracto$

For comparison, loading the tokenizer from the correct path Lightricks/LTX-2/tokenizer/ produces:

input_ids:       [0, 0, ..., 0, 2, 236746, 5866, 580, 496, 29919]
attention_mask:  [0, 0, ..., 0, 1, 1,      1,    1,   1,   1]

Real tokens, real attention mask, real embedding differences across prompts. Video output then differs across prompts as expected.

Suggested fix

The simplest fix is to try text_encoder_path / "tokenizer" as an additional candidate before the exception handler, and remove (or at least $

tokenizer_candidates = [
    model_path / "tokenizer",
    Path(str(text_encoder_path)) / "tokenizer",
    Path(str(text_encoder_path)),
]  
            
for candidate in tokenizer_candidates:
    if isinstance(candidate, Path) and not candidate.exists():
        continue
    try:
        self.processor = AutoTokenizer.from_pretrained(
            str(candidate), trust_remote_code=True
        )
        break
    except Exception as e:
        last_err = e
        continue
else:
    raise RuntimeError(
        f"Could not load a tokenizer from model_path={model_path} or "
        f"text_encoder_path={text_encoder_path}. Last error: {last_err}"
    )

Removing the google/gemma-3-12b-it fallback entirely is probably the right call — it's almost always the wrong tokenizer for the LTX-fine-tu$

Workaround (for users hitting this now)

Monkey-patch LTX2TextEncoder.load() at runtime to override self.processor with the correct tokenizer from the `text_encoder_path / "tokeni$

Impact

Prior to discovering this, I spent several hours running seed sweeps, sigma-schedule patches, and resolution experiments trying to "fix prompt$

Happy to submit a PR if that would help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions