Summary
When generating with prince-canuma/LTX-2.3-distilled (model-repo) plus Lightricks/LTX-2 (text-encoder-repo), LTX2TextEncoder.load() sile$
Repro is trivial. This affects both distilled and dev pipelines in T2V and I2V modes — effectively all generation is prompt-agnostic on th$
Version
mlx-video at 9ab4826d20e39286af13a26615c33b403d48be72 (current main, installed via `uv pip install git+https://github.com/Blaizzy/mlx-$
- Python 3.11.15, MLX 0.31.1, transformers 5.5.0
- Model repo:
prince-canuma/LTX-2.3-distilled
- Text encoder repo:
Lightricks/LTX-2
- Platform: macOS 15 (Apple Silicon, MLX/Metal)
Reproduction
Generate two videos with radically different prompts, same seed, T2V mode:
export HF_HOME=/path/to/hf/cache
python -m mlx_video.models.ltx_2.generate \
--prompt "a cat sitting on a red couch in a cozy living room, warm lighting" \
--model-repo prince-canuma/LTX-2.3-distilled \
--text-encoder-repo Lightricks/LTX-2 \
--pipeline distilled \
--num-frames 25 --fps 24 --height 512 --width 512 --seed 1337 \
--tiling none \
--output-path /tmp/cat.mp4
python -m mlx_video.models.ltx_2.generate \
--prompt "a rocket launching from a desert with smoke and fire, bright sunny sky" \
--model-repo prince-canuma/LTX-2.3-distilled \
--text-encoder-repo Lightricks/LTX-2 \
--pipeline distilled \
--num-frames 25 --fps 24 --height 512 --width 512 --seed 1337 \
--tiling none \
--output-path /tmp/rocket.mp4
md5 /tmp/cat.mp4 /tmp/rocket.mp4
Expected: two visibly different videos with different MD5s.
Actual: two bit-identical files.
MD5 (/tmp/cat.mp4) = 566cc11c069b40656311c6846365484f
MD5 (/tmp/rocket.mp4) = 566cc11c069b40656311c6846365484f
Same behavior in --pipeline dev (tested with --steps 15 --cfg-scale 3.0), same behavior in I2V with --image /path/to/image.png.
Root cause
In mlx_video/models/ltx_2/text_encoder.py around lines 877–897, the tokenizer load has a three-step fallback chain:
tokenizer_path = model_path / "tokenizer"
if tokenizer_path.exists():
self.processor = AutoTokenizer.from_pretrained(
str(tokenizer_path), trust_remote_code=True
)
else:
try:
self.processor = AutoTokenizer.from_pretrained(
text_encoder_path, trust_remote_code=True
)
except Exception:
self.processor = AutoTokenizer.from_pretrained(
"google/gemma-3-12b-it", trust_remote_code=True
)
self.processor.padding_side = "left"
With prince-canuma/LTX-2.3-distilled as model_path and Lightricks/LTX-2 as text_encoder_path:
- Step 1 fails:
prince-canuma/LTX-2.3-distilled does not ship a top-level tokenizer/ subdirectory.
- Step 2 fails:
text_encoder_path points at the root of the Lightricks/LTX-2 repo, but the tokenizer files in that repo live in `$
- Step 3 succeeds and silently loads
google/gemma-3-12b-it — which is a valid Gemma tokenizer, but its vocab does not match the LTX-fin$
Downstream effect, verified by adding prints to LTX2TextEncoder.encode():
Prompt: 'a cat on a couch'
input_ids[0]: [0, 0, 0, 0, 0, ..., 0, 3]
attention_mask sum: 1
Prompt: 'a rocket launching'
input_ids[0]: [0, 0, 0, 0, 0, ..., 0, 3]
attention_mask sum: 1
Every prompt → 1023 pad tokens + one token id 3. num_valid = 1 at every call site that checks the attention mask. The V2 feature extracto$
For comparison, loading the tokenizer from the correct path Lightricks/LTX-2/tokenizer/ produces:
input_ids: [0, 0, ..., 0, 2, 236746, 5866, 580, 496, 29919]
attention_mask: [0, 0, ..., 0, 1, 1, 1, 1, 1, 1]
Real tokens, real attention mask, real embedding differences across prompts. Video output then differs across prompts as expected.
Suggested fix
The simplest fix is to try text_encoder_path / "tokenizer" as an additional candidate before the exception handler, and remove (or at least $
tokenizer_candidates = [
model_path / "tokenizer",
Path(str(text_encoder_path)) / "tokenizer",
Path(str(text_encoder_path)),
]
for candidate in tokenizer_candidates:
if isinstance(candidate, Path) and not candidate.exists():
continue
try:
self.processor = AutoTokenizer.from_pretrained(
str(candidate), trust_remote_code=True
)
break
except Exception as e:
last_err = e
continue
else:
raise RuntimeError(
f"Could not load a tokenizer from model_path={model_path} or "
f"text_encoder_path={text_encoder_path}. Last error: {last_err}"
)
Removing the google/gemma-3-12b-it fallback entirely is probably the right call — it's almost always the wrong tokenizer for the LTX-fine-tu$
Workaround (for users hitting this now)
Monkey-patch LTX2TextEncoder.load() at runtime to override self.processor with the correct tokenizer from the `text_encoder_path / "tokeni$
Impact
Prior to discovering this, I spent several hours running seed sweeps, sigma-schedule patches, and resolution experiments trying to "fix prompt$
Happy to submit a PR if that would help.
Summary
When generating with
prince-canuma/LTX-2.3-distilled(model-repo) plusLightricks/LTX-2(text-encoder-repo),LTX2TextEncoder.load()sile$Repro is trivial. This affects both
distilledanddevpipelines in T2V and I2V modes — effectively all generation is prompt-agnostic on th$Version
mlx-videoat9ab4826d20e39286af13a26615c33b403d48be72(currentmain, installed via `uv pip install git+https://github.com/Blaizzy/mlx-$prince-canuma/LTX-2.3-distilledLightricks/LTX-2Reproduction
Generate two videos with radically different prompts, same seed, T2V mode:
Expected: two visibly different videos with different MD5s.
Actual: two bit-identical files.
Same behavior in
--pipeline dev(tested with--steps 15 --cfg-scale 3.0), same behavior in I2V with--image /path/to/image.png.Root cause
In
mlx_video/models/ltx_2/text_encoder.pyaround lines 877–897, the tokenizer load has a three-step fallback chain:With
prince-canuma/LTX-2.3-distilledasmodel_pathandLightricks/LTX-2astext_encoder_path:prince-canuma/LTX-2.3-distilleddoes not ship a top-leveltokenizer/subdirectory.text_encoder_pathpoints at the root of theLightricks/LTX-2repo, but the tokenizer files in that repo live in `$google/gemma-3-12b-it— which is a valid Gemma tokenizer, but its vocab does not match the LTX-fin$Downstream effect, verified by adding prints to
LTX2TextEncoder.encode():Every prompt → 1023 pad tokens + one token id
3.num_valid = 1at every call site that checks the attention mask. The V2 feature extracto$For comparison, loading the tokenizer from the correct path
Lightricks/LTX-2/tokenizer/produces:Real tokens, real attention mask, real embedding differences across prompts. Video output then differs across prompts as expected.
Suggested fix
The simplest fix is to try
text_encoder_path / "tokenizer"as an additional candidate before the exception handler, and remove (or at least $Removing the
google/gemma-3-12b-itfallback entirely is probably the right call — it's almost always the wrong tokenizer for the LTX-fine-tu$Workaround (for users hitting this now)
Monkey-patch
LTX2TextEncoder.load()at runtime to overrideself.processorwith the correct tokenizer from the `text_encoder_path / "tokeni$Impact
Prior to discovering this, I spent several hours running seed sweeps, sigma-schedule patches, and resolution experiments trying to "fix prompt$
Happy to submit a PR if that would help.