Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds by Li1177 · Pull Request #1931 · kijai/ComfyUI-WanVideoWrapper

Li1177 · 2026-02-08T04:12:53Z

Summary

Fix audio_cross_attn shape mismatch when used with SCAIL pose conditioning: SCAIL's ref_latent modifies grid_sizes temporal dimension and may append extra tokens to x, causing audio_cross_attn to fail with shape errors. The fix detects when audio temporal frames differ from grid_sizes or x token count, and corrects the shape/slice before calling audio_cross_attn.
Expose audio_scale parameter on MultiTalkSilentEmbeds: Previously hardcoded to 1.0. Users can now adjust silence conditioning strength (e.g. 0.3-0.5 for less color shift while still suppressing unwanted mouth movements). Parameter range/step matches MultiTalkWav2VecEmbeds for consistency.

Backward compatibility

The model.py fix only activates when grid_sizes[0][0] != audio_N_t or x_normed.shape[1] != audio_tokens. In all existing workflows (MultiTalk, InfiniteTalk, LongCat, etc.) these values match, so the else branch is taken — identical to the original code path. Zero impact on non-SCAIL workflows.

The audio_scale parameter defaults to 1.0, preserving existing behavior.

Test plan

SCAIL pose + MultiTalk SilentEmbeds: generates successfully, no shape errors
Tested with various audio_scale values (0.3, 0.5, 1.0)
Non-SCAIL MultiTalk workflows unaffected

…ning When SCAIL pose tokens are appended to the sequence and SCAIL ref_latent adds +1 to grid_sizes temporal dimension, audio_cross_attn fails to reshape tensors correctly. This fix derives the correct temporal dimension from the audio embedding itself and strips extra tokens before audio attention, padding zeros back afterward. Only activates when extra tokens are detected; normal workflows are unaffected.

Allow users to adjust the strength of silence conditioning for controlling mouth-closing effect. Default remains 1.0 for backward compatibility.

…ers from audio frames V2 patch only handled extra tokens (SCAIL pose), but missed the case where SCAIL ref_latent increases grid_sizes temporal dim without adding extra tokens to x. Now always checks audio temporal alignment.

- Add comment explaining audio temporal alignment fix - Use grid_sizes[0].clone() instead of torch.tensor() to avoid repeated allocation - Match audio_scale parameter range/step with MultiTalkWav2VecEmbeds

stanislavdrca

I updated the model.py document in the wanvideo/modules folder, but keep getting an error.
"RuntimeError: shape '[1, 4775, 21, 40, 128]' is invalid for input of size 513433600"

When I disconetct/bypass the multitalk_emebds the workflow works fine.

In the attachment you can find the log file.

log.txt

Any thoughts? Thanks! <3

Li1177 · 2026-02-25T01:27:31Z

I updated the model.py document in the wanvideo/modules folder, but keep getting an error. "RuntimeError: shape '[1, 4775, 21, 40, 128]' is invalid for input of size 513433600"

When I disconetct/bypass the multitalk_emebds the workflow works fine.

In the attachment you can find the log file.

log.txt

Any thoughts? Thanks! <3

Hi @stanislavdrca,

After looking at your log file carefully, the error is not related to our PR at all.

The actual culprit is the Enhance-A-Video (FETA) node in your workflow.

Here's what's happening:

SCAIL adds extra conditioning tokens to the sequence
The get_feta_scores() function in enhance_a_video/enhance.py assumes the total sequence length is always perfectly divisible by num_frames — but those extra SCAIL tokens break that assumption
Result: 100280 ÷ 21 = 4775 remainder 5 → reshape fails
Quick fix: Remove or bypass the Enhance-A-Video node from your workflow. MultiTalk + SCAIL should work fine without it.

This is a separate incompatibility between SCAIL and the Enhance feature, and is not something our PR addresses.

stanislavdrca · 2026-02-25T13:39:16Z

Hi @Li1177 ,

thank you soo much for the quick reply! I bypased/deleted the Enhance-A-Video (FETA) node but unfortunately, at around 50%, I still get an error in the Wan Video Sampler V2 "shape '[22, 32, 2, 40, 128]' is invalid for input of size 6881280"
Am I missing something?

I am attaching the new log and the workflow.

Any suggestions? Thank you! <3

log2.txt
Wanv2.1_14B_SCAIL.json

Li1177 · 2026-02-25T15:12:20Z

Hi @Li1177 ,

thank you soo much for the quick reply! I bypased/deleted the Enhance-A-Video (FETA) node but unfortunately, at around 50%, I still get an error in the Wan Video Sampler V2 "shape '[22, 32, 2, 40, 128]' is invalid for input of size 6881280" Am I missing something?

I am attaching the new log and the workflow.

Any suggestions? Thank you! <3

log2.txt Wanv2.1_14B_SCAIL.json

The error occurs because you seem to be using an outdated version of this fix (only the first commit). The line grid_tokens = grid_sizes[0].prod().item() was specifically updated in a later commit to handle the exact case where ref_latent (22) and audio frames (21) don't match.

Since this PR is still pending merger by the original author, the most reliable way to get it working right now is to pull the complete branch from my fork directly: Li1177:main. The fix for your specific issue is in commit 3289178.

Also, please double-check your model loading. I've attached a screenshot below showing the MultiTalk model I typically use for this workflow, as yours appears to be different from what's expected.

Hope this helps!

stanislavdrca · 2026-02-26T10:19:54Z

Hey @Li1177,

thank you again for the quick reply. I followed your instruction and got it working! <3
You're the king!

Li1177 · 2026-02-27T06:35:49Z

Glad to hear it's working now.

Just out of curiosity, what exactly did the trick for you in the end? Did pulling the rest of the commits from the Li1177:main branch fix the shape mismatch, or was it switching to the correct model (or both)?

For some context on why I wrote this patch: my main use case is generating 2D anime characters. In those workflows, it's notoriously difficult to stop the characters from making involuntary mouth movements or random lip-syncing artifacts.

Since the MultiTalkSilentEmbeds node was previously crashing in this pipeline, there was no good way to control those micro-expressions. Now that it plays nicely with SCAIL, you can adjust the audio_scale parameter to significantly suppress that unwanted talking behavior.

Hey @Li1177,

thank you again for the quick reply. I followed your instruction and got it working! <3 You're the king!

stanislavdrca · 2026-02-27T11:59:25Z

Glad to hear it's working now.

Just out of curiosity, what exactly did the trick for you in the end? Did pulling the rest of the commits from the Li1177:main branch fix the shape mismatch, or was it switching to the correct model (or both)?

For some context on why I wrote this patch: my main use case is generating 2D anime characters. In those workflows, it's notoriously difficult to stop the characters from making involuntary mouth movements or random lip-syncing artifacts.

Since the MultiTalkSilentEmbeds node was previously crashing in this pipeline, there was no good way to control those micro-expressions. Now that it plays nicely with SCAIL, you can adjust the audio_scale parameter to significantly suppress that unwanted talking behavior.

Hey @Li1177,
thank you again for the quick reply. I followed your instruction and got it working! <3 You're the king!

I both updated the model.py file, as well as changing the model to the one you refered.
The project I am currently working on involves photo realistic historical figures that need to come alive. I use video references with SCAIL for the movement and infinitetalk for the dialogue. Unfortunately I get much better results when using wavespeed.ai service online than when I create it locally in comfy for dialogues, but for consistent face expressions, multitalk silend embeds does a great job. On top of that, I use ReActor for face boosting and detailing.

Hope Kijai implements your fix in the official Wrapper update!

Thanks again! <3

Li1177 added 4 commits February 7, 2026 14:47

Expose audio_scale parameter on MultiTalkSilentEmbeds node

db81a06

Allow users to adjust the strength of silence conditioning for controlling mouth-closing effect. Default remains 1.0 for backward compatibility.

Polish model.py patch and align audio_scale parameter style

1f09a3e

- Add comment explaining audio temporal alignment fix - Use grid_sizes[0].clone() instead of torch.tensor() to avoid repeated allocation - Match audio_scale parameter range/step with MultiTalkWav2VecEmbeds

stanislavdrca reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds#1931

Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds#1931
Li1177 wants to merge 4 commits into
kijai:mainfrom
Li1177:main

Li1177 commented Feb 8, 2026

Uh oh!

stanislavdrca left a comment

Uh oh!

Li1177 commented Feb 25, 2026 •

edited

Loading

Uh oh!

stanislavdrca commented Feb 25, 2026

Uh oh!

Li1177 commented Feb 25, 2026

Uh oh!

stanislavdrca commented Feb 26, 2026

Uh oh!

Li1177 commented Feb 27, 2026

Uh oh!

stanislavdrca commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Li1177 commented Feb 8, 2026

Summary

Backward compatibility

Test plan

Uh oh!

stanislavdrca left a comment

Choose a reason for hiding this comment

Uh oh!

Li1177 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stanislavdrca commented Feb 25, 2026

Uh oh!

Li1177 commented Feb 25, 2026

Uh oh!

stanislavdrca commented Feb 26, 2026

Uh oh!

Li1177 commented Feb 27, 2026

Uh oh!

stanislavdrca commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Li1177 commented Feb 25, 2026 •

edited

Loading