Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds#1931
Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds#1931Li1177 wants to merge 4 commits into
Conversation
…ning When SCAIL pose tokens are appended to the sequence and SCAIL ref_latent adds +1 to grid_sizes temporal dimension, audio_cross_attn fails to reshape tensors correctly. This fix derives the correct temporal dimension from the audio embedding itself and strips extra tokens before audio attention, padding zeros back afterward. Only activates when extra tokens are detected; normal workflows are unaffected.
Allow users to adjust the strength of silence conditioning for controlling mouth-closing effect. Default remains 1.0 for backward compatibility.
…ers from audio frames V2 patch only handled extra tokens (SCAIL pose), but missed the case where SCAIL ref_latent increases grid_sizes temporal dim without adding extra tokens to x. Now always checks audio temporal alignment.
- Add comment explaining audio temporal alignment fix - Use grid_sizes[0].clone() instead of torch.tensor() to avoid repeated allocation - Match audio_scale parameter range/step with MultiTalkWav2VecEmbeds
stanislavdrca
left a comment
There was a problem hiding this comment.
I updated the model.py document in the wanvideo/modules folder, but keep getting an error.
"RuntimeError: shape '[1, 4775, 21, 40, 128]' is invalid for input of size 513433600"
When I disconetct/bypass the multitalk_emebds the workflow works fine.
In the attachment you can find the log file.
Any thoughts? Thanks! <3
Hi @stanislavdrca, After looking at your log file carefully, the error is not related to our PR at all. The actual culprit is the Enhance-A-Video (FETA) node in your workflow. Here's what's happening: SCAIL adds extra conditioning tokens to the sequence This is a separate incompatibility between SCAIL and the Enhance feature, and is not something our PR addresses. |
|
Hi @Li1177 , thank you soo much for the quick reply! I bypased/deleted the Enhance-A-Video (FETA) node but unfortunately, at around 50%, I still get an error in the Wan Video Sampler V2 "shape '[22, 32, 2, 40, 128]' is invalid for input of size 6881280" I am attaching the new log and the workflow. Any suggestions? Thank you! <3 |
The error occurs because you seem to be using an outdated version of this fix (only the first commit). The line grid_tokens = grid_sizes[0].prod().item() was specifically updated in a later commit to handle the exact case where ref_latent (22) and audio frames (21) don't match. Since this PR is still pending merger by the original author, the most reliable way to get it working right now is to pull the complete branch from my fork directly: Li1177:main. The fix for your specific issue is in commit 3289178. Also, please double-check your model loading. I've attached a screenshot below showing the MultiTalk model I typically use for this workflow, as yours appears to be different from what's expected. Hope this helps!
|
|
Hey @Li1177, thank you again for the quick reply. I followed your instruction and got it working! <3 |
|
Glad to hear it's working now. Just out of curiosity, what exactly did the trick for you in the end? Did pulling the rest of the commits from the Li1177:main branch fix the shape mismatch, or was it switching to the correct model (or both)? For some context on why I wrote this patch: my main use case is generating 2D anime characters. In those workflows, it's notoriously difficult to stop the characters from making involuntary mouth movements or random lip-syncing artifacts. Since the MultiTalkSilentEmbeds node was previously crashing in this pipeline, there was no good way to control those micro-expressions. Now that it plays nicely with SCAIL, you can adjust the audio_scale parameter to significantly suppress that unwanted talking behavior.
|
I both updated the model.py file, as well as changing the model to the one you refered. Hope Kijai implements your fix in the official Wrapper update! Thanks again! <3 |

Summary
grid_sizestemporal dimension and may append extra tokens tox, causingaudio_cross_attnto fail with shape errors. The fix detects when audio temporal frames differ fromgrid_sizesorxtoken count, and corrects the shape/slice before callingaudio_cross_attn.audio_scaleparameter onMultiTalkSilentEmbeds: Previously hardcoded to 1.0. Users can now adjust silence conditioning strength (e.g. 0.3-0.5 for less color shift while still suppressing unwanted mouth movements). Parameter range/step matchesMultiTalkWav2VecEmbedsfor consistency.Backward compatibility
The model.py fix only activates when
grid_sizes[0][0] != audio_N_torx_normed.shape[1] != audio_tokens. In all existing workflows (MultiTalk, InfiniteTalk, LongCat, etc.) these values match, so the else branch is taken — identical to the original code path. Zero impact on non-SCAIL workflows.The
audio_scaleparameter defaults to 1.0, preserving existing behavior.Test plan