Skip to content

Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds#1931

Open
Li1177 wants to merge 4 commits into
kijai:mainfrom
Li1177:main
Open

Fix MultiTalk + SCAIL pose compatibility, expose audio_scale on SilentEmbeds#1931
Li1177 wants to merge 4 commits into
kijai:mainfrom
Li1177:main

Conversation

@Li1177

@Li1177 Li1177 commented Feb 8, 2026

Copy link
Copy Markdown

Summary

  • Fix audio_cross_attn shape mismatch when used with SCAIL pose conditioning: SCAIL's ref_latent modifies grid_sizes temporal dimension and may append extra tokens to x, causing audio_cross_attn to fail with shape errors. The fix detects when audio temporal frames differ from grid_sizes or x token count, and corrects the shape/slice before calling audio_cross_attn.
  • Expose audio_scale parameter on MultiTalkSilentEmbeds: Previously hardcoded to 1.0. Users can now adjust silence conditioning strength (e.g. 0.3-0.5 for less color shift while still suppressing unwanted mouth movements). Parameter range/step matches MultiTalkWav2VecEmbeds for consistency.

Backward compatibility

The model.py fix only activates when grid_sizes[0][0] != audio_N_t or x_normed.shape[1] != audio_tokens. In all existing workflows (MultiTalk, InfiniteTalk, LongCat, etc.) these values match, so the else branch is taken — identical to the original code path. Zero impact on non-SCAIL workflows.

The audio_scale parameter defaults to 1.0, preserving existing behavior.

Test plan

  • SCAIL pose + MultiTalk SilentEmbeds: generates successfully, no shape errors
  • Tested with various audio_scale values (0.3, 0.5, 1.0)
  • Non-SCAIL MultiTalk workflows unaffected

…ning

When SCAIL pose tokens are appended to the sequence and SCAIL ref_latent
adds +1 to grid_sizes temporal dimension, audio_cross_attn fails to reshape
tensors correctly. This fix derives the correct temporal dimension from the
audio embedding itself and strips extra tokens before audio attention,
padding zeros back afterward. Only activates when extra tokens are detected;
normal workflows are unaffected.
Allow users to adjust the strength of silence conditioning for
controlling mouth-closing effect. Default remains 1.0 for backward
compatibility.
…ers from audio frames

V2 patch only handled extra tokens (SCAIL pose), but missed the case
where SCAIL ref_latent increases grid_sizes temporal dim without adding
extra tokens to x. Now always checks audio temporal alignment.
- Add comment explaining audio temporal alignment fix
- Use grid_sizes[0].clone() instead of torch.tensor() to avoid repeated allocation
- Match audio_scale parameter range/step with MultiTalkWav2VecEmbeds

@stanislavdrca stanislavdrca left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the model.py document in the wanvideo/modules folder, but keep getting an error.
"RuntimeError: shape '[1, 4775, 21, 40, 128]' is invalid for input of size 513433600"

When I disconetct/bypass the multitalk_emebds the workflow works fine.

In the attachment you can find the log file.

log.txt

Any thoughts? Thanks! <3

@Li1177

Li1177 commented Feb 25, 2026

Copy link
Copy Markdown
Author

I updated the model.py document in the wanvideo/modules folder, but keep getting an error. "RuntimeError: shape '[1, 4775, 21, 40, 128]' is invalid for input of size 513433600"

When I disconetct/bypass the multitalk_emebds the workflow works fine.

In the attachment you can find the log file.

log.txt

Any thoughts? Thanks! <3

Hi @stanislavdrca,

After looking at your log file carefully, the error is not related to our PR at all.

The actual culprit is the Enhance-A-Video (FETA) node in your workflow.

Here's what's happening:

SCAIL adds extra conditioning tokens to the sequence
The get_feta_scores() function in enhance_a_video/enhance.py assumes the total sequence length is always perfectly divisible by num_frames — but those extra SCAIL tokens break that assumption
Result: 100280 ÷ 21 = 4775 remainder 5 → reshape fails
Quick fix: Remove or bypass the Enhance-A-Video node from your workflow. MultiTalk + SCAIL should work fine without it.

This is a separate incompatibility between SCAIL and the Enhance feature, and is not something our PR addresses.

@stanislavdrca

Copy link
Copy Markdown

Hi @Li1177 ,

thank you soo much for the quick reply! I bypased/deleted the Enhance-A-Video (FETA) node but unfortunately, at around 50%, I still get an error in the Wan Video Sampler V2 "shape '[22, 32, 2, 40, 128]' is invalid for input of size 6881280"
Am I missing something?

I am attaching the new log and the workflow.

Any suggestions? Thank you! <3

log2.txt
Wanv2.1_14B_SCAIL.json

@Li1177

Li1177 commented Feb 25, 2026

Copy link
Copy Markdown
Author

Hi @Li1177 ,

thank you soo much for the quick reply! I bypased/deleted the Enhance-A-Video (FETA) node but unfortunately, at around 50%, I still get an error in the Wan Video Sampler V2 "shape '[22, 32, 2, 40, 128]' is invalid for input of size 6881280" Am I missing something?

I am attaching the new log and the workflow.

Any suggestions? Thank you! <3

log2.txt Wanv2.1_14B_SCAIL.json

The error occurs because you seem to be using an outdated version of this fix (only the first commit). The line grid_tokens = grid_sizes[0].prod().item() was specifically updated in a later commit to handle the exact case where ref_latent (22) and audio frames (21) don't match.

Since this PR is still pending merger by the original author, the most reliable way to get it working right now is to pull the complete branch from my fork directly: Li1177:main. The fix for your specific issue is in commit 3289178.

Also, please double-check your model loading. I've attached a screenshot below showing the MultiTalk model I typically use for this workflow, as yours appears to be different from what's expected.

Hope this helps!

image

@stanislavdrca

Copy link
Copy Markdown

Hey @Li1177,

thank you again for the quick reply. I followed your instruction and got it working! <3
You're the king!

@Li1177

Li1177 commented Feb 27, 2026

Copy link
Copy Markdown
Author

Glad to hear it's working now.

Just out of curiosity, what exactly did the trick for you in the end? Did pulling the rest of the commits from the Li1177:main branch fix the shape mismatch, or was it switching to the correct model (or both)?

For some context on why I wrote this patch: my main use case is generating 2D anime characters. In those workflows, it's notoriously difficult to stop the characters from making involuntary mouth movements or random lip-syncing artifacts.

Since the MultiTalkSilentEmbeds node was previously crashing in this pipeline, there was no good way to control those micro-expressions. Now that it plays nicely with SCAIL, you can adjust the audio_scale parameter to significantly suppress that unwanted talking behavior.

Hey @Li1177,

thank you again for the quick reply. I followed your instruction and got it working! <3 You're the king!

@stanislavdrca

Copy link
Copy Markdown

Glad to hear it's working now.

Just out of curiosity, what exactly did the trick for you in the end? Did pulling the rest of the commits from the Li1177:main branch fix the shape mismatch, or was it switching to the correct model (or both)?

For some context on why I wrote this patch: my main use case is generating 2D anime characters. In those workflows, it's notoriously difficult to stop the characters from making involuntary mouth movements or random lip-syncing artifacts.

Since the MultiTalkSilentEmbeds node was previously crashing in this pipeline, there was no good way to control those micro-expressions. Now that it plays nicely with SCAIL, you can adjust the audio_scale parameter to significantly suppress that unwanted talking behavior.

Hey @Li1177,
thank you again for the quick reply. I followed your instruction and got it working! <3 You're the king!

I both updated the model.py file, as well as changing the model to the one you refered.
The project I am currently working on involves photo realistic historical figures that need to come alive. I use video references with SCAIL for the movement and infinitetalk for the dialogue. Unfortunately I get much better results when using wavespeed.ai service online than when I create it locally in comfy for dialogues, but for consistent face expressions, multitalk silend embeds does a great job. On top of that, I use ReActor for face boosting and detailing.

Hope Kijai implements your fix in the official Wrapper update!

Thanks again! <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants