Added EverAnimate Support#2023
Conversation
|
**I tested the EverAnimate PR locally and found two temporal alignment issues in the long-generation loop when using the default EverAnimate settings and had Codex (GPT 5.5high) fix them for me: num_video_anchor_latents = 4 The PR adds extra latent anchor slots for EverAnimate, but pose and face conditioning were still built for the original WanAnimate window length. Issue 1: pose latent mismatch Error: RuntimeError: The size of tensor a (24) must match the size of tensor b (21) at non-singleton dimension 2 Location: wanvideo/modules/model.py Cause: With N=4 anchors, noise.shape[1] becomes latent_window_size + 4. The model applies pose embeddings to x_[:, :, 1:], so it expects noise.shape[1] - 1 pose latent positions. For frame_window_size=81: pose_input_slice length = 21 Fix used locally: pose_input_slice = vae.encode([pose_image_slice], device, tiled=tiled_vae, pbar=False).to(dtype) Issue 2: face adapter temporal mismatch After fixing pose, face conditioning failed with: einops.EinopsError: Location: wanvideo/modules/wananimate/face_blocks.py Cause: The latent sequence had 25 temporal groups, but the face encoder produced 22. Again, this is short by N - 1 latent groups. Since the face encoder temporal stride is 4 RGB frames per latent group, N=4 needs 12 blank RGB frames prepended. Fix used locally: if wananim_face_pixels is None and wananim_ref_masks is not None: if everanim_mode and face_images_in is not None: I also found a small callback bug in the same loop: callback_latent = (latent_model_input.to(device) - noise_pred.to(device) * t.to(device) / 1000) "t" is not the loop timestep variable in this branch. I changed it to: callback_latent = (latent_model_input.to(device) - noise_pred.to(device) * timestep.to(device) / 1000) With these patches, the long EverAnimate loop is temporally consistent for N=4. |
Aligns the EverAnimate looping path with the canonical diffsynth reference (wan_video_svi.py) so the N-anchor + M-motion streaming scheme matches what the rank-32 LoRA was trained with. nodes_sampler.py: - Pose latents: front-pad by N-1 in everanim mode. Kijai's model adds pose to x_[:, :, 1:] (skip 1), but EverAnimate prepends N anchors, so the model expects noise.shape[1]-1 pose positions. Padding lands the real pose on x_[N:], reproducing the canonical after_patch_embedding (x[:, :, N:] += pose). Fixes RuntimeError on pose embedding. - Face pixels: prepend (N-1)*4 blank frames in everanim mode so the face encoder yields noise.shape[1] temporal groups (4x compression + the model's +1 pad), matching canonical pad_face=N. Fixes einops mismatch in FaceBlock. - Window stepping: derive refert_num from the motion carry M, not the anchor count N. The old (N-1)*4+1 caused 4*(N-M) duplicated RGB frames at every window boundary; now step == kept content (contiguous output). - Motion mask: keep motion mask at 0 (soft context) instead of 1 on continuation, matching canonical (only anchors get mask=1). mask=1 was out-of-distribution for the LoRA. - Callback: use 'timestep' (this loop's variable) instead of undefined 't', which raised NameError when a preview/callback was attached. everanimate/nodes.py: - frame_window_size default 81 -> 77 to match EverAnimate's trained clip length (frames_per_clip=77 -> 20 content latents per window). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No description provided.