Skip to content

Added EverAnimate Support#2023

Open
0xBeycan wants to merge 2 commits into
kijai:mainfrom
0xBeycan:everanimate-integration
Open

Added EverAnimate Support#2023
0xBeycan wants to merge 2 commits into
kijai:mainfrom
0xBeycan:everanimate-integration

Conversation

@0xBeycan

Copy link
Copy Markdown

No description provided.

@younestft

younestft commented Jun 1, 2026

Copy link
Copy Markdown

**I tested the EverAnimate PR locally and found two temporal alignment issues in the long-generation loop when using the default EverAnimate settings and had Codex (GPT 5.5high) fix them for me:

num_video_anchor_latents = 4
num_motion_latents = 1
frame_window_size = 81
pose_images connected
face_images connected

The PR adds extra latent anchor slots for EverAnimate, but pose and face conditioning were still built for the original WanAnimate window length.

Issue 1: pose latent mismatch

Error:

RuntimeError: The size of tensor a (24) must match the size of tensor b (21) at non-singleton dimension 2

Location:

wanvideo/modules/model.py
wananimate_pose_embedding()
x_[:, :, 1:].add_(pose_latents_, alpha=strength)

Cause:

With N=4 anchors, noise.shape[1] becomes latent_window_size + 4. The model applies pose embeddings to x_[:, :, 1:], so it expects noise.shape[1] - 1 pose latent positions.

For frame_window_size=81:

pose_input_slice length = 21
expected pose length = 24
missing = 3 = N - 1

Fix used locally:

pose_input_slice = vae.encode([pose_image_slice], device, tiled=tiled_vae, pbar=False).to(dtype)
if everanim_mode:
expected_pose_len = noise.shape[1] - 1
current_pose_len = pose_input_slice.shape[2]
if current_pose_len < expected_pose_len:
pad_len = expected_pose_len - current_pose_len
pose_pad = torch.zeros(
pose_input_slice.shape[0],
pose_input_slice.shape[1],
pad_len,
pose_input_slice.shape[3],
pose_input_slice.shape[4],
device=pose_input_slice.device,
dtype=pose_input_slice.dtype,
)
pose_input_slice = torch.cat([pose_pad, pose_input_slice], dim=2)
elif current_pose_len > expected_pose_len:
pose_input_slice = pose_input_slice[:, :, :expected_pose_len]**


Issue 2: face adapter temporal mismatch

After fixing pose, face conditioning failed with:

einops.EinopsError:
Error while processing rearrange-reduction pattern "B (L S) H D -> (B L) S H D".
Input tensor shape: torch.Size([1, 39000, 40, 128]).
Additional info: {'L': 22}.
Shape mismatch, can't divide axis of length 39000 in chunks of 22

Location:

wanvideo/modules/wananimate/face_blocks.py
FaceBlock.forward()
q = rearrange(q, "B (L S) H D -> (B L) S H D", L=T)

Cause:

The latent sequence had 25 temporal groups, but the face encoder produced 22. Again, this is short by N - 1 latent groups. Since the face encoder temporal stride is 4 RGB frames per latent group, N=4 needs 12 blank RGB frames prepended.

Fix used locally:

if wananim_face_pixels is None and wananim_ref_masks is not None:
face_images_in = torch.zeros(1, 3, frame_window_size, 512, 512, device=device, dtype=torch.float32)
elif wananim_face_pixels is not None:
face_images_in = face_images[:, :, start:end].to(device, torch.float32) if face_images is not None else None

if everanim_mode and face_images_in is not None:
extra_face_frames = max(0, (everanim_N - 1) * 4)
if extra_face_frames > 0:
face_pad = torch.full(
(
face_images_in.shape[0],
face_images_in.shape[1],
extra_face_frames,
face_images_in.shape[3],
face_images_in.shape[4],
),
-1.0,
device=face_images_in.device,
dtype=face_images_in.dtype,
)
face_images_in = torch.cat([face_pad, face_images_in], dim=2)


I also found a small callback bug in the same loop:

callback_latent = (latent_model_input.to(device) - noise_pred.to(device) * t.to(device) / 1000)

"t" is not the loop timestep variable in this branch. I changed it to:

callback_latent = (latent_model_input.to(device) - noise_pred.to(device) * timestep.to(device) / 1000)

With these patches, the long EverAnimate loop is temporally consistent for N=4.

Aligns the EverAnimate looping path with the canonical diffsynth
reference (wan_video_svi.py) so the N-anchor + M-motion streaming scheme
matches what the rank-32 LoRA was trained with.

nodes_sampler.py:
- Pose latents: front-pad by N-1 in everanim mode. Kijai's model adds
  pose to x_[:, :, 1:] (skip 1), but EverAnimate prepends N anchors, so
  the model expects noise.shape[1]-1 pose positions. Padding lands the
  real pose on x_[N:], reproducing the canonical after_patch_embedding
  (x[:, :, N:] += pose). Fixes RuntimeError on pose embedding.
- Face pixels: prepend (N-1)*4 blank frames in everanim mode so the face
  encoder yields noise.shape[1] temporal groups (4x compression + the
  model's +1 pad), matching canonical pad_face=N. Fixes einops mismatch
  in FaceBlock.
- Window stepping: derive refert_num from the motion carry M, not the
  anchor count N. The old (N-1)*4+1 caused 4*(N-M) duplicated RGB frames
  at every window boundary; now step == kept content (contiguous output).
- Motion mask: keep motion mask at 0 (soft context) instead of 1 on
  continuation, matching canonical (only anchors get mask=1). mask=1 was
  out-of-distribution for the LoRA.
- Callback: use 'timestep' (this loop's variable) instead of undefined
  't', which raised NameError when a preview/callback was attached.

everanimate/nodes.py:
- frame_window_size default 81 -> 77 to match EverAnimate's trained clip
  length (frames_per_clip=77 -> 20 content latents per window).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants