Skip to content

VAE produces grid-like tile artifacts on flat regions (constant-green diagnostic) #202

@AmitMY

Description

VAE produces grid-like tile artifacts on flat regions

We've been fine-tuning the LTX-2.3 VAE for a downstream tokenizer and noticed a regular grid pattern in the decoder output, most visible on flat regions (uniform backgrounds). It's small per-pixel but very structured, and we believe the same pattern contaminates natural images — it's just easier to isolate against a constant input.

The test

Encode → decode a constant (0, 128, 0) clip and look at the recon and its FFT. A constant input has zero spatial structure, so any structure in the output is the VAE's contribution.

Script (~50 lines)
import numpy as np
import torch
from src.ltx_vae import load_ltx_vae

enc, dec = load_ltx_vae("/path/to/vae.safetensors", device="cuda")
enc.eval(); dec.eval()

bg = torch.tensor([0, 128, 0], dtype=torch.float32) / 255.0
src = (bg * 2 - 1).view(1, 3, 1, 1, 1).expand(1, 3, 49, 256, 256).contiguous().cuda()
with torch.no_grad():
    recon = dec(enc(src)).clamp(-1, 1).cpu().numpy()

# FFT of the green channel of the middle frame, centre-cropped to 128x128.
g = recon[0, 1, 24, 64:192, 64:192]
mag = np.log10(np.abs(np.fft.fftshift(np.fft.fft2(g))) + 1e-9)

# Quantitative measure: fraction of FFT energy that's NOT at DC.
f = np.fft.fftshift(np.fft.fft2(recon[0, 1, 24]))
cy, cx = f.shape[0] // 2, f.shape[1] // 2
f_no_dc = f.copy(); f_no_dc[cy-2:cy+3, cx-2:cx+3] = 0
print(f"off-DC fraction: {np.abs(f_no_dc).sum() / np.abs(f).sum():.4f}")

What we see

LTX-2.3
max |Δ| (G channel) 0.124 (on a [0,1]-normalised input)
mean |Δ| (G channel) 0.0026
FFT off-DC fraction (G channel) 0.98

98 % of the recon's frequency energy is not at DC — i.e. it's in spatial structure, almost all of it on a regular grid. The grid period matches the VAE's patch_size=4 upsample stride and its harmonics.

Source frame | recon | abs(diff)×50:

Image

FFT log-magnitude

Image

FFT log-magnitude, source | recon, 128×128 centre crop:

Image

We also see strong artifacts at the frame boundaries (presumably from conv-stack padding); the centre crop above isolates the periodic structure cleanly.

Why we think this matters for real video

Natural-image variance hides the grid visually, but the FFT energy is still being deposited there — it's just masked by content. For downstream tasks that care about high-frequency fidelity (compression, sharp-region reconstruction, anti-aliasing under camera pans) this floor is the limit of what any fine-tune on top can achieve. We've observed this empirically: even after substantial domain-adaptation training with perceptual + segmentation-weighted losses, the per-frame FFT continues to show the same harmonics in foreground regions.

Suggested directions if you're iterating on the VAE architecture

This is the textbook checkerboard pattern from stride-s transposed convs (Odena, Dumoulin, Olah, "Deconvolution and Checkerboard Artifacts," Distill 2016 — https://distill.pub/2016/deconv-checkerboard/). In rough order of disruption:

  1. Resize-then-conv upsampling — replace ConvTranspose(stride=s) with Upsample(scale=s) + Conv(stride=1). One-line change per stage; eliminates uneven kernel overlap. (Odena et al. 2016, link above.)
  2. PixelShuffle with ICNR initialisation — fast on GPU, checkerboard-free if initialised so the sub-pixel conv equals nearest-neighbour upsample at start. Shi et al., "Real-Time Single Image and Video Super-Resolution…", CVPR 2016, https://arxiv.org/abs/1609.05158 ; Aitken et al., "Checkerboard artifact free sub-pixel convolution," 2017, https://arxiv.org/abs/1707.02937.
  3. BlurPool downsampling in the encoder — Zhang, "Making Convolutional Networks Shift-Invariant Again," ICML 2019, https://arxiv.org/abs/1904.11486. Stops aliased frequencies from being baked into the latent on the way in.
  4. Alias-free design throughout — the principled fix; sinc-windowed up/downsample respecting the Nyquist limit. Karras et al., "Alias-Free Generative Adversarial Networks" (StyleGAN3), NeurIPS 2021, https://arxiv.org/abs/2106.12423.
  5. Patch-based ViT tokenizer (no conv stride at all) — recent video tokenizers go this route and avoid the grid by construction. Yu et al., "Language Model Beats Diffusion — Tokenizer is Key to Visual Generation" (MAGVIT-v2), ICLR 2024, https://arxiv.org/abs/2310.05737 ; Yu et al., "An Image is Worth 32 Tokens for Reconstruction and Generation" (TiTok), NeurIPS 2024, https://arxiv.org/abs/2406.07550.

Happy to share the full diagnostic + numbers from our domain. The script above is fast (a few seconds) and the off-DC fraction on a constant input would make a useful CI regression test for any future VAE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions