VAE produces grid-like tile artifacts on flat regions
We've been fine-tuning the LTX-2.3 VAE for a downstream tokenizer and noticed a regular grid pattern in the decoder output, most visible on flat regions (uniform backgrounds). It's small per-pixel but very structured, and we believe the same pattern contaminates natural images — it's just easier to isolate against a constant input.
The test
Encode → decode a constant (0, 128, 0) clip and look at the recon and its FFT. A constant input has zero spatial structure, so any structure in the output is the VAE's contribution.
Script (~50 lines)
import numpy as np
import torch
from src.ltx_vae import load_ltx_vae
enc, dec = load_ltx_vae("/path/to/vae.safetensors", device="cuda")
enc.eval(); dec.eval()
bg = torch.tensor([0, 128, 0], dtype=torch.float32) / 255.0
src = (bg * 2 - 1).view(1, 3, 1, 1, 1).expand(1, 3, 49, 256, 256).contiguous().cuda()
with torch.no_grad():
recon = dec(enc(src)).clamp(-1, 1).cpu().numpy()
# FFT of the green channel of the middle frame, centre-cropped to 128x128.
g = recon[0, 1, 24, 64:192, 64:192]
mag = np.log10(np.abs(np.fft.fftshift(np.fft.fft2(g))) + 1e-9)
# Quantitative measure: fraction of FFT energy that's NOT at DC.
f = np.fft.fftshift(np.fft.fft2(recon[0, 1, 24]))
cy, cx = f.shape[0] // 2, f.shape[1] // 2
f_no_dc = f.copy(); f_no_dc[cy-2:cy+3, cx-2:cx+3] = 0
print(f"off-DC fraction: {np.abs(f_no_dc).sum() / np.abs(f).sum():.4f}")
What we see
|
LTX-2.3 |
| max |Δ| (G channel) |
0.124 (on a [0,1]-normalised input) |
| mean |Δ| (G channel) |
0.0026 |
| FFT off-DC fraction (G channel) |
0.98 |
98 % of the recon's frequency energy is not at DC — i.e. it's in spatial structure, almost all of it on a regular grid. The grid period matches the VAE's patch_size=4 upsample stride and its harmonics.
Source frame | recon | abs(diff)×50:
FFT log-magnitude
FFT log-magnitude, source | recon, 128×128 centre crop:
We also see strong artifacts at the frame boundaries (presumably from conv-stack padding); the centre crop above isolates the periodic structure cleanly.
Why we think this matters for real video
Natural-image variance hides the grid visually, but the FFT energy is still being deposited there — it's just masked by content. For downstream tasks that care about high-frequency fidelity (compression, sharp-region reconstruction, anti-aliasing under camera pans) this floor is the limit of what any fine-tune on top can achieve. We've observed this empirically: even after substantial domain-adaptation training with perceptual + segmentation-weighted losses, the per-frame FFT continues to show the same harmonics in foreground regions.
Suggested directions if you're iterating on the VAE architecture
This is the textbook checkerboard pattern from stride-s transposed convs (Odena, Dumoulin, Olah, "Deconvolution and Checkerboard Artifacts," Distill 2016 — https://distill.pub/2016/deconv-checkerboard/). In rough order of disruption:
- Resize-then-conv upsampling — replace
ConvTranspose(stride=s) with Upsample(scale=s) + Conv(stride=1). One-line change per stage; eliminates uneven kernel overlap. (Odena et al. 2016, link above.)
- PixelShuffle with ICNR initialisation — fast on GPU, checkerboard-free if initialised so the sub-pixel conv equals nearest-neighbour upsample at start. Shi et al., "Real-Time Single Image and Video Super-Resolution…", CVPR 2016, https://arxiv.org/abs/1609.05158 ; Aitken et al., "Checkerboard artifact free sub-pixel convolution," 2017, https://arxiv.org/abs/1707.02937.
- BlurPool downsampling in the encoder — Zhang, "Making Convolutional Networks Shift-Invariant Again," ICML 2019, https://arxiv.org/abs/1904.11486. Stops aliased frequencies from being baked into the latent on the way in.
- Alias-free design throughout — the principled fix; sinc-windowed up/downsample respecting the Nyquist limit. Karras et al., "Alias-Free Generative Adversarial Networks" (StyleGAN3), NeurIPS 2021, https://arxiv.org/abs/2106.12423.
- Patch-based ViT tokenizer (no conv stride at all) — recent video tokenizers go this route and avoid the grid by construction. Yu et al., "Language Model Beats Diffusion — Tokenizer is Key to Visual Generation" (MAGVIT-v2), ICLR 2024, https://arxiv.org/abs/2310.05737 ; Yu et al., "An Image is Worth 32 Tokens for Reconstruction and Generation" (TiTok), NeurIPS 2024, https://arxiv.org/abs/2406.07550.
Happy to share the full diagnostic + numbers from our domain. The script above is fast (a few seconds) and the off-DC fraction on a constant input would make a useful CI regression test for any future VAE.
VAE produces grid-like tile artifacts on flat regions
We've been fine-tuning the LTX-2.3 VAE for a downstream tokenizer and noticed a regular grid pattern in the decoder output, most visible on flat regions (uniform backgrounds). It's small per-pixel but very structured, and we believe the same pattern contaminates natural images — it's just easier to isolate against a constant input.
The test
Encode → decode a constant
(0, 128, 0)clip and look at the recon and its FFT. A constant input has zero spatial structure, so any structure in the output is the VAE's contribution.Script (~50 lines)
What we see
98 % of the recon's frequency energy is not at DC — i.e. it's in spatial structure, almost all of it on a regular grid. The grid period matches the VAE's
patch_size=4upsample stride and its harmonics.Source frame | recon | abs(diff)×50:
FFT log-magnitude
FFT log-magnitude, source | recon, 128×128 centre crop:
We also see strong artifacts at the frame boundaries (presumably from conv-stack padding); the centre crop above isolates the periodic structure cleanly.
Why we think this matters for real video
Natural-image variance hides the grid visually, but the FFT energy is still being deposited there — it's just masked by content. For downstream tasks that care about high-frequency fidelity (compression, sharp-region reconstruction, anti-aliasing under camera pans) this floor is the limit of what any fine-tune on top can achieve. We've observed this empirically: even after substantial domain-adaptation training with perceptual + segmentation-weighted losses, the per-frame FFT continues to show the same harmonics in foreground regions.
Suggested directions if you're iterating on the VAE architecture
This is the textbook checkerboard pattern from stride-
stransposed convs (Odena, Dumoulin, Olah, "Deconvolution and Checkerboard Artifacts," Distill 2016 — https://distill.pub/2016/deconv-checkerboard/). In rough order of disruption:ConvTranspose(stride=s)withUpsample(scale=s) + Conv(stride=1). One-line change per stage; eliminates uneven kernel overlap. (Odena et al. 2016, link above.)Happy to share the full diagnostic + numbers from our domain. The script above is fast (a few seconds) and the off-DC fraction on a constant input would make a useful CI regression test for any future VAE.