LTX 2.3 Audio and Video Generation – Latest WebUI Portable Version高质量音视频生成工具
LTX-2 is the first DiT (Diffusion Transformer) architecture-based audio-video foundation model developed by Lightricks. Unlike previous approaches that handle video and audio separately, LTX-2 deeply integrates both into a single unified model, enabling true synchronized audio-video generation with high quality output.
Patreon:https://www.patreon.com/posts/ltx-2-3-webui-156971047
夸克网盘:https://pan.quark.cn/s/41e4da892a11
youtube:https://www.youtube.com/watch?v=Pt_8HhYHozs
Eight Core Features Explained
- Two-Stage HD Generation
Best for: Final renders where maximum image quality is the priority.
How it works: The dev main model first generates a low-resolution draft, then the 2× spatial upscaler doubles the resolution — balancing content quality with fine detail clarity.
Required models: ltx-2.3-22b-dev + spatial-upscaler-x2 + distilled-lora + Gemma
Steps:
Switch to the "Two-Stage HD Generation" tab
Enter your prompt and set resolution and frame count under "Prompt & Basic Parameters"
Adjust "Distilled LoRA Strength" (default 1.0, range 0–2; too high may over-sharpen)
Click "Start Generation"
Notes:
Generation takes longer — best for final output, not quick previews
Recommended inference steps: 20–40
- Distilled Model Fast Generation (Recommended)
Best for: Speed-critical scenarios, or environments with limited VRAM.
How it works: Uses a knowledge-distilled model that generates video in just 8 fixed-sigma inference steps, with spatial upscaler output.
Required models: ltx-2.3-22b-distilled + spatial-upscaler-x2 + Gemma
Steps:
Switch to the "Distilled Fast Generation" tab
Enter your prompt and configure parameters
Click "Start Generation"
Notes:
Inference steps are fixed at 8; adjusting the "Inference Steps" parameter has no effect in this mode
Fastest speed, but slightly lower quality and detail richness than two-stage HD
This mode does not use distilled LoRA — no need to set "Distilled LoRA Strength"
- Image/Video-to-Video
Best for: Generating new videos with consistent style and controlled motion based on reference images or videos (IC-LoRA).
Required models: ltx-2.3-22b-distilled + spatial-upscaler-x2 + Gemma
Tab-specific parameters:
Parameter Description Reference Video File Upload one or more reference videos as conditioning guidance Reference Video Strength Influence strength of each reference video (0–1+), comma-separated (e.g. 0.8,0.6) Skip Second-Stage Upscaling Check to skip the high-res stage — faster but no resolution doubling Attention Strength Controls how much the reference video influences attention (0.0–1.0); higher = closer to reference Mask Video (optional) Upload a mask video; white areas are influenced by reference conditions, black areas generate freely
Steps:
Upload reference video(s) (multiple supported)
Set strength for each video, e.g. 1.0 or 0.8,0.6
Optionally upload a reference image in the "Image Conditions" accordion
Enter a prompt describing the target video content
Click "Start Generation"
Notes:
Number of reference videos must match the number of strength values; if fewer values are provided, the last value is used to fill in the rest
Mask video dimensions are automatically scaled to half the generation size
- Keyframe Interpolation
Best for: Generating smooth transition video clips between a set of keyframe images.
Required models: ltx-2.3-22b-dev + spatial-upscaler-x2 + distilled-lora + Gemma
Steps:
Switch to the "Keyframe Interpolation" tab
Expand the "Image Conditions (Optional)" accordion below
Upload multiple keyframe images
In "Frame Index," enter the frame number for each image, e.g. 0,16,32 (frame numbers start at 0; spacing indicates interpolated frames)
In "Strength," enter the influence strength for each keyframe, e.g. 1.0,1.0,1.0
Enter a prompt describing the overall motion/scene
Make sure "Frame Count" ≥ maximum frame index + 1
Click "Start Generation"
Notes:
Keyframe count, frame index count, and strength value count must all match
First frame index is typically set to 0; last frame index is set to num_frames - 1
Distilled LoRA Strength affects interpolation smoothness — recommended to keep default value of 1.0
- Audio-Driven Video Generation
Best for: Generating video content synchronized to the rhythm of music or speech.
Required models: ltx-2.3-22b-dev + spatial-upscaler-x2 + distilled-lora + Gemma
Tab-specific parameters:
Parameter Description Audio File Upload a WAV, MP3, or other supported audio file Audio Start Time (seconds) Start position within the audio file (default 0) Max Duration (seconds) Length of audio clip to use (0 = auto, matched to video frame count)
Steps:
Switch to the "Audio-Driven Video Generation" tab
Upload your audio file
Set start time and max duration (usually leave as default)
Enter a prompt describing the visual content of the video
Set "Frame Count" and "Frame Rate" so video duration matches the audio duration
Click "Start Generation"
Notes:
Audio file is required — generation will error without it
Video duration = Frame Count ÷ Frame Rate; keep this consistent with your audio clip length
You can upload a reference image under "Image Conditions" to influence the visual style
- Video Segment Regeneration
Best for: Locally regenerating an unsatisfactory segment of an existing video while keeping the rest unchanged.
Required models: ltx-2.3-22b-distilled + Gemma
Tab-specific parameters:
Parameter Description Source Video File Upload the original video to be partially modified Start Time (seconds) Start point of the segment to regenerate End Time (seconds) End point of the segment to regenerate Regenerate Video Track Check to regenerate the video frames in the selected time range Regenerate Audio Track Check to regenerate the audio in the selected time range Use Distilled Model Check for fast distilled inference; uncheck for full inference (requires manual guidance parameter setup)
Steps:
Switch to the "Video Segment Regeneration" tab
Upload your source video
Set start and end times (in seconds)
Choose whether to regenerate the video track and/or audio track
Enter a prompt describing the target content for the regenerated segment
Click "Start Generation"
Notes:
Source video file is required — generation will error without it
Portions outside the time range remain unchanged
When using the distilled model, guidance parameters are automatically set to preset values; manual adjustments have no effect
- HDR Video Generation
Best for: Professional film and post-production workflows requiring high dynamic range (HDR) footage for color grading, tone mapping, and compositing.
Required models: ltx-2.3-22b-distilled + spatial-upscaler-x2 + HDR IC-LoRA
Tab-specific parameters:
Parameter Description Reference Video File Upload an SDR reference video as the basis for HDR conversion Reference Video Strength Conditioning strength for each reference video (comma-separated) Spatial Tile Size Controls tile size during upscaling (default 1280); affects VRAM usage EXR Output Only Check to save only the EXR sequence without generating an MP4 preview EXR Half Precision Save EXR using float16 — smaller file size, slightly reduced precision High Quality Mode Enables a more refined HDR processing pipeline (slower)
Steps:
Switch to the "HDR Video Generation" tab
Upload your reference SDR video
Click "Start Generation"
Output:
Output is an EXR frame sequence (LogC3-encoded linear light data), saved to the output/hdr_XXXXXX_exr/ directory
By default, an MP4 preview file is also generated (check "EXR Output Only" to skip this)
EXR files require tone mapping in professional software such as DaVinci Resolve or Nuke before they display correctly
Notes:
Larger tile sizes increase VRAM usage; reduce if you encounter OOM (out of memory) errors
General Parameters Reference
Prompt & Basic Parameters
Parameter Default Description Prompt (empty) Describes the video content; detailed descriptions of motion, scene, camera, and lighting are recommended (see Prompt Writing Tips below) Negative Prompt (empty) Describes content to avoid, e.g. blurry, low quality Random Seed -1 -1 for random; a fixed value reproduces identical results Height / Width (px) 512 / 768 Output resolution Frame Count 33 Total frames to generate; video duration = Frame Count ÷ Frame Rate Frame Rate (fps) 24 Output video frame rate Inference Steps 8 Diffusion denoising steps; more = better quality but slower (fixed at 8 in distilled mode) Max Batch Size 1 Number of chunks processed in parallel; increasing this speeds things up but requires more VRAM Auto-Enhance Prompt Off When enabled, uses Gemma to automatically expand your prompt — useful for short prompts Distilled LoRA Strength 1.0 For two-stage / keyframe / audio-driven modes; affects detail sharpness in the second stage
Image Conditions (Optional)
Upload reference images to provide visual anchors for the generated video.
Parameter Description Condition Image File Upload one or more images (required in Keyframe Interpolation mode) Frame Index Which frame in the video each image corresponds to (0-indexed), comma-separated Strength How strongly each image influences the generated content, comma-separated CRF Image compression quality (lower = higher quality; default 33 is usually fine)
Runtime Parameters
Parameter Description VRAM Offload Mode none: keep everything in VRAM; cpu: offload part to RAM; disk: offload to disk (lowest VRAM usage, but slowest) Quantization Mode none: full precision; fp8-cast: dynamic FP8 quantization (recommended for 40/50-series GPUs); fp8-scaled-mm: Hopper GPU only Torch Compile Acceleration First-time compilation takes a few minutes; subsequent generations are noticeably faster Additional LoRA One per line, format: /path/to/lora.safetensors,0.8
Guidance Parameters (Advanced)
Controls diffusion guidance strength — generally no adjustment needed.
Parameter Suggested Range Description cfg_scale 2–7 Classifier-free guidance strength; higher = stronger prompt adherence but may oversaturate stg_scale 0–2 Skip-step guidance strength rescale_scale 0.5–0.9 Guidance rescaling compensation to prevent oversaturation modality_scale 1–5 Multimodal (audio-video) alignment strength skip_step 0 Number of initial steps to skip stg_blocks 28 Transformer block index where skip-step guidance is applied
Prompt Writing Tips
LTX-2 uses Gemma for deep semantic understanding and supports detailed natural language descriptions. Keep descriptions precise and specific — think like a film storyboard. Recommended length: under 200 words.
Output & Settings
Output Files
Generated videos are saved to the output/ folder in the project root, with filenames in the format:
output/{feature_name{datetime}.mp4
HDR mode additionally generates:
output/hdr_{date_timeexr/frame00000.exr output/hdr_{date_timeexr/frame00001.exr ...
Settings Saving
Manual save: Click the "Save Settings" button
Auto-save: All current parameters are automatically saved each time you click "Start Generation"
Settings file path: {project root}/settings.json
All parameters are automatically restored from settings.json on next launch
FAQ
Q: How much storage space is needed for a full setup?
A: Downloading all models requires approximately 100 GB or more (dev model 44 GB, distilled model 44 GB, Gemma ~22.7 GB, upscaler, etc.). If you only use specific features, you only need to download the corresponding models.
Q: What is the minimum VRAM requirement?
A: For lower VRAM setups, use "Quantization Mode" (fp8-cast — do not enable on RTX 30-series or older) combined with "VRAM Offload Mode" (cpu or disk). The less VRAM your NVIDIA GPU has, the slower generation will be. For reasonable speeds, 12 GB VRAM or more is recommended.
Q: My output doesn't match the prompt — what can I do?
A:
Increase cfg_scale (e.g. from 3 to 5–7)
Make your prompt more specific and detailed
Enable "Auto-Enhance Prompt"
Increase "Inference Steps"