Wan2.2 S2V in ComfyUI Workflow | Audio to Talking Video
詳細
ファイルをダウンロード (1)
このバージョンについて
モデル説明
Turns your audio clip into lifelike, synced video from one image
Who it's for: creators who want this pipeline in ComfyUI without assembling nodes from scratch. Not for: one-click results with zero tuning — you still choose inputs, prompts, and settings.
Open preloaded workflow on RunComfy
Open preloaded workflow on RunComfy (browser)
Why RunComfy first
- Fewer missing-node surprises — run the graph in a managed environment before you mirror it locally.
- Quick GPU tryout — useful if your local VRAM or install time is the bottleneck.
- Matches the published JSON — the zip follows the same runnable workflow you can open on RunComfy.
When downloading for local ComfyUI makes sense — you want full control over models on disk, batch scripting, or offline runs.
How to use (local ComfyUI)
1. Load inputs (images/video/audio) in the marked loader nodes.
2. Set prompts, resolution, and seeds; start with a short test run.
3. Export from the Save / Write nodes shown in the graph.
Expectations — First run may pull large weights; cloud runs may require a free RunComfy account.
Overview
This workflow lets you create video from sound and one image, making speech-driven or music-driven visuals possible. You can quickly generate talking avatars, music loops, or expressive clips without manual animation. It preserves image fidelity while syncing lips and expressions to audio. You just provide voice or music plus a reference image, and it produces a matching video. Easy setup means less tinkering, more creating. It’s designed for seamless audio-matched animation.
Important nodes:
Key nodes in Comfyui Wan2.2 S2V workflow
WanSoundImageToVideo (#55)
Drives audio-synchronized motion from a single image. Set ref_image to the portrait or scene you want animated, connect audio_encoder_output from the encoder, and provide a length in frames. Increase length for longer clips or reduce for snappier previews. If you change FPS elsewhere, update the frames value accordingly so timing stays in sync.
AudioEncoderLoader (#57) and AudioEncoderEncode (#56)
Load and run the Wav2Vec2-based encoder that turns speech or music into features Wan can follow. Use clean speech for lip sync, or percussive/beat-heavy audio for rhythmic motion. If your input language or domain differs, swap in a compatible Wav2Vec2 checkpoint to improve alignment.
CLIPTextEncode (#6) and CLIPTextEncode (#7)
Positive and negative prompt encoders for UMT5/CLIP conditioning. Keep positive prompts concise, focusing on subject, style, and shot terms; use negative prompts to avoid unwanted artifacts. Overly forceful prompts can fight the audio, so prefer light guidance and let Wan2.2 S2V handle motion.
KSampler (#3)
Samples the latent sequence produced by the Wan2.2 S2V node. Adjust sampler type and steps to trade speed for fidelity; keep a fixed seed when you want reproducible timing with the same audio. If motion feels too rigid or noisy, small changes here can noticeably improve temporal stability.
VHS_VideoCombine (#66)
Creates the final video and attaches the audio. Set frame_rate to match your intended FPS and confirm the clip length matches your length frames. The container, pixel format, and quality controls are exposed for quick exports; use higher quality when you plan to post-process in an editor.
Notes
Wan2.2 S2V in ComfyUI Workflow | Audio to Talking Video — see RunComfy page for the latest node requirements.

