InfiniteTalk TTS BGM Foley US IP
Details
Download Files
About this version
Model description
VibeVoice → ACE-Step → MMAudio → WAN Video Generation Workflow
(Image-to-Video Pipeline for ComfyUI)
This workflow transforms a single finished image into a short, cinematic clip with realistic motion, adaptive background music, and contextual sound design.
It’s built for creators who already have a rendered character and want to bring it to life through expressive movement and ambient depth.
Core Stages
VibeVoice (Speech & Expression): Generates spoken dialogue or monologue synced with emotional tone, allowing characters to deliver lines naturally within the scene.
ACE-Step (Background Music): Generates BGM to match emotional intent and tempo.
MMAudio (Foley & Ambience): Layers in realistic room tone and sound cues for immersion. As used in this workflow, foley is described and not based on video input.
WAN 2.1 I2V 480 or 720 (Motion & Tone): Adds lifelike motion and camera behavior through natural-language tone prompts.
Upscaling: The workflow includes a 1× detail upscaler pass (ideal for skin texture and edge refinement), but you can substitute any preferred upscaler.
Frame Interpolation: Integrated interpolation smooths motion between generated frames for cleaner playback and more natural character movement.
User Note
Audio generation inside ComfyUI can be tricky to configure.
This workflow includes inline notes listing required dependencies and node packs but users should expect some environment troubleshooting.
Once configured, the chain runs end-to-end from a still image to a complete audiovisual scene with motion, music, foley, and interpolation.
This workflow has settings that were optimized on a machine with a 5090.
