Wan.Humo Music Video Automation Workflow.
Details
Download Files
Model description
🎬 AI Music Video Workflow (ComfyUI)
Turn your favorite tracks into fully AI-generated cinematic music videos — automatically right inside ComfyUI - NO POST EDITING NEEDED.
This workflow takes a reference image and an audio file, then generates a lip synced video that matches lyrics, mood, and scene dynamics and is 95% fully automated.
For some reason the example videos are not showing up for anyone so you can find them all here
High Level walkthrough can be found here
Need help or have questions? Please reach out through discord
✨ What It Does
🎭 Keeps your reference image as the main performer across all scenes.
🎶 Splits audio into lyric-synced snippets for perfect timing.
🖋️ Uses a custom Prompt creator node that sends custom instructions to an LLM node to build cinematic prompts from the lyrics and your style choices.
🎥 Generates scene-by-scene visuals, then combines them into a seamless final video.
The samples I provided were all created inside ComfyUI with NO post edits.
On a 5090 it took around 2 hours for the full song.
More examples can be found here and more will be added as I make them.
🔧 Key Features
Reference Image Control – Import your character photo (headshot recommended) it auto-removes the background, and resizes for clean framing.
Audio Handling – Automatic vocal/instrument separation, Whisper V3 transcription, advanced settings for lyric overlap, and fallback options.
Prompt Creator – Flexible scene builder with fields for style, theme, lighting, camera motion, outfits, and more to get a custom look
Auto Queueing – Handles multi-run videos seamlessly for long audio files.
Final Render Automation – Collects all video chunks, merges them, and saves your finished video as
FINAL_VIDEO.mp4.This workflow uses the Native Gemini LLM API node by default, which receives detailed instructions generated by the Prompt Creator node. You can swap Gemini out for another LLM if you prefer, but the instruction sets are fairly complex, and most local models struggle to follow them reliably. If you’d rather not use an LLM at all, you can manually enter prompts instead—just reach out on Discord for extra guidance and tips. For context, I’ve spent only $5 so far, which has powered 50+ videos, and I still have credit left—so it’s been very cost-effective.
🚀 Quick Start
Upload reference image
Load your audio file
Set your folder name (e.g., the song title).
Fill in Prompt Creator fields (style, mood, shots, etc.).
Hit Run — everything else is automated.
The workflow will auto-queue middle runs for long audio files.
For the final pass, it will tell you which groups to mute.
Simply follow the on-screen instructions, hit run again, and the workflow finishes the process automatically. (You do not have to wait for runs to finish. You just mute and hit run once more.)
🎵 Creative Workflow Tip
Just like real music videos, you don’t have to stick to one pass. You can run the same audio file multiple times with different reference images or styles — for example:
One pass with the lead singer as the performer.
Another pass featuring a band member or supporting character.
Additional passes experimenting with different themes, outfits, or camera styles.
Later, you can edit these separate video runs together, cutting between performances or blending visual moods — exactly how professional music videos are produced with multiple takes.
📦 Required Custom Nodes
This workflow relies on a set of custom nodes I built specifically for this workflow.
You’ll need to install them before running the workflow:
👉 ComfyUI-VRGameDevGirl Custom Nodes (GitHub)
They can also be installed via the manager.
These nodes handle:
Audio splitting, transcription, and auto-queueing
Smart folder management and metadata tracking.
Popup instructions for multi-run projects.
Scene sync and frame adjustments for HuMo compatibility.
Video combining and more.
👉 Join the discord community for support, tips and tricks.
✅ In Summary
This workflow is designed for creators, musicians, and visual storytellers who want to merge AI visuals with music. With automatic transcription, smart prompt handling, and seamless video assembly, you can focus on creative direction while the workflow handles the heavy lifting.
