LTX 2.3 Single-Person Digital Human OmniNFT + Relay Audio-Driven Workflow
詳細
ファイルをダウンロード (1)
モデル説明
Watch the full video first if you want to understand how this LTX 2.3 single-person digital human workflow works in practice. The video shows how one character image and one audio file can be turned into an audio-driven talking video, how the staged rendering structure improves stability, and how to launch the workflow online without rebuilding the full ComfyUI environment locally.
This ComfyUI workflow is designed for LTX 2.3 single-person digital human video generation, using OmniNFT, Relay-style prompt control, and the distilled 1.1 model route to create an audio-driven talking character video from a still image. The main purpose of this workflow is to make single-character digital human production easier, more repeatable, and more suitable for real creator use.
The workflow starts with one character image. The image is resized and prepared before entering the LTX video pipeline. This image becomes the main identity reference for the digital human, controlling the face, clothing, framing, visual style, and overall composition. The workflow then uses LTXVImgToVideoConditionOnly to inject the image into the video latent process, allowing the model to preserve the original subject while generating motion.
The audio side is also important. The workflow includes Audio Duration detection and a SimpleMath frame calculation system. The audio length is read automatically, then converted into an LTX-compatible frame count. This helps reduce manual frame-count mistakes and keeps the generated video length closer to the input audio. The workflow also uses LTXVEmptyLatentAudio, LTXVConcatAVLatent, and LTXVSeparateAVLatent to connect audio latent logic with video latent generation.
The core generation process is divided into three stages. The first stage establishes the character motion, composition, and basic video structure. The second stage performs latent-space refinement and continuation. The third stage applies high-resolution refinement after latent upscaling. This three-stage structure is more stable than a simple one-pass render because each stage has a clearer purpose: build motion, improve structure, then polish quality.
The workflow also includes LTX2_NAG and a universal negative prompt structure. These are used to reduce common digital human problems such as identity drift, face distortion, broken mouth shapes, unstable lip movement, flicker, frame jitter, unwanted subtitles, watermarks, bad hands, or sudden scene changes. For digital human videos, this is especially important because small facial errors are very noticeable.
Compared with ordinary image-to-video workflows, this graph is more suitable for talking character production. A basic I2V workflow can create motion, but it may not properly handle audio duration, frame alignment, staged refinement, or identity preservation. This workflow combines image guidance, audio-aware frame logic, negative guidance, latent upscaling, tiled decoding, and final video assembly into one more practical pipeline.
This workflow is suitable for AI presenters, single-person digital humans, virtual hosts, narration avatars, character dialogue clips, product explanation videos, short-form talking videos, Bilibili demonstrations, YouTube content, RunningHub showcases, and Civitai workflow examples.
Main features:
LTX 2.3 single-person digital human workflow
One image + one audio input
OmniNFT + Relay-style prompt control
Distilled 1.1 model route
Audio duration detection
Automatic LTX-compatible frame calculation
LTXVImgToVideoConditionOnly image guidance
LTXVEmptyLatentAudio audio latent structure
LTXVConcatAVLatent and LTXVSeparateAVLatent
LTX2_NAG negative guidance support
Three-stage rendering pipeline
LTXVLatentUpsampler high-resolution refinement
Tiled VAE decoding and CreateVideo output
Suggested workflow:
Prepare a clean single-person character image first. The face should be clear, the mouth area should not be blocked, and the subject should not be too small in the frame. Then prepare a clean audio file with stable volume and limited background noise. Load the image and audio into the workflow, then write a prompt describing the character, camera framing, lighting, expression, and speaking style. Start with a short test to check identity stability, mouth movement, and motion quality. If the face changes too much, reduce aggressive motion language and strengthen image guidance. If the result is too static, add subtle head movement or natural speaking motion to the prompt. Once the base motion is stable, continue through latent upscaling and final high-resolution output.
⚙️ RunningHub Workflow
Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2058190932108992514?inviteCode=rh-v1111
If the results meet your expectations, you can later deploy it locally for customization.
🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!
📺 Bilibili Updates (Mainland China & Asia-Pacific)
If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV1yRGj6XEaM/
☕ Support Me on Ko-fi
If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk
💼 Business Contact
For collaboration or inquiries, please contact aiksk95 on WeChat.
⚙️打开下方链接即可在线体验,无需安装。
👉 工作流: https://www.runninghub.ai/post/2058190932108992514?inviteCode=rh-v1111
如果觉得效果理想,你也可以在本地进行自定义部署。
🎁 粉丝福利: 注册即送 1000 积分,每日登录 100 积分,畅玩 4090 体验 48 G 超级性能!
📺 Bilibili 更新(中国大陆及南亚太地区)
如果你在中国大陆或南亚太地区,可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频: https://www.bilibili.com/video/BV1yRGj6XEaM/
我会在 夸克网盘 持续更新模型资源:
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户,方便进行创作与学习。

