HuMo for Wan

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

✨ Key Features

HuMo is a unified, human-centric video generation framework designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs—including text, images, and audio. It supports strong text prompt following, consistent subject preservation, synchronized audio-driven motion.

VideoGen from Text-Image - Customize character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images.
VideoGen from Text-Audio - Generate audio-synchronized videos solely from text and audio inputs, removing the need for image references and enabling greater creative freedom.
VideoGen from Text-Image-Audio - Achieve the higher level of customization and control by combining text, image, and audio guidance.

Examples and models from the following sources reuploaded for your convenience here:
https://huggingface.co/bytedance-research/HuMo
https://github.com/Phantom-video/HuMo

Compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.

모델 유형	체크포인트
기본 모델	Wan Video 14B t2v
게시일	2025-09-13

HuMo for Wan

세부 정보

파일 다운로드 (1)

모델 설명

이 모델로 만든 이미지