Qwen3-TTS Ultimate Pack (Cloning + Design + Low VRAM)

Details

Model description

# 🚨 UPDATE V1.5 (Jan 24, 2026) - CRITICAL FIX

Please update to this version immediately!

The previous version (v1.0) may crash due to a recent "Breaking Change" in the ComfyUI-Qwen3-TTS custom nodes.

✅ Fixes in v1.5:

* Fixed Crash: Solved the Unsupported speakers: fixed error.

* Plug & Play: Removed all personal file paths (no more "File Not Found" errors on first run).

* Potato Mode Guide: Added a visual guide inside the workflow for 0.6B model switching.

---

# 🎧 Qwen3-TTS Ultimate Pack (Voice Design & Cloning)

This is a beginner-friendly workflow for the newly released Qwen3-TTS model. It is optimized to run on consumer hardware with as little as 6GB VRAM (tested and working perfectly on a GTX 1060).

💡 POTATO PC MODE (<4GB VRAM):

If you are crashing, change the repo_id in the loader to: Qwen/Qwen3-TTS-12Hz-0.6B-Base (It is faster, uses half the memory, but has slightly less emotion).

I created this because the new nodes can be confusing for beginners. This download includes two separate groups in one workflow managed by a "Fast Switcher."

## 🚀 What's Included?

### Workflow 1: Voice Design (Text-to-Speech)

* Best for: Narrators, Movie Trailers, Assistant Voices.

* Uses the VoiceDesign model for high-quality, directed acting.

* Includes the "Instruct" field setup so you can direct the emotion (e.g., "Sad whisper", "Angry shout").

### Workflow 2: Voice Cloning (Audio-to-Speech)

* Best for: Cloning specific voices (yourself, friends, characters).

* Uses the Base model + Reference Audio.

* Pro Tip: I've set it up to accept ref_text which improves accuracy significantly.

## ⚙️ Requirements

1. ComfyUI Manager installed.

2. Qwen3 Nodes: You need ComfyUI-Qwen3-TTS (Author: DarioFT / ID: 3172 in Manager).

3. Utility Nodes: You need rgthree-comfy (via Manager) for the mode switcher to work.

* (Note: If you don't want to install rgthree, you can just bypass the groups manually using Ctrl+M).

## 📝 How to Use (New Easy Mode)

I have cleaned up the workflow into two distinct Color-Coded Groups. You don't need to wire anything manually!

The Control Switch: Look for the "Fast Groups Bypasser" node on the left.

1. For Text-to-Speech: Set Enable Voice Design to "yes" and Cloning to "no".

2. For Cloning: Set Enable Voice Cloning to "yes" and Design to "no".

* Note: Only enable ONE at a time to save VRAM on your GTX 1060.

Visual Guide:

🟦 *Pale Blue Group (Top)** = Voice Design.

🟦 *Cyan Group (Bottom)** = Voice Cloning.

* Visual Cue: If the nodes inside a group turn darker/muted, it means that group is Bypassed (OFF).

## 💡 Performance Note

* VRAM Usage: ~3.5GB to 5GB (depending on model choice).

* Speed: Fast generation even on older cards (GTX 10xx series).

Enjoy making your AI speak! And please Thumb Up if this saved your day! ⭐

Images made by this model

No Images Found.