Ryuko Matoi š¬ Wan2.1-T2V-14B
Details
Download Files (2)
Model description
Description
"To Hell With Your Opinion. I'll Take My Own Path No Matter What Anyone Else Says." - Ryuko Matoi
Usage
The trigger word is "Ryuko-chan".
There is no specific prompt format I use (still experimenting), but I usually start all prompts with "High quality 2D art animation." For example, "High quality 2d art animation. Ryuko-chan, with long red hair and a black trench coat, stands on top of a ruined skyscraper. The city burns in the distance, smoke rising into the dark sky. The wind makes her coat billow as she crosses her arms. The camera suddenly zooms in on her determined face. She smirks, tilts her head slightly, and winks."
For inference, I use Kijai's wrapper. Unfortunately, Civitai does not seem to parse metadata from it, but the workflow should be embedded in each clip. Just in case, here's an example workflow in JSON format. It's still a work in progress, so the right side is a bit messy, but overall fully functional.
All clips I post are the raw output of the LoRA, I do not use upscaling or frame interpolation (they misrepresent the LoRA's true capabilities). The following parameters are constant across all clips:
Sampler: unipc
Steps: 20
Cfg: 6
Shift: 7If you check my workflow, you'll see that I use all possible optimizations introduced since Wan2.1 Ńame out. And I really want to express my appreciation to the software developers and ML engineers (Comfyanonymous, Kijai, TeaCache team, and many others) who made it possible to run Kling-level video models on a consumer GPU with satisfying speed and (almost) no compromise in quality.
I started with "pure" FP8 + SDPA, and below is a breakdown of the speed improvements I achieved with each optimization technique.
For reference, my setup is: RTX 3090, 64 GB RAM, Win 11, Python 3.11.6, Torch 2.7.0.dev20250311+cu126, Sage Attention 2.1.1, Triton 3.2.0, NVidia driver 572.16, ComfyUI portable v0.3.26.
The times I mention below are not for rendering showcase clips in the gallery, most of those are 640x480x81, which adds about 2 minutes to the total time. But that's fine since I usually launch 50-60 prompts at once and check back few hours later to collect the results.
However, the times that matter most to me are for running Wan-14B T2V during testing and comparing different LoRA versions. During this phase I use 512x512x65 dimensions. 99% of these test clips are put to recycle bin immediately after being generated, but they provide a clear assessment of LoRA quality, so I need to make test-time inference as fast as possible.
For 512x512x65 clip, 20 steps, UniPC sampler:
(Only the DiT inference phase is considered, as optimizations do not affect the time required for encoding prompts with UMT5 or decoding latents with VAE. Additionally, this time is negligible, taking only 2ā3 seconds.)
fp8_e4m3fn + sdpa -> 09:24 (28.24s/it)
fp8_e4m3fn + sageattention 2 -> 06:53 (20.70s/it) +36% (1.36x)
fp8_e52m + torch.compile + sageattention 2 -> 06:21 (19.08s/it) +48% (1.48x)
fp8_e52m + torch.compile + sageattention 2 + teacache (0.250/6) -> 04:29 (13.50s/it) +109% (2.09x)
fp8_e52m + torch.compile + sageattention 2 + teacache + fp16_fast -> 03:23 (10.19s/it) +177% (2.77x)
So, almost x3 speedup, with minimal quality loss (well, for my use case; your experience may differ). I compared clips using the same seed, rendered with and without TeaCache/Sage Attention 2, and honestly, I couldn't see a clear drop in quality. Maybe there's a slight difference in very complex prompts with a lot of motion, but even in "raw" mode, those tend to struggle anyway. If it matters, I also make extensive use of Enhance-A-Video and SLG (layer 9, 0.2-0.8). They seem to have a positive impact on clips, improving quality and mitigating motion artifacts.
The speed boost was one of the major factors in my decision to switch from HV to Wan. While Wan provides better quality, I was used to fast rendering in HV, so getting similar speeds with Wan made the transition worthwhile.
Training
I used 215 images (most of them were manually selected from 18461 screencaps of all 25 episodes of Kill la Kill, plus a few official artworks). They were captioned with THUDM/cogvlm2-llama3-chat-19B, using the following prompt:
"Describe this artwork in detail, focusing on the visual style, setting, atmosphere, and artistic techniques. When a female character with dark hair is present, refer to her as Ryuko-chan. Mention her clothing, accessories, and pose, but do not describe her facial features, body proportions, or physical attributes. Include specific art style terminology (e.g., cel-shaded, painterly, watercolor, digital illustration) and visual elements that define the aesthetic. Describe lighting, color palette, composition, and any notable artistic influences."
My idea was to caption in a way that would allow me to freely alter Ryuko's clothing and hair color, while keeping all other aspects of her appearance (facial features, body, etc.) authentic and recognizable. And this actually worked, WanVideo did an amazing job on its part (as I hope you can see from the examples I posted). Unfortunately, gear-shaped pupils were not memorized, but I did not have enough of close shots in dataset.
I didnāt caption anything by name besides Ryuko, so while the model likely learned visual aspects of Senketsu and the Scissor Blade, explicitly calling them out may not work (though I haven't tested it much). But this was intentional - I wanted to teach the model only about Ryuko and her physical appearance.
If her outfit and hair color are not explicitly described, they will most likely default to Ryuko's original clothing. In everyday scenes, she will wear her regular uniform, while in battle and intense scenes, she tends to prefer Senketsu.
(Initially I planned it create a character-only LoRA, not style, so I could also draw Ryuko in realistic manner, but the dataset bias was too strong, so it can only render in style of the original series, which actually isn't a bad thing. That said, itās probably more precise to refer to it as a mixed character/style LoRA.)
First version of LoRA was trained for 1.3B model with diffusion-pipe, but I didn't like the result, 1.3B model (imho) is too small and can't really compete in quality with HV (the only advantage is speed and requirements). Second version was trained with ai-toolkit, but the LoRA also didnāt turn out as well as I had hoped (I attribute this to the fact ai-toolkit yet doesnāt support training with captions).
Finally, current (successful) version was trained for 14B model with musubi-tuner. Below are commands I used (I typically create .bat files for launching musubi-tuner pipelines, hence the format of the train command). For a brief breakdown, almost all important parameters are the same as the default, except the learning rate, which was changed to 7e-5. I trained for 40 epochs (17630 steps, with 430 steps per epoch) and, after three days of testing, selected the checkpoint at step 15050 as the most successful. I could have trained for more steps because the model seemed to be improving, and the dynamics of the training loss were promising.
cache_latents (vae):
python wan_cache_latents.py --dataset_config G:/samples/musubi-tuner/_ryuko_matoi_wan14b_config.toml --vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensorscache_prompts (t_enc):
python wan_cache_text_encoder_outputs.py --dataset_config G:/samples/musubi-tuner/_ryuko_matoi_wan14b_config.toml --t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth --batch_size 16 dataset config (I did not write it by hand, made small script that scans folder with image files, calculates a scaled resolution for each folder's images so that max dimension of scaled image does not exceed 720px, preserving approximate aspect rate while ensuring the dimensions are divisible by 16 - but does not resize it physically, just preparing ahead for musubi-tuner's bucketing mechanism - then groups images with same dimensions into subfolders and generates a YAML file with metadata for musubi-tuner):
[general] caption_extension = ".txt" batch_size = 1 enable_bucket = true bucket_no_upscale = false[[datasets]] resolution = [528, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/1057x1516x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1057x1516x1/cacheā num_repeats = 2
[[datasets]] resolution = [768, 432] image_directory = āH:/datasets/ryuko_matoi_wan_video/1280x720x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1280x720x1/cacheā num_repeats = 2
[[datasets]] resolution = [592, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/1600x2033x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1600x2033x1/cacheā num_repeats = 2
[[datasets]] resolution = [400, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/1727x3264x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1727x3264x1/cacheā num_repeats = 2
[[datasets]] resolution = [720, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/1917x2002x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1917x2002x1/cacheā num_repeats = 2
[[datasets]] resolution = [768, 432] image_directory = āH:/datasets/ryuko_matoi_wan_video/1920x1080x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1920x1080x1/cacheā num_repeats = 2
[[datasets]] resolution = [736, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/1920x1963x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1920x1963x1/cacheā num_repeats = 2
[[datasets]] resolution = [480, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/1920x3038x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/1920x3038x1/cacheā num_repeats = 2
[[datasets]] resolution = [768, 576] image_directory = āH:/datasets/ryuko_matoi_wan_video/2363x1813x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/2363x1813x1/cacheā num_repeats = 2
[[datasets]] resolution = [768, 624] image_directory = āH:/datasets/ryuko_matoi_wan_video/3877x3208x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/3877x3208x1/cacheā num_repeats = 2
[[datasets]] resolution = [640, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/690x820x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/690x820x1/cacheā num_repeats = 2
[[datasets]] resolution = [576, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/690x920x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/690x920x1/cacheā num_repeats = 2
[[datasets]] resolution = [512, 768] image_directory = āH:/datasets/ryuko_matoi_wan_video/800x1195x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/800x1195x1/cacheā num_repeats = 2
[[datasets]] resolution = [768, 496] image_directory = āH:/datasets/ryuko_matoi_wan_video/935x608x1ā cache_directory = āH:/datasets/ryuko_matoi_wan_video/935x608x1/cacheā num_repeats = 2
sampling file:
# prompt 1 Ryuko-chan with blonde hair is walking on the beach, camera zoom out. āw 384 āh 384 āf 45 ād 7 ās 20prompt 2
Ryuko-chan dancing in the bar. āw 384 āh 384 āf 45 ād 7 ās 20
train command:
accelerate launch ānum_cpu_threads_per_process 1 āmixed_precision bf16 wan_train_network.py ^
ātask t2v-14B ^
ādit G:/samples/musubi-tuner/wan14b/dit/wan2.1_t2v_14B_bf16.safetensors ^
āvae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors ^
āt5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth ^
ādataset_config G:/samples/musubi-tuner/_ryuko_matoi_wan14b_config.toml ^
āsdpa ^
āmixed_precision bf16 ^
āfp8_base ^
āfp8_t5 ^
āoptimizer_type adamw8bit ^
ālearning_rate 7e-5 ^
āgradient_checkpointing ^
āmax_data_loader_n_workers 2 ^
āpersistent_data_loader_workers ^
ānetwork_module networks.lora_wan ^
ānetwork_dim 32 ^
ānetwork_alpha 32 ^
ātimestep_sampling shift ^
ādiscrete_flow_shift 3.0 ^
āmax_train_epochs 50 ^
āsave_every_n_epochs 1 ^
āseed 42 ^
āoutput_dir G:/samples/musubi-tuner/output ^
āoutput_name ryuko_matoi_wan14b ^
ālog_config ^
ālog_with tensorboard ^
ālogging_dir G:/samples/musubi-tuner/logs ^
āsample_prompts G:/samples/musubi-tuner/_ryuko_matoi_wan14b_sampling.txt ^
āsave_state ^
āsample_every_n_epochs 1During training the speed was around 4 s/it, and VRAM usage was around 21GB. I didnāt apply any specific optimizations aside from the āfp8_base and āfp8_t5 flags.
Compatibility testing with other LoRAs has not been conducted (and is not planned).
Also I did not test it with I2V models (in fact, I havenāt even downloaded any of them yet).
(Oh, and I also published dataset alongside LoRA, but there is nothing notable in it.)
Conclusion
This is my first (successful) LoRA for Wan Video 2.1-T2V 14B, and I can say I feel very excited about this model. It has been a pleasure to train, and it grasps concepts and styles exceptionally well (based on my current, limited experience). I canāt wait to train all the LoRAs I have planned! Until now, I have only trained generative AI models for style, but now I feel very enthusiastic about training models for not only style but also VFX, concepts, movements, etc.
Returning to this LoRA, I canāt say itās perfect, but out of the 3 clips I generate, 1 is successful (i.e., it adheres to the prompt and doesnāt have undesirable artifacts), which I consider a success. I attribute this to the model itself. Although I trained the LoRA on images only (for upcoming LoRAs, I will also use video clips), the model learned (or rather, extrapolated) a lot about animation techniques and visual features of the original series. This includes, for example, the excessive (and not always appropriate xD) use of lens flare in outdoor scenes, exaggerated facial animations, etc.
I also think this model benefits from diverse datasets (at least 100 images) and low learning rates (1e-4 is too high). For my next LoRA, I will use no more than 5e-5. It might require more steps, but in this case, it learns all the details better without overfitting.




















