Look Back 🎥 Wan2.1-T2V-14B
Details
Download Files
Model description
About
Adapted from the celebrated one-shot manga by Tatsuki Fujimoto, Look Back (2024) by Studio DURIAN tells the poignant story of two young aspiring artists. It follows the relationship between the confident, outgoing Fujino and the reclusive, shut-in Kyomoto, who connect from afar through their shared passion for drawing manga. What begins as a tale of youthful rivalry and admiration slowly blossoms into a complex and deeply moving exploration of friendship, passion, and the quiet passage of time.
The greatness of Look Back lies in its profound emotional honesty and its relatable depiction of the creative spirit. The film is a masterful and heart-wrenching meditation on talent, jealousy, regret, and the unexpected tragedies that shape our lives. With stunningly expressive animation that captures every subtle emotion, it serves as a beautiful, melancholic tribute to the power of art to process grief and the indelible impact we have on one another, making it a truly unforgettable and essential cinematic experience.
This animated film unexpectedly had a deep influence on me, and with its artistic direction being fascinating, it was only a matter of time before I made an attempt to steal its art identity. Although I did not fully succeed in replicating the style (for instance, the "line boil" effect is missing, the linework looks way too sharp, etc.), and the high-motion scenes are disappointing, I decided that in some aspects the result is not so bad and worth sharing (as my last LoRA for Wan2.1).
Usage
I use WanVideoWrapper. Each clip I post contains an embedded workflow. Example workflow in JSON format is here.
All videos were created using the base WanVideo2.1-14B-T2V model.
For acceleration, I use a lightx2v self-forcing LoRA set at 0.9 strength.
Dataset
The dataset was extracted from the original animated film, which was split using PySceneDetect. It contains 135 videos (source resolution 1920x960) and 170 images (static frames, same resolution). I also had a validation dataset consisting of 20 images and 20 videos.
For captioning, I used Kwai-Keye/Keye-VL-8B-Preview. There were a few cases where it hallucinated and came up with completely irrelevant captions (I attribute this to the "preview" condition of the model). But when it didn't, it captioned with very impressive precision and detail following (probably the best among all open-weighted video captioners I've tried so far).
The training configuration included 3 video sections and 1 image section. All video sections were based on the same set of 135 videos, but trained with different settings:
"high-res" dataset, trained at [960x480], frame_extraction = "head", target_frames = [1, 9]
"medium-res" dataset, trained at [480x240], frame_extraction = "uniform", target_frames = [17], frame_sample = 3
"low-res" dataset, trained at [240x128], frame_extraction = "full", max_frame = 81
For all three video datasets, I used different caption files.
- For captioning high-res section (maximum complexity of captions) the prompt was:
Your task is to generate a detailed, multi-faceted description that deconstructs the video's content and technical execution. The output should be a comprehensive paragraph.
Analyze and describe all of the following elements:
Subject and Scene: Describe with rich, specific details about appearance, clothing, objects, background, and foreground.
Motion: Describe the characteristics of the movement, including its speed, direction, and quality.
Camera Language: Identify specific camera work, including shot types (e.g., long shot, bird's-eye view), camera angles (e.g., low angle), and camera movements (e.g., dolly, pan, orbit).
Atmosphere: Describe the mood of the scene with evocative language.
Strictly follow these rules:
Do not describe or name any artistic styles (e.g., 'cyberpunk', 'wasteland', 'masterpiece').
Focus only on describing the literal visual elements.
Always start description with phrase 'Lookback style.'
Examples:
Lookback style. A bird's-eye view tracking shot follows a lone figure walking through a desolate landscape of cracked earth and ruined structures. The camera moves smoothly, emphasizing the vast emptiness and creating a powerful atmosphere of loneliness and melancholy.
Lookback style. The camera pans slowly across a rainy, street illuminated by bright neon signs. A figure in a long trench coat walks away from the camera, their reflection shimmering in the puddles on the asphalt. The dense, layered visuals and vibrant yet somber colors create a mysterious and immersive atmosphere.
Lookback style. An extreme close-up shot focuses on a honeybee covered in yellow pollen, moving methodically on a sunflower. The camera is static, capturing the intricate details of the bee's wings and the flower's texture with high clarity. The bright, natural lighting and vibrant colors evoke a sense of wonder and the beauty of nature.
Lookback style. Using a shaky, first-person-view drone shot, the video speeds through a dense forest, dodging trees and branches at high velocity. The rapid motion and low angle create a thrilling and disorienting experience, filled with tension and anxiety.
- For captioning medium-res section (moderate complexity of captions) the prompt was:
Your task is to generate a descriptive sentence that captures the video's main activity, key visual characteristics, and general mood.
You must identify and combine the following elements into a fluid sentence:
Subject with Description: The main subject plus one or two clear adjectives (e.g., 'a tall man in a hat').
Scene with Description: The environment with some context (e.g., 'in a busy city park').
Motion: The primary action, described with more detail (e.g., 'walking briskly').
Atmosphere: A word or short phrase describing the overall mood if it is obvious (e.g., 'peaceful', 'energetic').
Strictly follow these rules:
Do not describe or name any artistic styles.
You may mention dominant camera work if it is a central feature, for example, 'slow motion'.
Always start description with phrase 'Lookback style.'
Examples:
Lookback style. A young woman in a yellow coat is running through a park with falling leaves, creating a lonely mood.
Lookback style. A classic red convertible drives along a sunny coastal highway, giving off an energetic vibe.
Lookback style. A chef quickly chops vegetables on a wooden board in a bright kitchen.
Lookback style. In slow motion, a drop of water falls into a still pool, creating ripples.
Lookback style. A large brown bear catches a fish from a rushing river in a display of natural power.
- For captioning low-res section (very brief captions) the prompt was:
Your task is to generate a single, concise sentence that describes only the most essential action in the video.
Analyze the video to identify three core elements:
Subject: The main person, animal, or object. Use a general term like 'a person' or 'a car'.
Scene: The general environment. Use a simple description like 'outside' or 'in a room'.
Motion: The primary action. Use a simple verb like 'walking' or 'moving'.
Strictly follow these rules:
Do not describe colors, textures, facial expressions, or small background objects.
Do not identify camera work like 'close-up' or 'pan'.
Do not interpret the mood, atmosphere, or any artistic style.
Always start description with phrase 'Lookback style.'
Examples:
Lookback style. A person walks across a field.
Lookback style. A car moves on a road.
Lookback style. An animal runs through a forest.
Lookback style. Two figures stand in a room.
Lookback style. An object falls from the sky.
Lastly, image dataset (static frames, sourced from videos) was trained in 1280x640 resolution was captioned with the prompt:
Your task is to generate a detailed description of a single, static image. Focus on the composition, literal visual elements, and implied state at one specific moment in time.
Crucial Rule: Do not describe movement or actions that happen over time. Instead of 'a person is walking', describe their state as 'a person is captured mid-stride'.
Analyze and describe the following elements present in the frame:
Subject and Scene: Describe all visual elements, including subjects, objects, and the environment, in high detail.
Camera Language: Identify the static camera properties, such as the shot type (e.g., close-up, medium shot), camera angle (e.g., low angle), and lens effects (e.g., fisheye).
Atmosphere: Analyze the mood contained within the single image.
Implied Action: You may describe the posture or state of the subject that suggests potential action (e.g., 'poised to jump', 'frozen in a dance').
Strictly follow these rules:
Do not describe or name any artistic styles.
Describe only what is literally visible.
Always start description with phrase 'Lookback style.'
Examples:
Lookback style. A medium shot of a refined woman in a long beige coat standing still in front of a famous painting. The soft lighting highlights the texture of her coat and her serene facial expression. The composition creates a calm and contemplative atmosphere.
Lookback style. A low-angle shot captures a basketball player frozen in mid-air, about to dunk the ball. The harsh arena lights create strong highlights and shadows on his straining muscles, conveying a powerful sense of energy and peak action.
Lookback style. A wide-angle lens captures a majestic mountain range at sunrise. The foreground shows a calm, reflective lake, and the sky is painted with vibrant orange and pink hues. The image has a grand and serene atmosphere.
Lookback style. This frame is an extreme close-up of a single, tear-streaked eye, rendered in high-contrast black and white. The focus is razor-sharp on the eyelashes and the reflection in the pupil, creating an intensely intimate and melancholic mood.
Overall, I think that increasing captioning "tiers" improves model comprehension.
Out of all my LoRAs for Wan2.1, this one probably provides the best prompt following and prompt comprehension.
But (as usual) I'll provide no numeric evidence 🙃
Training
The LoRA was trained using Takenoko on Windows 10, with 64 GB RAM and an RTX 3090. Training params include:
rank/alpha: 32/32
optimizer: AdamW8bit
learning rate: 5.0e-5
weight decay: 0.01
lr_scheduler: Polynomial
lr_warmup_steps: 200
lr_scheduler_power: 1.0
For this LoRA I used a variable flow shift strategy: for the first 4900 steps, I trained with discrete_flow_shift = 5, and for the remaining steps (up to 30000), I used discrete_flow_shift = 3.
Here is what I tried to achieve with this.
During training, shift controls which noise levels the model learns to denoise. Adjusting this shift allows training to focus either on structure (at high noise levels) or on fine details (at low noise levels). This visualization helped me understand (or at least believe I understand) this concept.
The idea was to focus early training on learning structure and motion - features best represented in the high-noise segment - by using a higher shift value. This shifts the center of the sigmoid-based timestep sampling distribution during the reverse denoising process toward higher timesteps (i.e., higher noise levels). Then, in the later phase of training, reducing the shift value makes the model focus more on learning finer details, such as facial expressions, fabric textures, and other high-frequency features, which are better learned at lower noise levels.
In other words, a higher shift value pushes the probability of sigmoid-sampled timesteps toward the high-noise region (with "y" values near 1.0, corresponding to early steps in the denoising process). In this segment, the model sees heavily noised inputs and thus learns structural, low-frequency patterns like layout and global motion.
Reducing the shift value concentrates the timestep distribution more toward lower noise levels, increasing the density of training samples in later denoising steps, during which model encounters cleaner inputs, that help it focus on high-frequency details like textures, facial features, and microstructure.
(This also explained to me why increasing shift is recommended when using fewer inference steps: with fewer steps, the model must start from a higher noise level to have enough room to form a coherent structure. If the sigmoid is centered too far left, toward low-noise regions, the model won't receive enough high-noise steps to establish consistent structural understanding.)
I'm not yet sure how well this worked, needs more experiments. I now think it might've been better to reduce the shift further, like down to 1, around 13k steps to emphasize fine detail learning even more. One thing I noticed is that the model's responsiveness to camera movement prompts has improved; it now follows them more readily. Although apparently, with the 2.2 release, this is no longer relevant.
Anyway, I trained up to 30K steps. The training flow was somewhat strange because both average loss and validation loss became almost flat starting around 11K steps, but the samples kept getting better and better until somewhere around 22K. After testing on multiple prompts, I stopped at the checkpoint from step 20580.
Coda
Looking back at these 5 months, Wan 2.1 is a great model that has already earned its place in history (just like its talented developers who created this impressive work). But now it's time to move on to 2.2, which seems like a very worthy successor that fixes the main flaw of 2.1 - poor motion coherence. I'm sure we'll have a lot of fun playing with it!🔥
