all the way through x-ray (anima)
세부 정보
파일 다운로드 (1)
이 버전에 대해
모델 설명
The model is hard to train and generalize because it gets overly influenced by a handful of images. It fails to bridge the connections between different objects, making it difficult to combine elements or shift perspectives.
As a result, I continue to look for ways to synthesize data and control LoRA convergence, while figuring out how to connect different combinations. Without this, the model won't truly understand word meanings, and its ability to generalize will be terrible.
Training the concept of 'all the way through x-ray' is highly challenging.
It requires finding a way to train numerous combinations and getting the model to link them together. Whether it's SDXL or the current Anima, they all fail to properly grasp it during training or when trying to generalize.
The usage can be quite unstable, and the output quality isn't ideal.
Even though I’ve done my best to eliminate poor samples and repeatedly fix the synthesis, the model still heavily struggles to understand deeply complex, combined concepts.
The reason there are so many prompt words is that they were meant to help adjust the outputs—but it failed because the model just can't understand the combinations of these words...
Please, everyone, put effort into synthesizing enough high-quality data, and run more experiments to make this concept truly shine, hoping it can become part of a base model one day.
As for me, I will keep diving deeper into synthetic data and iterations. But it’s going to take a lot longer, and I'll only try to update with a new version when there’s a noticeable improvement.
_________
I highly suspect that Anima's architecture design might lead to significantly poorer compositional generalization for novel concepts.
It relies almost purely on memorization, is highly prone to mutual exclusivity, and tends to easily overwhelm each other.
Right now, Anima's strong performance seems to be purely a result of the staggeringly massive, robust, and massive training data of Cosmos itself.
Fine-tuning subsequently inherits this existing knowledge base, which leads to good results.
However, if I add completely out-of-distribution (OOD) novel data, the effectiveness could be much worse.
On the flip side, because its own foundation is sufficiently well-trained, robust, and powerful, using poorer quality training data does not easily diffuse and cause degradation.
This issue is much more severe in z-image and other models; however, their ability to connect concepts is undoubtedly much better.
Anima's alignment and connection capabilities seem significantly weaker, and the influence of text conditioning is not strong enough.
Without a sufficiently massive volume of training data, this architecture would perform much worse.
Although I will continue researching to improve it, it remains an uphill battle.
The anime performance of z-image and Flux2 is too weak.
Fortunately, Cosmos 3 has now been released.
Its architecture is undoubtedly a massive improvement and unification compared to current single-stream and dual-stream architectures, and its comprehension and physics capabilities appear to be vastly superior.
I look forward to using the future 4B model as a foundation to build a powerful Anima 2, which I believe should be far more powerful than what we have now.
The current state is quite different because the generation quality is vastly inferior to the iteratively improved training data, making it look like a forced patchwork that lacks true understanding.






