Yarat - Yet Another Realistic Anima (Fine-)Tune

詳細

ファイルをダウンロード (1)

モデル説明

What is this?

Personal checkpoint / full-matrix LoKr that adds better photorealistic generation capabilities to Anima, as well as further training on high-resolution images.
Basically I had a few goals in mind with this project:

  • make photography generations more stable and "real" looking than what Anima is capable of out of the box

  • increase native high-res (>2 MP) consistency, proportions, and fine-detail coherence

  • do both of these things while leaving the model knowledge as untouched and intact as possible

  • personal learning efforts ¯\_(ツ)_/¯ I've only done simpler LoRas so far; this is the first larger-scale project

Albeit not ZIT / Klein level on most dimensions in terms of realism (simply due to model architecture), I'm pretty satisfied with the results at this point, and hope someone else can find value in it as well.

If you'd prefer to use Yarat as the LoKr it was originally trained as instead of a full-blown checkpoint, see here.

Usage

tl;dr

Prompt exactly like you would prompt Anima and use @photograph as an artist style. Something like a photograph of... in NL prompts also strengthens the style and is highly recommended.


Resolutions: up to 4.2 MP (2048x2048, 1664x2432, etc.) works with good consistency
Sampler / scheduler: I've most extensively tested with Euler / Simple
Steps: 30-60+ - the higher the resolution, the more gens typically benefits from more steps
CFG: 4.5 - same settings as Anima Base recommended

That's pretty much the basic gist of it.

Photorealism and General Prompting Advice

  • @photograph artist tag

    • New "artist" style that does exactly what it says on the tin and was the cornerstone of training photography knowledge into the model.
      Works completely fine on its own, but feel free to use NL captions like a photograph of... in conjunction, or some of the adjacent tags covered below. Depending on what you're trying to generate it can help a lot anchoring photorealism.

  • high detail quality tag

    • A new quality tag I trained a small subsection of images on, in particular ones that a) were extraordinarily high resolution without glaring visual flaws and/or b) include fine detail (mostly) found in photographs and not illustration content - think skin pores, liquid texture, etc.
      Helps to pin down details in more complex compositions as well from my anecdotal testing (though it's definitely a bit snake oil-y at times). I also tried my best to keep it unbiased, but it's almost certainly slightly leaning towards elements more typically seen in photographs rather than illustrations, by nature of its purpose.

Some training quirks worth nothing:

  • Style / Mood descriptions in natural language captions

    • Captions often contained a small list of key words describing its overall mood using these two prefixes; though it's not a hard requirement by any means it probably helps the model "get" what you're going for. Example: Mood: casual, relaxed. Style: amateur photography, smartphone photo.

  • pre-existing realism / photography tags

    • some fine-tuning has also been done on the photo (medium), cosplay photo and real life tags, as well as certain combinations of photorealistic and realistic and/or 3d, all in combination with @photograph.
      The idea here is to strengthen the model's idea of @photograph as just another artist's style that also happens to be photorealistic , 3d , uses photo (medium) for their images, etc. - just like it works with any other artist tag. Also it improved a bit on what I would expect photorealistic to look like, which Base kind of sucks at IMO. Once again, @photograph works perfectly fine on its own, but feel free to try out combinations.

Aside from these cliff notes, use captions, workflows, etc. just like with Anima. This fine-tune was trained with heavy shuffling on all kinds of combinations of tags-only, NL only, mixtures, short and long prompts, and everything inbetween to keep behaviour as similar to Anima as possible while also trying to avoid any kind of bias in specific phrasings.

If you are trying to prompt for different styles of photography, use NL; there is no special tags for "candid photo", "amateur photo" "DSLR", "8k", etc.

Some small notes on negative prompts: while there is nothing new trained into the model, I did make extensive usage of tags like jpeg artifacts , cropped , upscaled , adversarial noise , etc. to describe image imperfections. My own negative prompt usually includes these as well as some other quality-indicating Booru tags, which is also my personal recommendation for prompting.

Native High-Res Generation

Training was mostly done three resolutions up to 2 MP (1536x1536 pixels), as well as a curated second fine-detail pass up to 4.2 MP (2048x2048 pixels) (or more practically speaking, 2048x2048, 1664x2432 / 2432x1664, etc.). Resolutions up to 4.2 MP, even for non-photographic images from my own testing, work a bit more consistently than on base Anima. Even resolutions beyond that can work out of the box, albeit much less consistently so.

For fine detail in photography and clean anatomy (hands, etc), I also strongly recommend to at aim decently above 1 MP for production-targeted gens. Another important knob here is also to play with steps - you do want to use significantly more steps than 30 at higher resolutions to let the model clean up artefacts, especially in the >=4.2 MP resolution range. My personal recommendation is to iterate at low steps to find a seed and composition you like, then do a proper longer pass at many steps.
How many steps? Personally I've used up to 120 in my usage of the model so far and still seen improvements to details especially, but usually something around 60 to 90 tends to yield mostly converged results.
CFG behaves similarly to Anima, but I suggest to try out higher values at higher resolutions as well depending on overall composition.

Other differences vs Anima Base

Base Knowledge

Character and concept knowledge from the base model should be almost fully intact, at least I have not encountered any cases of forgetting so far. The model even translates anime-only concepts pretty damn well in most cases, though characters tend to have a cosplay-like look by default. Text was explicitly part of the captions and shouldn't have broken either. Of course nothing is perfect though, and there are some small caveats and likely some more things I simply didn't catch; see section at the end.
"Prompt just like Anima" also means: using similar phrasing like used in corresponding Booru tags, prompt characters directly (e.g. Frieren instead of a woman cosplaying as Frieren - unless you want a different character to explicitly cosplay as Frieren of course), and so on. Don't try to prompt this like base SD 1.5 or similar, it'll likely look like garbage in comparison with a proper prompt. Read, or at least skim through, Circlestone Labs' prompting guide.

Meta Tags

Quality tags, resolution tags, Pony scores, safety tags, etc. were part of trained captions, can be freely used, and should not bias the model towards a drawn style or whatever. The only exception to this are image age tags which weren't trained at all including regularisation data - see section at the end for why.

Keep in mind what these quality tags do and when to avoid them. lowres, for example, is going to be a positive prompt tag for your usage if you want to go for e.g. grainy 2010's smartphone style images.
Different styles of photography work okay just by prompting them from the NL captions (as well as the limited things covered by Booru tags of course), though usually you'll have to push the model a bit to get these to work without losing on factors like e.g. quality of overall composition.

The non-Pony quality tags likely changed somewhat in meaning vs the base model, as the exact process by which Anima was trained on these is, to my knowledge, not published and was a bit up for guessing.
The tl;dr of a pretty long-winded process how I determined these ratings for my data set is a simple predictive model trained against Booru scores vs a load of image quality metrics and aesthetic classifiers from the community, normalised for the most egregious bias factors like safety ratings (surprise, certain safety ratings have on average a lot more votes than others) and tags.

Caveats

  1. I completely left out image age tags during training, so there's likely a bias introduced if you e.g. want to prompt specifically Araki's art style specifically from JoJo Part 3 or something like that. To put a long explanation short: a) I did not have corresponding data available for my data set, b) not as intuitive to apply to photography as to an artist's art style, c) not a big fan of these tags anyway as they overlap somewhat with existing Booru tags (e.g. old )

  2. You will get cosplay-esque / parody-like looks for characters sometimes. cosplay in negative tags or more specifically prompting a character's look helps mitigate this very well.

  3. Character proportions at higher resolutions can look a bit stretched or anime-like, depending on your Gacha luck when picking a seed. I encounter this myself very rarely, so I don't consider it a huge issue.

  4. Anthropomorphic characters are... Interesting. By default, the model tends to have a human-only bias. What I mean by that: if you just prompt for bowser , you might get a guy who is essentially a human reimagining of Bowser, instead of a realistic depiction of the King of Koopas. What usually works is to include tags like furry , reptile boy or whatever fits the description and is typically used on the Boorus in conjunction with the character. Keep in mind, in general, that Anima isn't trained on e621 and probably still weak for these kinds of characters even in base, especially beyond super popular ones.

  5. Some anime concepts simply look silly or outright uncanny when depicted in real life. The model tends to translate this very well from my own testing, but I haven't checked every single Booru tag there is of course.

  6. Prompting specific styles of photography can be difficult. If you're looking for the best replication of polaroid photography, this might not be the best model at it. The model does understand many of these concepts (not least because Cosmos2, the base model of Anima, probably does) and they were almost always also captioned, but there was no specific effort taken by me to balance that out beyond what I've described above.

  7. I have not tested anything in regards to compatibility with other LoRas. I would assume it'll work well due to not really touching any Anima concepts, but no guarantees. If you have any feedback on this, please let me know and I can gladly add it here as well.

License

This model inherits the CircleStone Labs Non-Commercial License from Anima.

Credits

  • tdrussell and CircleStone Labs for creating the amazing Anima model

  • the Laxhar and OneTrainer Discord servers, which are amazing places to gather knowledge

  • anons at /ldg/

  • and of course the local gen community at large

このモデルで生成された画像