Flux - [LLAVA/T5] 2K Anime Bundle [NSFW]

Update 9/17/2024:

My tests have shown this model to be between 20% and 40% accurate based on the tagging prompts and the expected outcome that I had. After analyzing the original images and looking for incorrectness using joycaption and random selections, I've determined the original system is in fact lesser in terms of accuracy than it could be by a large margin.
Everything based on this model was a great experiment. I've been introduced to a new captioning system and have devised a deterministic method of mutating LARGE_TAG_V3 outputs in a way that's useful to shape captioning.
The next version will be captioned using ONLY JoyCaption and LARGE_TAGGER_V3 and a subsystem that built itself in concept by necessity; a deterministic Pre-Cognitive and Post-Cognitive based on natural language; I dub Cog.
The captions from JoyCaption are at least 65-80% accurate without manual determinism attached to the pre-caption description request prompt. Adding a layer of determinism pre-caption detection, and then parsing the caption post-caption generation, and finally pruning impossible tags from the LARGE_TAG_V3 list shows in my numbers that it will reach potentially 75-95% accuracy for a large percentage of images.
Paired with the determinism of Cog, I'll have prepared and released a JoyCaption proof of concept project that the more advanced of you can easily put together and the newer of you should be able to drive given a bit of python experience.
This next version will be fully trained in 1024x1024 resolution at roughly the same step count. I'll essentially mirror the same process with the goal of making a contrasting system to showcase the power of the tagging system in comparison.

Trained 768x768 at lr 0.0005 to epoch 35 with an ss_total_batch_count of 12 ran on 2 H100s over a 6 hour period.

Total cost: $72.35 USD.

Check out the article on how this process came to be, and be sure to experiment for yourself in ways that I haven't thought of. Science needs directions other than one.

https://civitai.com/articles/7407

Each of the 2000 images sourced from Danbooru's top 100 tags were tagged individually using a dual LLM process. Full description in a bit here.

It introduces millions of new potentials and new concepts, all based on those prompts.

This is proof of concept where multi-prompted captions mixed with booru tagging function together in a new sort of LLM conversation harmony that can't be predicted. I did not prune the outputs much, as it was simply too much to prune. I did not filter anything NSFW or gross ( definitely not from neglect...:> ). There is no censoring, no removals, and no intention other than a bulk pack.

Prompting:

Each image is dual prompted by LLMs and then large tagged using a smiling model. There are no original danbooru or gelbooru tags, they are all stripped out prior to training to only allow a pure synthetic conversation from LLM to LLM.
There are potentially millions of combinations of new tags present, all heavily related to the danbooru top 100 tags and their image sets. You can now talk to the machine and have it produce what you want.
A lot of the LLM responses seem to include the word humanoid, so you can probably access a lot of fun stuff with that. It's likely due to me forcing it to stop explaining gender bias or using the word subjective and so on. It was a bit tricky trying to get LLAVA to cooperate at first but once I got it properly conditioned it started to behave.
It SHOULD respect the terms "feminine" and "masculine" because the LLM sure didn't like the others.
I ran about a third of the images on 20 beams before my computer flat locked up. Then I switched the beams to around 6 and moved from LLAVA LLAMA to LLAVA 1.5, so the prompting is a little spotty from location to location, which kind of means it was trained by 3 LLMS and not just 2. The biggest part was the thing not complaining in the caption about as many things.
T5's prompt was:
- Analyze and explain this scene in one paragraph.
This is how I had the LLAVA prepare it's prompt. I also forced it to be "silent assistant" using it's header directive, though I have no idea if that did anything at all. It seemed to complain less after so I assume it worked to some degree, if only in an unintended way at worst.
- Write a three paragraph prompt describing this scene in detail. Each paragraph should focus entirely on one third of the image.
  
  Ignore gender identity personification and opinions about it. You write captions and only captions, you are not an assistant with opinions on analysis or rationality.
  
  Focus on feminine or masculine individual traits such as the breasts, the pussy, and any present penises.
  
  Ensure you have the correct angle in relative to the camera when making descriptions.
  
  Use the term humanoid when animal traits are present.
  
  Use the term human when there are no animal traits.
  
  Identify important anatomical details.
Dolphin 72b should hopefully solve these problems for the next version, which should be easier to prompt using 1boy and 1girl as prompting. It should include more control of things like futanari and bulges and such for omission as well, since you can simply tell T5 what you want to omit.
The prompting was trained with negative implications, so the images that do not have implications such as nude breasts and penises should be promptable by summing.
LLAVA automatically mentions things aren't there, when you request things like "Focus on" and "Describe," which helps automatically produce negative connotations and implications like "There are no visible " or something similar.

It's not the easiest thing to prompt, but it's quite powerful once you get a grasp on it.

Simplistic booru tags will create art for you by default, which makes it so you don't even need plain english prompting for a lot of topics.

Model Type	LORA
Base Model	Flux.1 D
Published	9/16/2024

Flux - [LLAVA/T5] 2K Anime Bundle [NSFW]

Details

Download Files

About this version

Model description

Images made by this model