bigASP ๐งช v2.5
์ธ๋ถ ์ ๋ณด
ํ์ผ ๋ค์ด๋ก๋
๋ชจ๋ธ ์ค๋ช
bigASP ๐งช v2.5
โ ๏ธThis is not a normal SDXL model and will not work by default.โ ๏ธ
A highly experimental model trained on over 13 MILLION images for 150 million training samples. It is based roughly on the SDXL architecture, but with Flow Matching for improved quality and dynamic range.
โ ๏ธโ ๏ธโ ๏ธ WARNING โ ๏ธโ ๏ธโ ๏ธ
You are entering THE ASP LAB
bigASP v2.5 is purely an experimental model, not meant for general use.
It is inherently difficult to use.
If you wish to persist in your quest to use this model, follow the usage guide below.
Usage

Currently, this model only works in ComfyUI. An example workflow is included in the image above, which you should be able to drop into your ComfyUI workspace to load. If that does not work for some reason, then you can manually build a workflow by:
- Start with a basic SDXL workflow, but add a ModelSamplingSD3 node onto the model. i.e. - Load Checkpoint -> ModelSamplingSD3 -> KSampler
 
- Everything else is the usual SDXL workflow with two clip encoders (one for positive and one for negative), empty latent, VAE Decoder after the sampler, etc. 
Resolution
Supported resolutions are listed below, sorted roughly by best to worst supported. Any resolutions below or above these are very unlikely to work.
832x1216
1216x832
832x1152
1152x896
896x1152
1344x768
768x1344
1024x1024
1152x832
1280x768
768x1280
896x1088
1344x704
704x1344
704x1472
960x1088
1088x896
1472x704
960x1024
1088x960
1536x640
1024x960
704x1408
1408x704
1600x640
1728x576
1664x576
640x1536
640x1600
576x1664
576x1728
Sampler Config
First, unlike normal SDXL generation, you now have another parameter: shift.
Shift is a parameter on the ModelSamplingSD3 node, and it bends the noise schedule. Set to 1.0, it does nothing. When set higher than 1.0 it makes the sampler spend more time in the high noise part of the schedule. That makes the sampler spend more effort on the structure of the image and less on the details.
This model is very sensitive to sampler and scheduler, and it benefits greatly from spending at least a little extra time on the high noise part of the schedule. This is unlike normal SDXL, where most schedules are designed to spend less time there.
I have, so far, found the following setups to work best for me. You should tweak and experiment on your own, but expect many failures.
- Scale=1, Sampler=Euler, Schedule=Beta 
- Scale=6, Sampler=Euler, Schedule=Normal 
- Scale=3, Sampler=Euler, Schedule=Normal 
I have not had much success with samplers other than Euler. UniPC does work, but generally did not perform as well. Most of the others fail or were worse. But my testing is very limited so far. It's possible the other samplers could work, but are misinterpreting this model (since it's a freak).
Beta schedule is the best general purpose option and doesn't need the scale parameter tweaked. Beta schedule forms an "S" in the noise schedule, with the first half spending more time than usual in high noise, and the latter half spending more time in low noise. This provides a good balance between the quality of image structure and the quality of details.
Normal schedule generally requires scale to be greater than 1, with values between 3 and 6 working best. I found no benefit from going higher than 6. This setup results in the sampler spending most of its time on the image structure, which means image details can suffer.
Which setup you use can vary depending on your preferences and the specific generation you're going for. When I say "image structure quality" I mean things like ensuring the overall shape of objects is correct, and placement of objects is correct. When image structure is not properly formed, you'll tend to see lots of mangled messes, extra limbs, etc. If you're doing close-ups, structure is less important and you should be able to tweak your setup so it spends more time on details. If you're medium shot or further out, structure becomes increasingly important.
CFG and PAG
In my very limited testing I've found values for CFG between 3.0 and 6.0 to work best. As always, CFG is trading off between quality and diversity, so lower CFGs produce a greater variety of images but lower quality, and vice versa. Though the quality at 2.0 and below tends to be so low as to be unusable.
I highly recommend using a PerturbedAttentionGuidance node, which should be placed after ModelSamplingSD3 and before KSampler. This has a scale parameter which you can adjust. I tend to keep it hovered around 2.0. When using PAG, you'll usually want to decrease CFG. When I have PAG enabled I'll keep my CFG between 2.0 and 5.0.
The exact values for CFG and PAG can vary depending on personal preference and what you're trying to generate. If you're not overly familiar with them, set them to the middle of the recommended ranges and then adjust up and down to get a feel for how they behave in your setup.
PAG can help considerably with image quality and reliability. However it can also tend to make images more punchy and contrast, which you may not want depending on what you're going for. Like many things it's a balancing act and it can be disabled if it's overcooking your gens.
Steps
I dunno, 28 - 50? I usually hover around 40, but I'm a weirdo.
Negative Prompt
So far the best negative I've found is simply "low quality". A blank negative works as well, as does more complicated negatives. But "low quality" alone provides a significant boost in generation quality, and other things like "deformed", "lowres", etc didn't seem to help much for me.
Positive Prompt
I do not have many recommendations here, since I have not played with the model enough to know for sure how best to prompt it. At the very least you should know that the model was trained with the following quality keywords:
- worst quality 
- low quality 
- normal quality 
- high quality 
- best quality 
- masterpiece quality 
These were injected into the tag strings and captions during training. The model generally shouldn't care where in the prompt you put the quality keyword, but closer to the beginning will have the greatest effect. You do not need to include multiple quality keywords, just the one instance should be fine. I also haven't found the need to weight the keyword.
You do not have to include a quality keyword in your prompt, it is totally optional.
I do not recommend using "masterpiece quality" as it causes the model to tend toward producing illustrations/drawings instead of photos. I've found "high quality" to be sufficient for most uses, and I just start most of my prompts with "A high quality photograph of" blah blah.
The model was trained with a variety of captioning styles, thanks to JoyCaption Beta One, along with tag strings. This should, in theory, enable you to use any prompting style you like. However in my limited testing so far I've found natural language captions to perform best overall, occasionally with tag string puke thrown at the end to tweak things. Your favorite chatbot can help write these, or you can use my custom prompt enhancer/writer (https://huggingface.co/spaces/fancyfeast/llama-bigasp-prompt-enhancer).
If you're prompting for mature subjects, I would advise to try using the kind of neutral wording that chatbots like to use for various body parts and activities. The model should understand slang, but so far in my attempts they make the gens a little worse.
The model was trained on a wide variety of images, so concept coverage should be fairly good, but not yet on the level of internet-wide models like Flux.
The Lab (What's Different About v2.5)
This model was trained as a side experiment to prepare myself for v3. It included a grab bag of weird things I wanted to try and learn from.
Compared to v2:
- Captioning - The dataset was captioned using JoyCaption Beta One, instead of the earlier releases of JoyCaption. 
- More Data - From 6M images in v2 to 13M images. 
- Anime - A large chunk of anime/furry/etc type images were included in the dataset. 
- Flow Matching Objective - I took a crowbar to SDXL and then duct tapped on flow matching. 
- More Training - From 40M training samples to 150M. 
- Frozen Text Encoders - Both text encoders were kept completely frozen. 
So ... why?
The captioning was just because I had finished Beta One by the time I was doing data prep, so I figured I should swap over. My hope is that the increased performance and variety of Beta One would imbue this model with more flexibility in prompting. However, since the text encoders were frozen I'm not sure there will be any meaningful impact here.
More data is more gooder. Most importantly, I dumped a good chunk of images in that had pre-existing alt text, and were carefully balanced to have as wide a variety of concepts as possible. This was meant to help expand the variety of images and captioning styles the model sees during training.
The inclusion of anime was done for two reasons. One is that it'd be nice to have a unified model, rather than an ecosystem split between photoreal and non-photoreal. The big gorilla models (like GPT 4o) can certainly do both modalities equally, so it's at least possible. The second reason is that I want the photoreal half to inherit concepts from the anime/furry side. The latter tends to have a much larger range of content and concepts. Photoreal datasets are more limited by comparison, and that makes it difficult for models trained on them to be creative.
Flow Matching is the objective used by most modern models, like Flux. It comes with the benefit of higher quality generations, but also a fixed noise schedule. SDXL's noise schedule is broken, which results in a variety of issues but most pronounced is its worse structure generation. That's the main source of SDXL based models tending towards duplicated limbs, melted objects, and "little buddies." It also prevents SDXL from generating images with good dynamic range, or dark images, or light images. Switching to Flow Matching fixes all of that.
More training is more bester. One of the biggest issues with v2 (and v1) is how often it produces failed gens. After much experimenting, I determined that this was down to two main things: SDXL's broken noise schedule and, more importantly, not enough training. Models like PonyXL were trained for much longer than v2. By increasing the training time from 40M samples to 150M samples, v2.5 is now in the ballpark of PonyXL's training.
As for the frozen text encoders, I didn't actually intend that to be a feature of this model. It was basically a result of me trying to deal with training instabilities.
What Worked and What Didn't
The biggest change was Flow Matching. People have transitioned SDXL over to other objectives before, like v-pred, but I don't think anyone has tried switching it to Flow Matching. Well ... it works. And I think this was a success. It's hard to know if Flow Matching helps SDXL's output quality, since results are conflated with more training, but it definitely helped the dynamic range of images, which I'm greatly enjoying. And, as noted above, the fixed noise schedule is likely a big part of why v2.5 has fewer mangled generations compared to v2.
More training almost certainly helped the model as well. Combined, v2.5's failed gen rates have drastically lowered.
More data and adding anime/etc data does seem to have expanded the model's concepts and creatively. I'm able to generate a much wider array of artsy and creative, yet still photorealistic, images.
However, v2.5 did not gain any real ability to actually generate non-photoreal content. My attempts to use it for anime-style gens have been wholly unsuccessful. Quite odd.
Freezing the text encoders is both a blessing and curse. By keeping the text encoders frozen, they retain all of their knowledge and robustness gained from the scale of their original training. That's quite useful and beneficial, especially for experiments like mine that do not have that kind of scale.
The downside is that without tuning them, prompt adherence suffers greatly. v2.5 suffers from lots of confusion like seeing "beads of sweat" and then generating a string of beads instead of sweat.
So it's a trade off. And with the text encoders frozen, v2.5 probably doesn't benefit much from JoyCaption Beta One's improvements.
Training LORAs and Merging
Frankly, I have no clue if this model can be merged. It was trained for a different objective than stock SDXL, so merging would be weird. Though apparently someone merged this model with DMD2 and it worked? So who knows.
As for using existing LORAs, the situation is going to be the same: probably not going to work but who knows for sure.
Training LORAs is similarly not going to work, due to the different training objective of this model.
I certainly don't like this, but this model was meant to be just an experiment so support for loras and such was not a priority. v3 will either be based off a model with existing tooling, or I'll release it with tooling if I somehow get forced into going down the custom architecture route again.
Support
If you want to help support dumb experiments like this, JoyCaption, and (hopefully) v3: https://ko-fi.com/fpgaminer




















