PixArt-Sigma-1024px_512px-animetune

125

1.7K

512px_0.7 1024px_v04 1024px_v0.31 1024px_v0.2 1024px_v0.1 1024px_v0.0 512px_0.6 512px_v0.5 512px_v0.4 512px_v0.3 512px_v0.2 512px_v0.1 512px_v0.0

Details

Download Files

About this version

Pruned Model fp16 (1.15 GB):inference Model 200epoch.

Pruned Model bf16 (9.13 GB):Diffusers model for fine tuning + onetrainer config data.

Training Data (86.05 KB):comfyui workflow.

● I trained with a dataset of 400,000 images.

I feel like the stability has improved compared to before.

Also, please note that aside from the first two images, the sample images this time were generated using SD1.5 with i2i.

Lately, I’ve been enjoying the fusion of Pixart’s composition and SD1.5’s style — it’s a lot of fun.

I also created an SD1.5 merge model for i2i, so feel free to try it if you’re interested.

The following is the same description as before.

I’ve also uploaded a few workflows for reference.

The sample images have embedded workflows viewable in ComfyUI, but recently they’ve been converted to JPG to save space, so some may not load. Installing the extension below will allow you to check them.

https://github.com/Goktug/comfyui-saveimage-plus

There are workflows for automatic prompt generation using TIPO, experimental workflows for general quality prompts, and simplified workflows. There's no single correct approach to inference, so experimenting with different methods can be interesting.

While tags have many constraints, natural language allows more freedom for instructions. It might be worth exploring your ideal quality prompts.

The 1024px model might ultimately provide superior inference, but reaching that point involves many failures and takes too much time. Continuously generating 1024px images without knowing the outcome is frustrating. This model is designed to support that process.

This model has several potential uses:

● Since it has fast inference speed and tag compatibility, it can be used for prompt testing before running inference on a 1024px model.

● Use the 512px model to create a good composition, then upscale with the 1024px model.

● Merge the 512px model with the 1024px model to generate around 768px, balancing speed and detail. (Practicality is uncertain, as it may not work reliably.)

The standard size for this model is 512px

A ratio like 512x768 like SD1.5 is suitable.

However, with a long side of 768px, it slightly breaks down. If stability is important, it’s better to base it on 512x512px and adjust the aspect ratio, such as 384x640.

768px 1024px is not trained, so the result will be disastrous.

The base model is very high quality even at 512px!

Usually, models in the middle of pre-training or lite versions lack sufficient learning or aesthetic appeal, but this base model is different. It is the most aesthetically pleasing I have seen so far.

●If you can't come up with a prompt, try using the prompt auto-generation below.

https://huggingface.co/spaces/KBlueLeaf/TIPO-DEMO

●Also, the model has not been trained on quality tags or negative prompts.

It has not been trained on images with potentially harmful effects, like sketche or monochrome images.

However, all 400,000 images are of high quality, so there is a possibility that any tag could improve quality. The more tags, the better.

●There are likely tags that can be used as negative prompts.

Tags related to styles, like the ones listed below, can be included in negative prompts to switch to a different style.

1990s (style),00s,10s,simple background,anime screencap, realistic,figure etc...

●If you find it troublesome to come up with prompts that produce stable quality, using prompts like the ones below might help stabilize the output. Ironically, tags like these end up becoming quality tags. lol

" nikke, azur lane, blue archive, kancolle, virtual youtuber, arknights, girls' frontline"

Model description

4/7 1024px model update! 1024px_v0.4 Please check the details in the 1024px_v0.4 tab.

Compared to the 512px model, it's less stable and more prone to artifacts, but it can offer more compositional freedom. While the newer version has learned more concepts, v0.2 or earlier may be better for aesthetic results.

3/5 512px model update! 512px_v0.7 Please check the details in the 512px_v0.7 tab.

Personally, I recommend the 512px model.The 512px model has learned significantly more concepts. I like the workflow of using the 512px model for trial-and-error inference to generate good images, then either upscaling them with i2i using the 1024px model or sd1.5 or trying the same prompt with the 1024px model.

2/11 1024px&512px workflow update! I have also added the TIPO workflow & sd1.5 i2i. TIPO: It reduces the effort of crafting prompts and allows for easily generating images, so I highly recommend it.The SD1.5 i2i workflow is useful for improving details and changing styles.There's joy in choosing a model. It leverages the strengths of both Pixart and SD1.5.The "TinyBreaker" in Suggested Resources is a perfect example, further refined by exploring its potential. Be sure to check it out as well.

I also experimentally merged an SD1.5 model for i2i, so feel free to check it out if you're interested.

/model/1246353

A method to combine PixArt with SDXL has also been discovered.

https://github.com/kantsche/ComfyUI-MixMod

/model/1565538/a-pile-of-junk-mixmod-workflow

■This is an experimental fine tuning.

Attention This Fine Tuning model is very difficult!

The quality is not good!! Don't expect too much!

If you are interested in PixArt-Sigma for the first time, we recommend that you check out the workflow that allows you to infer the original model... Even if my model is not great, try using other people's amazing fine-tuning models!

I think the "Comfy Sigma Portable" can be used even by those who have never used ComfyUI before. There's no need for a difficult installation. Just download and try it out!

Merging can be done with ComfyUI. The "Tool to easily merge models" is also simple and good.

●Forge also has the following extensions available.Inference is also possible with SDNext.

It's not the smartest solution, but I've prepared a guide on using fine-tuned models in Forge. Feel free to use it as a reference.2/16:With a recent update, my model can now be added and used for inference. I appreciate the developer for creating such a highly functional and user-friendly extension.

https://github.com/DenOfEquity/PixArt-Sigma-for-webUI

https://civitai.com/articles/11612

The 'anime sigma1024px' in Suggested Resources is a flexible and aesthetically pleasing anime model. Give it a try.

I would be happy if you could be interested in Pixart even a little.Pixart has potential.

My hope is for more people to discover basemodels with potential and to see their possibilities grow even further. I would be happy if I could help make that happen.

PixArt-Sigma is simple, highly lightweight, and capable of training with 300 tokens. Few models meet these conditions, making it a rare model with minimal training limitations. Since its hardware requirements are nearly the same as SD1.5, anyone can participate in training, and even individuals can conduct large-scale experiments with minimal burden.You can benefit from 300 tokens even during inference, and the small model size makes merge experiments easier.This is like an SD1.5 model with support for 1024px, DIT, T5, SDXL VAE, and improved contrast handling. I was looking for a model like this, and PixArt met that standard.

■I trained using onetrainer.

Fine-tuning is performed on a 70,000 or 400,000 image dataset(no use AI image) that mainly contains anime images, but also some realistic and AI images.all booru tag train. The training resolution is 512px or 1024px. Pixart is high quality but has low requirements, making it suitable for training. 12GB VRAM is enough .Detailed information about the training is written at the bottom of the page, so please refer to it. I have also uploaded the Onetrainer configuration data.

■Please be careful as sexual images are also generated.

■Here are my recent favorite inference settings. This will be updated as needed.

This is not the optimal solution.Please try various things!

Both booru tags and natural language are available for use.

●Using SD1.5 i2i could be a good idea. This approach frees Pixart from its limitations.

Pixart has good compositional strength, but details like hands can often be challenging. Combining it with SD1.5 through i2i improves the details, allowing you to benefit from the strengths of both models.

Additionally, by switching the SD1.5 model, you can flexibly shift to any style—realistic, 2.5D, or anime. If you have the resources, combining it with SDXL is also an excellent option.

●The sample images have embedded workflows viewable in ComfyUI, but recently they’ve been converted to JPG to save space, so some may not load. Installing the extension below will allow you to check them.

https://github.com/Goktug/comfyui-saveimage-plus

●sampler:"SDE cfg2.5-6 step12-20" ,"Euler cfg_pp" or "Euler A cfg_pp" cfg 1.5-2.5 step30-50

Scheduler:"GITS" or "simple"

●Euler, Euler_CFG_PP, DEIS: Sharp with excellent composition, enjoying the aesthetics of collapse.

Euler_A: The most stable, ideal for poses and unique concepts, but less surprising.

DPM++_SDE: A middle ground—dynamic yet stable.

●GITS provides rich textures, Simple ensures stable generation quality, SDE stays true to the dataset, Euler is sharp,Euler A offers stability.

I generally prefer GITS + "Euler," "Euler cfg_pp," or "SDE."

"GITS + Euler" or "Euler cfg_pp" is very sharp.

"GITS + SDE" is dynamic.

"simple + Euler A or SDE" feels stable and seems to improve fidelity, though it may have high contrast.

●GITS can produce amazing detail, but it sometimes seems prone to breakdowns or not following prompts. I prefer it when I want to focus on atmosphere using natural language. Simple, on the other hand, is stable and follows prompts well, making it more suited for character work.

●Resolutions slightly outside of 512x512 and 1024x1024 are acceptable. Resolutions like 512x768 or 1024x1536 may have minor issues but remain practical. For more stability, it’s best to stick to resolutions like 832x1216 that are closer to standard.

I prefer larger resolutions over stability, so I tend to choose non-standard resolutions.

●If you can't come up with a prompt, try using the prompt auto-generation below.

https://huggingface.co/spaces/KBlueLeaf/TIPO-DEMO

Command R+ does not censor or reject prompts, making it ideal for explicit natural language prompts. You can try it for free by creating an account on the official website.

●If a certain tag's effect is too strong, try lowering its weight or increasing the weight of other tags. It may not be non-functional but rather overly dominant, and this can help resolve the issue.

Be cautious with unique tags for characters, as they can be very dominant.

Character tags might even alter the style, so depending on the situation, placing character tags at the end and supplementing the character's traits with general tags like "1girl, green hair, School uniform" may provide more flexibility.

●Negative prompts are not trained. Please try various prompts!

As described in the dataset contents on the page below, if you don't like realistic textures, you might want to include terms like "realistic, figure".

Adding 'anime screencap' to the negative prompt helps reduce flatness.

I don't like restrictions and prioritize diversity, so I keep the negative prompts to a minimum.

Lately, I've been favoring a workflow where I disable negative prompts in the early steps and only apply them starting from the later steps. This approach results in fewer compositional issues in the early stages, and since I can freely adjust the style in the later stages, the overall quality is improved.

However, my way of thinking is unconventional. You don't have to follow it! You might get better results with many negative prompts, so give it a try!

I feel that with fewer steps, the composition doesn't turn out as well.

●It might be better to have at least 20 steps. Recently, I've been sticking to 50 steps.

For previews, I stop around 15-25 steps to check the progress.

Once I find a good seed, I refine it with 50 or 100 steps, adjusting the CFG as needed.

Since there is little change in the later steps, I can predict the outcome. This way, I balance both efficiency and quality.

However, with a higher number of steps, breakdowns may decrease, but it might end up overcooked. A setting like 30 steps might provide a better balance in terms of contrast.

By the way, I haven't trained with tags for work titles, but sometimes character tags include the work title. This tendency is especially strong with mobile games. When I randomly added a work title, there was a change in the style, so it’s possible that it may have some effect.

●It might be better to have at least 20 steps. Recently, I've been sticking to 50 steps.

For previews, I stop around 15-25 steps to check the progress.

Uni-pc may be faster as it achieves good results in about 20 steps. If i2i is the basis, I think it's also a good idea to finish in half the steps using methods like splitsigmas and then perform i2i.

Once I find a good seed, I refine it with 50 or 100 steps, adjusting the CFG as needed.

Since there is little change in the later steps, I can predict the outcome. This way, I balance both efficiency and quality.

If you find it troublesome to come up with prompts that produce stable quality, using prompts like the ones below might help stabilize the output. Ironically, tags like these end up becoming quality tags.lol

" nikke, azur lane, blue archive, kancolle, virtual youtuber, arknights, girls' frontline"

●I’ll also share the natural language prompt I use for quality improvement. Try adding it to the end of your prompt. It’s already included in my workflow.I think adding the game title tag to the last row would be a good idea.

■Consistently high quality

A highly detailed character with smooth, glowing skin and vibrant, natural colors, A dynamic, expressive pose with natural proportions and accurate composition. Soft, balanced lighting enhances depth and warmth, while surrounding light subtly interacts with the character, blending tones and creating a harmonious connection with the environment. Rich facial expressions convey emotion and presence, and soft highlights accentuate the character’s curves and details, adding depth and a natural, luminous glow.

■Dynamic composition.

A highly detailed anime-style character with smooth, radiant skin and vibrant, balanced colors, depicted in a dynamic and expressive pose with flawless anatomy and natural proportions. The composition is visually compelling, with intricate textures and exquisite detailing in the character's design. Soft, nuanced lighting enhances depth and warmth, interacting harmoniously with the surroundings to create a cohesive, immersive atmosphere. The background is richly detailed and dynamic, filled with captivating elements that complement the scene without overwhelming the character. Subtle highlights and shadows accentuate the character's curves, clothing, and features, adding realism and a luminous glow. The overall image captures a perfect balance between artistic stylization and a convincingly grounded presence.

●This massive, chaotic negative prompt might actually be effective, though I just copied it from other models without any guarantees. Still, it seems to have some effect.

If you feel that the composition or anatomy looks strange, try removing the negative prompt. I've noticed several times that it can have a negative impact.

■amputated, bad anatomy, bad proportions, blurry, dated, deformed, extra limbs, fused fingers, low quality, malformed limbs, missing limbs, mutated, ugly, overexposed, underexposed, flat colors, low detail,

■512px model.

The standard size for this model is 512px

A ratio like 512x768 like SD1.5 is suitable.

768px 1024px is not trained, so the result will be disastrous.

The base model is very high quality even at 512px!

Usually, models in the middle of pre-training or lite versions lack sufficient learning or aesthetic appeal, but this model is different. It is the most aesthetically pleasing I have seen so far.

Due to its low requirements for training and inference specs and its fast speed, I feel that it has the potential to become the successor to SD1.5 that I've been looking for.I love this model.

Honestly, for creating images focused on 2D characters, there’s little difference between 512px and 1024px. Unless it’s a concept that clearly requires high resolution, 512px should be sufficient.

■ 1024px model.

If you don’t want to waste time, it might be a good idea to use the 512px model first to practice which prompts are effective.

Merging might also be interesting.

Merging with a realistic model can sometimes improve anatomy.

An example of an interesting merging experiment:

simply merge the 1024px and 512px models at a 0.5 ratio. This will allow you to generate at a 768px scale. Try resolutions like 768x768, 576x960, or even 640x1024. 768x1024 may sometimes break down, but it can succeed occasionally.

If the preview shows no block noise or line noise, then it’s fine. If these appear and strange artifacts start to show in the generated image, that’s the resolution limit.

This approach balances speed and detail, but I’m not entirely confident the merge is stable—it may have some issues. Still, it’s worth trying for an interesting experiment.

※By the way, I don't think the older versions are inferior.

As the training progresses, the model learns more concepts but gradually deviates from PixArt's aesthetics.

Therefore, earlier versions might have a better balance in some cases.

It's a matter of personal preference, so I think you should use the version you like best.

Personally, there are sample images from older versions that I really like. I'm not confident I could replicate them with the latest version, lol.

■I am training with the danbooru tag.

We are only learning general tags such as 1gril, and we are not training artist or anime work tags.

A small number of tags will produce a disastrous result.

Popular tags tend to be of higher quality.

Examples: looking at viewer, upper body,shiny skin,anime screencap, etc..

If the effect is too strong, it might be a good idea to lower the weight.

It would be interesting to generate various tags using something that can automatically generate tags.

This is an experiment to see how much the tags can learn.

My training quality is poor, but it's learning better than expected.

In some cases, it may be able to express things that are difficult to do with other models.

It seems possible to add some new concepts even without fine-tuning the T5.

The base model is not excessively censored; like Cascade, it can handle high-exposure outfits without issues and sometimes even generate nudity.

It's interesting because it feels different from other models.

Due to the small size of the dataset, we are not yet able to recognize all tags.

It seems that natural language still works as well. There might be an interesting aspect that is different from the base model.

It's quite fun. I give themes to ChatGPT to create natural language prompts.

■There are cases where the look of something realistic or AI comes out strongly.

It might be a good idea to add "realistic" to the negative prompt.

On the other hand, it might be fun to try something other than anime.

New discoveries are made in areas that were not originally intended.

It's okay not to expect perfection too much.

This model is still immature.The broken results are more interesting!

■There is no consistency in style.The quality is poor and there are no fixed settings or prompts.

●It has no advantage over existing models and has a narrower dataset.

●It's an incomplete and very difficult model, but if you're interested, please give it a try.

●If the human body breaks down, it's not due to censorship but rather because my fine-tuning is poor, so please bear with me! lol

I will continue to refine it to make it better in the future!

●Merging is no problem.If you have any interesting results please share!

I think the 512px model can be merged into the 1024px model using differential merging. If the proportion is too large, it might break down, but it could be useful for enhancing concepts and styles.

■Dataset Notes:

●"realistic, figure, anime screencap"

These are the only three tags that I intentionally trained for style, and using them will enforce a particular style.

"anime screencap" will result in a TV anime style.

●Putting "realistic, figure" in the negative prompts will enforce an anime style.

However, other 2D styles lack consistency and the style will change based on the keywords...

●From what I understand, sexual content tends to adopt a visual novel game style, and natural language tends to lean towards AI or 2.5D.

Tags like "looking at viewer, upper body, shiny skin" are tagged in many images, so the quality might be higher. I feel they tend to be closer to the AI image style.

"blush" is also widely used and tends to be the flat style of visual novel games and Japanese 2D artists.

●The contents of my dataset include visual novel games, real people, figures, 2.5D, anime screencaps, and AI images.

Because I trained on such a wide range, styles are linked to tags, which might make control a bit difficult...

●If there are no background tags, the image may end up with a white background.

This happens because elements outside the given prompt are less likely to bleed into the image.

With a short prompt, the result may be vague and blurry.Try adding key keywords that describe the type of image you want to generate.

●It's best to include tags for the type of scenery you have in mind, like the examples below.

Additionally, based on those tags, consider what elements should be present in the background and add them accordingly—such as plants in a room or cars in a city.

If the background becomes the main focus and the character appears small, using tags like "solo focus" can help emphasize the character as the main subject.The "landscape" tag tends to make the background the main focus. If the character is the main subject, it might be better not to use it.

"outdoors, scenery, landscape, indoors, bedroom, building, car, crowd, forest, beach, city, street, day, night, from above, from below"

■For reference, I will also share my simple confyui workflow and onetrainer training setting data.

If you want to use confyui for inference, you need to install the "ExtraModels" plugin. I will also share the URLs of "vae" and "T5" that I use.

I don't know if it can be used with other WebUI.

Other people have shared their workflows, so it might be a good idea to refer to them.

■ExtraModels

https://github.com/city96/ComfyUI_ExtraModels?tab=readme-ov-file#installation

■vae

https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/diffusion_pytorch_model.safetensors

■T5

https://huggingface.co/theunlikely/t5-v1_1-xxl-fp16/tree/main

It's the same as the T5 on sd3, so you can probably use the 8bit T5 on sd3 as well. That should load faster.

■Base model Please download when you want to try other resolutions.

https://huggingface.co/PixArt-alpha/PixArt-Sigma/tree/main

■1024px diffuser model is required during training. Please specify this as the base model and train.

https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-1024-MS

■ 512px Model.

https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-512-MS

Compared to the 1024px model, it has lower hardware requirements and training speed is about 4 times faster, making it accessible for more people to train. Apart from the transformer, it uses the same data as the 1024px model, so please transfer the data from the URL above.

■If you have room in your GPU, loading T5 on the GPU will make inference faster and less stressful.

By converting T5 to 4-bit, inference is possible even with lower specifications.

A 12GB GPU should be fine.If you convert it to 4bit you might be able to load it on an 8GB GPU...If that doesn't work don't worry you can load it into your system RAM!

If an error occurs even after installing ExtraModels with ComfyUI Manager,

follow the instructions in the ExtraModels URL,

activate VENV, and re-enter the requirements.

When I tried to convert T5 to 4-bit, an error occurred with bitsandbytes, but re-entering the requirements solved the problem.

I don't know much about it either, so it may be difficult for me to provide support for installation...

■I'm new to civitai, so if you have any opinions, I'd appreciate it if you could let me know.

I'm not good at training, but I would be happy if I could share the potential of pixart with as many people as possible.

PixArt-Sigma have potential.

My dream is to see more Pixart models. I'd love to see the models you've trained as well!

The training requirements are low, 12GB is fine!

The total number of downloads has exceeded 1000. Thank you for your interest in my immature model! Thank you very much for your many likes. m(＿＿)m

Thank you for the buzz as well!

This fine-tuning itself isn't particularly exceptional, but I hope the information about my training can help someone interested in Pixart!

■Below I will list the GPU and training time I used for my training. Please use it as a reference for your training!

If you want to know the exact settings, please download the onetrainer data.

GPU: RTX 4060 Ti 16GB

■512px

Batch size: 48

70,000 / 48 = 1,500 steps

1 epoch: 5 hours

15 epochs: 75 hours

GPU usage: 13GB

With this batch size and epoch time, I think the speed isn't much different from SD1.5. It's fast.

I feel the 512px model is like a successor to SD1.5.

■1024px (testing)

Batch size: 12

70,000 / 12 = 5,833 steps

1 epoch: 30 hours

5 epochs: 150 hours

GPU usage: 15GB

The reason it doesn't take exactly four times longer is due to the difference in batch size.

In my environment, I felt it was impossible to train a 1024px SDXL model, so I haven't tried it and don't know if it's fast or slow. But I think the batch size is good!

■Full fine-tuning With 12GB, 1024px training is not a problem.

I have 16GB, so my batch size is slightly larger.

If you lower the batch size, the VRAM usage decreases significantly.

With a batch size of 1 or 2, it might be fine even with 8GB.

I use CAME as the optimizer, which slightly increases GPU usage.I liked it because the quality was good.

With Adafactor or AdamW8bit, VRAM usage is significantly reduced.

Since the text encoder is T5 and very large, it might be difficult for now because training requires a lot of VRAM...

With the advent of SD3, this discussion will progress and training methods will be established. Until then, a large amount of VRAM might be necessary...

If you want guidelines for full fine-tuning settings, you can use these as a reference.

However, it may sometimes lead to overfitting or be challenging due to your PC specifications.

While referring to these, try to find settings that work best for you.

I was able to achieve the same settings by switching to BF16 training to reduce GPU usage, so that's what I use.

https://github.com/PixArt-alpha/PixArt-sigma/blob/master/configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py

https://github.com/PixArt-alpha/PixArt-sigma/blob/master/configs/pixart_sigma_config/PixArt_sigma_xl2_img1024_internalms.py

Note!

■When training with Onetrainer, the number of tokens may be limited to 120.

For tag training, the impact should be minimal since tag shuffling is performed.

Honestly, I have never had any issues with 120 tokens for tags.

However, for natural language, the length of the caption is important, so unintended truncation might occur.

■Relevant part: "max_token_length=120" This value is the token limit.

https://github.com/Nerogar/OneTrainer/blob/23006f0c2543e52a9376b0557e7a78016d489acc/modules/dataLoader/PixArtAlphaBaseDataLoader.py#L244

■In the case of xformers, errors occurred beyond 256 tokens. With sdp, there were no issues up to 300 tokens, but at 512 tokens, the generated images broke down.

It seems that more tokens do not necessarily mean better results.

Due to the increase in cache size, if the cost-effectiveness is not promising, 120 tokens might be sufficient.

There is no guarantee of quality improvement, but it might be worth investigating.

Since there is no certainty, please let me know if there are any mistakes!

If you have any questions, please feel free to ask!

日本語での質問も大丈夫ですのでご気軽にお声がけください～

Images made by this model

Sort by

No Images Found.

Model Type	Checkpoint
Base Model	PixArt E
Published	3/5/2025