zer0int's Long CLIP_L-Registers-Gated_MLP-ViT-L-14

Details

Model description

100% of the credit goes to zer0int from Huggingface. I'm only putting this on here so I can mark it as a resource in image generation. - Huggingface link - https://huggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14 - And will remove this if requested by zer0int.

NOTE:- This is not recommended for SDXL - Quickly reading issues it appears a compatible CLIP_G has not been released yet. So it may not work very well for SDXL. If you need CLIP_G probably best to not use this.

The main difference between "Long-CLIP_L" and "CLIP_L" is token length.

CLIP_L = 77 Token length from the prompt data.

Long-CLIP_L = 248 Token length from the prompt data.

Since I do pretty much only Flux generations and utilize LLM's sometimes having a larger token length helps. I mean, why limit yourself to 77 tokens when you can have 248!

TE Only = Text Encoder only, this is really all you need most of the time.

Full Model = The whole thing, if you wanted to do more than just text to image.

This particular Long-CLIP_L is the Registers-Gated version which is a fine-tune done version. zer0int has provided a nice little chart showing the difference with this fine-tune. If I'm not mistaken it's basically saying that both the text-2-text and img-2-text are closer lined up with each other and have much less instances of erroneous data. TL'DR - The shorter and wider the better!

Images made by this model

No Images Found.