AstolfoMix-SD2

An exotic merge of 15 models (12 UNET + 4 CLIP). See this article for description. Go to HuggingFace model page for sneek peek before "kind of official release". Contents /theories will not repeated with the SD1 version, contents below are exclusive for SD2.

Abstract

I present AstolfoMix-SD2, a merge model focusing on "making a useful SD2.1 model", after a disastrous history of the model and finetuned variant (especially WD1.5). Currently it is in anime style. She is not Astolfo, but still so cute!

Introduction

SD2.1 is the best proving ground for an original idea. Whole history of SD2 and its finetunes (especially WD1.5) is a total tragedy. If my mix looks reasonable out of merging total unuseable models, I will discover a lot more exlucsive findings, and get closer to the ground truth.

Related Works

Started from WD1.5B2, I believe that it is trainer problem along with A1111's runtime problem. Most models at that era cannot generate any reasonable image, except the prompts mentioned in the demo case. As on 231231, I can't reproduce the same image on 2303. Except Replicant-V3.0 and its variants, even the PonyDiffusion, I cannot make any reasonable images out from demo prompts (most users don't aware of this because they are targeted users). Therefore there must be something to improve.

Althouth there was some tryhard merges which shared a similar thought, it is acting strange (later I found some essential prompts to include), and there are some actual improvement from parent model. There must be something really bad, was included in the mix, to make the result model being poor on understanding concepts.

Methodology

Unlike SD1 which "works even you pick models randomly", (careful) model seleciton is a NP Hard ( O(N!) ) question. However with pattern recognition by visual inspection (since total number is still 23), it reduced to a core concept of "Replicant-V3 UNET + WD1.5B3 CLIP". With some further matching, 10 UNETs and 4 CLIP /TEs are picked for this version of AstolfoMix.

This time stable-diffusion-webui-model-toolkit will be used frequently, because I need to keep extracting and importing UNETs /CLIPs together. It also saves the model as FP16 safetensors, which is commonly used, reliable and take less disk space. Now I've generated 80+ models which is huge ( O(N) for space)

(Images and steps are omitted, see my full length Github page) I made a few pass of global comparasion across CLIPs and UNETs, then slowly reduce the model range, and finally picked only half of discovered SD2.1 models.

Experiements

Same as SD1 version. The only difference is I must adjust my prompt, since most models are tagged with anime /real, and AI chose to not blending the style like SD1 did. Due to the difference between actual model structure, I've got runtime errors, exact same score across whole merge, and consistant glitched images. Multiple workarounds has been applied, even switching model selections.

Discussion

Same as SD1 version. v-pred is useful. The tagging is still solveable (I oppose quality /style tagging, it may link to so many irrelvant objects and make the whole (base) model not flexible. Solve the problem with LoRA /embeddings (badhands /badprompt) instead.

Conclusion

Just try my model! AstolfoMix only represent my personal "model selection" and "feature extraction", everybody can use the "uniform merge" to make you own base model, before great SDXL model comes. Also do not waste your time on the risky MBW /finetuning if you don't have abundant resources.