Getting Stepped On - Flux.dev版本v1.0 (ID: 1140341) 综合资源合集综合资源合集

This LoRA allows a from below view with a foot in focus in Flux.dev.

Training images included different variants of feet (barefoot, socks, boots, stockings, ...), however, due to the way of how fast Flux learns, and the dataset balance of these concepts only barefoot and boots are really reliable.

Main trigger: POV stepped on

Additional tags which might have an influence (ordered by occurrence):

barefoot
view straight up (if the view is more or less from straight below and less angled)
large foot (was tagged when the foot covered a majority of the image)
dirty foot
heels /boots /socks /stockings (your mileage will vary here)

Suggested LoRA weight: 0.8 – 1.2.

I am not really happy with the results, but I think it is better to publish it than let it rot on my storage. It can be wonky with the amount of toes, or even disfigured feet. It also sometimes puts the wrong foot on the wrong side (so for example left foot on right leg), and it really likes toes. So trying to get stockings, socks or similar, you will often end up with some unrealistic tight fabric (or none at all).

And with these weaknesses out of the way, we can talk a bit about the

Training

The dataset is about 200 images, automatically captioned with Joycaption alpha two, then manually slightly refined and tagged with the above mentioned tags.

Overall, the training itself was quite unyielding. In total I think I trained 5 or 6 versions to a high step count testing out different parameters and trainers.

In the beginning I started with OneTrainer, but quickly changed to Ostris’s ai-toolkit. There I completed multiple full trainings, and none of them were bad, however they were also not great, so I changed the parameters a few times and started again hoping for better results.

I increased the dataset size from 30 to 200 in between, adding more detailed captions (which did give more control and diversity to the outputs)
I reduced the rank from 16 to 8 and 4 (and also played around with alpha)
Tried different batch sizes
…

However they all still suffered from similar problems (disfigured generations sometimes, not perfect control, …) so in the end I concluded that I spent enough time on that, and that I now had to work with what I had.

The two best candidates were with the most recent variation of the dataset -> Variant A with dim 4, alpha 8, and variant B with dim 16, alpha 16. They both had superior results depending on generation however, and depending on what you were going for (variant A for example was slightly less likely to generate disfigurements). However the difference was miniscule. So, I played around with merging them (or well weighting and concatenating them, due to the different alphas and dims), but that only degraded (or did not impact) the results.

Because of that, I finally settled on a SVD merge at rank 128, as previous experiments (outside of this LoRA) have shown that it keeps more "ground truth" of the base model. This did (as expected) improve disfigurements.

Training Settings (A | B):

Alpha, Dim: 8, 4 | 16, 16
Total Steps: 9000
Caption dropout: 0.05
Resolutions: 512, 768, 1024
Batch Size: 2 | 1
Noise scheduler: flowmatch
Learn rate: 2.5e-5
Linear timesteps
Quantized (with gradient checkpointing)

Training took about 14.5 (A) and 8.2 (B) hours on a RTX 4090.

After training safetensors keys were converted to be compatible with Kohya, a SVD merge was done (as said before, rank 128), and the merge was then resized to rank 32 (sv_fro with 0.985), at least if I remember correctly.

As a quick and last tangentially related note: I also tried running the same dataset through SD3. And after similar training time I only was rewarded with disfigurement nightmare fuel, so I stopped this short side experiment.