Running Stable Diffusion Efficiently: Forge + Flux.1 GGUF on Low-Power GPUs

Introduction

Stable Diffusion has revolutionized AI-generated art, but running it effectively on low-power GPUs can be challenging. Enter Forge, a framework designed to streamline Stable Diffusion image generation, and the Flux.1 GGUF model, an optimized solution for lower-resource setups. Together, they make it possible to generate stunning visuals without breaking the bank on hardware upgrades.

This article will guide you through setting up and using Forge with the Flux.1 GGUF model for a smooth experience on low-power GPUs.


What is Forge?

Forge, officially known as Stable Diffusion WebUI Forge, is a streamlined interface designed for generating high-quality images with Stable Diffusion while optimizing for user control and hardware efficiency. Unlike some modular UIs like ComfyUI or Node-based editors, Forge focuses on offering simplicity and direct functionality for those who want powerful results without overly complex workflows.

Key Features of Forge:

  • Lightweight and Fast: Specifically designed to minimize overhead, making it suitable for low-power GPUs and older systems.
  • Clean Interface: Provides a straightforward UI for text-to-image and image-to-image generation with intuitive controls.
  • Model Optimization: Easily integrates GGUF quantized models like Flux.1 to maximize performance on limited hardware.
  • Advanced Sampling Options: Supports various samplers and enables fine-tuning for balance between speed and quality.

Whether you’re new to Stable Diffusion or an experienced user looking for a more efficient solution, Forge offers a balance of usability and performance.


Introducing the Flux.1 GGUF Model

The Flux.1 GGUF model is an innovation in Stable Diffusion optimization, designed specifically for low-power GPUs. The GGUF format reduces memory and processing demands while preserving visual fidelity.

Why Flux.1 GGUF?

  • Efficiency: Requires less VRAM compared to standard models.
  • Quality: Delivers good results without compromising detail to much.
  • Versatility: Compatible with tools like Forge or ComfyUI.

Step 1: Setting Up Your Environment

Hardware Requirements

  • GPU: A low-power GPU (e.g., Nvidia RTX 2080 Ti, RTX 3060, or similar).
  • RAM: At least 16 GB for smooth performance.
  • Disk space: At least 25GB of free space needed.

Software Installation

  1. Forge Installation
    • Download Forge from its official repository. This is the link to the file with CUDA 12.1 and PyTorch 2.3.2 on Windows.
    • Extract the .7z file to a directory.
    • You will find two batch files: update.bat and run.bat
    • Double click on update.bat to update forge.
    • Double click on run.bat to run forge. You might need to run this a few times for the dependencies got installed completely.
  2. Flux.1 GGUF Model
    • Download the Flux.1 GGUF model from here. You only need one model from this repo.
      • For GPUs with 8GB VRAM: Download the flux1-dev-Q2_K.gguf model.
      • For GPUs with 8GB to 10GB VRAM: Choose the Q3 or Q4 models.
      • For GPUs with 10GB to 12GB VRAM: Opt for the Q5 model.
      • For GPUs with 12GB or more VRAM: Download the Q6 or Q8 models.
    • Place the model file in Forge’s designated model folder . This is under webui\models\Stable-diffusion.
    • If you’re curious about the suffixes in the model names, such as _0, _1, or _K, please refer to the appendix.
  3. VAE and Text Encoders
    • Download the VAE ae.safetensors from here.
    • Put the ae.safetensors in this directory webui\models\VAE.
    • Download the text encoders clip_l.safetensors and t5xxl_fp8_e4m3fn.safetensors from here.
    • Place the two text encoders in this directory webui\models\text_encoder.

Once you have all the files downloaded, restart forge. You can do this by closing the run.bat window and run run.bat again.


Step 2: Configuring Forge for Flux

  1. Click on Txt2img tab.
  2. Click on flux in the UI section.
  3. Pick the Flux gguf model file you downloaded earlier.
  4. In the VAE/ Text Encoder section, select the VAE and the two text encoders you downloaded.
  5. If you want to use LoRAs, you need to select Automatic (fp16 LoRA) under Diffusion in Low Bits. If not, just leave it at Automatic.
  6. Enter the prompt and click on the Generate button to generate images.

Step 4: Tips for Generating with Flux.1 Model

  • There is no need for negative prompts.
  • Keep the distilled CFG between 1 to 4.
  • Reduce the dimension if you get out of GPU memory error.
  • I usually use Euler sampling method with 20 steps.

Samples

All the images are in the default size: 896 x 1152. The GPU used was a NVidia RTX 2080 Ti with 11GB VRAM.

Q5_0 (VRAM usage: 10.5GB)

Q4_0 (VRAM usage: 9.7GB)

Q3_K_S (VRAM usage: 8.6GB)

Q2_K (VRAM usage: 7.3GB)


Conclusion

By combining Forge’s flexibility with the Flux.1 GGUF model’s efficiency, you can unlock the full potential of Stable Diffusion on low-power GPUs. Whether you’re a hobbyist or an advanced user, this setup offers an accessible way to create high-quality AI art.

Try it out and let us know your thoughts!

Appendix: GGUF Model Name Suffixes

The suffixes _0, _1, _K_S, and _K in GGUF models often indicate variations in quantization and optimization settings, which affect the model’s performance, resource usage, and output quality. Here’s what each typically signifies:


1. _0 and _1

These usually represent quantization levels or versions of the model with different optimizations:

  • _0:
    • Represents the original precision or minimally quantized version.
    • Maintains the highest possible quality but requires more VRAM and computational resources.
    • Ideal for setups with sufficient GPU memory or when maximum fidelity is essential.
  • _1:
    • A more aggressively quantized version.
    • Reduces the precision of model weights (e.g., from 32-bit to 16-bit or lower).
    • Uses significantly less VRAM and runs faster but may slightly sacrifice quality.
    • Best for low-power GPUs or when resource efficiency is crucial.

2. _K_S and _K

These are often associated with further optimizations or configurations for specific sampling methods or workflows:

  • _K_S:
    • Likely optimized for sampler efficiency, hence the _S (possibly “Sampling”).
    • Tends to perform better with specific sampling algorithms (like DDIM or Euler) and may reduce computational load during image generation.
    • Recommended for workflows requiring fast sampling without much compromise on quality.
  • _K:
    • A broader optimization, not specifically tailored for samplers but focused on overall efficiency.
    • Balances quality and performance, making it a general-purpose choice.

When to Use Each

  • Low-power GPUs: Use _1 or _K_S for better performance and lower VRAM usage.
  • Mid-range GPUs: Try _K for a balance between quality and resource use.
  • High-end GPUs: Go with _0 for the highest fidelity if resources are not a concern.

This post may contain affiliated links. When you click on the link and purchase a product, we receive a small commision to keep us running. Thanks.


Be the first to comment

Leave a Reply