By Shrinivasan Sankar — Nov 3, 2024

How I generate unlimited, stunning images for free - A Step-by-step guide!

Image generation is becoming increasingly commonplace. But we are still resorting to subscription services. Their free quota is quite limited and we can only create so much with it. Moreover, by the time we get to the real result we want, we might have hit the free threshold for the day. So, why not generate images locally or using Google colab with ~50 lines of Python code?

If you are like me and would like a video explanation, please check the below video

Step 0 — GPU Availability

First, let's make sure we have some GPU at our disposal for this work. It could either be a basic 8GB GPU on your desktop or a much larger one on Google colab.

Anyone who has a Gmail account can run a colab notebook with GPU these days. To enable it on colab, hit colab and click “New Notebook”. Under Edit → Notebook settings, choose T4 GPU in the below pop-up window and you are sorted.

Step 1 — Install Python packages

We will install transformers accelerate bitsandbites protobuf and huggingface_hub with pip as shown below

!pip install --upgrade -q transformers accelerate bitsandbytes
!pip install -q git+https://github.com/huggingface/diffusers
!pip install protobuf

# to login to HF ecosystem
!pip install --upgrade huggingface_hub

You don’t need experience with these libraries. They are standard libraries to load and play with LLMs and Large Vision Models(LVMs) these days.

Step 2 — Hugging Face Access

We also need access to HuggingFace to load the model. So head over there and navigate to “Access Tokens” to generate yourself a new access token. Login with the access token with the code below:

from huggingface_hub import login

login(token = "<your token here>")

Step 3 — handy functions

As we are playing with really large models with less memory (RAM and GPU) at our disposal, we will have to clean the GPU memory as and when needed. So lets write a small function for that,

import torch

def clean_gpu():
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()

clean_gpu()

Step 4 — Encode the prompt

Time to write our prompt. We also need to encode the prompt with the encoder model with the below code. We will be using the Flux dev model. More specifically, flux.1-dev-nf4-pkg which is a 4-bit quantized model that can run on relatively less memory devices like 8GB RAM or 8GB GPUs!

from transformers import T5EncoderModel

ckpt_4bit = "Resleeve/flux.1-dev-nf4-pkg"
text_encoder_2_4bit = T5EncoderModel.from_pretrained(
ckpt_4bit,
subfolder="text_encoder_2",
)from diffusers import FluxPipeline, FluxTransformer2DModel

ckpt_id = "black-forest-labs/FLUX.1-dev"

pipeline = FluxPipeline.from_pretrained(
ckpt_id,
text_encoder_2=text_encoder_2_4bit,
transformer=None,
vae=None,
torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()prompt = "a cute model on London bridge photoshoot"

with torch.no_grad():
print("Encoding prompts.")
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
prompt=prompt, prompt_2=None, max_sequence_length=256
)

pipeline = pipeline.to("cpu")
del pipeline

# now that we have encoded the prompt, we no longer need it taking space in
# GPU. Just clean it
clean_gpu()

Note the important clean_gpu() step above. We have basically used the GPU to encode the prompt and store it in prompt_embeds, pooled_prompt_embeds. We have then freed the GPU memory to be used immediately for the generation step as we will see next.

Step 5 — Define the Image generation pipeline

Lets define the Flux Transformer 2D Model pipeline we will be using for the image generation. Note that we now set text_encoder = None, text_encoder_2 = None tokenizer = None and tokenizer_2 = None too. Note the last line too where we do to(“cuda:0”) to move everything to the GPU.

transformer_4bit = FluxTransformer2DModel.from_pretrained(ckpt_4bit, subfolder="transformer")
pipeline = FluxPipeline.from_pretrained(
ckpt_id,
text_encoder=None,
text_encoder_2=None,
tokenizer=None,
tokenizer_2=None,
transformer=transformer_4bit,
torch_dtype=torch.float16,
).to("cuda:0")

Step 6 — Generate Images… hurray

We are now just one step away from generating stunning, realistic images. The step is called denoising where the model takes in simple noise (just random jumbled numbers) and converts them into visually appealing images following the input prompt. Here we go,

print("Denoising ...")
height, width = 512,512
pipe = pipeline(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
num_inference_steps=50,
guidance_scale=5.5,
height=height,
width=width,
output_type="pil",
)

image_sample = pipe.images[0]
image_sample.save("result.png")

Visualize the image

Let’s not forget to visualize the result of our hard work. Yes, let us plot the image generated.

from PIL import Image
Image.open("result.png")

A sample image generated by the Flux dev model using our simple pipeline.

Conclusion

Flux is a state-of-the-art generative model as of today. We are fortunate to have access to a quantized dev version of the model for free. But we are still scratching the surface of what is possible with generative AI these days.

We have treated the libraries as black boxes without really understanding what is going on under the hood. Please leave a comment if you want me to dive deeper into the working mechanics of any of these libraries — transformers, bitsandbytes accelerate, etc

See you in my next...