Maochuan Lu - CS180 Project 5

Part A: Overview

In Project5 part A, I will play around with diffusion models, implement diffusion sampling loops, and use them for other tasks such as inpainting and creating optical illusions.

Downloading Precomputed Text Embeddings

I use seeds = 180, and num_inference_steps = 10, 20, and 40, for the following pictures, We can see that increasing num_inference_steps generally enhances the detail and clarity of generated images, also they all represent their corresponding prompt except the detail in the generated image is different: at num_inference_steps = 10, images tend to be less detailed and may appear rough, with features like the hat being less defined, we can see that the man with hat looks like in a oil painting rather than a real photo. With 20 steps, there's a improvement, it shows a better clarity and more detail, we can see that the features in generated images are more recognizable. At 40 steps, images are at their most refined, with highly detailed textures, sharper edges, and natural shading, making features like the hat look more realistic.

num_inference_steps = 10

num_inference_steps = 20

num_inference_steps = 40

1.1 Implementing the Forward Process

I generates noisy images of test_im at different noise levels based on specified timesteps (t=250, t=500, and t=750). The forward function adds Gaussian noise to the image according to each timestep's cumulative alpha value.

1.2 Classical Denoising

I generates noisy images of campanile at different timesteps (t=250, t=500, t=750), then applies Gaussian blur to each.

Original campanile

1.3 One-Step Denoising

In this part, I generate noisy images at t=250, t=500, t=750). For each noisy image, I estimate the noise using a U-Net model with the prompt embedding "a high quality photo" and performs one-step denoising to approximate a clean image.

1.4 Implementing Iterative Denoising

I implement iterative denoising on a noisy version of test_im using a U-Net model with a prompt embedding. I first start by adding noise to test_im and then performs iterative denoising over several timesteps. Specifically, in the iterative_denoise function, each iteration calculates a progressively less noisy version of the image by estimating and removing noise. For each timestep t, I use the formula provided to compute alpha and beta values based on cumulative noise, then uses the U-Net model to predict the noise in the current image. This predicted noise is subtracted to estimate a x0, and a weighted average between this estimate and the previous image gives the pred_prev_image for the next timestep.

1.5 Diffusion Model Sampling

I generate five random noisy images and iteratively denoises each one using a U-Net model with the text prompt "a high quality photo". For each noisy image, I use the iterative_denoise function to progressively reduce noise over a series of timesteps, resulting in cleaner images.

1.6 Classifier Free Guidance

The quality of previous part's image are not very good, so I use CFG to improve. It is really similar to iteratively denosing. Basically, I denoise five randomly generated noisy images, refining each towards a cleaner version guided by a specific prompt. It first defines conditional and unconditional embeddings for CFG. Then I progressively removes noise from each image across a series of timesteps by combining noise estimates from both conditional and unconditional models by using the formula of CFG. In this case, we can direct the denoising process towards the desired prompt ("a high quality photo").

1.7 Image-to-image Translation

I use SDEdit with CFG to progressively refine a noisy image toward a clean version, guided by the prompt "a high quality photo". I first iteratively add noise at different starting timesteps (1, 3, 5, 7, 10, and 20) to create increasingly noisy versions of the original image. For each noisy image, I use the iterative_denoise_cfg function to apply iterative denoising, conditioned on the prompt embedding, to gradually remove noise and reconstruct a cleaner image.

Campanile

California Hall

VLSB

1.7.1 Editing Hand-Drawn and Web Images

In this part, I use one image from the web and two hand-drawn images. With the provided tools, I basically use SDEdit with CFG to denoise a progressively noisier version of web or hand-drawn image using the prompt "a high quality photo". For each starting noise level (i_start=1, 3, 5, 7, 10, and 20), I add noise to these images, then uses iterative_denoise_cfg to gradually remove the noise and produce a clean image.

Original Web Image

Original Hand-Drawn Image

Original Hand-Drawn Image

1.7.2 Inpainting

In this part, I perform inpainting on 3 images by filling in masked regions using a guided denoising process. The inpaint function starts with a noisy image generated from input image and applies CFG to denoise it over multiple timesteps, using the prompt "a high quality photo". For each timestep, just like before, I calculate noise estimates from both the conditional and unconditional U-Net models. Then I use CFG to refine the noise estimate. Using the formula provided, In masked areas, the updated image is retained, while in non-masked areas, the original image is preserved using the mask.

Campanile

Space Neddle

Lake

1.7.3 Text-Conditional Image-to-image Translation

In this part, I denoise a noisy img using the prompt "a rocket ship" and CFG. For each noise level (i_start=1, 3, 5, 7, 10, and 20), I add noise to input img, then I apply iterative_denoise_cfg to refine the image based on the prompt.

Campanile

Space Neddle

Hoover Tower

1.8 Visual Anagrams

In this part, I create a mirrored "flip illusion" effect by iteratively denoising an image using two different prompt embeddings: one for the original image and one for its flipped version. The function make_flip_illusion denoises the image over a series of timesteps, alternating between the original and flipped image versions: For each timestep, it applies CFG just like before to denoise on both the original image and its flipped version, using separate prompt embeddings. Then it use the formula provided to average two noise generated from two prompt embeddings. Then following things is same as before, it gradually refines the image based on the combined effect of the prompts and their flips, creating a balanced, mirrored "flip illusion" effect in the final denoised image.

1.9 Hybrid Images

Basically, it is the same thing except we not dealing with the flip image but another image. Also, we just change our original formula from 1.8 to use low pass and high pass function to get the new noise. But to get a promising result, I might need to run several time.

Part B: Overview

In Project5 part B, I will train my own diffusion model on MNIST using different strategies.

Part 1: Training a Single-Step Denoising UNet

Part 1.1: 1.1 Implementing the UNet

In this part, I implement Conv, DownConv, UpConv, Flatten, Unflatten, ConvBlock, DownBlock, UpBlock in the colab, so no deliverable now.

Part 1.2: 1.2 Using the UNet to Train a Denoiser

I show the effect of adding Gaussian noise with varying standard deviations to MNIST images, using 5 examples and 7 different noise levels [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Part 1.2.1 Training

I train an unconditional U-Net model on the MNIST dataset. The model was trained to reconstruct clean images from noisy inputs by adding Gaussian noise to the images. Here is the traning loss curve.

I also display sample results after the 1st and 5th epoch.

Part 1.2.2 Out-of-Distribution Testing

I evaluate my pretrained U-Net model for denoising MNIST images by adding Gaussian noise with varying levels of 𝜎 to the test set. Here is the result:

Part 2: Training a Diffusion Model

Part 2.1 Adding Time Conditioning to UNet

I basically added the time embedding to the previous architecture, following the instructions from the website.

Part 2.2 Training the UNet

I first implement ddpm_schedule, ddpm_forward in the colab. Then I started training it using the hyperparameters from the website. Here is the training curve.

Part 2.3 Sampling from the UNet

I then implement ddpm_sample, and here is the sample result after 5 epochs.

Here is the sample result after 20 epochs.

Part 2.4 Adding Class-Conditioning to UNet

Based on the Time Conditioning Unet, I add the class embedding following the instruction in the website. Also, I train this Unet using learning rate 1e-4, batch size = 128, and hidden_dim = 128. Here is the training curve.

Part 2.5 Sampling from the Class-Conditioned UNet

I then implement ddpm_sample in the Class-Conditioned Unet. Here is the ssampling result after 5 epoches.

Here is the sample result after 20 epochs.