# A Simple Creature Learning Generative Diffusion Model

Picture this: a creature exists that can only perceive images made up of a grand total of two pixels, each in one of sixteen glorious shades of gray. Despite what you might think, this creature is sophisticated. Not all two-pixel masterpieces are created equal in its eyes. It’s like us humans, really — certain combinations of those two pixels, each with its possible sixteen states, are like fine art to them. (We humans, on the other hand, usually need a gazillion more pixels to be impressed, but who’s counting?)

The creature has a unique hobby — it loves to take a rough, noisy image and tweak it bit by bit, transforming it into a stunning picture. Why? Well, it’s all about that social media clout! This creature, with a keen eye for aesthetics (and perhaps a hint of entrepreneurial spirit), offers to spruce up images for a modest fee on every social media platform it graces.

It asks human for the help.

As mundane humans, we perceive a two-pixel image with 16 gray scales as merely two integers ranging from 0 to 15. For instance, to the creature, an image might be represented simply as (0, 15) or (12, 13). Some of these combinations are deemed appealing by the creature, while others are dismissed as noise or bad.

While humans may be bland, they possess an almighty power that happens to discern the ‘good’ pictures for the creature from the 16x16 = 256 possible combinations. So, how can we, in our humanly wisdom, assist the creature in generating good pictures from any noisy images as the starting point?

# Connect the dots or connect the noise?

We need some connection from the good pictures to the noisy pictures. How do we do that?

We can gradually add noise to a known good picture, distorting it and recording the distortion process to generate noisy pictures. This process can be quite simple for humans.

For any image represented by (x, y), we throw a 4-sided die labeled with “V”, “W”, “X”, and “Y”. If we roll an “X”, we generate an image (x+1, y) or (x-1, y) and if it’s a “Y”, we create a new image (x, y+1) or (x, y-1) regardless of if the new images may be noisy. If we roll a “V” or “W”, we do nothing and wait until the next opportunity to generate another noisy picture. (The x and y take values from 0 to 15. When x or y is bigger than 15, we reassign them to zero. If they become less than zero, we assign them to be 15.)

After undergoing such distortion processes, it is conceivable that, after several steps, we will end up with many noisy images. However, given that humans are intelligent and have access to computers, they can record every step from one image to another. Now, humans can play this trick: given that a noisy two-pixel images are generate from good images, if humans can just use a computer to reverse the time, maybe they can help the creature generating the good images.

Here are some pictures showing how humans can do it.

Since there are only 16x16 = 256 possible images, humans can generate a 2D heat map to perceive the whole probability distribution of “good” images. At time zero, we start with all good images, thus the probability distribution is sharp, only those squares that represent the good images have non-zero probabilities. We can get the probability distribution by taking any good images as the starting points and repeat the random distorting processes then we can get the probability distribution for all the following time steps. The probability distribution becomes blurred as more and more noisy two-pixel images are generated.

In the process, we record how many times we see from (x, y) -> (x’, y’) from t to t+1. With that, we can count, at time = t+1 at (x’, y’), how many times the two-pixel image (x, y) are generated from all 5 possibilities (x=x’-1, y=y’), (x=x’+1, y=y’), (x=x’, y=y’-1), (x=x’, y=y’+1) and (x=x’, y=y’). This allows us to reverse the distorting process (~diffusion process) and to “denoise” the images, namely, “moving” the image toward to the better images that in the previous time step.

Starting with an arbitrary two pixel (x’, y’) at time t, we can iteratively move it “backward in time” to find the path from (x’, y’) to some good image. We can represent the probability of a given (x’, y’) flows to backward in time as vector to see how those “two-pixel” images moving to. (In mathematical jargon, the equation governs who such “particles (the two-pixel images)” move is called Langevin equation.)

# Learning to move backward in time

By tracing back of the probabilistic flow, we can reconstruct the “good” images starting from noisy images. Meanwhile, given there are only totally 16 x 16 = 256 possible images, we can use a simple multiple layer proception model to learn how to get the initial probabilistic distribution at t=0 from a uniform distribution.

# Images that appeal to human

For two-pixel images, it is easy enough. Not only we can reconstruct good images from noisy image by just recoding the diffusion simulation process, but we can also learn to reconstruct the whole probabilistic distribution. We can please the simple creature that is only interesting in two-pixel images.

For human, the two-pixel images are not interesting. That could be a problem of our naïve approach. Assuming we are interested in the MNIST dataset (https://en.wikipedia.org/wiki/MNIST_database). There are 70,000 images of 28 * 28 pixel. Even if we only consider binary pixel (on and off), to recording the “diffusion process” as we do for the 2 pixel images, we need the order of 2^(28*28)= 1.017 *10²³⁶ numbers to keep the records. It is infeasible.

Meanwhile, the number of images that we know to be “good” occupies only a very tiny portion of the possible space. If we have a smarter way to generate the reverse probability flow without tracing all possible images, the strategy of using diffusion and re-tracing to generate good images may still be feasible. In the seminal paper (http://proceedings.mlr.press/v37/sohl-dickstein15.pdf) published in 2015, the authors demonstrated that using an artificial neural network to approximate the reverse probability flow/state transition can yield good results. With internet-scale image collections and large-scale model training, such image generation for human consumption has become routine through commercial services like MidJournal and OpenAI’s Dall-E.

# What one can learn from such two-pixel image toy model? (and summary for you with ChatGPT)

I studied physics in graduate school. One way we learned physics was by deducing complicated phenomena to simpler ones and generalizing from there. In theoretical physics, we construct toy models and study them, gaining an understanding of the details and trying to use such models to explain observed phenomena or reconcile them with experimental data. This piece is from my simple exercise to do some hands-on work for understanding the diffusion model beyond just reading papers, without the need for larger-scale computation. I personally learned quite a bit from such a process. I hope it helps you. Meanwhile, I asked ChatGPT the question, “What can one learn from such a two-pixel image toy model?” Here’s what it said:

The two-pixel image model, as simple as it seems, offers profound insights into the workings of more complex generative models. This tiny universe of 256 possible images serves as a microcosm, illustrating the principles of generative diffusion models in a comprehensible way. Here’s what we can learn from this:

**Understanding Diffusion Processes**: The process of adding noise and then reversing it, as seen with our two-pixel creature, mirrors the broader concept of diffusion models in image generation. This small-scale example helps demystify the larger, more complex processes.**The Role of Probability Distributions**: Observing how probabilities blur and then refocus during the reverse process gives an intuitive grasp of how generative models handle probability distributions to create coherent images from noise.**Simplification of Complex Concepts**: The two-pixel scenario simplifies the understanding of algorithms like Langevin dynamics, which are fundamental in explaining how generative models evolve images from one state to another.**Feasibility and Scalability Issues**: While the two-pixel model is manageable, scaling this to something like the MNIST dataset highlights the exponential growth in complexity. This underscores the challenges faced in real-world generative modeling and the necessity of efficient algorithms and powerful computing resources.**Innovations in Neural Networks**: The model hints at the potential of neural networks in approximating complex probabilistic flows. The success of models in approximating these flows for larger datasets, as mentioned in the seminal 2015 paper, showcases the versatility and power of neural networks in image generation.**Applications in Image Generation Services**: The principles illustrated by the two-pixel model are directly applicable to services like MidJournal and OpenAI’s Dall-E. These services use similar concepts, albeit at a much larger scale, to generate high-quality images from textual descriptions.**Educational Value**: Lastly, this model serves as an excellent educational tool. It breaks down high-level concepts into a format that is easier to visualize and understand, making it a valuable resource for teaching the fundamentals of generative models.

In conclusion, while the two-pixel image model is a simplified representation, it provides valuable insights into the workings of more complex image generation processes. It’s a testament to the fact that sometimes, the simplest models can offer the most profound understanding of complex systems.

Not Bad! (And may save you some time to run this piece of short writing through chatGPT. 😀)

The accompanying notebook for these examples can be found at https://gist.github.com/cschin/81151e755319714e7bd61580ea1a20ea