Sol-Dickstein used the principles of diffusion to develop a generative modeling algorithm. The idea is simple: the algorithm first turns the complex images in the training data into simple noise — much like a drop of ink turns into diffuse blue water — and then teaches the system how to reverse the process, turning the noise into images.
Here’s how it works: First, the algorithm takes an image from the training set. As before, let’s assume that each of the million pixels has some value, and we can represent the image as a point in a million space. The algorithm adds some noise to each pixel at each time step, which is equivalent to ink diffusion after one small time step. As this process continues, the pixel values are less and less related to their values in the original image, and the pixels are more like a simple distribution of noise. (The algorithm also slightly shifts the value of each pixel toward the origin, zero on all of these axes, at each time step. This shift prevents the pixel values from growing too high for computers to easily work with.)
Do this for all the images in the dataset, and the initial complex distribution of points in million-dimensional space (which cannot be easily described and sampled) turns into a simple, normal distribution of points around the origin.
“The sequence of transformations turns your data distribution into just a big noise ball very slowly,” Sol-Dickstein said. This “direct process” leaves you with a distribution that you can easily sample from.
Next comes the machine learning part: give the neural network the noisy forward pass images and train it to predict less noisy images one step earlier. It will make mistakes at first, so you tweak your network settings to make it work better. Eventually, the neural network can reliably turn a noisy image representing a sample from a simple distribution into an image representing a sample from a complex distribution.
The trained network is a complete generative model. Now you don’t even need the original image for the forward pass: you have the complete mathematical description of the simple distribution, so you can sample it directly. The neural network can turn this sample—essentially just a static one—into a final image that resembles the image in the training dataset.
Sol-Dickstein recalls the first results of his diffusion model. “You would squint and say, ‘I think this colored blob looks like a truck,'” he said. “I’ve spent so many months of my life looking at different pixel patterns and trying to see structure that I thought, ‘This is a lot more structured than ever before.’ I was very excited.”
Foresight of the future
Sol-Dickstein published his diffusion model algorithm in 2015, but it was still far from what GANs could do. While diffusion models could sample across the entire distribution and never get stuck producing only a subset of images, the images looked worse and the process was too slow. “I don’t think it was considered exciting at the time,” Sol-Dickstein said.
It took two students, neither of whom knew either Zola-Dickstein or each other, to connect the dots from this initial work with modern distribution models such as DALL·E 2. The first was Song, then a graduate student at Stanford University. . In 2019 he and his adviser published a new method to build generative models that did not estimate the probability distribution of the data (multidimensional surface). Instead, he estimated the gradient of the distribution (think of it as the slope of a multidimensional surface).