* Some of the images above are being generated live in the browser, and are unique to you and only you
Making a machine generate (dream, hallucinate, ...) natural images is a lot of fun, and quite complicated.
About two months ago, we decided to try something that's been at the back of our heads since Ian Goodfellow first presented (the awesome) Generative Adversarial Networks at NIPS last year. Ian's work came close to generating natural-ish images, but still fell short in terms of image quality. The paper excited us, and we discussed it to length at the conference and after.
Adversarial networks are fun things.
You pit a generative (G) machine against a discriminative (D) machine and make them fight. The G machine attempts to produce fake data that looks so real that the D machine cannot tell it is fake. The D machine is learning to not get fooled by the G machine. After these have played their minimax game, you hope to get a good quality generator that can generate as many real looking samples as you want.
To keep this section brief, this did not work. It didn't work at all. The adversarial training was more unstable, and when we did get a stable model, it did not improve image sample quality.
The key idea that got us going was to train a Laplacian pyramid of networks, each network predicting the next finer scale. What that means in simpler terms is that the network at a particular scale gets the coarsened version of the image to predict, and generates the refinement necessary to reconstruct the original image.
The multi-scale generation worked extremely well. It got us to a point where we had samples that looked much more real than previous attempts at generating such samples. However, there was one problem.
The images always looked object-like, but had no actual object.
Looking at the image at a glance would totally wow you, but looking closer, you'd be scratching your head. This wasn't going as well as we'd hoped, we needed another advance to get us closer to generating objects that actually were objects.
Because our neural networks are convolutional in nature, we can recursively apply a particular generator over and over again to get images as large as we want. However, image statics vary quite a bit at each scale, especially in the lower octaves. You can also chain different generators trained on different datasets together. For example, you can train a generator on a cats dataset, and chain it in the next scale with a generator trained on horses. It would be fun to try that.
Our last idea, which was a fairly natural extension for us to enforce explicit objectness, was to pass to the G networks not only an input random noise, but also a one-hot coding of the class label we want to generate.
This idea finally nailed it home in terms of objectness of our generated output. We trained in on the CIFAR-10 database to produce cool looking horses, cars, airplanes etc.
Another simplification that we did for training the LSUN database (because it is large enough, and because we were running against a conference deadline) is that we trained a separate network for each class. That also produced nice towers, bedrooms, churches and conference rooms.
One of the easiest thing a generative network can do to generate realistic samples is to simply copy the training images. This is not a true generative model that generalizes to all natural images, but one that overfit heavily on the training set. This is something we meticulously checked while training our networks, i.e. that our network is simply not copying the training samples. We've done this in a couple of ways.
For each of our generated sample, we found it's nearest neighbor in the training set using a L2 distance in pixel space. For a large sheet of samples showing the nearest neighbors in pixel space, click here
For each of our generated sample, we found it's nearest neighbor in the training set using a L2 distance in feature-space using this well-performing Network-in-Network CIFAR model . For a large sheet of samples showing the nearest neighbors in feature space, click here
Both the experiments comfortably concluded that our networks are generating images that are not just trivial manipulations of copied training images.
Drawing samples conditioned on the same initial image gives us the ability to see if the variability
in producing the samples is trivial (copying), or non-trivial.
We did a few experiments on the LSUN database and concluded that our samples are fairly different
for every draw. For example, The network changes the building architectures, places and removes
windows, adds extra tunnels and hulls (and sometimes extra towers) and creates details that are
different everytime a sample is drawn.
Each of our generative and discrimnative models were standard Convolution Neural Networks between 3 to 6 layers. For the complete architectural details, go to this link for CIFAR-10 and this link for LSUN/Imagenet
There are so many things to explore, as follow-up work to this paper. A list of simple ideas would be:
The training code (in torch) is located here: https://github.com/facebook/eyescream