Image to Image Translation–U-Nets and cGANs

Machine learning is everywhere in translation tasks. After all, the Chinese word ‘mao’ and the English word ‘cat’ carry the same meaning, though in different forms, and machine learning is great at learning underlying patterns and representations. This isn’t much different from taking a picture of a cat, and then sketching the photo. They are two different representations of the same idea. “Image-To-Image Translation with Conditional Adversarial Networks”1 by Berkeley’s team Isola, Zhu, Zhou, and Efros tackle this idea and produce fantastic results. In this post I seek to explain their two major contributions, their conditional generative adversarial network (cGAN), and their patch discriminator, as well as their use of U-Nets2. To stop this post from getting too long, I assume some prior knowledge in deep learning. If you’d like me to expand on any information, feel free to contact me and request me to write a blog post on it.


A GAN is made up of two parts, a generator network and a discriminator network. The generator G takes a noise vector z as input and tries to create an image. The discriminator D then takes an image as its input and outputs the probability of it being a real image. At the beginning of this process, D is not very good at discriminating between real and fake images, and G is not very good at generating realistic images. As D sees more real images and generated images, it becomes better at discriminating what is real and what is not, and as D improves, G needs to produce better images to trick D. Therefore, over time, G learns to create images similar to those that D has seen during training. This gives us G : z → y, where y is the reconstructed image.

cGANs are a little different from this. They start with some input x (e.g. a black and white image), which they map to a latent space z, before reconstructing an image again, y (which in this case could be a colored image). Thus we have G : {x,z} → y. This is why it is called a conditional GAN–It doesn’t just create realistic images from noise, it creates some image given some input image. Three examples they use in the paper are decolored images to colored images, sketches to photos, and aerial photos to maps.

When training a cGAN, D is shown a pair of images, either x and the real y, or x and the generated G(x). It is then tasked with learning which two images are the correct pairs. This could allow us to create powerful loss functions, which test whether images are visually similar, and whether one is likely to generate the other. This is hard to achieve with L1 or L2 loss alone, as they assume every pixel is independent, and just seek to minimize a pixel-by-pixel loss. The discriminator loss looks more at what textures are likely to be in the image and at what position. In the paper, Isola et al. actually use a mix of discriminator loss and L1 loss. In my experiments, adding a weighted L1 term to the loss made it significantly easier for the generator to train. I hope to upload my tensorflow code to my github ( within the next week.


For how simple U-Nets are, they are remarkably powerful in image translation. U-Nets are very similar to autoencoders; they start with an input x, reduce the dimensionality over several fully connected or convolution layers to create a code z, and then decode back to y. However, U-Nets differ from autoencoders in the use of “skip connections”. These skip connections connect the corresponding layers in the encoder and decoder. If we label the layers in the encoder as e1, e2, …, en, and the layers in the decoder as d1, d2, …, dn, then there are connections between ei and d(n-i).

The skip connections in a U-Net allow the network to learn what it wants from each layer in the encoder network. If it is important for the network to know of specific pixel values and simple lines and textures, it can draw from the information gathered in the earlier layers of the encoder. If it is useful to have a higher level of abstraction, such as what it is the image contains, it can draw from the later layers. This is very useful in image translation, as the different translations frequently share the outlines and important details between the inputs and the outputs.

The skip connections in the paper appear to implemented by concatenating the corresponding layers together. This can be achieved easily in Tensorflow with the following:

tf.concat(3, [ci, d(n-i)])

Here we use 3 as the first parameter as we concatenate the layers in the channels (or filter number) dimension, since we have tensors of shape [batch, height, width, channels].

Patch Discriminator

Generally, people have used discriminators to look at an entire image and decide whether it is a real or a fake image. Here, the authors of the paper break the image into patches, and then ask the discriminator whether the patch is real or fake. Afterwards, they average the results of discriminating each patch to get the overall probability of the image being real. This is incredibly useful, as it allows the discriminator to work much easier on images of varying sizes. The discriminator doesn’t need to be designed to fit just one specified dimension for images.

It appears tf.extract_image_patches doesn’t have a gradient defined for the operation yet, which means that back-prop won’t work if it is in your computation graph. However, there is this simple work around using Tensorflow operations which do have gradients defined:

def get_patches(input, patch_dim):in_filters = input.get_shape().as_list()[3]

out_filters = patch_dim**2 * in_filters

filter = tf.constant(np.eye(out_filters).reshape(patch_dim, patch_dim, in_filters, out_filters), tf.float32)

return tf.nn.conv2d(input, filter, [1,patch_dim,patch_dim,1], “VALID”)

The result of this function is a tensor of shape [batch, patch_height, patch_width, channels * patch_height * patch_width]. It can then be reshaped:

tf.reshape(patches, [-1, num_patches, patch_height, patch_width, channels])

After the reshape you can run tf.conv3d over the patches, which allows you to convolve over each patch separately but with the same filters. After several conv3d operations, you can reduce the dimensions to [batch, num_patches, 1, 1, 1], and then take the mean of the probabilities from each filter, and the mean across the batches for total loss.


While the three ideas presented in this post are incremental improvements on already existing ideas, they are a great step towards designing better, and more descriptive loss functions. If you want to implement these ideas yourself, please check out the paper for yourselves. As I mentioned earlier, I hope to post my code on my github soon. On my github I also have available an image sketcher which will run a sketch filter over a large amount of images at once, so you could create your own sketch-photo pair dataset with your own photos, or take advantage of many great datasets out there.

Also, this is my first proper blog post on machine learning, and I would love to hear your feedback if you read it! I’m still building the website, and getting used to presenting machine learning ideas. If you have any suggestions, comments, or requests please let me know. You can contact me at my email address on the “contact” page.

Thank you for reading, and I hope this helped!