Computer Vision: Generative Models and Conditional Image Synthesis
Author: Antonio Nieto García, Data Scientist
Our Computer Vision blog series has already covered an introduction to Computer Vision and some current use cases, the future of Computer Vision, how to create an Object Detection model without any code, and how to train and deploy your own Image Segmentation model. This blog post will explore one of the hottest topics in the Computer Vision field, Generative Models, focusing on one of its most popular branches, Conditional Image Synthesis.
In this blog, we will start by presenting our brand new Fashion Generator demo, and then we will go deeper into the underlying technologies behind it: Conditional Generative Models.
As part of Datatonic Labs, our dedicated R&D hub where we push the boundaries for cloud technologies, data engineering, and AI, we have been able to explore cutting-edge Conditional Image Synthesis models. To showcase the potential of these models, we have developed a demo oriented to the Fashion sector.
Datatonic’s Fashion Generator Demo
Our Fashion Generator model can generate new clothing styles by using an input image of people wearing clothes, removing the original clothes, and replacing them with new stylish ones generated by the model.
The generative part of the demo is based on a Variational Autoencoder model. Having previously explored different alternatives such as Conditional GAN (Generative Adversarial Networks) or Diffusion models, we finally decided to use Variational Autoencoders to enable higher diversity of outputs.
We used data from Kaggle to train our model, as there is a large set of images and clothes masks available that fit our use case. We used Vertex AI Workbench notebooks, taking advantage of its flexibility in terms of available resources, reduced setup time, and scalability functions that allowed us to easily access the GPU resources for the training stage.
What is a Generative Model?
You may have heard about Deep Learning models like Deep Fakes or those able to estimate how your face would look in 25 years. Indeed, deep generative models are behind these impressive use cases, but these models go far beyond purely trivial purposes and offer great possibilities for industries such as game design, cinematography, and content generation, among others.
These models are based on unsupervised learning algorithms capable of approximating complex, high-dimensional probability distributions from data and generating new samples from these underlying distributions. These algorithms may be applied to many types of data, including audio, image, and video data.
In the last five years, there has been huge progress in the field of generative models from both academia and industry. There are two specific projects worth noting: StyleGAN from NVIDIA, presenting a model capable of generating human faces, and the GPT-2 language model from OpenAI, which can generate original text based on an introductory piece of text. However, evaluating the performance of these models has been difficult given the subjective aspect of measuring the quality of the output.
Types of Generative Model
It is quite common to classify generative models into two main groups:
1. Likelihood-based models
Models in this group try to explicitly learn the likelihood distribution P(X|Y) through loss functions. They are trained to infer a probabilistic distribution that is as similar as possible to the original input data distribution. We could classify VAE models into this group.
2. Implicit models
Unlike likelihood-based models, implicit models are not explicitly trained to learn the likelihood distribution, but to generate output similar to the input data. In the case of GAN, the generator learns how to generate images that could fool the discriminator by generating images that look real enough to avoid the discriminator detecting them as fake.
Evaluating Generative Models
As we mentioned previously, evaluating generative models’ performance is challenging. In the case of likelihood-based models, we may use the likelihood values to measure how good a model is, but this does not take into account the output of the models. While outputs can be observed visually, at least for image outputs, we need an empirical metric to objectively measure the model quality and compare it against other models.
Here are some of the most commonly used metrics to evaluate generative models:
1. Kullback-Leibler Divergence (KL Divergence)
The KL Divergence measures how different a probability distribution is from another probability distribution. This is similar to standard maximum likelihood optimisation problems, but we are minimising this metric instead of maximising the likelihood.
2. Inception Score (IS)
This metric evaluates the quality of the generative model output based on the InceptionV3 Network pre-trained on ImageNet. The IS value is composed by measuring two different aspects: (1) if the InceptionV3 network is capable of clearly identifying a certain type of object; and (2) the variety of outputs: checking if the generative model can generate a large set of different objects.
3. Fréchet Inception Distance (FID)
This metric uses the InceptionV3 Network slightly differently than the IS metric. It tries to measure the differences between real and generated images by evaluating the responses of the penultimate layer of the Inception Network when using real and generated images as the input. A lower FID means that generated images are similar to the real ones, so the model is performing well.
It is important to highlight that these metrics have some limitations to keep in mind when using them to evaluate the models:
- Their values can only be compared in the same context (dataset, image size, etc.).
- There may be different implementations for the same metric.
Conditional Image Synthesis
While generative models have been a milestone in the Computer Vision field, these technical advances need to be adapted to real business problems. In many cases, we may want to be able to control the output in some way, whether for exploring output variations or generating content in a specific direction. This is where Conditional Image Synthesis models appear to resolve the matter.
This kind of model is given an additional input that influences the output that allows us to control the content generation. This input may be very diverse, from text to segmentation masks. A popular example of these models is DALL-E, an impressive model capable of generating realistic images from text descriptions. In this case, the way of conditioning the model output is through text.
Recent advances in the Conditional Image Synthesis field have been based on the original generative model architectures, such as GAN and VAE, and applying some modifications to allow the model to be given additional input to control the output. Nevertheless, there are new architectures, like Diffusion Models, with a different approach to generating synthetic images.
Conditional GAN
Before we start explaining the Conditional GAN model, it is worth giving a brief introduction to the GAN architecture. This model architecture was first introduced in 2014, and is composed of two sub-models: a generator and a discriminator. The former tries to generate realistic images similar to the ones present in the training data and the latter tries to discriminate between real images and the images generated by the generator. The generator learns from the output of the discriminator and is trained to create images that look ‘real’ to the discriminator.
This is great, but we cannot control the generator output, as it is randomly generated based on the ‘knowledge’ gathered from the training phase. This characteristic differentiates GAN from Conditional GAN, as the latter allows control of the generator output. The Conditional GAN architecture includes an additional control vector that feeds both the generator and discriminator, controlling the model’s behaviour in the provided direction.
This control vector can be in multiple formats such as text labels, images, and segmentation masks, among others. An example of this type of model is GauGAN, by NVIDIA, which takes a segmentation mask as a conditional input.
The GAN architecture models have a lot of potential applications, but it is important to consider which use cases they may perform well on. GAN can generate high-resolution images but is unable to catch the entire data distribution, suffering from a lack of diversity. Generally speaking, GAN models are suitable for tasks where the required images have sparse spatial details (like human faces) or good textural details are required, such as landscapes.
Variational Autoencoders (VAE)
To understand Variational Autoencoders, it helps to first understand what autoencoders are, as they are the foundation upon which VAE rests. The autoencoder architecture is composed of an encoder that compresses the input image into a numerical vector, where each dimension represents a feature using discrete values, and a decoder that takes that vector and tries to reconstruct the original image from the information encoded in that numerical vector.
The encoder is trained to gather the more representative features of every image and compress that information into a numerical vector. Meanwhile, the decoder is trained to reproduce the original image from that vector. For solving real problems, this is fairly limited, as the decoder is always going to generate the same output given an encoded vector; there is no margin for diversity.
Variational Autoencoders solve this issue in a very simple way. Instead of compressing the input image into a numerical vector with discrete values, the encoder describes the image attributes using probabilistic terms, using probability functions instead of discrete numbers. This vector is commonly named latent space, and the decoder can randomly sample the probability distributions of each attribute to infer a new image, providing more diverse outputs.
In terms of limitations, VAE models are good at estimating the latent distribution and providing more diverse outputs than GAN, but often the results are blurry compared to the high-quality images GAN may provide. So, VAE is usually recommended when image resolution requirements are not so high, and diversity is a plus.
Diffusion Models
The Palette Model is an Image-to-Image model that can perform various tasks, like Inpainting, Image Refinement, Colorization, and more. The Inpainting task is exciting as the examples in the paper show excellent results when given an image with a blanked-out area. The results are images with no blank areas, and the filled area looks natural.
The Diffusion model is inspired by nonequilibrium thermodynamics, and the process is explained in this paper that explores the area. The Model is based on a Markov Chain of data, where the chain consists of a set of images derived from an original image. Noise is added to the image, or a specific area of the image, in steps and constitutes the chain when the last image, or the selected area of the image, is just random noise. The Model then learns how to reconstruct the image by denoising it recursively. The result is an image without noise and a natural fill.
The Palette Diffusion Model does have a limitation when it comes to inpainting. We cannot condition the model to generate predefined information. If the blanked-out area on the input image were a dog, the dog would most likely not appear in the resulting image. The diffusion process creates natural-looking photos but with no control of the outcome.
Diffusion models are fascinating when it comes to potential use cases. The Palette model explored four tasks; colourization, inpainting, uncropping, and JPEG restoration. The various functions make it an exceptional media editing tool. Taking pictures, such as at tourist attractions, can be a pain when people are in the way of the beautiful photos you want to take. With the inpainting task, the model can remove the people if blanked out in the input. The information about people is removed and will likely not be generated in the output image.
Another use case is restoring the resolution of images by the restoration task. Storing or sending high-resolution media can be data-heavy. Lowering the image’s resolution could solve this problem if we could restore it to the local machine, which is possible with the diffusion model.
Summary
In this blog post, we have looked at Conditional Image Synthesis models and highlighted their importance and potential future impact. The main takeaways we’d like to highlight for anyone interested in using Conditional Image Synthesis are the following:
- Understand the challenges, limitations, and possibilities associated with each of the models discussed.
- Identify which one may fit a particular use case better.
- Understand the fundamentals underlying the latest generative models use cases.
There are a lot of expectations in the Computer Vision field regarding upcoming advances in this area, as there is an increasing interest, investment, and research around generative models. We can see fast progress, meaning there is still a lot to expect from these models.
Further Reading
For more information about some of the topics discussed in this blog, take a look at these resources:
- Multimodal Conditional Image Synthesis: https://arxiv.org/abs/2112.05130
- Multiple latent spaces VAE: https://arxiv.org/pdf/2106.13416
- Vector Quantized Variational Autoencoder (VQ-VAE): https://arxiv.org/pdf/1906.00446v1
- Diffusion Models: https://lilianweng.github.io/posts/diffusion-models/
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics: https://arxiv.org/abs/1503.03585
Check out our other Computer Vision blogs in this series:
Part 1: Computer Vision: Insights from Datatonic’s Experts
Part 2: Computer Vision: Emerging Trends and Google Cloud Technology
Part 3: Computer Vision: Object Detection and No-Code AI with AutoML
Part 4: Computer Vision: Deploying Image Segmentation Models on Vertex AI
Part 5: Computer Vision: Generative Models and Conditional Image Synthesis