Today, I decided to pen down a quick review of an algorithm that lies at the heart of all the beautiful Van Gogh and Picasso versions of your profile picture. Once this algorithm was unleashed onto the world, it went viral and resulted in all sorts of apps like Prisma and Vinci, and has since, been extended to video as well. More recent attempts have been made trying to apply something similar to audio data.
This algorithm uses a class of neural networks called Convolutional Neural Networks (CNNs), that have proven to be very effective in various computer vision tasks such as image classification, segmentation, object detection, etc. In this blog post, I assume that the reader has a basic understanding of how CNNs work. If you need a quick recap, I divert you to this page, and once you’re down, you can hop right back in here.
The algorithm for style transfer was introduced by Gatys, Ecker and Bethge in this paper on ‘Image Style Transfer using Convolutional Neural Networks‘ which was published in CVPR 2016, (it does however have an earlier version (2015) on ArXiv). This is an important paper in terms of what it resulted in and how it attracted the attention of people outside of the deep learning community, but more importantly, it showed that CNNs that were traditionally being used for image classification tasks, could in fact, be used for something different through their powerful image representations.
Key idea: They introduce an algorithm that is capable of separating the content of an image from its style and recombining them to produce interesting results.
For example, consider an artwork $latex a$ and a photograph $latex p$. We want to apply the style of the artwork over the contents of the photograph to produce a new stylized image.
The technique involved here is a parametric optimization approach where we want to minimize a linear combination of two losses – Content loss and Style loss.
For encoding the content representation, they directly make use of feature maps at a particular layer. Feature maps or activation maps are the neuron responses of the given layer. Consider a layer in the network having feature maps each of dimension (height times width of the feature map). From this we build a feature map matrix in space . In essence what we are doing is taking a feature map and stretching it out into a single row, and then stacking multiple such feature maps one below the other.
So the encoding is done as follows:
- Layer having feature maps each of size
- Feature: where is the activation of the filter at position in layer
During optimization, we perform gradient descent on a white noise image() and incrementally update it to match the feature representation at layer , so as to find another image that matches the content representation of the original photograph ().
We then use the L2 difference between the two representations to compute the loss. The loss is given by the following equation:
The content representation was something that was pretty straight forward. What’s a little less trivial and more nuanced in my opinion is the style loss. The paper uses the local correlation between feature maps of a layer for building the style representation. For doing this, they compute the Gram matrix of the feature maps. Mathematically, a gram matrix is nothing but a matrix multiplied by its transpose. So we take our feature map matrix that I had described earlier, and multiply it by its transpose to get the gram matrix at layer . They compute these matrices at multiple layers in the network.
So the encoding is given by the correlations between feature maps of a layer and is computed as follows:
- Feature: Gram matrix , where is the inner product between the vectorised feature map and in layer :
- Computed at different layers to get multi-scale representation
During optimization, like earlier, we perform gradient descent on a white noise image() and minimize the L2 difference between the two gram matrices at a layer, such that it resembles the style of the artwork ().
We get the total style loss by summing up the losses at different layers.
Bonus: Why Gram matrices for style?
Having read the original paper a couple of times to gain a good understanding of the algorithm, I however couldn’t find an explanation behind why gram matrices encode the style of an image. This made me dig deeper into the realms of the internet to find a satisfactory answer. Section 2.2 of this nice paper has a good explanation, presented below.
Intuitively, these matrices compute feature statistics that have two important properties – spatial averaging and local coherence.
Starting with the 1st of the two properties, what does spatial averaging mean? It means that we want to quantify how important a filter is for the layer without bothering about the input locations in which this filter gets activated. Now this is a crucial property for representing style because style of an image is a static property. It is supposed to be invariant to the position in the image. For instance, the brush strokes that the artist uses for painting the sun are similar to the ones used for painting the house. This spatial averaging is taken care of by the diagonal terms in the matrix.
The non-diagonal terms are where the 2nd property of local coherence comes in. The non-diagonal elements tell us how correlated two filters are at specific spatial locations. A term of 0 for instance tells us that the two filters in question are not supposed to be activated at the same locations. So they enforce some kind of separation or coherency between filters. This is the most basic argument. There were further explanations that went into Gabor-like filters and Fourier transforms, but I haven’t looked into it.
This makes sure that the style representation remains blind to the global arrangement of objects while still retaining the local correlation between features.
The Network Model: Combining the content and style losses
The final loss to be optimized is a linear combination of both the content and the style losses given by:
where and are constants which control how much of the style and content we want in the final image. A schematic representation of the entire architecture is shown in the below figure.
From a number of experiments , the authors present the following points:
- Use higher layers for content → capture high level image representations such as objects and their arrangement in the input image, and discard the detailed pixel-level information.
- Use multiple layers for style → Lower layers are more grainy. Higher layers are able to capture local structures pretty well.
One major takeaway from this paper is that CNNs are capable of encoding highly powerful and separable feature representations for the content and style of images. In a way, the authors show us that we could completely ignore the results coming out of the network and focus purely on the intermediate feature representations, which are pretty powerful in themselves, for performing completely new tasks in the domain of computer vision.