This is the webpage of WPI CS 539 (Machine Learning) final project.
Group Member:
Jiaming Nie, Ruojun Li, Yu Li, Yang Tao, Guangda Li
Style transfer is the technique of recomposing images in the style of other images. Three images will be taken, a content image, a style image and output image.
Deep convolutional neural network is a powerful tool to extract the feature which represents the styles. The style could be the structure of one painting, the way artists draw their paintings and the colors that the artists use. Representation of content images can also be extracted using the deep convolutional neural network.
A style transfer demo is illustrated on the following image:
Deep convolutional neural network can output the activation map, which are features representing the content or style of an image. In this project, a ImageNet pretrained vgg16 network will be used to output the feature at different layers.
VGG 16 is a deep CNN including 5 convolutional layers, 2 fully connetced layers and 1 softmax output layer. In this project, only the convolutional layers will be used as they will output the activation maps.
The architecture of VGG16 is on the following:
VGG16 takes 224 as the input size and all the convolutional layers are hidden layers.
Layer 2 of the CNN. The left image represents what the CNN has learned and the right image has parts of actual images.
As the network went deeper, the features became more high-level and abstrated for human to recongnize. The pretrained vgg16 will be able to filter out the objects in one image.
The structure of the content image could be easily extracted using the output of the last convolutional layer of vgg16.
Similarly, the style of one image could also be extratced using the same method.
The activation map is a combined matrix which are high-level features human can easily recongnize.
In style transfer, network need to convert them into another shape for the traning and testing procedure.
Each row of feature matrix represents different features, each column represent each pixel in activation map.
The feature matrix contains all the inforamtion in style image.
Another information needs to be known is the relationship between different features. The inner product of the feature matrix could describe the relationship in a clear manner. The new matrix is named as gram matrix.
In python, the gram matrix could be easily calculated (F is feature matrix):
def Gram(F):
return np.dot(F.T, F) / F.size
If one element on the gram matrix is large, the relation between these 2 features will be stronger. For example, in one cell, the row represent the tree, the column is flower, the value is larger means this artist prefers the combination of tree and flower.
The size of gram matrix is only determined by the number of feature map. In style transfer, the input size is not strictly defined.
The whole process is showed on the following.
In the forward propagation, input the style image and content image into the network, extracted the feature.
For each the content image, the content loss is determined as the the sum of all the training images.
The cost function is the sum of the style loss and content loss.
In the forward propagation, compute the gradient of total loss over the content image. Take the gradient descent until total loss converges.
The content loss is a determined value defined by the training images.
The calculation equation of the style loss and content loss is actually the equation of mean square function (MSE).
In the training the loss will be the Mean Square Error (MSE) equation.
VGG 16 was used to extract the feature and reconstruct images.
Training the network takes very long time, a light CNN may be used for time reduction.
The tune of parameter α and β to change the output.
The real-time demo video is here:
https://www.youtube.com/watch?v=5WqvDl81Cj0s
Or referenced the gif:
The training dataset uses the Microsoft COCO 2014 dataset.
The size of the dataset is 14GB which has 80000 images in total.
https://github.com/jmnie/CS539-ML-18F-Project