Summary of paper ‘AN IMAGE IS WORTH 16X16 WORDS’

5 min readApr 8, 2021

We have always studied and seen an image as a collection of pixels. Nowadays, even we are used to seeing it in high pixels like 1080p. Then how does an image worth 16 x 16 words? Is this question troubling you? Let’s discuss the answer.

To begin with, we should first introduce ourselves to Transformers. Transformers is an architecture/model which uses the concept of attention. They consist of encoders and decoders and are most commonly used for NLP (Natural Language processing). They came as a substitute for LSTM and GRU models which were state-of-the-art models earlier.

You must be thinking, but how the transformers are useful for images and how can we make an image worth 16 X 16 words? Earlier we have used CNNs for image tasks like detecting and classifying them. In CNN we have a lot of assumptions but in Transformers, there is nothing like that. For NLP tasks we used to pass the sequence of taken whereas for image tasks we split an image into small patches of 16 X 16 and provide the sequence of linear embeddings of these patches as vectors to input to a Transformer.

Transformers lack inductive biases thus aren’t well for small-size data but for large-scale data they outperform all the previous state of art models. Transformers take much lesser computation power as compared to ResNet and other models. So how does the Vision Transformer (ViT) works?

ViT works similar to the basic transformer with a few changes in input. They embed the image as Patches i.e. patch embedding and add a position embedding along with it for positional information. As ViT are mainly made for classification problems thus, they have added an extra learnable, ‘classification token’ to the sequence and classification head which is attached to transformer encoder which is implemented by an MLP with one hidden layer at pre-training time and by a single linear layer at a fine-tuning time.

To handle images( obviously 2D), they reshape the image into

into a sequence of flattened 2D patches

where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P² is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.

The Transformer encoder consists of alternating layers of multiheaded self-attention and MLP(multi-layer perception) blocks. Layernorm (LN), which is used for normalization of the distribution, is applied before every block, and residual connections after every block. The MLP contains two layers with a GELU (Gaussian Error Linear Unit) non-linearity as an activation function.

Dataset

The model is trained on large scale datasets like:

ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images
ImageNet-21k with 21k classes and 14M images
JFT with 18k classes and 303M high-resolution images

Benchmarks:

Model Variant on ViT used for the experiment

We can see that ViT perform better with much lesser computation power, and one more thing to notice is that as we decrease the patch size from 16 to 14, the computation power increases indicating lower patch size yield more computational power.

Performance on Various Dataset

As I have earlier said that, ViT doesn’t work well with small datasets, here is the graph representing its performance on the various datasets. On small datasets, it does not perform well, but as we shift to larger sets, its performance also increases. Thus, ViT depends largely on the training data.

One more great feature about ViT is that it can easily and directly account for filters which were a little difficult with CNN. ViT also uses the concept of PCA, principal component analysis. It also learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embedding. Self-attention allows ViT to integrate information across the entire image even in the lowest layers.

Hybrid Architecture

This paper also mentioned Hybrid architecture where it combines both the models CNN and transformers to use their capabilities. The result obtained was unexpected and surprising.

For small model size, Hybrid model performs better than pure transformer but for larger models, the performance was near ViT. It was surprising as the CNN layer should enhance the model performance for every model size, but it wasn’t the case. The pure transformer architecture (ViT) is more efficient and scalable than traditional CNNs at both smaller and larger compute scales.

There are a lot more details and explanation in the paper, it was just the crux of the paper for understanding and going through the concept of using the transformer for image classification.

Applications of Transformers for different image tasks such as Detection, Panoptic Segmentation, Image completion, and more have come like DETR, Image GPT.

This article is based on AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE written by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, and is still under review.

Here is a link to a colab file, where CIFAR-100 classification is done using CNN as well as ViT. Click here for the file.

Or, ‘https://colab.research.google.com/drive/1Yxbn2yR8hLD07y8sLMllylJrN8HT0VcM?usp=sharing’

Summary of paper ‘AN IMAGE IS WORTH 16X16 WORDS’

Written by Gareema Dhingra

No responses yet