ViT [Eng]
Dosovitskiy et al. / An Image is Worth 16*16 Words; Transformers for Image Recognition at Scale / ICLR 2021
Last updated
Dosovitskiy et al. / An Image is Worth 16*16 Words; Transformers for Image Recognition at Scale / ICLR 2021
Last updated
Transformers, which are based on self-attention architecture, are widely used in NLP due to its efficiency and scalability. However, in computer vision convolutional models are widely used, since the number of parameters are too big to directly apply the Transformer model to individual pixels. This paper is consisted of two parts: a way to apply the self-attention structure in computer vision and performance comparison with CNN-based models on specific datasets and environments. Thus the problem definition would be as follows: can we build a self-attention based model adequate for vision tasks (ex. image classification) that can show significant performance compared to SotA CNN models?
Transformers (Vaswani et al. / Attention Is All You Need / NIPS 2017)
The Transformer architecture is formulated with an encoder-decoder structure similar to variational autoencoders.
Encoder
There are N number of identical layers, the initial input passing the first layer and the output passing the next layer, and so on.
Each layer is consisted of two sub-layers, the multi-head self-attention layer and position-wise fully connected feed-forward network.
Residual connection and layer normalization are used in each sub-layer such that the output of each layer is LayerNorm(x + Sublayer(x)). (Sublayer(x) is the function corresponding to each sublayer). In order to perform residual connection, all sublayers and embedding layers have the same dimension.
Decoder
Like the encoder, the decoder is consisted of N number of identical layers.
A third sublayer is added which performs multi-head attention to the output of the encoder. Residual connection and layer normalization is identically used in the decoder.
Unlike the encoder, the decoder alters the self-attention layer to prevent each position from using information of subsequent positions, in other words, making sure the prediction for the i-th position can only depend on previous outputs of positions less than i. This technique is commonly referred to as masking.
For more detailed explanations on recurrent neural networks and the self-attention architecture itself, please refer to
Sherstinsky, Alex / Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network / Elseviser 2021
as well as the original paper above.
In order to reduce the number of parameters, the authors propose to split images into fixed-size patches, and by linearly embedding each of them with positional informations, we can feed the resulting vectors to a standard transformer encoder.
Embedding
In order to process 2D images into 1D token embeddings, the original input image is split into a sequence of flattened 2D patches .
: Resolution of Image Patch
: Number of Patches ()
: Fixed dimension size for all sub-layers
The flattened patches are then mapped to D dimensions with a trainable linear projection (E). The output of the projection will be referred as the patch embeddings.
Class Token
Like the class token similar to BERT, a learnable embedding is appended to the sequence of patch embeddings. ()
, the state of the token at the output of the encoder, serves as the predicted image class .
A classification head is attached to z^0_L for pre-training and fine-tuning.
Pre-training: 2-layered MLP
Fine-tuning: 1 linear layer
Positional Embedding
To give positional information for each patch, positional embeddings are attached to the patch embeddings.
1D position embeddings are used instead of 2D as no significant performance gain was witnessed.
Transformer Encoder
The original transformer encoder consists of N number of identical blocks of multiheaded self-attention layers and feed-forward neural networks as discussed in Related Works.
In ViT, layer normalization (LN) is applied before every block and residual connections after every block.
Inductive bias
Unlike CNN which assumes locality, 2D neighborhood structure, and translation equivariance, ViT has much less image-specific induction bias.
MLP Layer: Locality & Equivariance assumed.
Self-attention Layer: Global
The 2D neighborhood structure is only used when cutting the patches and at fine-tuning time. The position embeddings at initialization hold no information of the 2D positions of the patches nor the spatial relations.
Hybrid Architecture
Instead of using image patches, we can use the feature maps of a CNN architecture.
In this case, the patches are replaced with a pretrained CNN feature map, flattened, and projected to D dimension similar to the original architecture.
Dataset
ImageNet-1k (1k classes and 1.3M images)
ImageNet-21k (21k classes and 14M images)
JFT (18k classes and 303M high-resolution images)
Model Structure
ViT-Base, ViT-Large, ViT-Huge ex) ViT-Large/16 : "Large" number of parameters and 16 × 16 input patch
Baseline (CNNs)
ResNet, replacing the Batch Normalization layers with Group Normalization, and use standardized convolutions (BiT)
Training setup
Hyper-parameter
Optimizer: Adam ()
Batch size: 4096
Weight Decay: 0.1
Fine-tuning
Hyper-parameter
Optimizer: SGD
Batch size: 512
Evaluation metric
Fine-tuning accuracy
Capture the performance of each model after fine-tuning
Few-shot accuracy
Obtained by solving a regularized LS regression problem (used only when fine-tuning is too costly)
Map the representation for each image in the batch to target vectors then calculate the accuracy in closed form.
Comparison to SotA Baseline model
Noisy Student is SoTA on ImageNet-1k and BiT-L on all other datasets (CIFAR-10, CIFAR-100, VTABetc.)
This can be also checked by the figure on performance versus pre-training compute for each model. The x-axis denotes the pre-training cost (computation time) while the y-axis denotes classification accuracy. ViTs generally outperform ResNets with the same computational budget. For smaller model sizes, Hybrids show better performance upon pure Transformers but the gap close when the model size grows.
Left Figure: Filters of the initial linear embedding of RGB values of ViT-L/32
Recall the first layer of ViT linearly projects the flattened patches to a fixed low-dimensional space.
The principal components of the learned embedded filters perform similar tasks to the CNN filters.
Mid Figure: Similarity between position embeddings
Recall ViT adds position embedding to the projected embedded vector sequence.
Patches with closer distance show higher similarity, implying the spatial information is properly trained
Right Figure: Size of attended area by head and network depth
Self-attention allows ViT to integrate information across the entire image.
Attention distance is calculated by the average distance in image space across which information is integrated, based on the attention weights.
The model attends to regions that are semantically relevant for classification.
The somewhat simple idea of applying the Transformer architecture to computer vision tasks created a massive repercussion to the Academia due to its massive performance on big datasets such as ImageNet/CIFAR-100/etc. The key difference of ViT to CNN is that it does not employ initial induction biases in training. In small-sized datasets CNN usually outperforms ViT but in large-sized datasets with high number of classes, the bias-free architecture and generalization power of the self-attention model shines. After its publication, majority of generative models utilizing an encoder-decoder structure utilize ViT, displaying how much influence ViT had in multiple fields.
An image is truly worth 16 × 16 words, which was striking for most of us. How about videos? :)
윤여동 (Yeo Dong Youn)
KAIST AI
Mainly working on generative models, specifically representation learning
E-mail: yeodong@kaist.ac.kr
Korean name (English name): Affiliation / Contact information
Korean name (English name): Affiliation / Contact information
...
Citation of this paper
Official GitHub repository (Link)
Citation of related work BERT
Attention is all you need
ViT
Vit-H/14 & Vit-L/16 vs. BiT-L & Noisy Student
The models are pre-trained on either JFT or ImageNet-21k and tested on benchmark datasets for classification accuracy.
Inspection on Vision Transformers
We can check that some heads attend to most of the image (the whole image globally) in the lowest layers, showing that it incorporates information globally rather than locally compared to CNN.