📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 1. Problem definition
  • 2. Motivation
  • Related work
  • Idea
  • 3. Method
  • Vision Transformer (ViT)
  • 4. Experiment & Result
  • Result
  • 5. Conclusion
  • Take home message (오늘의 교훈)
  • Author
  • Reviewer
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2022 Spring] Paper review

ViT [Eng]

Dosovitskiy et al. / An Image is Worth 16*16 Words; Transformers for Image Recognition at Scale / ICLR 2021

PreviousReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]NextCrossTransformer [Kor]

Last updated 2 years ago

Was this helpful?

1. Problem definition

Transformers, which are based on self-attention architecture, are widely used in NLP due to its efficiency and scalability. However, in computer vision convolutional models are widely used, since the number of parameters are too big to directly apply the Transformer model to individual pixels. This paper is consisted of two parts: a way to apply the self-attention structure in computer vision and performance comparison with CNN-based models on specific datasets and environments. Thus the problem definition would be as follows: can we build a self-attention based model adequate for vision tasks (ex. image classification) that can show significant performance compared to SotA CNN models?

2. Motivation

Related work

  • Transformers (Vaswani et al. / Attention Is All You Need / NIPS 2017)

    • The Transformer architecture is formulated with an encoder-decoder structure similar to variational autoencoders.

    • Encoder

      • There are N number of identical layers, the initial input passing the first layer and the output passing the next layer, and so on.

      • Each layer is consisted of two sub-layers, the multi-head self-attention layer and position-wise fully connected feed-forward network.

      • Residual connection and layer normalization are used in each sub-layer such that the output of each layer is LayerNorm(x + Sublayer(x)). (Sublayer(x) is the function corresponding to each sublayer). In order to perform residual connection, all sublayers and embedding layers have the same dimension.

    • Decoder

      • Like the encoder, the decoder is consisted of N number of identical layers.

      • A third sublayer is added which performs multi-head attention to the output of the encoder. Residual connection and layer normalization is identically used in the decoder.

      • Unlike the encoder, the decoder alters the self-attention layer to prevent each position from using information of subsequent positions, in other words, making sure the prediction for the i-th position can only depend on previous outputs of positions less than i. This technique is commonly referred to as masking.

    • For more detailed explanations on recurrent neural networks and the self-attention architecture itself, please refer to

      Sherstinsky, Alex / Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network / Elseviser 2021

      as well as the original paper above.

Idea

In order to reduce the number of parameters, the authors propose to split images into fixed-size patches, and by linearly embedding each of them with positional informations, we can feed the resulting vectors to a standard transformer encoder.

3. Method

Vision Transformer (ViT)

  • Embedding

    • In order to process 2D images into 1D token embeddings, the original input image x∈RH×W×Cx∈ \mathbb{R}^{H×W×C}x∈RH×W×C is split into a sequence of flattened 2D patches xp∈RN×(P2⋅C)x_p∈ \mathbb{R}^{N×(P^2·C)}xp​∈RN×(P2⋅C).

      • (P,P)(P, P)(P,P): Resolution of Image Patch

      • NNN: Number of Patches (=HW/P2=HW/P^2=HW/P2)

      • DDD: Fixed dimension size for all sub-layers

    • The flattened patches are then mapped to D dimensions with a trainable linear projection (E). The output of the projection will be referred as the patch embeddings. z0=[xclass;xp1E;xp2E;⋅⋅⋅;xpNE]+Epos,E∈R(P2⋅C)×D,Epos∈R(N+1)×Dz_0 = [x_{class}; x ^1_{p}E; x ^2_{p}E; · · · ; x ^N_{p}E] + E_{pos}, E ∈ \mathbb{R}^{(P^2·C)×D}, E_{pos} ∈ \mathbb{R}^{(N+1)×D}z0​=[xclass​;xp1​E;xp2​E;⋅⋅⋅;xpN​E]+Epos​,E∈R(P2⋅C)×D,Epos​∈R(N+1)×D

  • Class Token

    • Like the class token similar to BERT, a learnable embedding is appended to the sequence of patch embeddings. (z00=xclassz^0_0 = x_{class}z00​=xclass​)

    • zL0z^0_LzL0​, the state of the token at the output of the encoder, serves as the predicted image class yyy. y=LN(zL0)y = LN(z^0_L)y=LN(zL0​)

    • A classification head is attached to z^0_L for pre-training and fine-tuning.

      • Pre-training: 2-layered MLP

      • Fine-tuning: 1 linear layer

  • Positional Embedding

    • To give positional information for each patch, positional embeddings are attached to the patch embeddings.

    • 1D position embeddings are used instead of 2D as no significant performance gain was witnessed.

  • Transformer Encoder

    • The original transformer encoder consists of N number of identical blocks of multiheaded self-attention layers and feed-forward neural networks as discussed in Related Works.

      • zl′=MSA(LN(zl−1))+zl−1,l=1...Lz'_l = MSA(LN(z_{l−1})) + z_{l−1}, l = 1 . . . Lzl′​=MSA(LN(zl−1​))+zl−1​,l=1...L

      • zl=MLP(LN(zl′))+zl′,l=1...Lz_l = MLP(LN(z'_l)) + z'_l, l = 1 . . . Lzl​=MLP(LN(zl′​))+zl′​,l=1...L

    • In ViT, layer normalization (LN) is applied before every block and residual connections after every block.

  • Inductive bias

    • Unlike CNN which assumes locality, 2D neighborhood structure, and translation equivariance, ViT has much less image-specific induction bias.

      • MLP Layer: Locality & Equivariance assumed.

      • Self-attention Layer: Global

    • The 2D neighborhood structure is only used when cutting the patches and at fine-tuning time. The position embeddings at initialization hold no information of the 2D positions of the patches nor the spatial relations.

  • Hybrid Architecture

    • Instead of using image patches, we can use the feature maps of a CNN architecture.

    • In this case, the patches are replaced with a pretrained CNN feature map, flattened, and projected to D dimension similar to the original architecture.

4. Experiment & Result

  • Dataset

    • ImageNet-1k (1k classes and 1.3M images)

    • ImageNet-21k (21k classes and 14M images)

    • JFT (18k classes and 303M high-resolution images)

  • Model Structure

      • ViT-Base, ViT-Large, ViT-Huge ex) ViT-Large/16 : "Large" number of parameters and 16 × 16 input patch

    • Baseline (CNNs)

      • ResNet, replacing the Batch Normalization layers with Group Normalization, and use standardized convolutions (BiT)

  • Training setup

    • Hyper-parameter

      • Optimizer: Adam (β1=0.9,β2=0.999\beta_1=0.9, \beta_2=0.999β1​=0.9,β2​=0.999)

      • Batch size: 4096

      • Weight Decay: 0.1

  • Fine-tuning

    • Hyper-parameter

      • Optimizer: SGD

      • Batch size: 512

  • Evaluation metric

    • Fine-tuning accuracy

      • Capture the performance of each model after fine-tuning

    • Few-shot accuracy

      • Obtained by solving a regularized LS regression problem (used only when fine-tuning is too costly)

      • Map the representation yyy for each image in the batch to {−1,1}K\{-1,1\}^K{−1,1}K target vectors then calculate the accuracy in closed form.

Result

  • Comparison to SotA Baseline model

      • Noisy Student is SoTA on ImageNet-1k and BiT-L on all other datasets (CIFAR-10, CIFAR-100, VTABetc.)

      • This can be also checked by the figure on performance versus pre-training compute for each model. The x-axis denotes the pre-training cost (computation time) while the y-axis denotes classification accuracy. ViTs generally outperform ResNets with the same computational budget. For smaller model sizes, Hybrids show better performance upon pure Transformers but the gap close when the model size grows.

    • Left Figure: Filters of the initial linear embedding of RGB values of ViT-L/32

      • Recall the first layer of ViT linearly projects the flattened patches to a fixed low-dimensional space.

      • The principal components of the learned embedded filters perform similar tasks to the CNN filters.

    • Mid Figure: Similarity between position embeddings

      • Recall ViT adds position embedding to the projected embedded vector sequence.

      • Patches with closer distance show higher similarity, implying the spatial information is properly trained

    • Right Figure: Size of attended area by head and network depth

      • Self-attention allows ViT to integrate information across the entire image.

      • Attention distance is calculated by the average distance in image space across which information is integrated, based on the attention weights.

    • The model attends to regions that are semantically relevant for classification.

5. Conclusion

The somewhat simple idea of applying the Transformer architecture to computer vision tasks created a massive repercussion to the Academia due to its massive performance on big datasets such as ImageNet/CIFAR-100/etc. The key difference of ViT to CNN is that it does not employ initial induction biases in training. In small-sized datasets CNN usually outperforms ViT but in large-sized datasets with high number of classes, the bias-free architecture and generalization power of the self-attention model shines. After its publication, majority of generative models utilizing an encoder-decoder structure utilize ViT, displaying how much influence ViT had in multiple fields.

Take home message (오늘의 교훈)

An image is truly worth 16 × 16 words, which was striking for most of us. How about videos? :)

Author

윤여동 (Yeo Dong Youn)

  • KAIST AI

  • Mainly working on generative models, specifically representation learning

  • E-mail: yeodong@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Citation of this paper

     article{DBLP:journals/corr/abs-2010-11929, 
     author = {Alexey Dosovitskiy et al.},
     title = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
     journal   = {CoRR},
     volume    = {abs/2010.11929},
     year      = {2020}}
  2. Citation of related work BERT

     article{DBLP:journals/corr/abs-1810-04805,
     author    = {Jacob Devlin et al.},
     title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
            Understanding},
     journal   = {CoRR},
     volume    = {abs/1810.04805},
     year      = {2018}
     }

    Attention is all you need

     article{DBLP:journals/corr/VaswaniSPUJGKP17,
     author    = {Ashish Vaswani et al.},
     title     = {Attention Is All You Need},
     journal   = {CoRR},
     volume    = {abs/1706.03762},
     year      = {2017}
     }

ViT

Vit-H/14 & Vit-L/16 vs. BiT-L & Noisy Student

The models are pre-trained on either JFT or ImageNet-21k and tested on benchmark datasets for classification accuracy.

Inspection on Vision Transformers

We can check that some heads attend to most of the image (the whole image globally) in the lowest layers, showing that it incorporates information globally rather than locally compared to CNN.

Official GitHub repository

(Link)
Figure 3: Vision Transformer
Figure 1: Overview of the Transformer Architecture
Figure 2: Masking
Figure 4: Model Setup
Figure 5: Experiment Result
Figure 6: Performance Comparison
drawing
drawing