📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 1. Problem definition
  • 2. Motivation
  • Related work
  • Idea
  • 3. Method
  • 4. Experiment & Result
  • Experimental setup
  • Result
  • 5. Conclusion
  • Take home message
  • Author
  • Reviewer
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2021 Fall] Paper review

Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]

Chen et al. / Scene Text Telescope - Text-focused Scene Image Super-Resolution / CVPR 2021

PreviousHyperGAN [Kor]NextScene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]

Last updated 3 years ago

Was this helpful?

한국어로 쓰인 리뷰를 읽으려면 ****를 누르세요.

1. Problem definition

Scene Text Recognition (STR) : a task to recognize text in scene images.

(Example Applications : extraction of car license plate, reading ID card, etc)

  • Despite the ongoing active researches in STR tasks, recognition on low-resolution (LR) images still has subpar performance.

  • This needs to be solved since LR text images exist in many situations, for example, when a photo is taken with low-focal camera or under circumstances where a document image is compressed to reduces disk usages.

    → To address this problem, this paper proposes a text-focused super-resolution framework, called Scene Text Telescope.

2. Motivation

Related work

  • Works on Scene Text Recognition

    • Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

      : combines CNN and RNN to obtain sequential features of text images and utilizes CTC decoder [1] to maximize the probability of paths that can reach the ground truth

    • Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

      : employs a Spatial Transformer Network to rectify text images and utilizes attention mechanism to focus on specific character at each time step

      → Not suitable for tackling curved texts!

  • Works on Text Image Super-Resolution

    • Mou, Yongqiang, et al. "Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer International Publishing, 2020.

      : considers text-specific properties by designing multi-task framework to recognize and upsample text images

    • Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

      : captures sequential information of text images

      → Can suffer from disturbances from backgrounds which can degrade the performance of upsampling on text results!

(* Note that these related works and their limitations are mentioned by the paper)

Idea

This paper proposes text-focused super-resolution framework, called Scene Text Telescope

  1. To deal with texts in arbitrary orientations,

    → Utilize a novel backbone, named TBSRN (Transformer-Based Super-Resolution Network) to capture sequential information

  2. To solve background disturbance problem,

    → Put Position-Aware Module and Content-Aware Module to focus on the position and content of each character

  3. To deal with confusing characters in Low-Resolution,

    → Employ a weighted cross-entropy loss in Content-Aware Module

  • Works that are utilized on model and evaluation

    • Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.

    • Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

    • Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

    • Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

3. Method

The overall architecture is composed of...

Pixel-Wise Supervision Module + Position-Aware Module + Content-Aware Module

  • Pixel-Wise Supervision Module

    1. LR (Low-Resolution) image is rectified by a STN (Spatial Transformer Network) to solve misalignment problem [2].

    2. Then, rectified image goes through TBSRN (Transformer-based Super-Resolution Networks).

      TBSRN (Transformer-based Super-Resolution Networks)

      • Two CNNs : to extract feature map

      • Self-Attention Module : to capture sequential information

      • 2-D Positional Encoding : to consider spatial positional information

    3. Finally, the image gets upsampled to SR (Super-Resolution) image through pixel-shuffling.

  • Position-Aware Module

    1. Pretrain a Transformer-based recognition model using synthetic text datasets (including Syn90k [3] and SynthText [4])

    2. Leverage its attending regions at each time-step as positional clues

    3. Employ L1 loss to supervise two attention maps

  • Content-Aware Module

    1. Train a VAE (Variational Autoencoder) using EMNIST [5] to obtain each character's 2D latent representation

      → Positions of similar characters are usually close in the latent space

  • Overall Loss Function

    (Here, lambdas are hyperparameters to balance three terms)

4. Experiment & Result

Experimental setup

  • Dataset

    TextZoom [2] : 17,367 LR-HR pairs for training + 4,373 pairs for testing (1,619 for easy subset / 1,411 for medium / 1,343 for hard)

    LR images : 16 × 64 / HR images : 32 × 128

  • Evaluation metric

    For SR images,

    • PSNR (Peak Signal-to-Noist Ratio)

    • SSIM (Structural Similarity Index Measure)

    Proposes metrics that focuses on text regions

    • TR-PSNR (Text Region PSNR)

    • TR-SSIM (Text Region SSIM)

    → Only take the pixels in the text region (This is done by utilizing SynthText [4], U-Net [6] )

  • Implementation Details

    HyperParameters

    • Optimizer : Adam

    • Batch size : 80

    • Learning Rate : 0.0001

    GPU details : NVIDIA TITAN Xp GPUs (12GB × 4)

Result

  • Ablation Study

    • This paper evaluated the effectiveness of each component on backbone, Position-Aware Module, Content-Aware Module, etc.

    • Dataset : TextZoom [2]

      +) Recognition Accuracy is computed by the pre-trained CRNN [7].

  • Results on TextZoom [2]

    • Compared the model with other SR models on three recognition models (CRNN [7], ASTER [8], and MORAN [9])

    • As we can see from the tables below, the recognition accuracy when utilizing TBSRN is relatively higher than the others.

    • Visualized Examples

  • Failure Cases

    • Long & Small texts

    • Complicated background / Occlusion

    • Artistic fonts / Hand-writting texts

    • Images whose labels have not appeared in the training set

5. Conclusion

  • To summarize, this paper

    • Proposed a Text-focused Super Resolution Model (Scene Text Telescope)

      • Used TBSRN as a backbone which utilizes self-attention mechanism to handle irregular text images

      • Used weighted cross-entropy loss to handle confusable characters

Take home message

  • Text-focused SR technique can be very effective in handling LR text images than generic SR techniques.

  • Ablation study and Explanation of failure cases can make paper look fancy!

Author

박나현 (Park Na Hyeon)

  • NSS Lab, KAIST EE

  • julia19@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.

  2. Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

  3. Jaderberg, Max, et al. "Reading text in the wild with convolutional neural networks." International journal of computer vision 116.1 (2016): 1-20.

  4. Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman. "Synthetic data for text localisation in natural images." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

  5. Cohen, Gregory, et al. "EMNIST: Extending MNIST to handwritten letters." 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017.

  6. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

  7. Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

  8. Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

  9. Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.

+) In this module, where are the image of each resolution

Given an HR image, Transformer outputs a list of attention maps ( where = attention map at the i-th time-step & = length of its text label)

Generated SR image is also fed into Transformer to obtain .

Assume that each at each time-step t, the pre-trained Transformer generates an output vector . The content loss for all time-steps is computed as (= ground-truth at t-th step)

여기
Figure1
Figure2
Figure3
Eq11
Eq12
Eq2
Eq3
Eq4
Eq5
Eq6
Figure4
Eq7
Eq8
Eq9
Eq10
Eq1
Figure6
Table
Table5
Figure8
Figure10