Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]

Chen et al. / Scene Text Telescope - Text-focused Scene Image Super-Resolution / CVPR 2021

한국어로 쓰인 리뷰를 읽으려면 **여기**를 누르세요.

1. Problem definition

Scene Text Recognition (STR) : a task to recognize text in scene images.

(Example Applications : extraction of car license plate, reading ID card, etc)

  • Despite the ongoing active researches in STR tasks, recognition on low-resolution (LR) images still has subpar performance.

  • This needs to be solved since LR text images exist in many situations, for example, when a photo is taken with low-focal camera or under circumstances where a document image is compressed to reduces disk usages.

    → To address this problem, this paper proposes a text-focused super-resolution framework, called Scene Text Telescope.

2. Motivation

  • Works on Scene Text Recognition

    • Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

      : combines CNN and RNN to obtain sequential features of text images and utilizes CTC decoder [1] to maximize the probability of paths that can reach the ground truth

    • Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

      : employs a Spatial Transformer Network to rectify text images and utilizes attention mechanism to focus on specific character at each time step

      → Not suitable for tackling curved texts!

  • Works on Text Image Super-Resolution

    • Mou, Yongqiang, et al. "Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer International Publishing, 2020.

      : considers text-specific properties by designing multi-task framework to recognize and upsample text images

    • Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

      : captures sequential information of text images

      → Can suffer from disturbances from backgrounds which can degrade the performance of upsampling on text results!

(* Note that these related works and their limitations are mentioned by the paper)

Idea

This paper proposes text-focused super-resolution framework, called Scene Text Telescope

  1. To deal with texts in arbitrary orientations,

    Utilize a novel backbone, named TBSRN (Transformer-Based Super-Resolution Network) to capture sequential information

  2. To solve background disturbance problem,

    Put Position-Aware Module and Content-Aware Module to focus on the position and content of each character

  3. To deal with confusing characters in Low-Resolution,

    Employ a weighted cross-entropy loss in Content-Aware Module

  • Works that are utilized on model and evaluation

    • Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.

    • Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

    • Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

    • Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

3. Method

The overall architecture is composed of...

Pixel-Wise Supervision Module + Position-Aware Module + Content-Aware Module

  • Pixel-Wise Supervision Module

    1. LR (Low-Resolution) image is rectified by a STN (Spatial Transformer Network) to solve misalignment problem [2].

    2. Then, rectified image goes through TBSRN (Transformer-based Super-Resolution Networks).

      TBSRN (Transformer-based Super-Resolution Networks)

      • Two CNNs : to extract feature map

      • Self-Attention Module : to capture sequential information

      • 2-D Positional Encoding : to consider spatial positional information

    3. Finally, the image gets upsampled to SR (Super-Resolution) image through pixel-shuffling.

  • Position-Aware Module

    1. Pretrain a Transformer-based recognition model using synthetic text datasets (including Syn90k [3] and SynthText [4])

    2. Leverage its attending regions at each time-step as positional clues

    3. Employ L1 loss to supervise two attention maps

  • Content-Aware Module

    1. Train a VAE (Variational Autoencoder) using EMNIST [5] to obtain each character's 2D latent representation

      → Positions of similar characters are usually close in the latent space

  • Overall Loss Function

    (Here, lambdas are hyperparameters to balance three terms)


4. Experiment & Result

Experimental setup

  • Dataset

    TextZoom [2] : 17,367 LR-HR pairs for training + 4,373 pairs for testing (1,619 for easy subset / 1,411 for medium / 1,343 for hard)

    LR images : 16 × 64 / HR images : 32 × 128

  • Evaluation metric

    For SR images,

    • PSNR (Peak Signal-to-Noist Ratio)

    • SSIM (Structural Similarity Index Measure)

    Proposes metrics that focuses on text regions

    • TR-PSNR (Text Region PSNR)

    • TR-SSIM (Text Region SSIM)

    → Only take the pixels in the text region (This is done by utilizing SynthText [4], U-Net [6] )

  • Implementation Details

    HyperParameters

    • Optimizer : Adam

    • Batch size : 80

    • Learning Rate : 0.0001

    GPU details : NVIDIA TITAN Xp GPUs (12GB × 4)

Result

  • Ablation Study

    • This paper evaluated the effectiveness of each component on backbone, Position-Aware Module, Content-Aware Module, etc.

    • Dataset : TextZoom [2]

      +) Recognition Accuracy is computed by the pre-trained CRNN [7].

  • Results on TextZoom [2]

    • Compared the model with other SR models on three recognition models (CRNN [7], ASTER [8], and MORAN [9])

    • As we can see from the tables below, the recognition accuracy when utilizing TBSRN is relatively higher than the others.

    • Visualized Examples

  • Failure Cases

    • Long & Small texts

    • Complicated background / Occlusion

    • Artistic fonts / Hand-writting texts

    • Images whose labels have not appeared in the training set

5. Conclusion

  • To summarize, this paper

    • Proposed a Text-focused Super Resolution Model (Scene Text Telescope)

      • Used TBSRN as a backbone which utilizes self-attention mechanism to handle irregular text images

      • Used weighted cross-entropy loss to handle confusable characters

Take home message

  • Text-focused SR technique can be very effective in handling LR text images than generic SR techniques.

  • Ablation study and Explanation of failure cases can make paper look fancy!

Author

박나현 (Park Na Hyeon)

  • NSS Lab, KAIST EE

  • julia19@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.

  2. Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

  3. Jaderberg, Max, et al. "Reading text in the wild with convolutional neural networks." International journal of computer vision 116.1 (2016): 1-20.

  4. Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman. "Synthetic data for text localisation in natural images." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

  5. Cohen, Gregory, et al. "EMNIST: Extending MNIST to handwritten letters." 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017.

  6. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

  7. Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

  8. Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

  9. Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.

Last updated