Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
Chen et al. / Scene Text Telescope - Text-focused Scene Image Super-Resolution / CVPR 2021
Last updated
Chen et al. / Scene Text Telescope - Text-focused Scene Image Super-Resolution / CVPR 2021
Last updated
한국어로 쓰인 리뷰를 읽으려면 **여기**를 누르세요.
Scene Text Recognition (STR) : a task to recognize text in scene images.
(Example Applications : extraction of car license plate, reading ID card, etc)
Despite the ongoing active researches in STR tasks, recognition on low-resolution (LR) images still has subpar performance.
This needs to be solved since LR text images exist in many situations, for example, when a photo is taken with low-focal camera or under circumstances where a document image is compressed to reduces disk usages.
→ To address this problem, this paper proposes a text-focused super-resolution framework, called Scene Text Telescope.
Works on Scene Text Recognition
Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.
: combines CNN and RNN to obtain sequential features of text images and utilizes CTC decoder [1] to maximize the probability of paths that can reach the ground truth
Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.
: employs a Spatial Transformer Network to rectify text images and utilizes attention mechanism to focus on specific character at each time step
→ Not suitable for tackling curved texts!
Works on Text Image Super-Resolution
Mou, Yongqiang, et al. "Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer International Publishing, 2020.
: considers text-specific properties by designing multi-task framework to recognize and upsample text images
Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.
: captures sequential information of text images
→ Can suffer from disturbances from backgrounds which can degrade the performance of upsampling on text results!
(* Note that these related works and their limitations are mentioned by the paper)
This paper proposes text-focused super-resolution framework, called Scene Text Telescope
To deal with texts in arbitrary orientations,
→ Utilize a novel backbone, named TBSRN (Transformer-Based Super-Resolution Network) to capture sequential information
To solve background disturbance problem,
→ Put Position-Aware Module and Content-Aware Module to focus on the position and content of each character
To deal with confusing characters in Low-Resolution,
→ Employ a weighted cross-entropy loss in Content-Aware Module
Works that are utilized on model and evaluation
Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.
Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.
Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.
Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.
The overall architecture is composed of...
Pixel-Wise Supervision Module + Position-Aware Module + Content-Aware Module
Pixel-Wise Supervision Module
LR (Low-Resolution) image is rectified by a STN (Spatial Transformer Network) to solve misalignment problem [2].
Then, rectified image goes through TBSRN (Transformer-based Super-Resolution Networks).
TBSRN (Transformer-based Super-Resolution Networks)
Two CNNs : to extract feature map
Self-Attention Module : to capture sequential information
2-D Positional Encoding : to consider spatial positional information
Finally, the image gets upsampled to SR (Super-Resolution) image through pixel-shuffling.
Position-Aware Module
Pretrain a Transformer-based recognition model using synthetic text datasets (including Syn90k [3] and SynthText [4])
Leverage its attending regions at each time-step as positional clues
Employ L1 loss to supervise two attention maps
Content-Aware Module
Train a VAE (Variational Autoencoder) using EMNIST [5] to obtain each character's 2D latent representation
→ Positions of similar characters are usually close in the latent space
Overall Loss Function
(Here, lambdas are hyperparameters to balance three terms)
Dataset
TextZoom [2] : 17,367 LR-HR pairs for training + 4,373 pairs for testing (1,619 for easy subset / 1,411 for medium / 1,343 for hard)
LR images : 16 × 64 / HR images : 32 × 128
Evaluation metric
For SR images,
PSNR (Peak Signal-to-Noist Ratio)
SSIM (Structural Similarity Index Measure)
Proposes metrics that focuses on text regions
TR-PSNR (Text Region PSNR)
TR-SSIM (Text Region SSIM)
→ Only take the pixels in the text region (This is done by utilizing SynthText [4], U-Net [6] )
Implementation Details
HyperParameters
Optimizer : Adam
Batch size : 80
Learning Rate : 0.0001
GPU details : NVIDIA TITAN Xp GPUs (12GB × 4)
Ablation Study
This paper evaluated the effectiveness of each component on backbone, Position-Aware Module, Content-Aware Module, etc.
Dataset : TextZoom [2]
+) Recognition Accuracy is computed by the pre-trained CRNN [7].
Results on TextZoom [2]
Compared the model with other SR models on three recognition models (CRNN [7], ASTER [8], and MORAN [9])
As we can see from the tables below, the recognition accuracy when utilizing TBSRN is relatively higher than the others.
Visualized Examples
Failure Cases
Long & Small texts
Complicated background / Occlusion
Artistic fonts / Hand-writting texts
Images whose labels have not appeared in the training set
To summarize, this paper
Proposed a Text-focused Super Resolution Model (Scene Text Telescope)
Used TBSRN as a backbone which utilizes self-attention mechanism to handle irregular text images
Used weighted cross-entropy loss to handle confusable characters
Text-focused SR technique can be very effective in handling LR text images than generic SR techniques.
Ablation study and Explanation of failure cases can make paper look fancy!
박나현 (Park Na Hyeon)
NSS Lab, KAIST EE
julia19@kaist.ac.kr
Korean name (English name): Affiliation / Contact information
Korean name (English name): Affiliation / Contact information
...
Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.
Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.
Jaderberg, Max, et al. "Reading text in the wild with convolutional neural networks." International journal of computer vision 116.1 (2016): 1-20.
Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman. "Synthetic data for text localisation in natural images." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Cohen, Gregory, et al. "EMNIST: Extending MNIST to handwritten letters." 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017.
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.
Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.
Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.
+) In this module, where are the image of each resolution
Given an HR image, Transformer outputs a list of attention maps ( where = attention map at the i-th time-step & = length of its text label)
Generated SR image is also fed into Transformer to obtain .
Assume that each at each time-step t, the pre-trained Transformer generates an output vector . The content loss for all time-steps is computed as (= ground-truth at t-th step)