Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]

Chen et al. / Scene Text Telescope - Text-focused Scene Image Super-Resolution / CVPR 2021

English version of this article is available.

1. Problem definition

Scene Text Recognition (STR)๋ž€, ์ผ์ƒ์ ์ธ ํ’๊ฒฝ ์ด๋ฏธ์ง€์—์„œ ๊ธ€์ž๋ฅผ ์ธ์‹ํ•˜๋Š” task์ž…๋‹ˆ๋‹ค.

(ํ™œ์šฉ ์˜ˆ์‹œ: ์šด์ „๋ฉดํ—ˆ์ฆ์— ์žˆ๋Š” ๋ฌธ์ž ์ฝ๊ธฐ, ID card์—์„œ์˜ ๊ธ€์ž ์ธ์‹, etc)

Figure1
  • ์ตœ๊ทผ STR๋ถ„์•ผ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•˜์ง€๋งŒ, ์ €ํ•ด์ƒ๋„(Low-Resolution, ์ดํ•˜ LR) ์ด๋ฏธ์ง€์—์„œ๋Š” ์•„์ง๊นŒ์ง€๋„ ๋งŽ์€ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ํ•˜์ง€๋งŒ ์‹ค์ƒํ™œ์—์„œ LR ํ…์ŠคํŠธ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋Š” ๊ฝค๋‚˜ ๋งŽ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ดˆ์ ์ด ์ž˜ ๋งž์ง€ ์•Š๋Š” ์นด๋ฉ”๋ผ๋กœ ์ฐ์€ ์ด๋ฏธ์ง€๋‚˜ ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ถˆ๊ฐ€ํ”ผํ•˜๊ฒŒ ์••์ถ•๋œ ํ…์ŠคํŠธ ์ด๋ฏธ์ง€๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    โ†’ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ…์ŠคํŠธ์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ดˆํ•ด์ƒํ™” (Super-Resolution, ์ดํ•˜ SR) ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

2. Motivation

  • Scene Text Recognition

    • Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

      : ์ด ๋…ผ๋ฌธ์—์„œ๋Š” CNN๊ณผ RNN์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ด๋ฏธ์ง€์—์„œ sequentialํ•œ ํŠน์ง•์„ ๊ตฌํ–ˆ์œผ๋ฉฐ, CTC decoder [1]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ground truth์— ๊ฐ€์žฅ ๊ฐ€๊น๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” path๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

    • Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

      : ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Spatial Transformer Network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ด๋ฏธ์ง€๋ฅผ ์–ด๋А์ •๋„ rectifyํ•˜๊ณ  attention mechanism์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ํƒ€์ž„์Šคํ…๋งˆ๋‹ค ํŠน์ • ๋ฌธ์ž์— ์ดˆ์ ์„ ๋‘์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

      โ†’ ํ•˜์ง€๋งŒ ์œ„์˜ ๋…ผ๋ฌธ๋“ค ๊ฒฝ์šฐ ์ด๋ฏธ์ง€์—์„œ ํœ˜์–ด์žˆ๋Š”(curved) ํ…์ŠคํŠธ๋“ค์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • Text Image Super-Resolution

    • Mou, Yongqiang, et al. "Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit." Computer Visionโ€“ECCV 2020: 16th European Conference, Glasgow, UK, August 23โ€“28, 2020, Proceedings, Part XV 16. Springer International Publishing, 2020.

      : ์ด ๋…ผ๋ฌธ์—์„œ๋Š” multi-task ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ณ ์•ˆํ•˜์—ฌ text-specificํ•œ ํŠน์ง•๋“ค์„ ๊ณ ๋ คํ•˜์˜€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

    • Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

      : ์ด ๋…ผ๋ฌธ์˜ ๊ฒฝ์šฐ์—๋Š” text SR ๋ฐ์ดํ„ฐ์…‹์ธ _TextZoom_์„ ์ œ์•ˆํ•˜๊ณ , _TSRN_์ด๋ผ๋Š” SR๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

      โ†’ ํ•˜์ง€๋งŒ, ์ด ๋‘๊ฐ€์ง€ ๋…ผ๋ฌธ์˜ ๊ฒฝ์šฐ ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ํ”ฝ์…€์„ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฐ๊ฒฝ์œผ๋กœ ์ธํ•œ disturbance ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ…์ŠคํŠธ๋ฅผ upsamplingํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Idea

๊ธฐ๋ณธ์ ์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Scene Text Telescope (ํ…์ŠคํŠธ์— ์ดˆ์ ์„ ๋งž์ถ˜ SR ํ”„๋ ˆ์ž„์›Œํฌ)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • ๋จผ์ €, ์ž„์˜์˜ ๋ฐฉํ–ฅ์œผ๋กœ ํšŒ์ „๋˜์–ด์žˆ๋Š” ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, TBSRN (Transformer-Based Super-Resolution Network) ์„ ๊ณ ์•ˆํ•˜์—ฌ ํ…์ŠคํŠธ์˜ sequentialํ•œ information์„ ๊ณ ๋ คํ–ˆ์Šต๋‹ˆ๋‹ค

  • ๋˜ํ•œ, ์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ์ด๋ฏธ์ง€ ๋ฐฐ๊ฒฝ์œผ๋กœ ์ธํ•œ disturbance๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, SR์„ ์ด๋ฏธ์ง€ ์ „์ฒด์— ์ง‘์ค‘ํ•˜์—ฌ ํ•˜๊ธฐ๋ณด๋‹ค๋Š” ํ…์ŠคํŠธ์— ์ดˆ์ ์„ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ํ…์ŠคํŠธ ๊ฐ ๋ฌธ์ž์˜ position๊ณผ content๋ฅผ ๊ณ ๋ คํ•˜๋Š” Position-Aware Module ๊ณผ Content-Aware Module ์„ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

  • ๋‚˜์•„๊ฐ€, LR ์ด๋ฏธ์ง€์—์„œ ํ—ท๊ฐˆ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ž๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ Content-Aware Module ์—์„œ weighted cross-entropy loss ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ์ถ”๊ฐ€์ ์œผ๋กœ, ์•„๋ž˜์˜ ๋…ผ๋ฌธ๋“ค์€ ๋ณธ ๋…ผ๋ฌธ์˜ Model ๊ณผ Evaluation์—์„œ ์ฐธ๊ณ ๋œ ๋…ผ๋ฌธ๋“ค์ž…๋‹ˆ๋‹ค.

    • Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.

    • Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

    • Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

    • Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

3. Method

Scene Text Telescope ๋Š” ํฌ๊ฒŒ ์•„๋ž˜์˜ ์„ธ๊ฐ€์ง€ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

โ†’ Pixel-Wise Supervision Module + Position-Aware Module + Content-Aware Module

Figure2
  • Pixel-Wise Supervision Module

    1. ๋จผ์ €, LR ์ด๋ฏธ์ง€๋Š” [2]์—์„œ ์–ธ๊ธ‰๋˜์—ˆ๋˜ misalignment ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด STN (Spatial Transformer Network) ์„ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค.

    2. ๊ทธ ํ›„, rectified๋œ ์ด๋ฏธ์ง€๋Š” TBSRN ์„ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค. TBSRN ์˜ ๊ตฌ์„ฑ์€ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

      TBSRN (Transformer-based Super-Resolution Networks)

      Figure3

      • CNN ร— 2 : feature map์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ๋ถ€๋ถ„

      • Self-Attention Module : sequentialํ•œ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•œ ๋ถ€๋ถ„

      • 2-D Positional Encoding : spatial / positionalํ•œ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•ด์ฃผ๋Š” ๋ถ€๋ถ„

    3. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ด๋ฏธ์ง€๋Š” pixel-shuffling ์„ ํ†ตํ•ด SR๋กœ upsampling๋ฉ๋‹ˆ๋‹ค.

      +) ํ•ด๋‹น ๋ชจ๋“ˆ์—์„œ, loss๋Š” Eq11 ์œผ๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ด๋•Œ Eq12์€ ๊ฐ๊ฐ HR์ด๋ฏธ์ง€์™€ SR์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค.

  • Position-Aware Module

    1. Position-Aware ๋ชจ๋“ˆ์—์„œ๋Š” ๋จผ์ € synthetic ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ (Syn90k [3], SynthText [4], etc) ์„ ์ด์šฉํ•˜์—ฌ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ recognition ๋ชจ๋ธ์„ pre-train์‹œํ‚ต๋‹ˆ๋‹ค.

    2. ์ด๋•Œ, ๊ฐ time-step์˜ attending region์„ positional clue๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

      • HR ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ํŠธ๋žœ์Šคํฌ๋จธ์˜ output์€ attention map๋“ค์˜ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, output์€ Eq2 ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด ๋•Œ Eq3 ๋Š” i๋ฒˆ์งธ time-step์—์„œ์˜ attention map์ด๋ฉฐ, Eq4 ์€ text label์˜ ๊ธธ์ด์ž…๋‹ˆ๋‹ค.

      • SR์ด๋ฏธ์ง€ ๋˜ํ•œ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ํ†ต๊ณผ์‹œ์ผœ Eq5๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

    3. ์œ„์˜ ๊ณผ์ •์—์„œ ๊ตฌํ•œ attention map๋“ค๋กœ L1 loss ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

    Eq6

  • Content-Aware Module

    1. ํ•ด๋‹น ๋ชจ๋“ˆ์—์„œ๋Š” ๋จผ์ €, EMNIST [5]๋ฅผ ์ด์šฉํ•˜์—ฌ VAE (Variational Autoencoder) ๋ฅผ ํ•™์Šต์‹œ์ผœ ํ…์ŠคํŠธ ๊ฐ ๋ฌธ์ž์˜ 2์ฐจ์› latent representaion์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

      Figure4

    2. ๊ฐ time-step๋งˆ๋‹ค pre-train๋œ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฒฐ๊ณผ๊ฐ’ (Eq7)๊ณผ ground-truth label์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

      ์ฆ‰, Eq8(content loss)๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

      โ†’ Eq9 (Eq10= t๋ฒˆ์งธ step์—์„œ์˜ ground-truth)

  • Overall Loss Function

    Eq1

    (์œ„์˜ ์‹์—์„œ lambda ๋“ค์€ loss term๋“ค ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ์กฐ์ ˆํ•˜๊ธฐ ์œ„ํ•œ hyperparameter์ž…๋‹ˆ๋‹ค.)

4. Experiment & Result

Experimental setup

  • Dataset

    TextZoom [2] : ํ•™์Šต์„ ์œ„ํ•œ LR-HR ์ด๋ฏธ์ง€ 17,367์Œ + testing์„ ์œ„ํ•œ ์ด๋ฏธ์ง€ 4,373์Œ (easy subset 1,619์Œ / medium 1,411์Œ / hard 1,343์Œ)

    +) LR ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„ : 16 ร— 64 / HR ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„ : 32 ร— 128

    Figure6

  • Evaluation metric

    SR ์ด๋ฏธ์ง€ ์˜ ๊ฒฝ์šฐ, ์•„๋ž˜์˜ ๋‘๊ฐ€์ง€ metric์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

    • PSNR (Peak Signal-to-Noist Ratio)

    • SSIM (Structural Similarity Index Measure)

    ๋‚˜์•„๊ฐ€, ํ…์ŠคํŠธ์— ์ดˆ์ ์„ ๋งž์ถ˜ metric์„ ๋‘๊ฐ€์ง€ ๋” ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋กœ, ์•„๋ž˜์˜ ๋‘๊ฐ€์ง€ metric์€ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ metric๋“ค์ž…๋‹ˆ๋‹ค. ์ด ๋‘๊ฐ€์ง€ metric์˜ ๊ฒฝ์šฐ, SynthText [4] ์™€ U-Net [6] ์—์„œ์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ํ…์ŠคํŠธ ๋ถ€๋ถ„๋งŒ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.

    • TR-PSNR (Text Region PSNR)

    • TR-SSIM (Text Region SSIM)

  • Implementation Details

    HyperParameters

    • Optimizer : Adam

    • Batch ํฌ๊ธฐ : 80

    • Learning Rate : 0.0001

    ์‚ฌ์šฉํ•œ GPU : NVIDIA TITAN Xp GPUs (12GB ร— 4)

Result

  • Ablation Study

    • ๋‚˜์•„๊ฐ€, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ๋ชจ๋“ˆ ๋ฐ ์š”์†Œ (backbone, Position-Aware Module, Content-Aware Module, etc.) ๋“ค์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ablation study๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

    • ๋ฐ์ดํ„ฐ์…‹ : TextZoom [2]

      +) ์•„๋ž˜์˜ ํ‘œ๋“ค์—์„œ Recognition ์ •ํ™•๋„๋Š” pre-train๋œ CRNN [7]์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

      Table

  • Results on TextZoom [2]

    • ๊ฐ๊ฐ ๋‹ค๋ฅธ backbone์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ธ ๊ฐ€์ง€ ๋ชจ๋ธ (CRNN [7], ASTER [8], MORAN [9]) ์—์„œ์˜ ์ •ํ™•๋„๋ฅผ ๋น„๊ตํ–ˆ์œผ๋ฉฐ, ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜ ํ‘œ์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    • ๋ณธ ๋…ผ๋ฌธ์˜ TBSRN ๋ฅผ backbone์œผ๋กœ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์˜ ์ •ํ™•๋„๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    Table5

    • Visualized Examples

      Figure8

  • Failure Cases

    Figure10

    ๋˜ํ•œ, ๋…ผ๋ฌธ์—์„œ๋Š” SR์—์„œ ์ œ๋Œ€๋กœ ์ธ์‹์„ ํ•˜์ง€ ๋ชปํ•œ ๊ฒฝ์šฐ๋„ ์กฐ์‚ฌ๋ฅผ ํ–ˆ๋Š”๋ฐ, ํ•ด๋‹น ๊ฒฝ์šฐ๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    • ๊ธธ๊ฑฐ๋‚˜ ์ž‘์€ ํ…์ŠคํŠธ

    • ๋ฐฐ๊ฒฝ์ด ๋ณต์žกํ•˜๊ฑฐ๋‚˜ occlusion์ด ์žˆ๋Š” ๊ฒฝ์šฐ

    • Artisticํ•œ ํฐํŠธ ๋˜๋Š” ์†๊ธ€์”จ

    • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— label์ด ์—†๋Š” ์ด๋ฏธ์ง€๋“ค

5. Conclusion

  • ์š”์•ฝํ•˜์ž๋ฉด, ๋ณธ ๋…ผ๋ฌธ์€

    • ๋ถˆ๊ทœ์น™ํ•œ ํ…์ŠคํŠธ ์ด๋ฏธ์ง€๋“ค์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด self-attention mechanism์„ ์‚ฌ์šฉํ•œ TBSRN ์„ backbone์œผ๋กœ ์‚ฌ์šฉํ–ˆ๊ณ ,

    • ํ—ท๊ฐˆ๋ฆด๋งŒํ•œ, ์ฆ‰, ์ธ์‹์ด ๊นŒ๋‹ค๋กœ์šด ๋ฌธ์ž๋“ค์„ ๊ณ ๋ คํ•ด weighted cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•˜๊ณ ,

    • ํ…์ŠคํŠธ์— ์ดˆ์ ์„ ๋‘” ์—ฌ๋Ÿฌ๊ฐ€์ง€ module๋กœ ๊ตฌ์„ฑ๋œ,

    • Super-Resolution ๋ชจ๋ธ (Scene Text Telescope) ์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

Take home message

  • SR technique์„ ํ…์ŠคํŠธ์— ์ดˆ์ ์„ ๋‘์–ด ์‚ฌ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ๋‹ค.

  • Ablation study๋‚˜ Failure case ์„ค๋ช… ๋“ฑ์ด ์ž˜ ๋˜์–ด ์žˆ๋Š” ๋…ผ๋ฌธ์€ fancyํ•˜๋‹ค!

Author

๋ฐ•๋‚˜ํ˜„ (Park Na Hyeon)

  • NSS Lab, KAIST EE

  • julia19@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.

  2. Wang, Wenjia, et al. "Scene text image super-resolution in the wild." European Conference on Computer Vision. Springer, Cham, 2020.

  3. Jaderberg, Max, et al. "Reading text in the wild with convolutional neural networks." International journal of computer vision 116.1 (2016): 1-20.

  4. Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman. "Synthetic data for text localisation in natural images." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

  5. Cohen, Gregory, et al. "EMNIST: Extending MNIST to handwritten letters." 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017.

  6. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

  7. Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.

  8. Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.

  9. Luo, Canjie, Lianwen Jin, and Zenghui Sun. "Moran: A multi-object rectified attention network for scene text recognition." Pattern Recognition 90 (2019): 109-118.

Last updated

Was this helpful?