ScrabbleGAN [Kor]

Fogel et al. / ScrabbleGAN; Semi-Supervised Varying Length Handwritten Text Generation / CVPR 2020

๋…ผ๋ฌธ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „, ์ง€๊ธˆ๊นŒ์ง€ ์ผ๋ฐ˜ ๊ธ€์”จ์ฒด๋Š” ๋‚ด์šฉ์— ๋Œ€ํ•œ ์„ค๋ช…์ด๊ณ , ์ด๋Ÿฐ ๊ธฐ์šธ์—ฌ์ง€๊ณ  ๋ฐ‘์ค„์นœ ๊ธ€์”จ์ฒด๋Š” ์ž‘์„ฑ์ž์˜ ์ƒ๊ฐ์ด ๋‹ด๊ธด ๊ฒƒ์œผ๋กœ ๊ตฌ๋ถ„ํ•ด์„œ ๋ณด๋ฉด ๋ ๊ฑฐ๊ฐ™๋‹ค.

ScrabbleGAN ๋…ผ๋ฌธ์€ CVPR 2020์— ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค. Handwritten Text Generation์„ ์ฃผ์ œ๋กœ ํ•˜๊ณ ์žˆ๋‹ค. Fully Convolutional Neural Network GAN ๊ตฌ์กฐ์™€ Handwritten Text Recognition(HTR) ๋ชจ๋ธ์„ ์ „์ฒด ๊ตฌ์กฐ๋กœ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ๋กœ realisticํ•œ Handwritten Text Generation์ด ๊ฐ€๋Šฅํ•œ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฌผ๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ๊ธฐ์กด HTR ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋ฆฌ๋ทฐ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „ ์ „์ฒด์ ์ธ ๋™์ž‘๊ณผ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์‚ฌ์ง„์„ ๋จผ์ € ๋ณด์ž. ๊ทธ๋Ÿผ ์ „์ฒด์ ์ธ ์ดํ•ด์— ๋„์›€์ด ๋ ๊ฑฐ ๊ฐ™๋‹ค.

ScrabbleGAN ๋…ผ๋ฌธ์˜ Official Github์— ๊ฐ€๋ณด๋ฉด ๋‹จ์–ด "meet" ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •๊ณผ ๊ฐ€์žฅ ๊ธด ๋‹จ์–ด๋ผ๊ณ  ์•Œ๋ ค์ง„ โ€œSupercalifragilisticexpialidociousโ€์˜ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์„ ๋ณด์—ฌ์ค€๋‹ค.

1. Problem definition

๊ทธ๋ ‡๋‹ค๋ฉด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์— ์žˆ๋˜ ์–ด๋–ค ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  ํ–ˆ์„๊นŒ?

โ€‹ 1.RNN ๊ตฌ์กฐ์—์„œ CNN๊ตฌ์กฐ๋กœ์˜ ํƒˆํ”ผ

โ€‹ ์ฒซ๋ฒˆ์งธ๋Š” ๊ธฐ์กด์˜ Handwritten Text Generation ๋ชจ๋ธ๋“ค์€ RNN๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์ธ๋ฐ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” CNN๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ธฐ์กด ๋…ผ๋ฌธ๋“ค์ด RNN(์ •ํ™•ํ•˜๊ฒŒ๋Š” CRNN, LSTM๊ตฌ์กฐ๋ฅผ ์“ฐ๋Š”๊ฑฐ ๊ฐ™๋‹ค.)๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์ผ ์ˆ˜ ๋ฐ–์— ์—†๋Š” ์ด์œ ๋Š” Handwritten Text Generation ๋ชจ๋ธ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, Handwritten Text Generation์—์„œ ๋ฐ์ดํ„ฐ๋Š” ๊ฐ™์€ ์‚ฌ์ด์ฆˆ๋‚˜ ๋น„์Šทํ•œ ์‚ฌ์ด์ฆˆ๋กœ ๋ฌถ์—ฌ์žˆ๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋‹ค๋ฅด๊ฒŒ ๊ธ€์ž์— ๋”ฐ๋ผ ๊ทธ ๋‹ค์–‘์„ฑ์ด ํฌ๋‹ค. ๋”ฐ๋ผ์„œ input์„ ์ผ์ •ํ•˜๊ฒŒ resize์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ์ ์ ˆ์น˜ ์•Š๋‹ค.

๋”ฐ๋ผ์„œ, output์˜ ๊ธธ์ด์ œ์•ฝ์ด ์—†๋Š” many(input) to many(output) ๊ตฌ์กฐ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” RNN ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ๋งจ ์ฒซ๊ธ€์ž๋Š” ๋งˆ์ง€๋ง‰ ๊ธ€์ž์— ์˜ํ–ฅ์„ ๋ผ์น˜๋Š”๋ƒ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ์•„๋‹ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค. ์ด๋ฅผ ๋…ผ๋ฌธ์—์„œ๋Š” non-trivialํ•˜๋‹ค๊ณ  ์ง€๋ชฉํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์€ RNN๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹ , CNN๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

๋˜ํ•œ ๊ฐ ๊ธ€์ž๊ฐ„์˜ ์—ฐ์†์„ฑ๊ณผ ์ž์—ฐ์Šค๋Ÿฌ์›€์„ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ overlapped receptive field๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ž์‹ ์˜ ์–‘ ์˜†์˜ ๊ธ€์ž์™€ receptive field๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ, ์ž์‹ ์˜ ์•ž๋’ค์˜ sequentialํ•œ information์„ RNN์ด ์•„๋‹Œ CNN์—์„œ๋„ localํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋””์ž์ธ ํ•˜์˜€๋‹ค.

ScrabbleGAN ๋…ผ๋ฌธ์— Figure 3์— ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ, ๋ฐ์ดํ„ฐ์…‹๋„ ์ด์™€ ๋น„์Šทํ•˜๊ฒŒ ๋‹ค์–‘ํ•œ ๊ธธ์ด์™€ ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๋‹ค. ์˜ค๋ฅธ์ชฝ๋ถ€ํ„ฐ, retrouvailles, ecriture, les, e'toile, feuilles, s'oleil, pe'ripate'ticien and chaussettes

โ€‹ 2. GAN ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•œ semi-supervised learning

โ€‹ ๋‘ ๋ฒˆ์งธ๋Š” ์ •ํ™•ํžˆ ๋ ˆ์ด๋ธ”๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋งŒ ๊ธฐ์กด Handwritten Text Generation task๊ฐ€ ์ด๋ฃจ์–ด ์กŒ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ฐ์ดํ„ฐ์…‹์— ํฌ๊ฒŒ ์˜์กดํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. ํ•˜์ง€๋งŒ ๋…ผ๋ฌธ์—์„œ๋Š” Generator์™€ Discrimminator ๊ฐ„ ๋ ˆ์ด๋ธ”์ด ํ•„์š” ์—†๋Š” GAN๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ semi-supervised learning์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ Handwritten Text Generation๋ถ„์•ผ์˜ performance๋ฅผ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค.

โ€‹ 3. ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์˜ ํ•œ๊ณ„ ๊ทน๋ณต

โ€‹ ๋งˆ์ง€๋ง‰์œผ๋กœ, ์•ž์„œ ๋งํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ํ•œ๊ณ„๋ฅผ Handwritten Text Generation์œผ๋กœ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๋ ค๊ณ  ํ•˜์˜€๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ์ฃผ์š” Contribution์œผ๋กœ๋Š” ์† ๊ธ€์”จ ํŠน์„ฑ ์ƒ ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๊ฐ€ ์ผ์ •ํ•˜์ง€์•Š์•„ ๊ธฐ์กด Handwritten Text Generation์— ์“ฐ์ด๋Š” RNN-based๋ชจ๋ธ์ด ์•„๋‹Œ, Fully Convolutional Neural Network๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค๋Š” ์ , unlabeled data์— ๋Œ€ํ•ด Semi-supervised learning์„ ์‹œ๋„ํ–ˆ๋‹ค๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  ํ•ด๋‹น ๋ชจ๋ธ์„ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ตฌ์„ฑํ•จ์œผ๋กœ์จ ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•ด ๊ธฐ์กด Handwritten Text Recognition(HTR) ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์˜ฌ๋ ธ๋‹ค๋Š” ์ ์ด๋‹ค.

2. Motivation

Online๊ณผ Offline ๋ฐฉ์‹์˜ ์ฐจ์ด

โ€‹ ๊ด€๋ จ ์—ฐ๊ตฌ๋ฅผ ์‚ดํŽด๋ณด๊ธฐ ์ „์—, Handwritten Text๋Š” Online๊ณผ Offline ๋ฐฉ์‹์— ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„์•ผ ์†Œ๊ฐœํ•  ๋…ผ๋ฌธ์˜ ์ปจ์…‰๋“ค์ด ์ดํ•ด๊ฐ€ ๋œ๋‹ค. ์˜จ๋ผ์ธ ๋ฐฉ์‹์€ ๊ทธ ๊ณผ์ •์„ ์ƒ˜ํ”Œ๋งํ•œ Stroke๋ผ๋Š” ๊ฐœ๋…์„ ํ†ตํ•ด ์† ๊ธ€์”จ๊ฐ€ ์จ์ง€๋Š” ๊ณผ์ •์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์˜คํ”„๋ผ์ธ ๋ฐฉ์‹์€ ๊ทธ ๊ณผ์ •์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์•„๋‹ˆ๋ผ, ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ๋งŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, Online์ด๋ƒ Offline์ด๋ƒ๋Š” ๊ทธ ๋…ผ๋ฌธ์˜ ์ปจ์…‰์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. ์˜ˆ์‹œ๋กœ, Handwritten Text Generation์—์„œ Online์€ sequantialํ•œ ์ˆœ์„œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, Offline์—์„œ๋Š” ํ•œ ์žฅ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” Stroke๋ฅผ ๊ธฐ๋กํ•ด์•ผํ•˜๋Š” ๋„๊ตฌ๊ฐ€ ์žˆ์–ด์•ผํ•˜๋Š” Online ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆ˜์ง‘ํ•˜๊ธฐ๋„ ํž˜๋“ค๊ณ , ์˜คํ”„๋ผ์ธ์—๋Š” ์•„์˜ˆ ์ ์šฉํ•  ์ˆ˜ ์—†์ง€๋งŒ, ๋ฐ˜๋Œ€๋กœ Offline์˜ ๋ฐฉ๋ฒ•๋ก ์€ Online์—๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฒ”์šฉ์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์˜คํ”„๋ผ์ธ ๋ฐฉ๋ฒ•๋ก ์— ๋Œ€ํ•ด ์ดˆ์ ์„ ๋งž์ท„๋‹ค๊ณ  ํ•œ๋‹ค.

Deepwriting ๋…ผ๋ฌธ์—์„œ ์„ค๋ช…ํ•œ Online data: ์‹œ๊ฐ„์— ๋”ฐ๋ผ sampling๋œ ์ˆœ์„œ๊ฐ€ ์ •ํ•ด์ง„ stroke๋ผ๋Š” ๊ฐœ๋…์ด ์žˆ๋‹ค.

์ด ์ฑ•ํ„ฐ์—์„œ๋Š” ๊ด€๋ จ๋œ ๋…ผ๋ฌธ์œผ๋กœ ์†Œ๊ฐœํ•œ ๋…ผ๋ฌธ ์ค‘ ์ค‘์š”ํ•˜๋‹ค ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ๋“ค์„ ์งง๊ฒŒ ์š”์•ฝ ๋ฐ ์ •๋ฆฌ๋ฅผ ํ•ด๋ณด์•˜๋‹ค. ์‚ฌ์‹ค ์ด related work๋ฅผ ๋‹ค follow up ํ–ˆ์œผ๋ฉด, ์ด๋ฒˆ ๋…ผ๋ฌธ์˜ ์ปจ์…‰์„ ๋‹จ๋ฒˆ์— ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

[Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.]

๋จผ์ €๋Š” ํ† ๋ก ํ†  ๋Œ€ํ•™์˜ Alex Graves๊ฐ€ ๋ฐœํ‘œํ•œ Generating sequences with recurrent neural networks์ด๋ž€ ๋…ผ๋ฌธ์ธ๋ฐ citation ์ˆ˜๊ฐ€ ๋ฌด๋ ค 3500์—ฌ ํšŒ๋กœ ๊ต‰์žฅํžˆ ์˜ํ–ฅ๋ ฅ ์žˆ๋Š” ๋…ผ๋ฌธ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ๋ชฉ ๊ทธ๋Œ€๋กœ RNN์„ ์ด์šฉํ•œ sequentialํ•œ ์ƒ์„ฑ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ stroke๊ฐ€ ํฌํ•จ๋œ IAM online ์†๊ธ€์”จ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ, ๊ธ€์”จ๋ฅผ ์“ฐ๋Š” ๊ณผ์ •์— ์žˆ์–ด์„œ ๋‹ค์Œ ์ง€์ ์ด ์–ด๋”˜์ง€ LSTM์„ ํ†ตํ•ด ์˜ˆ์ธกํ•˜๊ณ  ๊ณ„์†ํ•ด์„œ ๊ธ€์”จ๋ฅผ ๋งŒ๋“ค์–ด ๋‚ธ๋‹ค.

Generating sequences with recurrent neural networks ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด prediction ๊ณผ์ •์„ ์‹œ๊ฐํ™”ํ•ด์„œ ๋ณด์—ฌ์คฌ๋‹ค. <u>๊ธ€์”จ๋ฅผ ์ƒ์„ฑํ•˜๋Š”๊ฑฐ๋กœ ๋ณผ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๋‹ค์Œ ์ˆœ์„œ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฑฐ๋กœ๋ฐ–์— ์•ˆ๋ณด์ด๊ธฐ๋„ ํ•œ๋‹ค.</u>

[Bo Ji and Tianyi Chen. Generative adversarial network for handwritten text. arXiv preprint arXiv:1907.11845, 2019]

์ด ๋…ผ๋ฌธ์€ GAN ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•œ ์†๊ธ€์”จ ์ƒ์„ฑ์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์ด๋‹ค. ๊ธ€์ž๋งˆ๋‹ค ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅธ ์† ๊ธ€์”จ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํŠน์ง• ๋•Œ๋ฌธ์ธ๊ฑฐ ๊ฐ™์€๋ฐ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” sequentialํ•œ ๋ฐ์ดํ„ฐ๋ฅผ CNN-LSTM๋ฐฉ์‹์˜ discriminator๋ฅผ ์ œ์•ˆํ•˜์—ฌ LSTM๋ชจ๋ธ์„ generator๋กœ CNN-LSTM ๊ตฌ์กฐ๋ฅผ discriminator๋กœ ํ•˜์—ฌ GAN ๊ตฌ์กฐ๋กœ ์†๊ธ€์”จ ํ•™์Šต์„ ์‹œ๋„ํ–ˆ๋‹ค. ์ด ๋…ผ๋ฌธ ๋˜ํ•œ IAM online ์† ๊ธ€์”จ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. GAN๊ตฌ์กฐ๊ฐ€ realisticํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์— ์–ด๋А์ •๋„ ์ •ํ‰์ด ๋‚˜์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, ์† ๊ธ€์”จ ์ƒ์„ฑํ•˜๋Š” ๋…ผ๋ฌธ์ด 2019๋…„์—์•ผ ์ œ์•ˆ๋˜์—ˆ๋‹ค๋‹ˆ ์ƒ๊ฐ๋ณด๋‹ค ๋Šฆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

[Eloi Alonso, Bastien Moysset, and Ronaldo Messina. Adversarial generation of handwritten text images conditioned on sequences. arXiv preprint arXiv:1903.00277, 2019.]

โ€‹ ์ด ๋…ผ๋ฌธ์€ ScrabbleGAN์˜ Result ํŒŒํŠธ์—์„œ ์ค‘์ ์ ์œผ๋กœ ๋น„๊ตํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๊ทธ ์ด์œ ๋Š” ScrabbleGAN๊ณผ ์ „์ฒด์ ์œผ๋กœ ๋งค์šฐ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ฐ”๋กœ ์œ„์— ์–ธ๊ธ‰ํ–ˆ๋˜ ๋‹จ์ˆœํ•œ GAN๊ตฌ์กฐ(generator์™€ discriminator์˜ ์ ๋Œ€์  ํ•™์Šต ๋ฐฉ์‹)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, text recognition์„ ์œ„ํ•œ auxiliary network์„ ์ ์šฉ์‹œ์ผฐ๋‹ค. ๋˜ํ•œ online ๋ฐ์ดํ„ฐ์…‹์ด ์•„๋‹Œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” task๋กœ ๋ฐ”๋ผ๋ณด์•˜๋‹ค.

โ€‹ ํ•˜์ง€๋งŒ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ช…ํ™•ํ•œ ํ•œ๊ณ„์ ๋“ค์ด ์žˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ ์ผ์ • ๊ธธ์ด ์ด์ƒ์˜ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•ด๋‚ด์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Š” ๋ฐ‘์— ์‚ฌ์ง„ rho๋ถ€๋ถ„์—์„œ๋Š” ๊ธ€์ž๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ์ž…๋ ฅ๋ฐ›๋Š” bidirectional LSTM recurrent layers๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๋‹จ์–ด์— ๋Œ€ํ•œ embedding vector๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹น์—ฐํ•˜๊ฒŒ๋„ ๊ธด ๋‹จ์–ด์ผ ์ˆ˜๋ก ์ •๋ณด์˜ ์†์‹ค์ด ์žˆ์„ ๋ฟ๋”๋Ÿฌ ์ตœ์ข… ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๊ฐ€ ๊ณ ์ •๋œ ์ƒํƒœ์—์„œ ๋”๋”์šฑ ๊ทธ๋Ÿฐ ๋ฌธ์ œ์ ์ด ๋ฐœ์ƒํ•  ์—ฌ์ง€๊ฐ€ ์žˆ์–ด์„œ ๋ผ๊ณ  ๋ณธ๋‹ค.

๋‘ ๋ฒˆ์งธ๋Š”, writing style์„ ์ž˜ ํ‘œํ˜„ํ•ด ๋‚ด์ง€ ๋ชปํ•œ์ , ์ด ๋…ผ๋ฌธ์—์„œ๋Š” style์„ ์กฐ์ ˆํ•˜์ง€ ๋ชปํ•œ์ ์„ ์–ธ๊ธ‰ํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

Adversarial generation of handwritten text images conditioned on sequences์—์„œ ์ œ์•ˆํ•œ Network ๊ตฌ์กฐ

Idea

After you introduce related work, please illustrate the main idea of the paper. It would be great if you describe the idea by comparing or analyzing the drawbacks of the previous work.

์ด์— ๋ณธ ๋…ผ๋ฌธ์€ ํ˜„์žฌ๊นŒ์ง€์˜ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๊ณ ์ž ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•œ๋‹ค. ํŠนํžˆ, Adversarial generation of handwritten text images conditioned on sequences ๋…ผ๋ฌธ์—์„œ ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•œ ์‹œ๋„๊ฐ€ ScrabbleGAN์˜ ์ฃผ์š” ์•„์ด๋””์–ด๋ผ๊ณ ๋„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

  1. bidirectional LSTM ์œผ๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋˜ Embedding network๋ฅผ ์—†์• ๊ณ  filter bank๋ผ๋Š” ๊ฐœ๋…์˜ ๊ฐ charactor์˜ embedding vector์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ธ€์ž๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ƒ์„ฑํ•œ๋‹ค.

  2. ๋˜ํ•œ ๊ฐ charactor๊ฐ„์˜ interaction์„ ์œ„ํ•ด overlapped receptive field๋ฅผ ์ ์šฉํ•˜์—ฌ ์ธ์ ‘ํ•œ ๊ธ€์ž๊ฐ„ ์ž์—ฐ์Šค๋Ÿฌ์šด ์†๊ธ€์”จ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜์˜€๊ณ , Discriminator์™€ Recognizer๋Š” overlapped receptive field๋ฅผ ํฌํ•จํ•˜์—ฌ ๊ฐ ๊ธ€์ž๋ฅผ real/fake์ธ์ง€, ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค.

์ด ๋‘ ๊ฐ€์ง€๊ฐ€ ScrabbleGAN์ด ๊ธ€์ž๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค. Method ํŒŒํŠธ์—์„œ ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด์ž.

3. Method

Generator Part
  • ๋ชจ๋ธ ๊ตฌ์กฐ

โ€‹ ๋จผ์ € generator๋ฅผ ๋ณด์ž, ์ €์ž๋Š” RNN์ด ์•„๋‹Œ CNN ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ ์ด์œ ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค. RNN๊ตฌ์กฐ๋Š” ์‹œ์ž‘๋ถ€ํ„ฐ ํ˜„์žฌ๊นŒ์ง€์˜ state๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด ๊ธ€์ž๋ฅผ ์ƒ์„ฑํ•˜๋Š”๋ฐ non-trivial ํ•˜๋‹ค๊ณ  ํ•˜๋ฉฐ ์ข‹์ง€ ์•Š๋‹ค๊ณ  ์ง€์ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ CNN๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ์˜ค์ง ์–‘ ์˜†์—์žˆ๋Š” ๊ธ€์ž๋งŒ ์—ฐ๊ด€๋˜์–ด ๊ธ€์ž๋ฅผ ์ƒ์„ฑํ•จ์œผ๋กœ ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ overlapped receptive field๋Š” ๊ธ€์ž๊ฐ„ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ๋ถ€๋“œ๋Ÿฌ์šด ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ ๋‹ค.

โ€‹ ๋…ผ๋ฌธ์—์„œ๋Š” Meet๋ผ๋Š” ๊ธ€์ž๋ฅผ ๋งŒ๋“ค ๋•Œ๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ ๋‹ค. ์œ„์˜ ์‚ฌ์ง„์—์„œ์™€ ๊ฐ™์ด filter bank์— ๊ฐ ํ•ด๋‹นํ•˜๋Š” ๊ธ€์ž๋ฅผ ๋„ฃ๋Š”๋‹ค. ๊ทธ๋Ÿผ m,e,e ๊ทธ๋ฆฌ๊ณ  t ๊ฐ 4๊ฐœ์˜ filter bank๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฑด๋ฐ. ์—ฌ๊ธฐ์— ์Šคํƒ€์ผ์„ ๋‚˜ํƒ€๋‚ด๋Š” noise z๋ฅผ ๊ณฑํ•ด์ฃผ์–ด ๊ธ€์ž๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ์ž…๋ ฅ์„ ๋งŒ๋“ ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์œ„์— ๋งํ•œ๋˜ ๊ฒƒ ๊ฐ™์ด ๊ฐ ํ•„ํ„ฐ๋ฑ…ํฌ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ๋„คํŠธ์›Œํฌ์—์„œ๋Š” ์–‘ ์˜† ๊ณผ overlapped receptive field๋ฅผ ๊ณต์œ ํ•˜๋ฉด์„œ ์ƒ์„ฑํ•˜๊ฒŒ๋œ๋‹ค, ์ด๋Ÿฐ ๋ฐฉ์‹์€ ๊ธธ์ด์˜ ์ œ์•ฝ์ด ์—†์œผ๋ฉฐ, ์ „์ฒด ๊ธ€์ž์˜ ์Šคํƒ€์ผ๋„ ์ผ๊ด€๋œ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ๋˜ํ•œ ์ €์ž๋Š” ํ•œ filterbank๋Š” overlapped receptive field๊ฐ€ ์žˆ๋‹ค ํ•˜๋”๋ผ๋„ ์ž‘์€ ๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ์„ฑํ•œ ๊ธ€์ž๋Š” ํƒ€๊ฒŸ์œผ๋กœํ•˜๋Š” ๊ธ€์ž๊ฐ€ ๋ช…ํ™•ํžˆ ์ƒ์„ฑ๋œ๋‹ค. ํ•˜์ง€๋งŒ, overlapped receptive field๋กœ์จ ์–‘ ์˜† ๊ธ€์ž๊ฐ€ ๋‹ฌ๋ผ์ง์—๋”ฐ๋ผ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ• ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

Discriminator์™€ Recognizer Part ๋‘ network์—์„œ์˜ Loss๋ฅผ ํ†ตํ•ด ์ „์ฒด ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šตํ•œ๋‹ค.

โ€‹ ๋‹ค์Œ์œผ๋กœ๋Š” Discriminator๋ฅผ ๋ณด์ž. Discriminator์˜ ์—ญํ• ์€ ์•ž์„œ ๋งํ–ˆ๋“ฏ ์ง„์งœ ๊ฐ™์€(realistic) ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ๊ณผ ์—ฌ๊ธฐ์„œ๋Š” ์Šคํƒ€์ผ์„ ๋ถ„๊ฐ„ํ•˜๋Š” ์—ญํ• ๋„ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•œ ํ•„ํ„ฐ๋ฑ…ํฌ์—์„œ ๋‚˜์˜จ (์˜ค๋ฒ„๋žฉํฌํ•จ)๊ธ€์ž๋งˆ๋‹ค ํ•˜๋‚˜์”ฉ ๋„ฃ๊ณ  ํ‰๊ท ์„ ๋‚ด๋Š” ์‹์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ข… ์ถœ๋ ฅ์˜ ๊ธธ์ด ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์ด ์—†์ด ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ Recognizer๋Š” ์ฝ์„ ์ˆ˜ ์žˆ๋Š” ํ…์ŠคํŠธ๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ ๊ธฐ์—ฌํ•œ๋‹ค. Discrimminator๋ฅผ ์†๊ธ€์”จ ๊ฐ™์€ ์ •๋„๋ฅผ ๋งŒ๋“ ๋‹ค ์น˜๋ฉด ๋‹ค๋ฅธ ์ผ์ž„์— ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๋‹ค. Recognizer๋Š” ์˜ค์ง ๋ผ๋ฒจ์ด ์žˆ๋Š” real sample์—์„œ๋งŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Handwritten Text Recognition(HTR)network์ธ Recognizer๋„ CNN๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ทธ ์ด์œ ๋กœ๋Š” ๋งŽ์€ ๋ชจ๋ธ๋“ค์ด ์•ž๋’ค ๋ฌธ๋งฅ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š” bidirectional LSTM์„ ์„ ํƒํ–ˆ์ง€๋งŒ, ์ด ๋ชจ๋ธ์€ ๊ธ€์”จ ์ž์ฒด๊ฐ€ ์ œ๋Œ€๋กœ ์“ฐ์—ฌ์žˆ์ง€ ์•Š์•„๋„ ๋ฌธ๋งฅ์ƒ์œผ๋กœ ๋•Œ๋ ค ๋งž์ถ”๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ง€๋ชฉํ•œ๋‹ค. ์ž์ฃผ ์“ฐ๋Š” ๋‹จ์–ด๋Š” ์„ธ ๊ธ€์ž์ค‘ ๊ฐ€์šด๋ฐ๊ฐ€ ์ด์ƒํ•ด๋„ ์•Œ์•„๋ณด๋“ฏ์ด ๋ง์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ฌธ์ œ๋ฅผ ์ง€๋ชฉํ•˜๋ฉฐ ํ•œ ๊ธ€์ž ๊ธ€์ž๊ฐ€ ์ œ๋Œ€๋กœ ์ธ์‹ํ•ด์•ผํ•˜๋Š” Recognizer๊ตฌ์กฐ๋กœ convolutional backbone์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

*์—ฌ๊ธฐ์„œ Handwritten Text Recognition ๋ถ„์•ผ๋Š” ๋ง๊ทธ๋Œ€๋กœ ์†๊ธ€์”จ๋ฅผ ์ธ์‹ํ•˜๋Š” ๋ถ„์•ผ์ด๋‹ค. Discriminator์™€ ์—ญํ• ์ด ํ˜ผ๋™์ด ๋  ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, Discriminator๋Š” ํ•ด๋‹น ์ด๋ฏธ์ง€๊ฐ€ ๊ธ€์”จ๊ฐ™์ด ์ƒ๊ฒผ๋ƒ ์•ˆ์ƒ๊ฒผ๋ƒ๋ฅผ ํŒ๋‹จํ•˜๋Š” ๊ฒƒ์ด์ง€ ์ด๊ฒŒ ๋ฌด์Šจ ๊ธ€์ž, ์•ŒํŒŒ๋ฒณ์ธ๊ฐ€๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ •ํ™•ํ•œ ์˜ˆ์‹œ๋Š” ์•„๋‹ˆ์ง€๋งŒ, ๊ตณ์ด ์˜ˆ์‹œ๋ฅผ ๋“ค์ž๋ฉด Discrimminator๋Š” ์‚ฌ๋žŒ์ด ์†์œผ๋กœ ์“ด๊ฑฐ ๊ฐ™๋ƒ(realistic)ํ•˜๋ƒ ์ด๊ณ , Recognizer๋Š” ์“ฐ์ธ ๊ธ€์”จ๊ฐ€ label๊ณผ ์ผ์น˜ํ•˜๋ƒ ๋งŒ์•ฝ "meet"๋ผ๊ณ  ์“ด๊ฑฐ๋ฉด "m", "e", "e" ๊ทธ๋ฆฌ๊ณ  "t"๋ผ๊ณ  ์ฝํžˆ๋ƒ๋ฅผ ํŒ๋‹จํ•œ๋‹ค.

  • Loss Function

๋‹ค์Œ์œผ๋กœ ํ•™์Šต์—์„œ์˜ ๋””ํ…Œ์ผ์„ ์‚ดํŽด๋ณด์ž.

Total loss: *lambda์™€ ๋ฐ‘์˜ ์‹์˜ alpha๋Š” ๊ฐ™์€ ๊ธฐํ˜ธ๋กœ ๋ด์•ผํ•œ๋‹ค.

ํ•™์Šต์€ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ์—์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ, Recognizer์—์„œ ๋‚˜์˜ค๋Š” Loss R๊ณผ Discriminator์—์„œ ๋‚˜์˜ค๋Š” Loss D๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ ๋กœ์Šค์˜ ๋ฐธ๋Ÿฐ์Šค๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•ด Gradient of Loss R์˜ stadard deviation์„ Gradient of Loss D์— ๋งž์ถฐ์ค€๋‹ค. lambda์˜ ์—ญํ• ์ด loss_D์™€ loss_R๊ฐ„์˜ ์Šค์ผ€์ผ์„ ์กฐ์ ˆํ•˜๋Š” ์—ญํ• ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฐ‘์— ์ˆ˜์‹์—์„œ๋Š” alpha๋กœ ํ‘œํ˜„๋˜์—ˆ๋‹ค.

๋ฐ‘์— ์ˆ˜์‹์„ ๋ณด๋ฉด ์ข€๋” ์ž์„ธํžˆ ๊ธฐ์ˆ ์ด ๋˜์–ด์žˆ๋‹ค. Recognizer์—์„œ ๋‚˜์˜ค๋Š” gradient R์€ gradient D์˜ ํ‘œ์ค€ํŽธ์ฐจ์™€ ๋งž์ถฐ์ฃผ๊ณ , ๊ทธ๋‹ค์Œ ์ƒ์ˆ˜ alpha๋ฅผ ๊ณฑํ•ด ์œ„์˜ lambda์™€ ๊ฐ™์ด ์Šค์ผ€์ผ์„ ์กฐ์ ˆํ•˜์—ฌ ๋‘ loss_D ์™€ Loss_R๊ฐ€ ์ ์ ˆํžˆ ํ•™์Šต๋˜๊ฒŒ ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ์œ„์—์„œ๋„ ์–ธ๊ธ‰ํ•œ Adversarial generation of handwritten text images conditioned on sequences ๋…ผ๋ฌธ์—์„œ์™€ ๋‹ค๋ฅด๊ฒŒ ํ‰๊ท ์€ Gradient of Loss D์— ๋งž๊ฒŒ ์˜ฎ๊ฒจ์ฃผ์ง€ ์•Š๋Š”๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ ์ด์œ ๋ฅผ **ํ‰๊ท ์„ ์ด๋™ํ•˜๋ฉด์„œ gradient ๋ถ€ํ˜ธ๊ฐ€ ๋ฐ”๋€Œ๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ ์ž ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋™์„ ์•ˆํ•ด์„œ ๋‘ ๋กœ์Šค๊ฐ„ scale์˜ ํ‰๊ท ์ด ์•ˆ๋งž๋Š” ์ƒ๊ธฐ๋Š” ๋ฌธ์ œ๋„ ์žˆ์„๊ฑฐ ๊ฐ™์€๋ฐ, ํ‘œ์ค€ํŽธ์ฐจ๋งŒ ๋งž์ถฐ์ค˜์„œ ์ƒ๊ธฐ๋Š” ์žฅ์ ๊ณผ ๋‹จ์ ์— ๋Œ€ํ•ด์„œ ๋…ผ๋ฌธ์—์„œ ๋ณ„๋‹ค๋ฅธ ์–ธ๊ธ‰์ด ์—†๋‹ค.

Gradient R์˜ ํ‘œ์ค€ํŽธ์ฐจ scaling: *์œ„์˜ ์‹์˜ lambda์™€ alpha๋Š” ๊ฐ™์€ ๊ธฐํ˜ธ๋กœ ๋ด์•ผํ•œ๋‹ค.

4. Experiment & Result

Experimental setup

  • Dataset and Evaluation metric

    โ€‹ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋Š” RIMES, IAM ๊ทธ๋ฆฌ๊ณ  CVL์ด๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. Evaluation Metirc์€ ๋‘ ๊ฐ€์ง€๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ฒซ๋ฒˆ์งธ๋กœ๋Š” word error rate(WER)์ด๋‹ค. ๋ง๊ทธ๋Œ€๋กœ ์ „์ฒด ๋‹จ์–ด์ค‘์— ๋ช‡ ๊ฐœ์˜ ๋‹จ์–ด๊ฐ€ ์ž˜๋ชป ์ฝํ˜”๋ƒ๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค. ๋‘๋ฒˆ์งธ๋Š” normalized edit-distance(NED)์ธ๋ฐ, true์™€ prediction์‚ฌ์ด์— edit-distance๋ฅผ ์ธก์ •ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

    word error rate(WER)์˜ ์ˆ˜์‹, ์˜ˆ์‹œ๋กœ, A ๋‹จ์–ด๊ฐ€ B๋‹จ์–ด๊ฐ€ ๋˜๊ธฐ์œ„ํ•ด ์ˆ˜ํ–‰ํ•ด์•ผํ•˜๋Š” ์น˜ํ™˜, ์‚ญ์ œ ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์š”์†Œ๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค.

    normalized edit-distance(NED)์˜ ์ˆ˜์‹. ์ด๋•Œ A_i ์™€ B_i๋Š” ๊ฐ ๊ธ€์ž์˜ position ์ด๋‹ค.์˜ˆ๋ฅผ๋“ค์–ด abc์™€ acb๋ฉด a-a, b-c, c-d ์ˆœ์œผ๋กœ ๋น„๊ตํ•œ๋‹ค.

  • Training setup

    โ€‹ ๋จผ์ € ๋…ผ๋ฌธ์—์„œ๋Š” ํ•œ ๊ธ€์ž์˜ ์ƒ์„ฑํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ๋†’์ด 32๋กœ ๊ณ ์ •ํ•˜์˜€๊ณ  ๋„“์ด๋Š” 16 ํ”ฝ์…€๋กœ ๊ณ ์ •ํ–ˆ๋‹ค. ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” Filter bank์˜ ํฌ๊ธฐ๋Š” 32x8192์ธ๋ฐ ์—ฌ๊ธฐ์— 32dim-noise z ๋ฅผ ๊ณฑํ•œ๋‹ค. ๊ทธ๋Ÿผ n๊ฐœ์˜ ๊ธ€์ž๋ฅผ ์ƒ์„ฑํ•  ๋•Œ n x 8192๊ฐ€ ๋œ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, n ๊ฐœ์˜ Filterbank*z((1x32) * (32x8192))์„ n๊ฐœ concatํ•œ๊ฑฐ๋ผ๊ณ  ์ดํ•ดํ•˜๋ฉด ๋œ๋‹ค.

    โ€‹ ๊ทธ ๋‹ค์Œ, reshape์„ ํ†ตํ•ด 512x4x4n (8192 = 512x4x4)๊ฐ€ ๋˜๊ณ , ์ด๋•Œ ๊ฐ ๊ธ€์ž๋Š” 4x4 spatial size๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ ๋‹ค์Œ 3๊ฐœ์˜ residual blocks์„ ํ†ต๊ณผํ•œ ํ›„์— Up-Sampling ํ›„, ๊ฒน์ณ์ง„ ์˜์—ญ์„ ๋งŒ๋“ค์–ด์„œ ์ตœ์ข… 32x16n์‚ฌ์ด์ฆˆ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ ๋‹ค.

    โ€‹ Discriminator ๊ตฌ์กฐ๋Š” BigGAN ๋ชจ๋ธ์—์„œ ์ฐจ์šฉํ–ˆ๋Š”๋ฐ 4๊ฐœ์˜ residual blocks๋กœ ๊ตฌ์„ฑ๋˜๊ณ  ๋งˆ์ง€๋ง‰์— fc๋ ˆ์ด์–ด๊ฐ€ ํ•˜๋‚˜ ์žˆ๋Š” ๊ตฌ์กฐ์ด๋‹ค. ์•ž์„œ ์ด์•ผ๊ธฐ ํ•œ๋Œ€๋กœ Fully Conv Layers๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๊ณ , ๊ฐ ํŒจ์น˜(๊ธ€์ž)์˜ ํ‰๊ท ์ด ์ตœ์ข… prediction์ด ๋œ๋‹ค.

Result

  • Comparison to Alonso el al.

    โ€‹ Adversarial generation of handwritten text images conditioned on sequences์—์„œ ์ œ์•ˆํ•œ Network์™€ ๋น„๊ตํ•œ๋‹ค ๋ฐ‘์— ํ‘œ์™€ ์‚ฌ์ง„์—์„œ๋Š” "Alonso et al. [2]"๋ผ๊ต ํ‘œ๊ธฐ๋œ ๋…ผ๋ฌธ์ด๋‹ค. ๋จผ์ € ๋ฐ‘์˜ ์‚ฌ์ง„์„ ๋จผ์ € ๋ณด๋ฉด, ScrabbleGAN์—์„œ ์ด์ „ ๋ชจ๋ธ์ด ์ž˜ ๋งŒ๋“ค์–ด๋‚ด์ง€ ๋ชปํ•œ ๊ธ€์”จ๋“ค๋„ ์ž˜ ๋งŒ๋“ค๊ณ  ์žˆ์Œ์„ ์ •์„ฑ์ ์œผ๋กœ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๊ทธ ์•„๋ž˜ ํ‘œ๋ฅผ ๋ณด๋ฉด, Fre'chet Inception Distance (FID)์™€ geometric-score (GS) ์Šค์ฝ”์–ด๋กœ ๋‘ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜์˜€๋‹ค. ์ด ํ‘œ๋ฅผ ํ†ตํ•ด ScrabbleGAN ์ •๋Ÿ‰์ ์œผ๋กœ๋„ ์ข‹์€ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค๊ณ  ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.

Comparison_with_[2]
Comparison_with_[2]2
  • Generating different styles

    โ€‹ ๋‹ค์Œ์œผ๋กœ๋Š” ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์— ๋Œ€ํ•œ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค. ์•„๋ž˜ ์ด๋ฏธ์ง€ ๊ฐ™์ด ๊ฐ™์€ ๋‹จ์–ด๋ฅผ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ๋กœ ์ƒ์„ฑํ•จ์„ ๋ณด์—ฌ์คŒ์œผ๋กœ ๋‹ค๋ฅธ ๊ฐ ๋‹ค๋ฅธ ์Šคํƒ€์ผ์˜ ๊ธ€์ž๊ฐ€ ์ž˜ ์ƒ์„ฑ๋จ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ, ๊ฐ ๊ธ€์ž๋งˆ๋‹ค ๊ฐ™์€ ์Šคํƒ€์ผ vector z๊ฐ€ ๋”ฐ๋กœ ๊ณฑํ•ด์กŒ๊ณ , overlapped receptive field๋กœ ์ธํ•ด ์ธ์ ‘ํ•œ ๊ธ€์ž๋งˆ๋‹ค์˜ interaction๋„ ์ž˜ ๋˜์–ด ์Šคํƒ€์ผ์ด ์œ ์ง€๋˜๋ฉด์„œ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ธ€์”จ๊ฐ€ ์ƒ์„ฑ๋˜์—ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

enerating different styles
  • Boosting HTR performance

    โ€‹ ๋‹ค์Œ์œผ๋กœ๋Š” ์ œ์•ˆํ•œ ๋„คํŠธ์›Œํฌ๋กœ ์ƒ์„ฑํ•œ dataset์„ ์ถ”๊ฐ€๋กœ ์ ์šฉํ•˜์—ฌ ๊ธฐ์กด์˜ HTR performance๋ฅผ ๋Š˜๋ฆฐ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ๋งํ•œ๋‹ค. ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋“ฏ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์„ ์ถ”๊ฐ€๋กœ ๊ตฌ์ถ•ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋” ๋‚˜์Œ์„ ์„ค๋ช…ํ•œ๋‹ค. ํ‘œ์— ๋”ฐ๋ฅด๋ฉด ๊ธฐ์กด๋ฐ์ดํ„ฐ๋ฅผ augmentationํ•œ ๋ฐ์ดํ„ฐ ์…‹๋ณด๋‹ค ScrabbleGAN์—์„œ ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•ด ํ•™์Šตํ•œ ๊ฒƒ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ScrabbleGAN์˜ ๊ฒฐ๊ณผ๋“ค์ด ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค€๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

Boosting HTR performance

5. Conclusion

โ€‹ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” RNN๊ตฌ์กฐ๋กœ ์ „์ฒด์˜ ๊ธ€์ž์ƒ์„ฑ์„ ํ†ต์œผ๋กœ ํ•˜๋‚˜๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ž˜๋ผ์„œ local problem์œผ๋กœ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Ÿฐ ์ ์œผ๋กœ ๊ธธ์ด์™€ ์Šคํƒ€์ผ์— ์ œ์•ฝ๋ฐ›์ง€ ์•Š์€ ์ด๋ฏธ์ง€๋ฅผ ์ž˜ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ณ  ์˜ค๋ฒ„๋žฉ๋œ receptive field๋กœ ์ธ์ ‘ํ•œ ๊ธ€์ž๊ฐ„ ์ž์—ฐ์Šค๋Ÿฌ์›€์„ ๋”ํ–ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

โ€‹ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” few shot learning์œผ๋กœ์˜ ๋ฐฉํ–ฅ์„ฑ, style๊ณผ ๊ธ€์”จ์ฒด(๊ตต๊ธฐ, ๋‚ ๋ฆผ์ •๋„) controllable, ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰์œผ๋กœ๋Š” ๊ฐ ๊ธ€์ž๋งˆ๋‹ค ๋‹ค๋ฅธ receptive field๋ฅผ ์ ์šฉ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋‚˜๋„ ์ฝ์œผ๋ฉด์„œ ์ƒ๊ฐํ•œ ํ•œ๊ณ„์ ์ธ๋ฐ ๊ฐ™์€ ๊ธ€์ž์— ์Šคํƒ€์ผ์€ ๋‹ฌ๋ผ๋„ ๊ธ€์”จ์˜ ํ•œ ๊ธ€์ž์— ํ•ด๋‹นํ•œ ๊ธธ์ด๊ฐ€ ์ผ์ •ํ•ด์„œ ๊ทธ๋Ÿฐ ์ธก๋ฉด์—์„œ ๋‹ค์–‘์„ฑ์ด ์—†๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ ์ €์ž๋„ ์ด์ ์„ ์ง€๋ชฉํ–ˆ๋‹ค.

My opinion: ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด RNN๊ตฌ์กฐ๋ฅผ CNN๊ตฌ์กฐ๋กœ ๋ฐ”๊ฟจ๋‹ค๋Š” ๊ฒƒ์— ํฐ contribution์ด ์žˆ๋‹ค. ์ „์ฒด ๊ธ€์”จ๋ฅผ ์ƒ์„ฑํ•˜๋Š” process๋ฅผ ํ•œ ๊ธ€์ž ๊ธฐ์ค€ ์–‘ ์˜†์˜ ๊ธ€์ž๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ๋กœ divide and conquerํ•œ ๊ฒƒ์ด๋‹ค. ๊ทธ ์„ฑ๋Šฅ์ด ๊ธฐ์กด RNN์„ ์‚ฌ์šฉ ํ•œ ๊ฒƒ๋ณด๋‹ค ์ข‹์€ ๊ฒƒ์„ ๋ณด์ด๋ฉฐ, ์–‘ ์˜†๋งŒ ์ฐธ๊ณ ํ•ด์„œ ๊ธ€์ž๋ฅผ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๊ฒƒ์ด ๊ทผ๊ฑฐ ์žˆ๋Š” ๊ฐ€์ •์ด๋ผ๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค.

ํ•˜์ง€๋งŒ ๋…ผ๋ฌธ์—์„œ๋„ ๋งํ–ˆ๋“ฏ ๊ฐ™์€ n๊ฐœ์˜ ๊ธ€์ž๊ฐ€ ๋“ค์–ด๊ฐ„ ๋‹จ์–ด๋Š” i๊ฐ€ 100๊ฐœ๋“  m์ด 100๊ฐœ๋“  ๊ฐ™์€ ๊ธธ์ด๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๋ช…ํ™•ํ•œ ํ•œ๊ณ„์ ์ด ์žˆ๋‹ค. ๋˜ํ•œ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์˜ ๊ฒฐ๊ณผ๋Š” ๋ณด์—ฌ์คฌ์ง€๋งŒ controllableํ•œ ๋ชจ์Šต์€ ๋ณด์—ฌ์ฃผ์ง€ ๋ชปํ–ˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

RNN์œผ๋กœ ํ’€์–ด์˜จ ๋ฌธ์ œ๋„ ๋ฌธ์ œ ์ •์˜๋งŒ ์ž˜ ํ•˜๋ฉด CNN์œผ๋กœ ํ’€ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋„ ์žˆ๋‹ค.

Text Generation ๋ถ„์•ผ๋Š” Recognizable๊ณผ Realistic์ด๋ผ๋Š” target์„ ๊ฐ€์ง„ ์ด๋ฏธ์ง€ Generation๊ณผ๋Š” ๋˜ ๋‹ค๋ฅธ ๋А๋‚Œ์˜ ํฅ๋ฏธ๋กœ์šด ๋ถ„์•ผ์ธ๊ฑฐ ๊ฐ™๋‹ค.

Author / Reviewer information

Author

๊น€๊ธฐํ›ˆ(GiHoon Kim)

  • KAIST GSCT, Visual Media Lab

  • gihoon@kaist.ac.kr

Reviewer

  1. ๊ถŒ๋‹คํฌ (Kwon Dahee): KAIST / -

  2. ๋ฐฑ์ •์—ฝ (Baek Jeongyeop): KAIST/ -

  3. ํ•œ์ •๋ฏผ (Han Jungmin): KAIST/-

Reference & Additional materials

  1. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

  2. Bo Ji and Tianyi Chen. Generative adversarial network for handwritten text. arXiv preprint arXiv:1907.11845, 2019

  3. Eloi Alonso, Bastien Moysset, and Ronaldo Messina. Adversarial generation of handwritten text images conditioned on sequences. arXiv preprint arXiv:1903.00277, 2019.

  4. Emre Aksan, Fabrizio Pece, and Otmar Hilliges. Deepwriting: Making digital ink editable via deep generative modeling. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1โ€“14, 2018.

  5. Official GitHub repository: https://github.com/amzn/convolutional-handwriting-gan

  6. Author's Video: https://www.youtube.com/watch?v=jGG5Q8S1Rus

Last updated

Was this helpful?