CUT [Kor]

Park et al. / Contrastive Learning for Unpaired Image-to-Image Translation / ECCV 2020

CUT ๋ชจ๋ธ์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ

1. Problem definition

image-to-image translation task๋Š”

source domain A์— ์žˆ๋Š” input image xAx_A๋ฅผ target domain B๋กœ ๋ณ€ํ™˜์‹œํ‚ค๋Š”๋ฐ, ์ด๋•Œ source content๋Š” ์œ ์ง€ํ•˜๋ฉด์„œ target style๋กœ ๋ฐ”๊ฟ”์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, ์šฐ๋ฆฌ๋Š” mapping function GAโ†ฆBG_{A\mapsto B}๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•˜๊ณ , ์ด ํ•จ์ˆ˜๋Š” target domain image xBโˆˆBx_B \in B์™€ ๊ตฌ๋ถ„ํ•˜๊ธฐ ํž˜๋“  xABโˆˆBx_{AB} \in B๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Image-to-image translation task๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค:

xAโˆˆdomainA,xBโˆˆdomainBx_A \in domain A, x_B \in domain B

๋„๋ฉ”์ธ A์˜ ์ž„์˜์˜ ์ด๋ฏธ์ง€ xAx_A ์™€, ๋„๋ฉ”์ธ B์˜ ์ž„์˜์˜ ์ด๋ฏธ์ง€ xBx_B ๊ฐ€ ์žˆ์„๋•Œ,

xABโˆˆB:xAB=GAโ†ฆB(xA)x_{AB} \in B : x_{AB} = G_{A\mapsto B}(x_A)

Generator GAโ†ฆBG_{A\mapsto B} ์— xAx_A ๋ฅผ ๋„ฃ์€ ์•„์›ƒํ’‹ xABx_{AB} ๋Š” ๋„๋ฉ”์ธ B์˜ ์›์†Œ์—ฌ์•ผํ•ฉ๋‹ˆ๋‹ค.

2. Motivation

Image translation

Image-to-Image translation์ด๋ž€, A ๋„๋ฉ”์ธ์— ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ B ๋„๋ฉ”์ธ์˜ ์ด๋ฏธ์ง€๋กœ ๋งตํ•‘ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด, ๊ฐˆ์ƒ‰ ๋ง(A ๋„๋ฉ”์ธ)์„ ์–ผ๋ฃฉ๋ง(B ๋„๋ฉ”์ธ)๋กœ ๋ฐ”๊พธ๋ ค๋Š” ๊ฒƒ, ํ‘๋ฐฑ์‚ฌ์ง„(A ๋„๋ฉ”์ธ)์„ ์ปฌ๋Ÿฌ์‚ฌ์ง„(B ๋„๋ฉ”์ธ)์œผ๋กœ ๋ฐ”๊พธ๋ ค๋Š” ๊ฒƒ์ด image translation์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋•Œ, ๊ฐˆ์ƒ‰ ๋ง์˜ ๋ชธํ†ต์ด๋‚˜ ํ˜•ํƒœ๋Š” ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋˜, ๋ง์˜ ํ„ธ์ƒ‰๋งŒ ๋ฐ”๊ฟ”์•ผ ๋ฉ๋‹ˆ๋‹ค. ํ‘๋ฐฑ์‚ฌ์ง„ ๋˜ํ•œ ๊ฑด๋ฌผ์ด๋‚˜ ๋ฐฐ๊ฒฝ์€ ๋ณ€ํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๊ทธ๊ฒƒ๋“ค์˜ ์ƒ‰๊น”๋งŒ ๋ณ€ํ•ด์•ผ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ˜•ํƒœ๋Š” ์œ ์ง€ํ•˜๋ฉด์„œ ์–ด๋–ค ์ƒ‰๊น”์ด๋‚˜ ํŠน์ง•๋งŒ ๋ณ€ํ•˜๋„๋ก ํ•ด์•ผ๋˜๋ฏ€๋กœ ํ•œ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๊ทธ ๋‘๊ฐ€์ง€๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด challenge์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ 'disentanglement problem'์ด๋ผ๊ณ  ํ•˜๊ณ , img-to-img translation task์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ค‘์— ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” ์•ž์œผ๋กœ ๋ง์˜ ๋ชธํ†ต/ํ˜•ํƒœ๋ฅผ 'content', ๋ง์˜ ํ„ธ์ƒ‰์„ 'style'์ด๋ผ๊ณ  ๋ถ€๋ฅผ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

  • Pix2Pix

    pix2pix์˜ ๊ฒฐ๊ณผ ์˜ˆ์‹œ

    Pix2Pix๋Š” paired dataset์„ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ img-to-img translation๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

    ์—ฌ๊ธฐ์„œ Paired dataset์ด๋ž€, ๋‘ ๋„๋ฉ”์ธ์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€๊ฐ€ ๋ฐ˜๋“œ์‹œ "ํ•œ ์Œ"์œผ๋กœ ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด (์Šค์ผ€์น˜, ์‹ ๋ฐœ์‚ฌ์ง„), (๋ ˆ์ด๋ธ”, ๊ฑด๋ฌผ์‚ฌ์ง„) ํ˜•ํƒœ๋กœ ๋ฐ˜๋“œ์‹œ ๋ชจ๋ธ์˜ ์ธํ’‹์œผ๋กœ ํ•œ ์Œ์˜ ์ด๋ฏธ์ง€๊ฐ€ ํ•„์š”ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

    Pix2Pix๋Š” ๊ธฐ์กด GAN์ด ๋„ˆ๋ฌด๋‚˜ ์ œ์•ฝ์—†์ด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, L1 loss๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์™€ ์›๋ณธ ์ด๋ฏธ์ง€๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ค„์—ฌ๋‚˜๊ฐ€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    ๊ทธ๋Ÿฌ๋‚˜ Pix2Pix์˜ ๋‹จ์ ์€,

    1. Paired dataset์€ ์‰ฝ๊ฒŒ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๋„ ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    2. L1 loss์— ๋„ˆ๋ฌด ์˜์กด์ ์ž…๋‹ˆ๋‹ค. ์–ด๋–ค ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” L1 loss๋งŒ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋ณด์ด๊ธฐ๋„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋…ผ๋ฌธ์ด 'CycleGAN'์ž…๋‹ˆ๋‹ค.

  • CycleGAN

    CycleGAN์€ ๋”์ด์ƒ paired dataset์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ƒฅ ๊ฐ ๋„๋ฉ”์ธ์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์„ธํŒ…์„ 'Unpaired Dataset'์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, ๋ง ์‚ฌ์ง„ 1000์žฅ๊ณผ ์–ผ๋ฃฉ๋ง ์‚ฌ์ง„ 800์žฅ๊ณผ ๊ฐ™์ด ์ด๋ฏธ์ง€๊ฐ„์˜ ์Œ์„ ์ด๋ฃจ์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค.

    cyclegan ๊ตฌ์กฐ

    CycleGAN๊ตฌ์กฐ๋Š” ์œ„ ์‚ฌ์ง„์„ ๋ณด๋ฉด์„œ ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

    ๋จผ์ € XX๋Š” ๋ง ์ด๋ฏธ์ง€, YY๋Š” ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€๋ผ๊ณ  ํ•ฉ์‹œ๋‹ค.

    ๋ง ์ด๋ฏธ์ง€ XX๋ฅผ GG๋ผ๋Š” generator์— ๋„ฃ์–ด์ฃผ๋ฉด, ์ด generator๋Š” ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€ G(X)G(X)๋ฅผ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ์ง„์งœ ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€ Y์™€ ๋น„๊ตํ•˜๋ฉด์„œ ์ด๊ฒƒ์ด ์ง„์งœ๊ฐ™์€์ง€, ๊ฐ€์งœ๊ฐ™์€์ง€๋ฅผ ํŒ๋ณ„ํ•ด๋ด…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ DYD_Y๊ฐ€ ํ•˜๋Š” ์ผ ์ž…๋‹ˆ๋‹ค.

    ๊ทธ ํ›„, ์ƒ์„ฑ๋œ ๊ฐ€์งœ ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€ G(X)G(X)๋ฅผ FF๋ผ๋Š” generator์— ๋„ฃ์–ด์ค๋‹ˆ๋‹ค. FF๋Š” ๋‹ค์‹œ ๋ง ์ด๋ฏธ์ง€ F(G(X))F(G(X))๋ฅผ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด, DXD_X๋Š” ์šฐ๋ฆฌ๊ฐ€ ์ฒ˜์Œ์— ๋„ฃ์–ด์ฃผ์—ˆ๋˜ ๋ง ์ด๋ฏธ์ง€ XX์™€, ๋‘ generator๋ฅผ ๊ฑฐ์น˜๊ณ  ๋Œ์•„์˜จ ๊ฐ€์งœ ๋ง ์ด๋ฏธ์ง€ F(G(X))F(G(X))์˜ real/fake์—ฌ๋ถ€๋ฅผ ํŒ๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

    ์ด ๊ณผ์ •์„ ๋ฐ˜๋Œ€์˜ ์ƒํ™ฉ์—์„œ๋„ ๋ฐ˜๋ณตํ•ด์ค๋‹ˆ๋‹ค. ์ฆ‰ ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€ YY๋ฅผ ๋„ฃ์–ด cycle์„ ๋Œ๊ณ ์˜ค๋Š” ๊ฐ€์งœ ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€ G(F(Y))G(F(Y))๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด์ฃ .

    ์ด๋ ‡๊ฒŒ ๊ตฌํ•˜๋Š” loss๋ฅผ 'cycle consistency loss'๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ pix2pix์˜ L1 loss๋ฅผ ๋Œ€์‹ ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

    ๊ทธ๋Ÿผ ์ตœ์ข… ์•„์›ƒํ’‹ ์ด๋ฏธ์ง€๋Š” ์ง„์งœ ์–ผ๋ฃฉ๋ง ๊ฐ™์œผ๋ฉด์„œ๋„(style), ๋ง ์ด๋ฏธ์ง€์˜ ๊ณ ์œ ์˜ ํ˜•ํƒœ(content)๋Š” ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

    ๋‹ค๋งŒ ์ด ๊ตฌ์กฐ์—๋„ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

    1. ์—ญ๋ฐฉํ–ฅ ํ•จ์ˆ˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ GG์˜ inverse์ธ FF๊ฐ€ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ๋‘๊ฐœ๋‚˜ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋‹ˆ ์šฉ๋Ÿ‰์ด ๋งŽ์ด ๋“ค๊ณ  ์†๋„๊ฐ€ ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    2. ๋ฐ˜๋“œ์‹œ ๋‘ ๋„๋ฉ”์ธ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์ผ๋Œ€์ผ ๋Œ€์‘์ด์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋„ˆ๋ฌด ์ œ์•ฝ์ ์ž…๋‹ˆ๋‹ค.

      CycleGAN์˜ ํŠน์ง•: bijection(์ผ๋Œ€์ผ๋Œ€์‘)

      ์ด ์ด์•ผ๊ธฐ์— ๋Œ€ํ•ด์„œ ์ข€ ๋” ์ž์„ธํžˆ ์–˜๊ธฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

      ๋งŒ์•ฝ ์–ด๋–ค ๊ฐˆ์ƒ‰ ๋ง์„ ์–ผ๋ฃฉ๋ง ๋„๋ฉ”์ธ์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค. ๊ทธ๋Ÿผ ๊ธฐ์กด ๊ฐˆ์ƒ‰ ๋ง์˜ ํ„ธ์ƒ‰ ์ •๋ณด๋Š” ์—†์• ๋ฉด์„œ ๊ทธ๊ฒƒ์˜ ํ˜•ํƒœ๋งŒ ์œ ์ง€๋ฅผ ํ•˜๋˜, ์–ผ๋ฃฉ๋ฌด๋Šฌ๋ฅผ ์ž…ํžˆ๋ ค๊ณ  ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ๋‹ค์‹œ ์–ผ๋ฃฉ๋งโ†’๋ง๋กœ ๋Œ์•„๊ฐˆ ๋•Œ์—, ์‚ฌ์‹ค ๊ทธ ํ˜•ํƒœ๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์›๋ž˜ ๋ง์ด ๊ฐˆ์ƒ‰์ด์—ˆ๋Š”์ง€, ํฐ์ƒ‰์ด์—ˆ๋Š”์ง€, ์ ๋ฐ•์ด์˜€๋Š”์ง€ ๊ทธ๋ฆฌ ์ค‘์š”ํ•œ ์š”์†Œ๊ฐ€ ๋˜์ง€ ์•Š์•„ ๊ทธ ์ •๋ณด๊ฐ€ ์†์‹ค๋ฉ๋‹ˆ๋‹ค.

      cycleGAN์€ ๋งโ†’์–ผ๋ฃฉ๋ง task๋Š” ์ž˜ ๋˜์ง€๋งŒ, ์–ผ๋ฃฉ๋งโ†’๋ง์€ ์ž˜ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์–ผ๋ฃฉ๋ง์€ ๋ง์— ๋น„ํ•ด์„œ ์ƒ๋Œ€์ ์œผ๋กœ ๋‹ค์–‘ํ•˜์ง€ ์•Š์€ style(๊ทธ๋ƒฅ ํ‘๋ฐฑ ์ค„๋ฌด๋Šฌ๋ฅผ ๊ฐ€์ง€๊ณ ์žˆ์œผ๋ฉด ๋จ)์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋” ์‰ฌ์› ๋˜ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜๋Œ€๋กœ ์–ผ๋ฃฉ๋ง์„ ๊ฐˆ์ƒ‰, ์ ๋ฐ•์ด, ํฐ์ƒ‰ ๋ง๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ์‰ฝ์ง€์•Š์€ task๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, cycleGAN์€ diversity๊ฐ€ ๋น„๊ต์  ๋‚ฎ์•„ ์ œ์•ฝ์ ์ธ ๋ฉด์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Idea

๋ง ์ด๋ฏธ์ง€ ๋‚ด์˜ patch๊ฐ„์˜ ๊ด€๊ณ„

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” cycleGAN์˜ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ธ "CUT(Contrastive learning for Unpaired image-to-image Translation)"์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. (์ด ๋…ผ๋ฌธ์˜ ์ €์ž๊ฐ€ ๋ฐ”๋กœ cycleGAN์˜ ์ €์ž์ž…๋‹ˆ๋‹ค.)

๋ง ์ด๋ฏธ์ง€์™€ ์–ผ๋ฃฉ๋ง ์ด๋ฏธ์ง€๋ฅผ ํŒจ์น˜ ๋‹จ์œ„๋กœ ์ž˜๋ผ์„œ ์‚ดํŽด๋ณด์•˜์„ ๋•Œ, ๋ง ๋จธ๋ฆฌ๋Š” ์–ผ๋ฃฉ๋ง ๋จธ๋ฆฌ๋ผ๋ฆฌ, ๋ง ๋‹ค๋ฆฌ๋Š” ์–ผ๋ฃฉ๋ง ๋‹ค๋ฆฌ๋ผ๋ฆฌ, ๊ทธ๋ฆฌ๊ณ  ๋ฐฐ๊ฒฝ์€ ๋ฐฐ๊ฒฝ๋ผ๋ฆฌ ์—ฐ๊ด€๊ด€๊ณ„๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉด, ์ข€๋” ์ง๊ด€์ ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ translateํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•œ ๊ฒ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ contrastive loss๋ฅผ ํ†ตํ•ด์„œ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค.

contrastive loss๋Š” ์ธ์ฝ”๋”๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

  1. ๋ชธ์˜ ํ˜•ํƒœ๋‚˜ ๊ตฌ์กฐ์™€ ๊ฐ™์€ ๊ณตํ†ต๋˜๋Š” ๋ถ€๋ถ„์€ ์œ ์ง€ํ•˜๋ฉด์„œ(invariant)

  2. ๋ง์˜ ํ„ธ์ƒ‰์ฒ˜๋Ÿผ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ๋Š” ์œ ์—ฐํ•˜๊ฒŒ ๋ฐ”๋€Œ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.(sensitive)

(contrastive loss๋Š” ์•„๋ž˜ method๋ถ€๋ถ„์—์„œ ๋” ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.)

๊ทธ๋ฆฌ๊ณ  CUT์€ cycleGAN๊ณผ ๋‹ฌ๋ฆฌ inverse network๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์•„ ๋” ๊ฐ„๋‹จํ•˜๋ฉด์„œ ํ•™์Šต์‹œ๊ฐ„๋„ ๋‹จ์ถ•๋˜์—ˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

3. Method

InfoNCE Loss

๋จผ์ € ์œ„์—์„œ ๋งํ•œ contrastive loss๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ ์ •๋ณด์ด๋ก  ๊ฐœ๋…์„ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

mutual information ์ •์˜

mutual information์ด๋ž€, source vector c์™€ target vector x๊ฐ€ ์žˆ์„๋•Œ ๊ทธ ๋‘ ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ์˜์กด์ •๋ณด๋Ÿ‰, ์ฆ‰ ๋‘ ๋ฒกํ„ฐ๊ฐ€ ๊ณต์œ ํ•˜๋Š” ์ •๋ณด๋Ÿ‰ ์ •๋„๋ผ๊ณ  ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๊ฒƒ์€ ์œ„์™€ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ , ๊ทธ๊ฒƒ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ p(xโˆฃc)p(c)\frac{p(x|c)}{p(c)} ์™€ ๋น„๋ก€ํ•˜๋Š” f(xt,ct)f(x_t, c_t) ๋ผ๋Š” ํ•จ์ˆ˜๋ฅผ mutual information๋ผ๊ณ  ํ•ฉ์‹œ๋‹ค.(์—ฌ๊ธฐ์„œ k๋Š” ๊ฐ€๋ณ๊ฒŒ ๋ฌด์‹œํ•˜๋„๋ก ํ•ฉ์‹œ๋‹ค. ์ €ํฌ๊ฐ€ ์ด์•ผ๊ธฐํ•  ๋‚ด์šฉ๊ณผ๋Š” ์ƒ๊ด€์—†๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.)

๊ทธ๋ ‡๋‹ค๋ฉด ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ mutual information์„ ์ตœ๋Œ€ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” loss, ์ฆ‰ ๋‘ ๋ฒกํ„ฐ๊ฐ„์˜ ์ƒํ˜ธ์˜์กด์ •๋ณด๋Ÿ‰์„ ์ตœ๋Œ€๋กœ ๋งŒ๋“œ๋Š” loss๋Š” ์–ด๋–ป๊ฒŒ ์ •์˜ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

"Representation learning with contrastive predictive coding(2018)"์ด๋ผ๋Š” ๋…ผ๋ฌธ์—์„œ๋Š” InfoNCE๋ผ๋Š” loss๋ฅผ ์ œ์•ˆํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

infoNCE loss ์ •์˜

InfoNCE loss๋Š” ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Vector space์—์„œ target vector x ์™ธ์˜ vector๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ negative sampling์ด๋ผ๊ณ  ํ•˜๊ณ , ๊ทธ ์ƒ˜ํ”Œ๋“ค์„ negative sample์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

N๊ฐœ์˜ negative sample๊ณผ 1๊ฐœ์˜ target vector(=positive sample), ์ฆ‰ N+1๊ฐœ์˜ vector ์ค‘, positive sample์„ ๋ฝ‘์„ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์€, ๋ถ„์ž๊ฐ’์€ ๋†’์ด๊ณ  ๋ถ„๋ชจ๊ฐ’์€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ฆ‰ target vector์™€์˜ mutual information์€ ์ตœ๋Œ€ํ™”์‹œํ‚ค๋ฉด์„œ ๋‚˜๋จธ์ง€ negative sample๊ณผ์˜ mutual info๋Š” ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๊ฒƒ์€ loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š”๊ฒƒ๊ณผ๋„ ๊ฐ™์Šต๋‹ˆ๋‹ค(๋งˆ์ด๋„ˆ์Šค๊ฐ€ ๋ถ™์–ด์„œ).

contrastive loss์˜ ์ •์˜

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” InfoNCE loss๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Query v(ํ˜น์€ source v) : output ์ด๋ฏธ์ง€์—์„œ ์˜จ ํŒจ์น˜์˜ feature

  2. Positive v+ : input ์ด๋ฏธ์ง€์—์„œ ์˜จ ํŒจ์น˜์˜ feature. ๋‹จ query v์™€ ๊ฐ™์€ ์œ„์น˜์— ์žˆ๋Š” ํŒจ์น˜์ž„.

  3. Negative v- : input ์ด๋ฏธ์ง€์—์„œ positive v+์˜ ํŒจ์น˜๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ํŒจ์น˜๋“ค์˜ feature.

๊ทธ๋ฆฌ๊ณ  ์ด feature๋“ค์˜ mutual information๋Š” cosine similarity๋กœ ํ‘œํ˜„๋˜์–ด ๋‘ feature๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰, ์ด loss๋ฅผ ์ตœ์†Œํ™”(min)ํ•˜๋Š” ๊ฒƒ์€, query์™€ positive์˜ ์œ ์‚ฌ๋„๋Š” ์ตœ๋Œ€ํ™”(max)ํ•˜๋Š” ๊ฒƒ์ด๊ณ , query์™€ negative์˜ ์œ ์‚ฌ๋„๋Š” ์ตœ์†Œํ™”(min)ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ''ํŒจ์น˜์˜ feature'๋ผ๋Š” ์ด์•ผ๊ธฐ๋ฅผ ํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๊ฒƒ์„ Patchwise Contrastive Loss๋ผ๊ณ  ์ •์˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ฐ‘์—์„œ ์ด์–ด์„œ ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค.

Multilayer, patchwise contrastive learning

patchwise contrastive loss ์„ค๋ช…

์•„๊นŒ์˜ ๋‚ด์šฉ์„ ์œ„ ๊ทธ๋ฆผ๊ณผ ์—ฐ๊ด€์‹œํ‚ค์ž๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ์–ผ๋ฃฉ๋ง ๊ทธ๋ฆผ์—์„œ์˜ ๋‚จ์ƒ‰ ํŒจ์น˜ = Query v

  • ๋ง ๊ทธ๋ฆผ์—์„œ์˜ ํŒŒ๋ž€์ƒ‰ ํŒจ์น˜ = Positive v+

  • ๋ง ๊ทธ๋ฆผ์—์„œ์˜ ๋…ธ๋ž€์ƒ‰ ํŒจ์น˜ = Negative v-

Motivation์—์„œ ๋งํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ทธ๊ฒƒ์„ ํŒจ์น˜๋‹จ์œ„๋กœ ๋œฏ์–ด์„œ ์‚ดํŽด๋ณด์•˜์„ ๋•Œ์—๋„ ์—ฐ๊ด€๊ด€๊ณ„๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ์›ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰, output์œผ๋กœ ์ƒ์„ฑ๋œ ์–ผ๋ฃฉ๋ง์˜ ๋จธ๋ฆฌ๋Š”, input ๋ง์˜ ๋‹ค๋ฆฌ๋ณด๋‹ค๋Š” input ๋ง์˜ ๋จธ๋ฆฌ์™€ ๋” ์—ฐ๊ด€์ด ์žˆ์–ด์•ผ ๋ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ๊ทธ ๊ฐœ๋…์ด pixel level๋กœ ๋‚ด๋ ค๊ฐ”์„๋•Œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค. ์–ผ๋ฃฉ๋ง์˜ ๋ชธ์˜ ์ƒ‰๊น”์€ ๋ง์˜ ๋ชธ ์ƒ‰๊น”๊ณผ ๋” ์—ฐ๊ด€์ด ์žˆ์–ด์•ผ ํ•˜๊ณ , ๋ฐฐ๊ฒฝ์ธ ์ดˆ์›(?)๊ณผ๋Š” ์—ฐ๊ด€์„ฑ์ด ๋–จ์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” input์ด encoder G์— ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๋ฉด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ feature map์ด ํ˜•์„ฑ๋˜๋Š”๋ฐ, ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๊ฒƒ์„ loss๋ฅผ ๊ตฌํ•˜๋Š”๋ฐ์— ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

encoder์˜ ll๋ฒˆ์งธ layer์—์„œ ๋‚˜์˜จ feature map์„ MLP network HlH_l์— ๋„ฃ์–ด ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘์‹œ์ผœ์ค๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๋งคํ•‘๋œ feature์—์„œ SlS_l๊ฐœ์˜ ํŒจ์น˜๋ฅผ ๋ฝ‘์•„ ๊ทธ ํŒจ์น˜๋“ค๋กœ contrastive loss๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์„ L๊ฐœ์˜ layer์— ๋Œ€ํ•ด์„œ ๋ฐ˜๋ณต์‹œ์ผœ์ค๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ feature map์— ๋Œ€ํ•ด์„œ ๋ฐ˜๋ณตํ•˜๊ฒŒ ๋˜๋ฉด ์ด๋ฏธ์ง€์˜ globalํ•œ ํŠน์„ฑ๋ถ€ํ„ฐ detailํ•œ ํŠน์„ฑ๊นŒ์ง€ ๊ณ ๋ฃจ๊ณ ๋ฃจ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

patchnce

์ €์ž๋“ค์€ ์ด loss์—๊ฒŒ PatchNCE loss๋ผ๋Š” ์ด๋ฆ„์„ ๋ถ™ํ˜€์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ˆ˜์‹์„ ์‚ดํŽด๋ณด๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. SlS_l๊ฐœ์˜ ํŒจ์น˜๋“ค ์ค‘์— ๋Œ์•„๊ฐ€๋ฉด์„œ query patch๋ฅผ ์ง€์ •ํ•œ ๋’ค contrastive loss๋ฅผ ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ๋ฐ˜๋ณตํ•˜๊ณ  ๋ชจ๋‘ ๋”ํ•จ.

  2. ๊ทธ๊ฒƒ์„ L๊ฐœ์˜ layer์— ๋Œ€ํ•ด์„œ ๋‹ค์‹œ ๋ฐ˜๋ณตํ•˜๊ณ  ๋ชจ๋‘ ๋”ํ•จ.

โ€ป ์ฐธ๊ณ : MLP network H๋Š” SimCLR(2020)arrow-up-right ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ์ด network๋Š” 2๊ฐœ์˜ linear layer + ReLU non-linear layer๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ์Šต๋‹ˆ๋‹ค. ์™œ ์ด ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”์ง€๋Š” ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ ์‹คํ—˜์„ ํ†ตํ•ด ์ฆ๋ช…ํ–ˆ์œผ๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ, ์ด๋ ‡๊ฒŒ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” negative sample์„ input ์ด๋ฏธ์ง€ ๋‚ด(internal patches)์—์„œ ์ƒ˜ํ”Œ๋ง์„ ํ–ˆ๋Š”๋ฐ, ์•„์˜ˆ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์—์„œ negative sample์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์ง„ ์•Š์„๊นŒ์š”(external patches)?

๊ทธ๊ฒƒ์— ๋Œ€ํ•œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋Š”๋ฐ, ๊ฒฐ๊ตญ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ์ƒ˜ํ”Œ๋ง์„ ํ–ˆ์„๋•Œ๊ฐ€ ๋” ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” [Section 4. Experiment, Ablation Study](###Ablation study) ๊ฒฐ๊ณผ์— ์ž์„ธํžˆ ๋‚˜์™€์žˆ์Šต๋‹ˆ๋‹ค.

internal patches vs. external patches

์ €์ž๋“ค์€ ๊ทธ ์ด์œ ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์œ ์ถ”ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

  1. ์ธ์ฝ”๋”๋Š” internal patch๋ฅผ ์ผ์„ ๋•Œ, intra-class variation์— ๋Œ€ํ•ด์„œ ๋ชจ๋ธ๋งํ•˜์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํŒจ์น˜๊ฐ€ ํ•˜์–€๋ง์˜ ํŒจ์น˜์ธ์ง€, ๊ฐˆ์ƒ‰๋ง์˜ ํŒจ์น˜์ธ์ง€๋Š” ์–ผ๋ฃฉ๋ง์„ ๋งŒ๋“œ๋Š”๋ฐ์— ์ค‘์š”ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ๊ฒƒ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค.

  2. External patch๋Š” ๊ตฌ๋ถ„ํ•˜๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์‰ฝ๊ณ , false positive๊ฐ€ ๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด์‹œ๋ฉด ์•„์ฃผ ์šฐ์—ฐํžˆ๋„, ๋‹ค๋ฅธ ๋ง ์ด๋ฏธ์ง€์—์„œ ์ƒ˜ํ”Œ๋ง์„ ํ–ˆ๋Š”๋ฐ query ํŒจ์น˜์™€ ์—ฐ๊ด€์ด ์žˆ๋Š” ๋ง์˜ ๋จธ๋ฆฌ๋ถ€๋ถ„์ด negative sample๋กœ ๋ฝ‘ํž ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด negative sample๋กœ์จ์˜ ์—ญํ• ์„ ํ•ด์ฃผ์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ false positive๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  3. ์ด๋ฏธ internal patch๋ฅผ ์“ฐ๋Š” ๋ฐฉ๋ฒ•๋ก ์€ texture synthesis๋‚˜ super resolution ๋ถ„์•ผ์—์„œ ์„ฑ๋Šฅ์ž…์ฆ์ด ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Final loss

์ตœ์ข… loss ์‹

์ตœ์ข… loss๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ GAN loss, PatchNCE loss, ๊ทธ๋ฆฌ๊ณ  identity loss๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

identity loss๋Š” PatchNCE loss๋ฅผ Y ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด์„œ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ generator๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋ณ€ํ™”์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” CycleGAN์—์„œ ์‚ฌ์šฉํ•œ identity loss์˜ ์—ญํ• ๊ณผ ๊ฑฐ์˜ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, G_Y์—๊ฒŒ X๊ฐ€ ์•„๋‹Œ, Y๋ฅผ ๋„ฃ์—ˆ์„๋•Œ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ Y๋กœ ๋งตํ•‘๋˜๋„๋ก ํ•˜๋Š” ๋กœ์Šค์ž…๋‹ˆ๋‹ค.

๊ธฐ๋ณธ CUT ๋ชจ๋ธ์€ ฮปX=1,ฮปY=1\lambda_X = 1, \lambda_Y = 1์„ ์‚ฌ์šฉํ•˜์˜€๊ณ ,

์ข€ ๋” lightํ•œ ๋ชจ๋ธ์ธ ์ผ๋ช… Fast CUT์€ ฮปX=10,ฮปY=0\lambda_X = 10, \lambda_Y = 0 ์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, identity loss๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„์„œ ์ข€ ๋” ๊ฐ€๋ฒผ์šด ๋ฒ„์ „์ž…๋‹ˆ๋‹ค.

4. Experiment & Result

Experimental Setup

  • Dataset:

    • Catโ†’Dog contains 5,000 training and 500 val images from AFHQ Dataset

    • Horseโ†’Zebra contains 2,403 training and 260 zebra images from ImageNet

    • Cityscapes contains street scenes from German cities, with 2,975 training and 500 validation images.

  • Baselines

    • CycleGAN

    • MUNIT

    • DRIT

    • Distance

    • SelfDistance

    • GCGAN

  • Evaluation Metric

    • FID(Fr ฬechet Inception Distance) : real ์ด๋ฏธ์ง€์˜ ๋ถ„ํฌ์™€ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์˜ ๋ถ„ํฌ๊ฐ„์˜ divergence๋ฅผ ๊ตฌํ•˜๋Š” metric. ๋‚ฎ์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์Œ.

    • Cityscape ๋ฐ์ดํ„ฐ์…‹์€ ground-truth label์ด ์กด์žฌํ•˜์—ฌ segmentation ์ง€ํ‘œ์ธ mAP, pixel-wise accuracy, average class accuracy๊ฐ€ ์‚ฌ์šฉ๋จ.

    • sec/iter, Mem(GB) : ์†๋„์™€ ์šฉ๋Ÿ‰ ์ธก์ • ์ง€ํ‘œ

  • Training details:

    • Generator ๊ตฌ์กฐ : Resnet-based generator

    • Discriminator ๊ตฌ์กฐ: PatchGAN discriminator

    • GAN Loss๋Š” LSGAN loss๋ฅผ ์‚ฌ์šฉํ•จ.

    • Encoder๋Š” Generator์˜ ์ ˆ๋ฐ˜๋งŒ ์‚ฌ์šฉํ•จ.

    • Feature๋Š” encoder์˜ 0, 4, 8, 12, 16๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ ๋ฝ‘์Œ.

Results

๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ์˜ ์ •์„ฑ์  ๊ฒฐ๊ณผ ๋น„๊ต

์ •์„ฑ์  ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. Light ๋ฒ„์ „์ธ FastCUT์ด ๋‹ค๋ฅธ baseline๋“ค๋ณด๋‹ค๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ horse-to-zebra์˜ task์—์„œ๋Š” CUT์ด ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค, ํŠนํžˆ cycleGAN๋ณด๋‹ค๋„ ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ๋งˆ์ง€๋ง‰ 2๊ฐœ์˜ ํ–‰์„ ๋ณด๋ฉด ์‹คํŒจํ•œ ์ผ€์ด์Šค๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ์š”. ์ต์ˆ™ํ•˜์ง€ ์•Š์€ ๋ง์˜ ํฌ์ฆˆ๊ฐ€ ๋‚˜์˜ค๋ฉด ๋ฐฐ๊ฒฝ์— ์ค„๋ฌด๋Šฌ๋ฅผ ์ž…ํ˜€๋ฒ„๋ฆฐ๊ฑฐ๋‚˜, ๊ณ ์–‘์ด์—์„œ ๊ฐœ๋ฅผ ๋งŒ๋“ค ๋•Œ ์žˆ์ง€๋„ ์•Š์€ ํ˜€๋ฅผ ๋งŒ๋“ค์–ด ๋ฒ„๋ ธ์Šต๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ์˜ ์ •๋Ÿ‰์  ๊ฒฐ๊ณผ ๋น„๊ต

๋‹ค์Œ์€ ์ •๋Ÿ‰์  ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. FID๋„ ๊ฐ€์žฅ ๋‚ฎ์ง€๋งŒ, ์†๋„์™€ ์šฉ๋Ÿ‰์ด ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด์„œ ๋งค์šฐ ๊ฒฝ์ œ์ ์ž…๋‹ˆ๋‹ค.

Ablation study

Ablation study ๊ฒฐ๊ณผ

Ablation study๋Š” ๋‹ค์–‘ํ•œ ์˜ต์…˜์„ ๋‘๊ณ  ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. Identity loss๋ฅผ ์ผ๋Š”์ง€

  2. Negative sample์˜ ๊ฐœ์ˆ˜

  3. Multi-layer learning์ธ์ง€, ์˜ค์ง encoder์˜ last layer๋งŒ ์ผ๋Š”์ง€

  4. Internal patches vs. External patches

(์šฐ์ธก ์ขŒํ‘œ๋ฅผ ๋ดค์„ ๋•Œ, ์˜ค๋ฅธ์ชฝ ์œ„๋กœ ์˜ฌ๋ผ๊ฐˆ ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹๊ณ , ์ขŒ์ธก ํ•˜๋‹จ์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์ง€์•Š๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.)

๋จผ์ € external patch๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์ง€ ๋ชปํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ , last layer๋งŒ ์ผ์„๋•Œ๋„ ์„ฑ๋Šฅ์ด ์ฉ ์ข‹์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ identity loss๊ฐ€ ์—†์„๋•Œ๋Š” ์„ฑ๋Šฅ์ด ๋น„๊ต์  ์ขŒํ‘œ ์ขŒ์ธก ํ•˜๋‹จ์— ๋ชฐ๋ ค์žˆ์Šต๋‹ˆ๋‹ค.

์ขŒ์ธก ํ‘œ๋ฅผ ๋ณด๋ฉด, Horse-to-zebra์—์„œ๋Š” ์˜คํžˆ๋ ค FID๊ฐ€ ๋‚ฎ์•„์กŒ์ง€๋งŒ(์„ฑ๋Šฅโ†‘), Cityscapes์—์„œ๋Š” FID๊ฐ€ ์˜ฌ๋ผ๊ฐ”์Šต๋‹ˆ๋‹ค(์„ฑ๋Šฅโ†“).

์ €์ž๋“ค์€ ์ด๋ ‡๊ฒŒ ๋‹ค๋ฅธ ์–‘์ƒ์„ ๋„๋Š”๊ฒŒ ์ด์ƒํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜์—ฌ training์‹œ์˜ loss ์ถ”์ด๋ฅผ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Training ์‹œ loss์˜ ์ถ”์ด๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ž˜ํ”„. ์ขŒ์ธก์€ Horse-to-zebra ๋ฐ์ดํ„ฐ์…‹, ์šฐ์ธก์€ Cityscapes ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ ์‹คํ—˜.

๊ทธ๋žฌ๋”๋‹ˆ Cityscape์—์„œ๋Š” identity loss๋ฅผ ์“ฐ์ง€์•Š์•˜์„ ๋•Œ, ๊ต‰์žฅํžˆ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ํ•™์Šต์„ ํ•˜๊ณ  ์žˆ๋Š” ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ identity loss๋ฅผ ์“ฐ์ง€์•Š์œผ๋ฉด ์ตœ์ข… FID๋Š” ์ž˜ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์–ด๋„, ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰ identity loss๋Š” ๋ณด๋‹ค ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Visualizing the learned similarity by encoder

encoder network์˜ ์—ญํ•  ๋ถ„์„

๋งˆ์ง€๋ง‰์œผ๋กœ, ์ €์ž๋“ค์€ encoder network๊ฐ€ ์–ด๋–ป๊ฒŒ ํ•™์Šต์„ ํ•˜๊ณ ์žˆ๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด visualization์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์œ„์˜ ์‚ฌ์ง„(a)์—์„œ ํŒŒ๋ž€์  ๋ถ€๋ถ„์ด query patch์ด๊ณ , ์ด์— ํ•ด๋‹นํ•˜๋Š” ํŒŒ๋ž€ ์‚ฌ๊ฐํ˜•์˜ ์ด๋ฏธ์ง€๋“ค(c)์ด output ์–ผ๋ฃฉ๋ง์˜ query patch์™€ input ๋ง์˜ patch๋“ค๊ฐ„์˜ similarity๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. (๋นจ๊ฐ„์ ๋„ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.) ์ฆ‰, similarity ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€์—์„œ ํ•˜์–€ ๋ถ€๋ถ„์ผ์ˆ˜๋ก ์œ ์‚ฌํ•˜๊ณ , ๊นŒ๋งŒ ๋ถ€๋ถ„์ผ์ˆ˜๋ก ์œ ์‚ฌํ•˜์ง€ ์•Š์€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํŒŒ๋ž€์ , ์–ผ๋ฃฉ๋ง์˜ ๋ชธ ๋ถ€๋ถ„์€ ๋ง์˜ ๋ชธ ๋ถ€๋ถ„๊ณผ ์œ ์‚ฌํ•˜๊ณ  ๋‚˜๋จธ์ง€ ๋ฐฐ๊ฒฝ๋ถ€๋ถ„๊ณผ๋Š” ์œ ์‚ฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋นจ๊ฐ„์ , ๋’ท๋ฐฐ๊ฒฝ ๋‚˜๋ญ‡์žŽ ๋ถ€๋ถ„์€ input์ด๋ฏธ์ง€์˜ ์ดˆ์›๋ฐฐ๊ฒฝ๊ณผ ์œ ์‚ฌํ•˜๊ณ  ๋ง์˜ ๋ชธ๊ณผ๋Š” ์œ ์‚ฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์šฐ์ธก ๊ทธ๋ฆผ๋“ค์€ ํŒจ์น˜๋“ค์˜ feature๋ฅผ ๊ฐ€์ง€๊ณ  PCA๋ฅผ ์ง„ํ–‰ํ•˜์—ฌ ์ฃผ์„ฑ๋ถ„์„ ์ถ”์ถœํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์œ ์‚ฌํ•œ ์ƒ‰๊น”์ด ํ”ผ์ณ์ŠคํŽ˜์ด์Šค์—์„œ ์œ ์‚ฌํ•œ ์œ„์น˜์—์„œ ์˜จ๋‹ค๊ณ  ํ•ด์„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ ์–ผ๋ฃฉ๋ง๊ณผ ๋ง์˜ ๋ชธ ๋ถ€๋ถ„๋ผ๋ฆฌ๋Š” ์—ฐ๊ด€์„ฑ์ด ์žˆ๊ณ , ๊ทธ์™ธ์˜ ๋ฐฐ๊ฒฝ๋ถ€๋ถ„๋ผ๋ฆฌ ์—ฐ๊ด€์„ฑ์ด ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

5. Conclusion

CUT์˜ ์ฃผ์š” contribution์„ ์ •๋ฆฌํ•˜์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ๊ธฐ์กด cycleGAN์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ  ๋ณด๋‹ค straightforwardํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ image translation task๋ฅผ ๋‹ค๋ฃจ๊ณ  ์žˆ์Œ.

  2. ํŠนํžˆ image synthesis task์—์„œ๋Š” ์ตœ์ดˆ๋กœ(์ €์ž๊ฐ€ ์ฃผ์žฅํ•˜๊ธธ,) contrastive loss๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Œ.

  3. perceptual loss์™€ ๊ฐ™์ด imagenet์— ์ œ์•ฝ๋œ predefined similarity function์ด ์•„๋‹Œ cross-domain similarity function์„ ํ•™์Šตํ•จ.

  4. ๋”์ด์ƒ inverse network๊ฐ€ ํ•„์š”์—†๊ณ , cycle-consistency์— ์˜์กดํ•˜์ง€ ์•Š์Œ.

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋…ผ๋ฌธ์—๋„ limitation์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์€ ํŠน์ • ๋„๋ฉ”์ธ์—๋งŒ ์ž˜ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์— ์‹คํ—˜์—์„œ๋„ ๋ณด์•˜๋“ฏ์ด, ํŠนํžˆ horse-to-zebra์—๋Š” ์œ ๋… ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋„๊ณ  ์žˆ์ง€๋งŒ, cityscape๋‚˜ cat-to-dog๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๊ทธ๋ฆฌ ๋ˆˆ์— ๋„๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค๊ณ ํ•˜๊ธฐ ํž˜๋“ญ๋‹ˆ๋‹ค. ์–ด์ฉŒ๋ฉด domain๊ฐ„์˜ gap์„ ๋ชจ๋ธ์ด ์ž˜ ์บ์น˜ํ•ด๋‚ด์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ์ผ ์ˆ˜๋„ ์žˆ๊ฒ ์ฃ .

์ตœ๊ทผ์— ๋‚˜์˜จ Dual Contrastive Learning for Unsupervised Image-to-Image Translation(CVPRW, 2021)arrow-up-right ์—์„œ๋Š” ๋ฐ”๋กœ ์ด ์ ์„ ์ง€์ ํ•˜๋ฉฐ, CUT์—์„œ๋Š” ๋‘ ๋„๋ฉ”์ธ์ด ์˜ค์ง ํ•˜๋‚˜์˜ ์ธ์ฝ”๋”๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž„๋ฒ ๋”ฉ์„ ์—ฌ๋Ÿฌ ๊ฐœ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ ํ•œ๊ณ„์ ์„ ํ•ด๊ฒฐํ•˜์˜€๋‹ค๊ณ  ํ•˜๋‹ˆ ๋’ท ๋‚ด์šฉ์ด ๊ถ๊ธˆํ•˜์‹  ๋ถ„๋“ค์€ ํ•ด๋‹น ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ณด์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ €๋Š” ๋ฌด์—‡๋ณด๋‹ค ์ด ๋…ผ๋ฌธ์„ ์ฝ์œผ๋ฉด์„œ ์‚ฌ์‹ค cycleGAN์ด๋ผ๋Š” ์œ ๋ช… ๋…ผ๋ฌธ์„ ์“ด ์ €์ž๊ฐ€ ๋ถ€๋Ÿฌ์šด ๋งˆ์Œ์ด ์ œ์ผ ์ปธ์Šต๋‹ˆ๋‹ค..๊ทธ๋Ÿฌ๋‚˜ ์ €์ž๋“ค์€ ์œ ๋ช…์„ธ์— ๊ทธ์น˜์ง€ ์•Š๊ณ  ์ตœ๊ทผ์— ๋‚˜์˜จ ๋ฐฉ๋ฒ•๋ก ์ธ contrastive learning์„ ์ž์‹ ์˜ ๊ธฐ์กด์—ฐ๊ตฌ์— ์ ์šฉํ•˜์—ฌ ๋‹ค์Œ์—ฐ๊ตฌ๋กœ ๋ฐœ์ „์„ ์‹œ์ผฐ๋„ค์š”. ๊ทธ๋Ÿฌํ•œ ์ž์„ธ๋ฅผ ์—ฐ๊ตฌ์ž๋กœ์จ ๋ฐฐ์›Œ์•ผ๊ฒ ๋‹ค๊ณ  ๋А๊ผˆ์Šต๋‹ˆ๋‹ค.

Take home message

  1. Contrastive learning์€ feature๊ฐ„์˜ embedding์„ ๋ฐฐ์šฐ๊ธฐ์— ๋งค์šฐ ์ ํ•ฉํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

  2. CycleGAN์˜ cycle consistency๋Š” ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ  ๋‹ค์†Œ ์ œ์•ฝ์ ์ด๋‹ค.

  3. Contrastive representation์€ single image๋งŒ ์žˆ์–ด๋„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, single image๋งŒ์œผ๋กœ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ๊ธด ํฌ์ŠคํŒ…์„ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ˜Š

Author / Reviewer information

Author

๋ฐ•์—ฌ์ • (Yeojeong Park)

References & Additional materials

Last updated