Neural Discrete Representation Learning [KOR]

Aaron van den Oord et al. / Neural Discrete Representation Learning / NIPS 2017

English version of this article is available.

1. Problem definition

์˜ค๋Š˜๋‚  Generative Model์€ image, audio, video ๋“ฑ ๋งŽ์€ ๋ถ„์•ผ์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๊ณผ๋ฅผ ๋‚ด๊ณ  ์žˆ๋‹ค. Generative model์˜ ๋Œ€ํ‘œ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Variational Auto Encoder(VAE)[1]์ด๋‹ค. VAE๋Š” data๋ฅผ ์–ด๋– ํ•œ latent space์— ๋งคํ•‘ํ•˜๊ณ , ๋งคํ•‘๋œ latent vector๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๊ธฐ์กด VAE๋Š” latent vector๊ฐ€ Gaussian distribution์„ ๋”ฐ๋ฅด๋„๋ก ํ•˜๊ณ , ํ•ด๋‹น distribution์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์˜ˆ์ธกํ•จ์œผ๋กœ์จ laten space๋ฅผ ์˜ˆ์ธกํ•˜๊ฒŒ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌ์„ฑ๋œ latent space๋กœ๋ถ€ํ„ฐ, ์šฐ๋ฆฌ๋Š” ์กด์žฌํ•˜์ง€ ์•Š์•˜๋˜ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

2. Motivation

VAE์˜ latent vector๋“ค์€ continousํ•œ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค. ๋‹ค์‹œ ๋งํ•ด ์ƒ˜ํ”Œ๋ง ํ•  ์ˆ˜ ์žˆ๋Š” latent ๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋ฌดํ•œํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ๋ฌดํ•œํ•œ ํฌ๊ธฐ์˜ latent space์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๊ณ  ๋น„ํšจ์œจ์ ์ด๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฌดํ•œํ•œ space์˜ ์–ด๋–ป๊ฒŒ ๋งคํ•‘๋  ์ง€ ์˜ˆ์ธกํ•˜๊ธฐ ์–ด๋ ค์šธ ๋ฟ๋”๋Ÿฌ ํŠนํžˆ ๋งคํ•‘๋œ vector๋“ค์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์ œ์–ดํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. VQ-VAE์˜ ๋ชจํ‹ฐ๋ฒ ์ด์…˜์€ ๋ฐ”๋กœ ์—ฌ๊ธฐ์—์„œ ์ถœ๋ฐœํ•œ๋‹ค. ๋งŒ์•ฝ ๋ชจ๋ธ์„ ๋ฌดํ•œํ•œ ๊ณต๊ฐ„์˜ latent space๊ฐ€ ์•„๋‹Œ, ์ œํ•œ๋œ ํฌ๊ธฐ์˜(discrete) latent space์— ๋Œ€ํ•ด ํ•™์Šต์‹œํ‚จ๋‹ค๋ฉด, ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์‰ฝ๊ณ  ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ? latent vector๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์˜ ์ˆ˜์™€ ๊ทธ ๊ฐ’์„ ์ œํ•œํ•˜๋ฉด, ์ฆ‰, ๋‹ค์‹œ๋งํ•ด discrete latent space๋ฅผ ํ•™์Šต์‹œํ‚ค๊ณ  ๊ทธ๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ด๋ณด๋ฉด ์–ด๋–จ๊นŒ?

์ด์ „์—๋„ Discrete latent VAE ๋ฅผ ํ•™์Šตํ•˜๊ณ ์ž ํ•˜๋Š” ์‹œ๋„๋Š” ๋ช‡ ์žˆ์—ˆ๋‹ค.

e.g. NVIL estimator[2], VIMCO[3]

ํ•˜์ง€๋งŒ ์œ„ ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘ ๊ธฐ์กด Gaussian ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” continuous latent VAE ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋”ฐ๋ผ์žก์ง€ ๋ชปํ–ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ์œ„ ๋ฐฉ๋ฒ•๋“ค์€ MNIST์™€ ๊ฐ™์€ ๋งค์šฐ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‹คํ—˜๋˜๊ฑฐ๋‚˜ ํ‰๊ฐ€๋˜์—ˆ์œผ๋ฉฐ, ๊ทธ ๋ชจ๋ธ์˜ ๊นŠ์ด ์—ญ์‹œ ๋งค์šฐ ์ž‘์•˜๋‹ค.

Idea

VQ-VAE์˜ ์•„์ด๋””์–ด๋Š” encoder์—์„œ ๊ณ„์‚ฐํ•œ latent vector๋ฅผ ์น˜ํ™˜๋  ์ˆ˜ ์žˆ๋Š” ์œ ํ•œํ•œ ๊ฐœ์ˆ˜์˜ ๋ฒกํ„ฐ๋“ค์„ ํ•™์Šต ์‹œํ‚ด์œผ๋กœ์จ discrete latent space๋ฅผ ๊ตฌ์„ฑํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ข€ ๋” ์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด 1. ์šฐ์„  ๋žœ๋คํ•œ discrete latent space๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 2. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์„ encoder output ๊ณผ ์ž˜ ๋งตํ•‘๋˜๊ฒŒ๋” ํ•™์Šต์‹œํ‚จ๋‹ค.

ํ•œ๋งˆ๋””๋กœ, ๋ฌดํ•œํ•œ ๊ณต๊ฐ„์˜ distribution์„ ์œ ํ•œํ•œ ๊ณต๊ฐ„์˜ distribution์œผ๋กœ ๋งตํ•‘ํ•˜๋Š” non-linear layer๋ฅผ VAE์˜ encoder์™€ decoder ์‚ฌ์ด์— ์ถ”๊ฐ€ํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด VQ-VAE์˜ ์•„์ด๋””์–ด์ด๋‹ค.

3. Method

Discrete Latent Space

๊ทธ๋ ‡๋‹ค๋ฉด discrete latent space๋Š” ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑํ•ด์•ผ ํ• ๊นŒ? ๋…ผ๋ฌธ์€ (K, D) ์ฐจ์›์˜ embedding space๋ฅผ ๋žœ๋ค์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ  ์ด space๊ฐ€ encoder์˜ output์„ ์ž˜ ๋ฐ˜์˜ํ•˜๊ฒŒ๋” ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ discrete latent space๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค. ์ด์ œ encoder์˜ output ์€ embedding space ์ค‘ ์ž์‹ ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋กœ ๋Œ€์ฒด๋˜์–ด decoder์— ์ „๋‹ฌ๋œ๋‹ค. encoder ์•„์›ƒํ’‹์ธ posterior distribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

์œ„ ์‹์—์„œ z_e(x) ๋Š” encoder output์„, e๋Š” embedding space๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ํ•œ๊ฐ€์ง€ ์•Œ์•„๋‘˜ ๊ฒƒ์€, ์ด๋กœ ์ธํ•ด posterior ๋ถ„ํฌ๊ฐ€ deterministic ํ•ด์ง„๋‹ค๋Š” ๊ฒƒ์ด๋‹ค(latent z์— ๋Œ€ํ•ด uniform prior๋ฅผ ์ •์˜ํ–ˆ์œผ๋ฏ€๋กœ).

decoder์— ์ „๋‹ฌ๋  z_e(x) ๋Š” ์ตœ์ข…์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋Œ€์ฒด๋œ๋‹ค.

Model Architecture

Forward

  1. Encoder ๋ฅผ ํ†ตํ•ด input data ๋กœ๋ถ€ํ„ฐ latent vector๋ฅผ ์–ป๋Š”๋‹ค.

  2. D ์ฐจ์›์„ ๊ฐ€์ง„ K๊ฐœ์˜ latent embedding vector ์ค‘ ์–ป์€ latent vector์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋ฒกํ„ฐ๋ฅผ ๊ณ ๋ฅธ๋‹ค.

    ์ด๋ ‡๊ฒŒ ํ•˜๋Š” ์ด์œ ๋Š”?(Reminder)

    • ๊ธฐ์กด continuous ํ•˜๋˜ latent vector๋Š” ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋ฌดํ•œํ–ˆ์ง€๋งŒ, ์œ„์™€ ๊ฐ™์ด ์„ค๊ณ„ํ•˜๋ฉด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” latent vector์˜ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ discrete, ์œ ํ•œํ•ด์ง„๋‹ค. ๋”ฐ๋ผ์„œ ํ•™์Šต์ด ๋ณด๋‹ค controllable ํ•ด์งˆ ๊ฒƒ์ด๋ผ๋Š” ๊ฒŒ ์ €์ž์˜ ๊ฐ€์„ค์ด๋‹ค.

    • ์ด ์Šคํ…์ด ์ถ”๊ฐ€๋จ์— ๋”ฐ๋ผ ํ•™์Šต๋˜์–ด์•ผ ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ encoder, decoder, ๊ทธ๋ฆฌ๊ณ  embedding space E ๊ฐ€ ๋œ๋‹ค .

  3. Decoder๋ฅผ ํ†ตํ•ด ๊ณ ๋ฅธ discrete latent vector๋กœ๋ถ€ํ„ฐ data๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.


Backward

  • Loss = reconstruction loss + Vector Quantisation(VQ) +commitment loss

    • reconstruction loss: ๊ธฐ์กด vae์˜ ๊ทธ reconstruction loss์™€ ๊ฐ™๋‹ค. encoder์™€ decoder ํ•™์Šต์— ์˜ํ–ฅ์„ ์ค€๋‹ค.

    • VQ loss(=codebook loss): discrete latent space, ์ฆ‰ embedding space ํ•™์Šต์„ ์œ„ํ•œ loss ๊ฐ’. embedding space์™€ encoder output ๊ฐ„์˜ L2์—๋Ÿฌ๋กœ, ์ด๋ฅผ ํ†ตํ•ด embedding์€ encoder output๊ณผ ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก ํ•™์Šต๋œ๋‹ค.

    • commiment loss: embedding space๋Š” encoder/decoder์™€ ๋‹ฌ๋ฆฌ 2์ฐจ์›์ด๋‹ค(dimensionless). ๋”ฐ๋ผ์„œ encoder/decoder ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ํ•™์Šต ์†๋„๊ฐ€ ํ˜„์ €ํžˆ ์ฐจ์ด๋‚  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. encoder output์ด embedding space ํ•™์Šต์— ์˜ํ–ฅ์„ ๋ฐ›์•„ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€/๋ฐœ์‚ฐ ํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, ์ €์ž๋Š” commitment loss๋ฅผ ์„ธ๋ฒˆ์งธ ํ•ญ์œผ๋กœ ์ถ”๊ฐ€ํ–ˆ๋‹ค.

  • ์œ„ Loss ๊ฐ’์„ backpropagation ํ•  ๋•Œ, Forward step 2๋ฒˆ '์–ป์€ latent vector์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด vector๋ฅผ embedding space๋กœ ๋ถ€ํ„ฐ ๊ณ ๋ฅธ๋‹ค' ๋ผ๋Š” ์—ฐ์‚ฐ์— ๋Œ€ํ•ด์„œ๋Š” gradient ๊ฐ’์„ ํ˜๋ฆด ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—(๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅ), ์ €์ž๋Š” decoder input์˜ gradient์˜ ๊ฐ’์„ encoder output ๊ฐ’์œผ๋กœ ๋‹จ์ˆœ ๋ณต์‚ฌํ•˜๋Š” ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

    we approximate the gradient similar to the straight-through estimator and just copy gradients from decoder input zq(x) to encoder output ze(x).

4. Experiment & Result

Experimental setup

๋…ผ๋ฌธ์€ ๊ฐ€์žฅ ๋จผ์ € ๊ธฐ์กด continuous VAE ๋ชจ๋ธ, VIMCO ๋ชจ๋ธ, ๊ทธ๋ฆฌ๊ณ  ๋…ผ๋ฌธ์˜ VQ-VAE ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด, architecture๋Š” ๊ธฐ์กด standard VAE ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ๊ฐ€๋˜, latent capacity ๋งŒ์„ ๋ณ€๊ฒฝํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

  • encoder๋Š” 2๊ฐœ์˜ Conv2d(kerner_size=(4, 4), stride=2) ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ ๋‘๊ฐœ์˜ residual connection์„ ๊ฐ€์ง„๋‹ค. ๋ชจ๋“  ๋ ˆ์ด์–ด๋Š” 256 ์ฐจ์›์„ ๊ฐ€์ง„๋‹ค.

  • decoder๋Š” transposed CNN์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ ์ œ์™ธ encoder์™€ ๋ชจ๋‘ ๋™์ผํ•œ ๊ตฌ์กฐ์ด๋‹ค.

Result

  • ์•ž์„  ์‹คํ—˜์„ ํ†ตํ•ด ์ €์ž๋Š” VAE, VQ-VAE ๋ฐ VIMCO ๋ชจ๋ธ์€ ๊ฐ๊ฐ 4.51bits/dim, 4.67bits/dim ๊ทธ๋ฆฌ๊ณ  5.14 bits/dim ์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. (bits per dimension, ์ด๋Š” NLL loss๋ฅผ dimension์œผ๋กœ ๋‚˜๋ˆ„์–ด์ค€ ๊ฐ’์ด๋‹ค.)

    Compute the negative log likelihood in base e, apply change of base for converting log base e to log base 2, then divide by the number of pixels (e.g. 3072 pixels for a 32x32 rgb image).

  • ๋น„๋ก VIMCO ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ๊นŒ์ง€๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ–ˆ์ง€๋งŒ, ๊ทธ๋ž˜๋„ discrete latent VAE ๋ชจ๋ธ๋กœ์„œ๋Š” ์ฒ˜์Œ์œผ๋กœ continuous VAE ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ์–ป์–ด๋ƒˆ๋‹ค๋Š” ์ ์—์„œ ์ด ๋…ผ๋ฌธ์— novelty๊ฐ€ ์žˆ๋‹ค.

  • ๋…ผ๋ฌธ์€ Images, Audio, Video 3๊ฐ€์ง€์˜ ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด ๋ชจ๋ธ์„ ์‹คํ—˜ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋ชจ๋‘ ๊ธฐ์กด VAE๋งŒํผ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€์œผ๋‚˜, ๊ทธ ์ค‘ Audio ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ํฅ๋ฏธ๋กญ๋‹ค. ์ €์ž๋Š” Audio input์— ๋Œ€ํ•ด์„œ latent vector์˜ ์ฐจ์›์ด 64๋ฐฐ๋‚˜ ์ค„์–ด๋“œ๋ฏ€๋กœ reconstruction ๊ณผ์ •์ด ์ƒ๋Œ€์ ์œผ๋กœ ํž˜๋“ค ๊ฒƒ์ด๋ผ ์˜ˆ์ƒํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š” ๋ชจ๋ธ์ด audio์˜ content(๋ฐœ์„ฑํ•œ ๋‹จ์–ด๋‚˜ ๊ทธ ์˜๋ฏธ ๋“ฑ e.g. 'ํ…Œ์ŠคํŠธ', '์•ˆ๋…•')๋“ค์„ ์•„์ฃผ ์ž˜ reconstruction ํ•˜๋Š” ๋Œ€์‹ , audio์˜ feature(๋ฐœ์„ฑ, ์Œ์ƒ‰, ์Œ์—ญ๋Œ€ ๋“ฑ e.g. ๋ชฉ์†Œ๋ฆฌ ํ†ค, ์†Œ๋ฆฌ ๋†’๋‚ฎ์ด)๋“ค๋งŒ ์กฐ๊ธˆ์”ฉ ๋ณ€ํ˜•์‹œํ‚ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ €์ž๋Š” VQ-VAE ๊ฐ€ ์†Œ๋ฆฌ์˜ ๊ณ ์ฐจ์›์  ํŠน์ง•๊ณผ ์ €์ฐจ์›์  ํŠน์ง•์„ ์ž˜ ๋ถ„๋ฆฌํ•˜์—ฌ ํ•™์Šตํ•  ๋ฟ ์•„๋‹ˆ๋ผ ๊ทธ ์ค‘ ๊ณ ์ฐจ์›์  ํŠน์ง•์„ ์ฃผ๋กœ encoding ํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค.

    The VQ-VAE has learned a high-level abstract space that is invariant to low-level features and only encodes the content of the speech.

5. Conclusion

VQ-VAE ๋Š” VAE์— discrete latent space๋ฅผ ์ ์šฉํ•œ ๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋‹ค. ๊ธฐ์กด VAE์™€ ๊ฑฐ์˜ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ์–ป์—ˆ๋‹ค๋Š” ์ ์—์„œ ์ฃผ๋ชฉํ•  ๋งŒํ•˜๊ธฐ๋„ ํ•˜๊ณ , ๋ฌด์—‡๋ณด๋‹ค ๊ฐœ์ธ์ ์ธ ์˜๊ฒฌ์œผ๋กœ๋Š” ํ•™์Šต ํšจ์œจ์ด๋‚˜ ์•ˆ์ •์„ฑ ์ธก๋ฉด์—์„œ ์‹ค์ œ ํ…Œ์Šคํฌ์— ์‚ฌ์šฉ๋  ๋•Œ ๋ฉ”๋ฆฌํŠธ๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•œ๋‹ค.

์š”์•ฝ

  • Discrete latent space ๋ฅผ Codebook(=Embedding space) ์„ ํ†ตํ•ด ๊ตฌํ˜„ํ•˜์˜€๊ณ  VAE์— ์ ์šฉ.

  • Discrete latent VAE model ๋กœ์„œ๋Š” ์ฒ˜์Œ์œผ๋กœ continuous VAEs ์„ฑ๋Šฅ์— ๊ฐ€๊นŒ์šด ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋ƒ„.

  • VQ-VAE๋ฅผ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ์‹ค์ œ ์ ์šฉํ•ด๋ณด๊ณ  ๋šœ๋ ทํ•œ ์„ฑ์ทจ๋ฅผ ํ™•์ธ. ํŠนํžˆ ์˜ค๋””์˜ค ๋„๋ฉ”์ธ์—์„œ high-level audio feature๋“ค๋งŒ ์ธ์ฝ”๋”ฉ๋˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋ƒ„.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

VAE ๋ชจ๋ธ์˜ Latent Space๋ฅผ discrete ํ•œ ๋’ค ๊ทธ๊ฒƒ์„ ์ž˜ ํ•™์Šตํ•œ๋‹ค๋ฉด ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋™์‹œ์— ํ•™์Šต ํšจ์œจ์  ์ธก๋ฉด์—์„œ ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

Author / Reviewer information

Author

์ •์œค์ง„ (Yoonjin Chung)

  • Master Student, Graduate School of AI, KAIST

Reviewer

  • ์œค๊ฐ•ํ›ˆ

  • ์žฅํƒœ์˜

  • ์ดํ˜„์ง€

Reference & Additional materials

[1] Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. In The 2nd International Conference on Learning Representations (ICLR), 2013.

[2] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014

[3] Andriy Mnih and Danilo Jimenez Rezende. Variational inference for monte carlo objectives. CoRR, abs/1602.06725, 2016.

Last updated

Was this helpful?