Video Frame Interpolation via Adaptive Convolution [Kor]

Niklaus et al. / Video Frame Interpolation via Adaptive Convolution / CVPR 2017

1. Problem definition


Video frame interpolation์€ ๊ธฐ์กด์˜ ํ”„๋ ˆ์ž„๋“ค์„ ์ด์šฉํ•˜์—ฌ ์—ฐ์†๋˜๋Š” ํ”„๋ ˆ์ž„ ์‚ฌ์ด ์ค‘๊ฐ„ ํ”„๋ ˆ์ž„์„ ์ƒˆ๋กœ ์ƒ์„ฑํ•จ์œผ๋กœ์จ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์œจ์„ ๋†’์ด๋Š” task์ž…๋‹ˆ๋‹ค. 1์ดˆ์— ๋ช‡๊ฐœ์˜ ํ”„๋ ˆ์ž„์ด ์žฌ์ƒ์ด ๋˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ”„๋ ˆ์ž„์œจ์ด ์ž‘์œผ๋ฉด ์˜์ƒ์ด ์—ฐ์†์ ์ด์ง€ ์•Š์•„ ๋‚ฎ์€ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋ณด์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ video frame interpolation์„ ์ด์šฉํ•˜์—ฌ ์ค‘๊ฐ„ ํ”„๋ ˆ์ž„๋“ค์„ ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑํ•ด๋ƒ„์œผ๋กœ์จ ์˜์ƒ์„ ๋”์šฑ ์—ฐ์†์ ์ด๊ฒŒ ๋ณด์ด๊ฒŒํ•˜์—ฌ ๋†’์€ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๊ฐ€์ง€๋„๋ก ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜๋‚˜์˜ ๋น„๋””์˜ค์— 5๊ฐœ์˜ ์—ฐ์†๋œ ํ”„๋ ˆ์ž„์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์˜€์„ ๋•Œ, video frame interpolation์„ ํ†ตํ•ด ์—ฐ์†๋˜๋Š” ํ”„๋ ˆ์ž„ ์‚ฌ์ด์— ํ•˜๋‚˜์˜ ํ”„๋ ˆ์ž„์„ ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค์–ด๋ƒ„์œผ๋กœ์จ ์ด 9๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ๊ฐ€์ง„ ๋น„๋””์˜ค๋ฅผ ๋งŒ๋“ค์–ด ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ์จ ๋ฌผ์ฒด์˜ ์›€์ง์ž„์ด ๋”์šฑ ์—ฐ์†์ ์œผ๋กœ, ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ณด์ผ ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

FRUC.png

Figure 1: Convert low frame rate to high frame rate

2. Motivation


๋ณดํ†ต์˜ video frame interpolation ๊ธฐ๋ฒ•์€ ํ”„๋ ˆ์ž„๋“ค ๊ฐ„์˜ ์›€์ง์ž„์„ ์ถ”์ •ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ธฐ์กด์˜ ํ”„๋ ˆ์ž„๋“ค์˜ ํ”ฝ์…€ ๊ฐ’์„ ํ•ฉ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ interpolation ๊ฒฐ๊ณผ๋Š” ํ”„๋ ˆ์ž„ ์‚ฌ์ด์˜ ์›€์ง์ž„์ด ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•˜๊ฒŒ ์ถ”์ •์ด ๋˜๋Š”์ง€์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์€ ์›€์ง์ž„ ์ถ”์ •๊ณผ ํ”ฝ์…€ ํ•ฉ์„ฑ์˜ ๋‘ ๋‹จ๊ณ„ ๊ณผ์ •์„ ํ•œ ๋‹จ๊ณ„๋กœ ํ•ฉ์นจ์œผ๋กœ์จ ๊ฐ•์ธํ•œ video frame interpolation ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • ๊ธฐ์กด์˜ frame interpolation ๊ธฐ๋ฒ•

    • Werlberger et al. Yu et al. Baker et al. ์—์„œ ์ œ์•ˆ๋œ ๊ธฐ์กด์˜ ๋งŽ์€ frame interpolation ๊ธฐ๋ฒ•๋“ค์€ optical flow ๋˜๋Š” stereo matching์„ ์ด์šฉํ•˜์—ฌ ๋‘ ์—ฐ์†๋œ ํ”„๋ ˆ์ž„๋“ค ์‚ฌ์ด์˜ ๋ชจ์…˜์„ ์˜ˆ์ธกํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‘ ํ”„๋ ˆ์ž„ ์‚ฌ์ด์— ํ•˜๋‚˜ ๋˜๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ interpolate ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • Meyer et al.์€ ๊ธฐ์กด์˜ motion estimation ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ input ํ”„๋ ˆ์ž„๋“ค ์‚ฌ์ด์˜ phase ์ฐจ์ด๋ฅผ ๊ตฌํ•˜๊ณ  ์ด phase ์ •๋ณด๋ฅผ multi-scale pyramid level์—์„œ propagating ์‹œํ‚ด๋กœ์จ ๋” ์ข‹์€ video frame interpolation ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ frame interpolation ๊ธฐ๋ฒ•

    • Zhou et al.์˜ ๋…ผ๋ฌธ์—์„œ๋Š” ๋™์ผํ•œ ๋ฌผ์ฒด๋ฅผ ์—ฌ๋Ÿฌ ๋‹ค๋ฅธ ์‹œ๊ฐ์œผ๋กœ ๋ฐ”๋ผ๋ณธ ๊ฒƒ๋“ค์€ ์„œ๋กœ ์—ฐ๊ด€์„ฑ์ด ๋†’๋‹ค๋Š” ์ ์„ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด frame interpolation์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ input view๋“ค์„ ํ๋ฆ„์— ๋”ฐ๋ผ warping ์‹œํ‚ค๊ณ , ๊ทธ๊ฒƒ๋“ค์„ ํ•ฉ์นจ์œผ๋กœ์จ ์ƒˆ๋กœ์šด view ํ•ฉ์„ฑ์„ ์œ„ํ•œ ์ ๋‹นํ•œ ํ”ฝ์…€์„ ๊ณ ๋ฅด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • Flynn et al. ์€ input ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ์˜ depth plane์œผ๋กœ projection์„ ์‹œํ‚ค๊ณ  ๊ฐ๊ฐ์˜ depth plane์— ์žˆ๋Š” ์ƒ‰๋“ค์„ ํ•ฉ์นจ์œผ๋กœ์จ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ํ•ฉ์„ฑํ•˜๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Idea

ํ•ด๋‹น video frame interpolation ๊ธฐ๋ฒ•์€ ๊ธฐ์กด์— ๋ถ„๋ฆฌ๋˜์–ด ์ง„ํ–‰๋˜๋˜ ๋ชจ์…˜ ์ถ”์ •๊ณผ ํ”ฝ์…€ ํ•ฉ์„ฑ์„ ํ•˜๋‚˜์˜ ๊ณผ์ •์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค. ํ”„๋ ˆ์ž„ ์‚ฌ์ด์˜ ์›€์ง์ž„์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์–ด๋–ค ํ”ฝ์…€๋“ค์ด ํ•ฉ์„ฑ์— ์ด์šฉ๋  ๊ฒƒ์ธ์ง€, ๊ทธ๋ฆฌ๊ณ  ์ด๋“ค ์ค‘ ์–ด๋–ค ํ”ฝ์…€์— ๋” ๋งŽ์€ weight๋ฅผ ์ค„ ๊ฒƒ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” interpolation coefficient๊ฐ€ ํ‘œํ˜„๋˜์–ด ์žˆ๋Š” convolution kernel์„ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•œ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์˜ˆ์ธก๋œ kernel์„ input ์ด๋ฏธ์ง€์™€ ๊ฒฐํ•ฉ์‹œํ‚ด์œผ๋กœ์จ ์ตœ์ข… ์ค‘๊ฐ„ ํ”„๋ ˆ์ž„์„ ์–ป์„ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด๋•Œ, ์ œ์•ˆํ•œ ๊ธฐ๋ฒ•์€ ๋ณ„๋„๋กœ optical flow๋‚˜ multiple depth plane์„ ์ด์šฉํ•˜์—ฌ input ์ด๋ฏธ์ง€๋ฅผ warping ์‹œํ‚ค๋Š” ๊ณผ์ •์„ ๊ฑฐ์น˜์ง€ ์•Š์•„๋„ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ฐ์†Œํ•˜๊ณ , occlusion๊ณผ ๊ฐ™์ด ํ•ฉ์„ฑ์ด ์–ด๋ ค์šด ๊ฒฝ์šฐ์—๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ณด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

3. Method


์ œ์•ˆํ•˜๋Š” video frame interpolation ๊ธฐ๋ฒ•์€ ๋‘ ๊ฐœ์˜ input frame I1, I2๊ฐ€ ์žˆ์„ ๋•Œ ๋‘ ํ”„๋ ˆ์ž„์˜ ์ค‘๊ฐ„์— ์žˆ๋Š”, ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„ equ4.png ์„ interpolate ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

Overall method

Approach.PNG

Figure 2: Interpolation by convolution (a): previous work (b): proposed method

Figure 2 (a)์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ๊ธฐ์กด์˜ video frame interpolation ๊ธฐ๋ฒ•์€ ๋ชจ์…˜ ์ถ”์ •์„ ํ†ตํ•ด equ4.png ์˜ ํ”ฝ์…€ (x, y)์— ์ƒ์‘ํ•˜๋Š” I1, I2์—์„œ์˜ ํ”ฝ์…€๋“ค์„ ๊ตฌํ•˜๊ณ  ์ด๋“ค์„ weighted sum์„ ํ•˜์—ฌ ์ตœ์ข… interpolate frame๋ฅผ ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด Figure 2 (b)์˜ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ชจ์…˜ ์ถ”์ •๊ณผ ํ”ฝ์…€ ํ•ฉ์„ฑ์„ ํ•˜๋‚˜์˜ ๊ณผ์ •์œผ๋กœ ํ•ฉ์น˜๊ธฐ์œ„ํ•ด interpolation์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ๋Š” kernel์„ ์˜ˆ์ธกํ•˜๊ณ ,์ž…๋ ฅ ํ”„๋ ˆ์ž„๋“ค์˜ patch์ธ P1,P2์™€ kernel์˜ local convolution์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ interpolation์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Architecture.PNG

Figure 3: Overall process of proposed method

Figure 3๋Š” ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์ „๋ฐ˜์ ์ธ ๊ณผ์ •์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. equ4.png ์—์„œ ์–ป๊ณ ์žํ•˜๋Š” ํ”ฝ์…€์˜ ์œ„์น˜๋ฅผ (x, y) ๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, ๊ฐ๊ฐ I1, I2์—์„œ (x, y)๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•˜๋Š” receptive field patch R1, R2๊ฐ€ fully convolutional neural network(Convnet)์˜ input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ Convnet์€ input ํ”„๋ ˆ์ž„์˜ ์ •๋ณด๋“ค์„ ์ด์šฉํ•˜์—ฌ ํ”„๋ ˆ์ž„๋“ค ์‚ฌ์ด์˜ ๋ชจ์…˜์„ ์ถ”์ •ํ•จ์œผ๋กœ์จ input์˜ ์–ด๋–ค ํ”ฝ์…€๋“ค์„ interpolation์— ์ด์šฉํ• ์ง€, ๊ทธ ์ค‘ ์–ด๋А ํ”ฝ์…€์— ๋น„์ค‘์„ ๋‘์–ด ํ•ฉ์„ฑํ•  ์ง€์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด kernel์„ output์œผ๋กœ ๋‚ด๋ณด๋‚ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ์–ป์€ kernel์€ input frame patch P1, P2 ์™€ convolve ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ P1, P2๋Š” ์•ž์„œ Convnet์˜ input R1, R2 ๋ณด๋‹ค๋Š” ์ž‘์€ ์‚ฌ์ด์ฆˆ์ด์ง€๋งŒ, (x, y)๋ฅผ center๋กœ ํ•˜๋Š” input patch๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, kernel K๋ฅผ ์ด์šฉํ•˜์—ฌ P1, P2์™€์˜ convolution์„ ์ง„ํ–‰ํ•จ์œผ๋กœ์จ ์ตœ์ข… interpolated frame์˜ (x, y)์— ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜์˜ pixel ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

equ1.png

์ด ๊ณผ์ •์„ equ4.png ์˜ ๋ชจ๋“  ํ”ฝ์…€์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•จ์œผ๋กœ์จ, equ4.png์˜ ๋ชจ๋“  ํ”ฝ์…€๊ฐ’์„ ์–ป์–ด ์ตœ์ข… interpolated๋œ frame์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Convolution kernel estimation

Convnet.PNG

Table 1: Architecture of Convnet

Table 1์€ receptive field patch R1, R2๋ฅผ input์œผ๋กœ ํ•˜์—ฌ kernel K๋ฅผ output์œผ๋กœ ๋‚ด๋ณด๋‚ด๋Š” Convnet์˜ ๊ตฌ์กฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Input์œผ๋กœ๋Š” 79 * 79์˜ spatial size์™€ RGB 3๊ฐœ์˜ ์ฑ„๋„์„ ๊ฐ€์ง€๋Š” R1, R2๊ฐ€ concat๋˜์–ด ๋“ค์–ด๊ฐ€๊ณ , ์ด input์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ convolutional layer๋“ค์„ ๊ฑฐ์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ feature map์€ spatial softmax๋ฅผ ๊ฑฐ์ณ ๋ชจ๋“  weight์˜ ํ•ฉ์ด 1์ด ๋˜๋„๋ก ํ•ด์ฃผ๊ณ , reshape ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•œ ์ด๋ฏธ์ง€ size ์กฐ์ •์„ ํ†ตํ•ด output์œผ๋กœ ๋‚ด๋ณด๋‚ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ output์˜ ํฌ๊ธฐ๋Š” 41 * (41+41)์˜ ํ˜•ํƒœ๋กœ, 41 * 41์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” input patch P1, P2 ์™€ local convolution์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

์ด๋•Œ, ๋‘ Convnet์˜ input์ธ R1, R2๋Š” channel ์ถ•์œผ๋กœ, output์ธ P1, P2๋Š” width ์ถ•์œผ๋กœ concatenate๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. R1, R2์„ width ์ถ•์œผ๋กœ concatenate๋ฅผ ํ•˜์—ฌ convnet์˜ input์œผ๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆฌ๋ฉด concat๋œ ์ด๋ฏธ์ง€๊ฐ€ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋กœ ์ธ์‹์ด ๋˜์–ด convolution ์—ฐ์‚ฐ์ด ๊ฐ™์ด ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ์ด๋ฏธ์ง€๊ฐ€ spatial dimension์—์„œ ์„ž์ธ์ฑ„๋กœ feature map์ด ๋งŒ๋“ค์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ receptive field๊ฐ€ spatial information์„ ์žƒ์–ด๋ฒ„๋ฆฌ๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— receptive field๋Š” channel ์ถ•์œผ๋กœ concatenate๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋˜๋Š”๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ kernel๊ณผ input patch์™€์˜ ๊ณฑ์…ˆ์—์„œ๋Š” P1, P2๊ฐ€ channel์ถ•์œผ๋กœ concatenate๋œ ํ˜•ํƒœ๋กœ ๋‚˜์˜ค๊ฒŒ ๋˜๋”๋ผ๋„ kernel๋„ ๊ฐ๊ฐ์˜ patch์— ๋งž๊ฒŒ ๊ณฑํ•ด์งˆ ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค๋ฉด, ๋ฌธ์ œ๊ฐ€ ์—†์„๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒ์ด ๋ฉ๋‹ˆ๋‹ค.

Loss function

๋จผ์ €, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‘๊ฐ€์ง€ loss ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฒซ๋ฒˆ์งธ๋กœ, Color loss๋Š” L1 loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ interpolated pixel color์™€ ground-truth color ์‚ฌ์ด์˜ ์ฐจ๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ๋‹จ์ˆœํžˆ color loss๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๋ธ”๋Ÿฌ ๋ฌธ์ œ๋Š” gradient loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์™„ํ™”์‹œ์ผœ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Gradient loss๋Š” input patch์˜ gradient๋ฅผ convnet์˜ ์ž…๋ ฅ์œผ๋กœ ํ–ˆ์„ ๋•Œ์˜ output๊ณผ ground-truth gradient ์‚ฌ์ด์˜ L1 loss๋ฅผ ํ†ตํ•ด ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ gradient๋Š” ์ค‘์‹ฌ ํ”ฝ์…€์„ ๊ธฐ์ค€์œผ๋กœ 8๊ฐœ์˜ neighboring pixel๊ณผ ์ค‘์‹ฌ ํ”ฝ์…€์˜ ์ฐจ์ด๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

equa2.PNG

4. Experiment & Result


Experimental setup

4.1. Training dataset

ํ•ด๋‹น ๋…ผ๋ฌธ์˜ dataset์€ optical flow์™€ ๊ฐ™์€ ๋ณ„๋„์˜ ground-truth๊ฐ€ ํ•„์š” ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ธํ„ฐ๋„ท์˜ ๋ชจ๋“  ๋น„๋””์˜ค๋ฅผ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” Flickr with a Creative Commons license๋กœ๋ถ€ํ„ฐ "driving", "dancing", "surfing", "riding", ๊ทธ๋ฆฌ๊ณ  "skiing"์˜ ํ‚ค์›Œ๋“œ๊ฐ€ ๋‹ด๊ธด 3000๊ฐœ์˜ ๋น„๋””์˜ค๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ค‘์—์„œ ์ €ํ™”์งˆ์˜ ๋น„๋””์˜ค๋Š” ์ œ๊ฑฐํ•˜๊ณ  1280 * 720์˜ ํ•ด์ƒ๋„๋กœ scaling์„ ํ•œ ํ›„, ์—ฐ์†์ ์ธ ์„ธ๊ฐœ์˜ ํ”„๋ ˆ์ž„์”ฉ ๋ฌถ์–ด triple-frame group์„ ํ˜•์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋“ค ์ค‘ ๋ชจ์…˜์ด ์ž‘์€๊ฒƒ๋“ค์€ ์ตœ๋Œ€ํ•œ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ”„๋ ˆ์ž„๋“ค ์‚ฌ์ด์˜ optical flow์™€ ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’์€ 250,000๊ฐœ์˜ triple-patch ๊ทธ๋ฃน์„ ์„ ๋ณ„ํ•จ์œผ๋กœ์จ ๋น„๊ต์  ๋†’์€ ๋ชจ์…˜์„ ๊ฐ€์ง„ frame์œผ๋กœ ์ด๋ฃจ์–ด์ง„ dataset์„ ๊ตฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

4.2. Hyper-parameter selection

Deep neural network๋ฅผ ์œ„ํ•ด ์„ค์ •ํ•ด์•ผํ•  ๋‘๊ฐ€์ง€ ์ค‘์š”ํ•œ hyper-parameter๋Š” convolution kernel size์™€ receptive field path size์ž…๋‹ˆ๋‹ค. ๋ชจ์…˜ ์˜ˆ์ธก์„ ์ž˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ kernel์˜ size๋Š” training data์—์„œ ํ”„๋ ˆ์ž„๊ฐ„์˜ ์ตœ๋Œ€ motion ํฌ๊ธฐ์˜€๋˜ 38 pixel ๋ณด๋‹ค ํฐ 41 pixel, ์ฆ‰ 41 * 41๋กœ ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ receptive field patch์˜ size๋Š” convolution kernel size๋ณด๋‹ค ํฌ์ง€๋งŒ ๋„ˆ๋ฌด ๋งŽ์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ฐจ์ง€ํ•˜์ง€ ์•Š๋„๋ก 79 * 79๋กœ ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

4.3. Training setup

-Parameter initialization: Xaiver initialization

-Optimizer: AdaMax with equ3.png

-Learning rate: 0.001

-Batch size: 128

-Inference time: 9.1 second for 1280*720

Result

Quantitative result

quan_result.PNG

Table 2: Evaluation on the Middlebury testing set (average interpolation error)

Table 2์—์„œ real-world scene์˜ ๋„ค๊ฐ€์ง€ ์˜ˆ์‹œ(Backy, Baske, Dumpt, Everg)์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ€์žฅ ๋‚ฎ์€ interpolation error, ์ฆ‰ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ syntheticํ•œ frame ์ด๊ฑฐ๋‚˜ lab scene์˜ ๋„ค๊ฐ€์ง€ ์˜ˆ์‹œ(Mequ., Schef., Urban, Teddy)์— ๋Œ€ํ•ด์„œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€ ์•Š๋Š”๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ  ์ค‘ ํ•˜๋‚˜๋กœ, training dataset์˜ ์ฐจ์ด๋ฅผ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ์ œ์•ˆํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋Š” ์œ ํŠœ๋ธŒ์™€ ๊ฐ™์ด ์ธํ„ฐ๋„ท์—์„œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์ œ ์˜์ƒ, real-world scene์˜ frame๋“ค์„ dataset์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ฉ์„ฑ์ด ๋œ frame๋“ค๊ณผ real-world์˜ frame์˜ ํŠน์„ฑ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ํ•ฉ์„ฑ์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ frame์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ๋น„๊ต์  ์ข‹์ง€ ์•Š๊ฒŒ ๋˜๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

Qualitative result

-Blur

qual_blur.PNG

Figure 4: Qualitative evaluation on blurry videos

Figure 4์—์„œ๋Š” ์นด๋ฉ”๋ผ์˜ ์›€์ง์ž„, ํ”ผ์‚ฌ์ฒด์˜ ์›€์ง์ž„ ๋“ฑ์œผ๋กœ ์ธํ•˜์—ฌ ๋ธ”๋Ÿฌ๊ฐ€ ์žˆ๋Š” ๋น„๋””์˜ค์— ๋Œ€ํ•œ video frame interpolation ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๊ณผ Meyer et al์—์„œ์˜ ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค์— ๋น„ํ•ด artifact๊ฐ€ ๊ฑฐ์˜ ์—†๊ณ  sharpํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

-Abrupt brightness change

qual_brightness.PNG

Figure 5: Qualitative evaluation in video with abrupt brightness change

Figure 5์—์„œ๋Š” input frame๋“ค ์‚ฌ์ด์˜ ๊ฐ‘์ž‘์Šค๋Ÿฌ์šด ๋ฐ๊ธฐ ๋ณ€ํ™”๋กœ ์ธํ•ด brightness consistency์— ๋Œ€ํ•œ ๊ฐ€์ •์ด ์นจํ•ด๋œ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ video frame interpolation ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ์—๋„ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ Meyer et al์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์ด artifact๊ฐ€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ ํŠนํžˆ, ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ๋ฆฟํ•จ ์—†์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

-Occlusion

qual_occl.PNG

Figure 6: Qualitative evaluation with respect to occlusion

Figure 6์—์„œ๋Š” ์ƒˆ๋กœ์šด ํ”ผ์‚ฌ์ฒด์˜ ์œ ์ž… ๋“ฑ์œผ๋กœ occlusion์ด ๋ฐœ์ƒํ•  ๋•Œ์˜ video frame interpolation ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Artifact๊ฐ€ ์ƒ๊ธฐ๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค์— ๋น„ํ•ด์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์—์„œ๋Š” ์„ ๋ช…ํ•˜๊ฒŒ, ์ž˜ ํ•ฉ์„ฑ๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•จ์œผ๋กœ์จ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด occlusion๊ณผ ๊ฐ™์€ ์–ด๋ ค์šด ๊ฒฝ์šฐ์—๋„ frame interpolation์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰, ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด์˜ video frame interpolation์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์–ด๋ ค์šด blur, abrupt brightness change, occlusion ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์—์„œ๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

5. Conclusion


์ €์ž๋Š” ๋ชจ์…˜ ์ถ”์ •๊ณผ ํ”ฝ์…€ ํ•ฉ์„ฑ์˜ ๋‘๊ฐ€์ง€ ๊ณผ์ •์„ ํ•˜๋‚˜์˜ ๊ณผ์ •์œผ๋กœ ํ•ฉ์นจ์œผ๋กœ์จ ๋”์šฑ ๋” ๊ฐ•์ธํ•œ video frame interpolation ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ฐ ํ”ฝ์…€๋งˆ๋‹ค ๋ชจ์…˜๊ณผ ํ•ฉ์„ฑ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ์ƒˆ๋กœ์šด kernel์„ ๋งŒ๋“ค์–ด interpolation์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ occlusion๊ณผ ๊ฐ™์ด video frame interpolation์„ ํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํ™ฉ์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด ๋ƒˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ฐ pixel๋งˆ๋‹ค ํฐ ํฌ๊ธฐ์˜ kernel์„ ์ƒ์„ฑํ•ด๋‚ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋„ˆ๋ฌด ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ  ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

๊ผญ optical flow๊ณผ ๊ฐ™์€ motion estimation์„ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋”๋ผ๋„ ์ข‹์€ ์„ฑ๋Šฅ์˜ video frame interpolation์„ ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ๋‹ค

๊ฐ ํ”ฝ์…€์„ ์œ„ํ•œ kernel์„ ์˜ˆ์ธกํ•ด ๋ƒ„์œผ๋กœ์จ ๊ฐ ํ”ฝ์…€์˜ ์ƒํ™ฉ์— ๋งž๊ฒŒ ํ”ฝ์…€ ํ•ฉ์„ฑ์„ ํ•  ์ˆ˜ ์žˆ๊ณ , ์ด๊ฒƒ์ด ๋”์šฑ ๊ฒฐ๊ณผ๋ฅผ ์ข‹๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Author / Reviewer information

Author

์ด์œ ์ง„ (Yujin Lee)

  • KAIST

  • dldbwls0505@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. โ€ฆ

Reference & Additional materials

  • S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1):1โ€“31, 2011.

  • M. Werlberger, T. Pock, M. Unger, and H. Bischof. Optical flow guided TV-L 1 video interpolation and restoration. In Energy Minimization Methods in Computer Vision and Pattern Recognition, volume 6819, pages 273โ€“286, 2011

  • Z. Yu, H. Li, Z. Wang, Z. Hu, and C. W. Chen. Multi-level video frame interpolation: Exploiting the interaction among different levels. IEEE Trans. Circuits Syst. Video Techn., 23(7):1235โ€“1248, 2013

  • S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. SorkineHornung. Phase-based frame interpolation for video. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1410โ€“1418, 2015

  • J. Flynn, I. Neulander, J. Philbin, and N. Snavely. DeepStereo: Learning to predict new views from the worldโ€™s imagery. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5515โ€“5524, 2016

  • T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In ECCV, volume 9908, pages 286โ€“301, 2016

Last updated

Was this helpful?