MultitaskNeuralProcess [Kor]

(Description) Kim et al. / MULTI-TASK NEURAL PROCESSES / ICRL2022

Multi-Task Neural Processes [Kor]

1. Problem definition

Neural Processes (NPs)๋Š” ํ•จ์ˆ˜์˜ ๋ถ„ํฌ๋ฅผ ๋ชจ๋ธ๋ง (์˜ˆ: ํ™•๋ฅ  ํ”„๋กœ์„ธ์Šค)ํ•˜๋Š” ๋ฉ”ํƒ€ ๋Ÿฌ๋‹ ๊ณ„์—ด์˜ ๋ฐฉ๋ฒ•๋ก  ์ค‘ ํ•˜๋‚˜์ด๋‹ค. NPs์€ ๋‚ด์žฌ๋˜์–ด ์žˆ๋Š” ํ™•๋ฅ  ํ”„๋กœ์„ธ์Šค๋กœ๋ถ€ํ„ฐ ๊ตฌํ˜„๋œ ํ•จ์ˆ˜๋ฅผ ํ•˜๋‚˜์˜ task๋กœ ๊ณ ๋ คํ•˜์—ฌ ๋ณด์ง€์•Š์€ task์— ํ•จ์ˆ˜์˜ ์ถ”๋ก ๊ณผ์ •์„ ํ†ตํ•ด์„œ adaptํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ ๋•Œ๋ฌธ์— image regression, image classification, time series regression ๋“ฑ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ํ™œ์šฉ๋˜์–ด ์™”๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ ์ €์ž๋“ค์€ ๊ธฐ์กด์˜ neural processes๋ฅผ ๋‹ค์ค‘ ํƒœ์Šคํฌ ํ™˜๊ฒฝ์œผ๋กœ ํ™•์žฅํ•˜์—ฌ ๋ฐฉ๋ฒ•๋ก ์„ ์†Œ๊ฐœํ•˜์˜€๋‹ค. ์ด ๋•Œ ๋‹ค์ค‘ ํƒœ์Šคํฌ ํ™˜๊ฒฝ์€ ๋‹ค์ค‘์˜ ํ™•๋ฅ  ํ”„๋กœ์„ธ์Šค๋กœ๋ถ€ํ„ฐ ๊ตฌํ˜„๋œ ์ƒ๊ด€ ๊ด€๊ณ„์˜ ํƒœ์Šคํฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ๋‹ค์ค‘ ํƒœ์Šคํฌ ํ™˜๊ฒฝ์€ ์˜๋ฃŒ ๋ฐ์ดํ„ฐ๋‚˜ ๊ธฐ์ƒ ๋ฐ์ดํ„ฐ์™€ ๊ฐ™์ด ํ™˜์ž๋‚˜ ์ง€์—ญ์— ๊ด€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ค์–‘ํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ํŠน์„ฑ์„ ๊ฐ€์ง„ ์ •๋ณด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ๋งŽ์€ ์‹ค์„ธ๊ณ„์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์ค‘์˜ ์ƒ๊ด€๊ด€๊ณ„์˜ ํ•จ์ˆ˜๋“ค์„ ํ‘œํ˜„ํ•œ๋‹ค๋Š” ์ ์—์„œ ์ค‘์š”ํ•œ ํ•™์Šต ํ™˜๊ฒฝ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ์กด์˜ neural processes ๊ณ„์—ด์˜ ๋ฐฉ๋ฒ•๋ก ์€ ๋‹ค์ค‘ ํ•จ์ˆ˜์˜ ์…‹์„ ๊ณต๋™์œผ๋กœ ๋‹ค๋ฃจ๊ณ  ์žˆ์ง€ ์•Š๊ณ  ์ด๋“ค ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„์˜ ์ •๋ณด๋„ ์–ป์„ ์ˆ˜ ์—†๋Š” ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ๋‹ค๋Š” ์ ์—์„œ ๋‹ค์ค‘ ํ•™์Šต ํ™˜๊ฒฝ์œผ๋กœ์˜ neural processes์˜ ํ™•์žฅ์€ ๊นŠ์€ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ์ƒ๊ฐ๋œ๋‹ค.

2. Motivation

๋‹ค์ค‘ ํƒœ์Šคํฌ ํ•™์Šต์„ ์œ„ํ•œ ํ™•์œจ ํ”„๋กœ์„ธ์Šค ๋‹ค์ค‘ ํƒœ์Šคํฌ ํ•™์Šต์„ ํƒ€์ผ“์œผ๋กœ ํ•˜๋Š” ๊ธฐ์กด์˜ ํ™•๋ฅ  ํ”„๋กœ์„ธ์Šค ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ๋Š” ๋Œ€ํ‘œ์ ์œผ๋กœ Multi-Output Gaussian processes (MOGPs)๊ฐ€ ์žˆ๋Š”๋ฐ ์ด๋Š” ๊ธฐ์กด์˜ Gaussian ํ”„๋กœ์„ธ์Šค๋ฅผ ํ™•์žฅํ•˜์—ฌ ๋‹ค์ค‘ ํƒœ์Šคํฌ๋ฅผ ์ถ”๋ก  ํ•˜๊ณ  ๋ถˆ์™„์ „ํ•œ ๋ฐ์ดํ„ฐ๋„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ •ํ™•ํ•œ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ๊ด€์ฐฐ ๊ฐ’์ด ํ•„์š”ํ•œ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ์ตœ๊ทผ์˜ ๋ฐฉ๋ฒ•๋ก  ์ค‘์—๋Š” Gaussian ํ”„๋กœ์„ธ์Šค์™€ ๋ฉ”ํƒ€ํ•™์Šต ๊ธฐ๋ฒˆ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ๋ฒ•๋ก ์ด ์žˆ์ง€๋งŒ ์ด๋Š” ๋‹ค์ค‘ ํ•™์Šต ํ™˜๊ฒฝ์„ ๊ณ ๋ คํ•˜์ง€๋Š” ์•Š์•˜๋‹ค. Conditional Neural Adaptive Processes (CNAPs)๋Š” ๋‹ค์–‘ํ•œ ์…‹์˜ ํด๋ž˜์Šค๋ฅผ ๊ณ ๋ คํ•˜๋Š” general ํ•œ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ์ œ์•ˆํ–ˆ์ง€๋งŒ NP์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด ๋…๋ฆฝ์ ์ธ ์ถ”๋ก ๋งŒ ๊ฐ€๋Šฅํ•˜๊ณ  ์ถ”๋ก  ์‹œ์— ํƒœ์Šคํฌ ๊ฐ„์˜ ์ƒ๊ด€ ์ •๋ณด๋ฅผ explicitํ•˜๊ฒŒ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.

Neural process ๊ณ„์—ด์˜ ๊ณ„์ธต์  ๋ชจ๋ธ Attentive Neural Processes (ANPs)๋Š” ์–ดํ…์…˜ ๋ฉ”์นด๋‹ˆ์ฆ˜์„ deterministic์— ํ†ตํ•ฉํ•˜์—ฌ ๊ฐ๊ฐ์˜ target example์— ๋Œ€ํ•ด ์ถ”๊ฐ€์ ์ธ context ์ •๋ณด๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๊ณ  ์ด๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ underfitting ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ฒŒ ํ•ด์ฃผ์—ˆ๋‹ค. ์œ ์‚ฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” local ์ž ์žฌ ๋ณ€์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ example์— ํŠนํ™”๋œ stochasticity๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ NP์˜ ๊ทธ๋ž˜ํ”ฝ ๋ชจ๋ธ์„ ๊ณ„์ธต์ ์ธ ๊ตฌ์กฐ๋กœ ํ™•์žฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์กด์žฌํ•œ๋‹ค.

Idea

๋‹ค์ค‘ ํ™˜๊ฒฝ์—์„œ์˜ ๋‹ค์ค‘ ํ•จ์ˆ˜์˜ ์…‹์„ ๊ณต๋™์œผ๋กœ ํ•™์Šตํ•˜๊ณ  ํƒœ์Šคํฌ ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„์˜ ์ •๋ณด๋„ ํ•™์Šตํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋„์ „ ๊ณผ์ œ ์ค‘ ํ•˜๋‚˜๋Š” ๊ด€์ฐฐ ๊ฐ’๋“ค์ด ๋ถˆ์ถฉ๋ถ„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ถ€๋ถ„์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์—ฌ๋Ÿฌ ์„ผ์„œ์—์„œ multi-modal ํ˜•ํƒœ์˜ ์‹œ๊ทธ๋„์„ ์ˆ˜์ง‘ํ•  ๋•Œ, ์„ผ์„œ๋Š” ๋™์‹œ์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” (asynchronous) ์ƒ˜ํ”Œ๋ง ๋น„์œจ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์‹œ ๋งํ•˜๋ฉด, ๋ชจ๋“  ํ•จ์ˆ˜๋“ค์ด ๊ณตํ†ต์ ์ธ ์ƒ˜ํ”Œ location์„ ๊ฐ€์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ €์ž๋“ค์€ ์ด๋Ÿฌํ•œ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์˜ ํ™œ์šฉ๋„๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ด์ƒ์ ์ธ ํ•™์Šต ๋ชจ๋ธ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ธํ’‹์—์„œ ๊ด€์ฐฐ๋  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ํ•จ์ˆ˜๋“ค์„ ์—ฐ๊ด€์ง€์–ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด์•ผํ•œ๋‹ค๊ณ  ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ธฐ์กด์˜ ๋‹ค๋ณ€๋Ÿ‰ ๊ฐ€์šฐ์‹œ์•ˆ ํ”„๋กœ์„ธ์Šค ๋ฐฉ๋ฒ•๋ก ์ด ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ถˆ์ถฉ๋ถ„ํ•œ ๊ด€์ฐฐ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ๋‹ค์ค‘ ํ•จ์ˆ˜๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ฅธ ๋ณต์žก๋„๊ฐ€ ๋†’์•„์ ธ ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ถ”์ • ๋ฐฉ๋ฒ•์„ ์ถ”๊ฐ€๋กœ ํ•„์š”๋กœ ํ•˜๊ฒŒ ๋œ๋‹ค. (๊ทธ๋ฆฌ๊ณ  ์ ํ•ฉํ•œ kernerl์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ์—ฌ๋ถ€์— ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์˜์กด์ ์ธ ํŽธ์ด๋‹ค.)

์ด์— ๋Œ€ํ•ด ์ €์ž๋“ค์€ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์ค‘ ํƒœ์Šคํฌ๋ฅผ ๊ณต๋™ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” Multi-task neural processes (MTNPs)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ณ  ํ•จ์ˆ˜๋“ค์„ ํ†ตํ•ด ๊ณต๋™ ์ถ”๋ก ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์ค‘ ํ•จ์ˆ˜ ๊ณต๊ฐ„์„ ๋””์ž์ธํ•˜์˜€๊ณ  ํ†ตํ•ฉ๋œ ํ•จ์ˆ˜ ๊ณต๊ฐ„์—์„œ ํ™•๋ฅ  ํ”„๋กœ์„ธ์Šค๋ฅผ ์ด๋ก ์ ์œผ๋กœ ์œ ๋„ํ•˜๊ธฐ ์œ„ํ•œ ์ž ์žฌ ๋ณ€์ˆ˜ ๋ชจ๋ธ (Latent variable model)์„ ์ •์˜ํ•˜์˜€๋‹ค. ์ด ๋•Œ, ํƒœ์Šคํฌ ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„ ํ™œ์šฉ์„ ์œ„ํ•ด์„œ ์ž ์žฌ๋ณ€์ˆ˜ ๋ชจ๋ธ์„ ๊ณ„์ธต์ ์œผ๋กœ ๊ตฌ์„ฑํ•˜์˜€๋Š”๋ฐ ์ด๋Š” ๋ชจ๋“  ํƒœ์Šคํฌ์˜ ์ •๋ณด๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•œ 1) global latent variable๊ณผ ๊ฐ๊ฐ์˜ ํ…Œ์Šคํฌ์— ์ง‘์ค‘๋œ ์ •๋ณด๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•œ 2) task-specific latent variable๋กœ ๋˜์–ด์žˆ๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ ๋˜ํ•œ ๊ธฐ์กด์˜ neural processes๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ์žฅ์ ๋“ค(flexible adaptation, scalable inferece, uncertainty-aware prediction)์„ ์—ฌ์ „ํžˆ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

3. Method

Neural processes ๋ฅผ ๋‹ค์ค‘ ํƒœ์Šคํฌ์— ์ ์šฉํ•˜๋Š” ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์€ ํƒœ์Šคํฌ ๊ฐ„์˜ ๋…๋ฆฝ์„ฑ์„ ๊ฐ€์ •ํ•˜๊ณ  ํ•จ์ˆ˜ ๊ณต๊ฐ„ $(y^1)^\mathcal{x}, ..., (y^T)^\mathcal{x}$ ์— ๋Œ€ํ•œ ๋…๋ฆฝ์ ์ธ NPs๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. Single-task neural processes (STNPs, Figure (a))๋กœ ๋ช…๋ช…ํ•˜์˜€๋‹ค. ๋…๋ฆฝ์ ์ธ ์ž ์žฌ ๋ณ€์ˆ˜ $v^1, v^2,...,v^T$์—์„œ ๊ฐ๊ฐ์˜ $v^t$๋Š” ํƒœ์Šคํฌ $f^t$๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

p(YD1:TโˆฃXD,C)=โˆt=1Tโˆซp(YDtโˆฃXD,vt)p(vtโˆฃCt)dvt.p(Y_D^{1:T}|X_D, C)=\prod_{t=1}^{T} \int p(Y^t_D|X_D, v^t)p(v^t|C^t)dv^t.

์ด ๋•Œ, STNP๋Š” ๊ฐ ํƒœ์Šคํฌ์— ํŠนํ™”๋œ ๋ฐ์ดํ„ฐ $C^t$ ์— ๋Œ€ํ•ด ์กฐ๊ฑดํ™”๋ฅผ ํ†ตํ•ด ๋ถˆ์ถฉ๋ถ„ํ•œ ๊ด€์ฐฐ ๊ฐ’ (contexts)์„ ๋‹ค๋ฃฐ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๋‹ค์ค‘ ํƒœ์Šคํฌ์˜ ๊ฒฐํ•ฉ ๋ถ„ํฌ์—์„œ ์กด์žฌํ•˜๋Š” ํƒœ์Šคํฌ ์‚ฌ์ด์˜ ๋ณต์žกํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์ฃผ๋ณ€ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๋ชจ๋ธ๋ง๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์—์„œ ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค.

๋‹ค๋ฅธ ๋Œ€์•ˆ์œผ๋กœ๋Š” ์ถœ๋ ฅ ๊ณต๊ฐ„์„ product space $\mathcal{Y}^{1:T} = \prod_{t\in\tau}\mathcal{Y}^t$ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•จ์ˆ˜ ๊ณต๊ฐ„ $(\mathcal{Y}^{1:T})^\mathcal{X}$ ์— ๋Œ€ํ•œ ํ•˜๋‚˜์˜ NP๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๊ฒฝ์šฐ์—๋Š” ํ•œ ๊ฐœ์˜ ์ž ์žฌ ๋ณ€์ˆ˜ $z$๊ฐ€ ์ „์ฒด ํƒœ์Šคํฌ $T$๋ฅผ ๊ณต๋™์œผ๋กœ ํฌํ•จํ•˜๊ณ  Joint-Task Neural Processes (JTNPs)๋ผ ๋ช…๋ช…ํ•œ๋‹ค.

p(YD1:TโˆฃXD,C)=โˆซp(YD1:TโˆฃXD,z)p(zโˆฃC)dz.p(Y_D^{1:T}|X_D, C)= \int p(Y^{1:T}_D|X_D, z)p(z|C)dz.

์ด ๋•Œ, JTNP๋Š” ์ž ์žฌ ๋ณ€์ˆ˜ $z$๋ฅผ ํ†ตํ•ด ์ „์ฒด ํƒœ์Šคํฌ ๊ฐ„์˜ ์ƒ๊ด€ ์ •๋ณด๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฌธ์ œ๋Š” ํ•™์Šต๊ณผ ์ถ”๋ก  ์‹œ์— ์™„์ „ํ•œ ๊ด€์ฐฐ๊ฐ’ context์™€ target ๊ฐ’์„ ํ•„์ˆ˜์ ์œผ๋กœ ํ•„์š”๋กœ ํ•œ๋‹ค.

Multi-Task Neural Processes

์œ„์—์„œ ์–ธ๊ธ‰๋œ ๋ฌธ์ œ (์™„์ „ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š”)๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ €์ž๋“ค์€ ๊ธฐ์กด์˜ JTNP์˜ ํ˜•ํƒœ๋ฅผ ์žฌ๊ณต์‹ํ™” ํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•œ๋‹ค: $h: \mathcal{X} \times \mathcal{\tau} \rightarrow \bigcup_{t\in\tau}\mathcal{Y}^t$. ์ด๋Ÿฌํ•œ union form์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์–ด๋–ค ๋ถ€๋ถ„์ ์ธ ์ถœ๋ ฅ ๊ฐ’์˜ set๋„ ${y_i^t}_{t\in\tau}$ ๋‹ค๋ฅธ ์ž…๋ ฅ ํฌ์ธํŠธ $(x_i, t),t\in\tau_i$์—์„œ ํƒ€๋‹นํ•œ ๊ฐ’์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

Figure 1์˜ (c)์—์„œ ์ฒ˜๋Ÿผ ๊ณ„์ธต์ ์ธ ์ž ์žฌ ๋ณ€์ˆ˜ ๋ชจ๋ธ์„ ์ •์˜ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ globalํ•œ ์ž ์žฌ๋ณ€์ˆ˜ $z$ ์ „์ฒด context์ธ $C$๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ํƒœ์Šคํฌ์— ๊ฑธ์นœ ๊ณต์œ ๋œ ํ™•๋ฅ ์ ์ธ ์š”์†Œ๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๊ณ , ๊ฐ ํƒœ์Šคํฌ์— ์ง‘์ค‘๋œ ํ™•๋ฅ ์  ์š”์†Œ๋Š” $C^t, z$๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํƒœ์Šคํฌ์— ์ง‘์ค‘๋œ (task-specific) ์ž ์žฌ ๋ณ€์ˆ˜ $v^t$์— ์˜ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ™•๋ณด๋˜๊ฒŒ ํ•˜์˜€๋‹ค.

p(YD1:TโˆฃXD,C)=โˆซโˆซ[โˆt=1Tp(YDTโˆฃXDT,vt)p(vtโˆฃz,Ct)]p(zโˆฃC)dv1:Tdz.p(Y_D^{1:T}|X_D, C)= \int \int [\prod_{t=1}^T p(Y^{T}_D|X_D^T, v^t)p(v^t|z, C^t)]p(z|C)dv^{1:T}dz.

์ด ๋•Œ, $v^{1:T}:= (v^1,..,v^T)$์ด๊ณ  $p(Y_D^t|x_D^t, v^t)$์— ๋Œ€ํ•œ ์กฐ๊ฑด์ ์ธ ๋…๋ฆฝ์„ฑ์„ ๊ฐ€์ •ํ•œ๋‹ค.

์ •๋ฆฌ๋ฅผ ํ•ด๋ณด๋ฉด ์šฐ์„  ์ „์ฒด $v^{1:T}$์— ๋”ฐ๋ฅธ $z$๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ ํ•ด๋‹น ๋ชจ๋ธ์€ ํƒœ์Šคํฌ๊ฐ„์˜ ์ƒ๊ด€ ์ •๋ณด๋ฅผ ํ™•๋ณดํ•˜๊ธฐ ์ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  global ์ž ์žฌ ๋ณ€์ˆ˜ $z$๋ฅผ ํ†ตํ•ด ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š”๋ฐ ์ด๋Š” ์ด ์ž ์žฌ๋ณ€์ˆ˜๊ฐ€ 1) ์ „์ฒด context ๋ฐ์ดํ„ฐ $\bigcup_{t\in\tau}C^t$์—์„œ ์ถ”๋ก ๋˜๋ฉฐ 2) ๊ฐ ํ…Œ์Šคํฌ์— ํŠนํ™”๋œ ์ž ์žฌ๋ณ€์ˆ˜ $v^t$๋ฅผ ์ถ”๋ก ํ•  ๋•Œ๋„ global ์ž ์žฌ ๋ณ€์ˆ˜ $z$๊ฐ€ ์กฐ๊ฑดํ™”๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Š” $v^t$์— ์œ ๋„๋œ ๊ฐ๊ฐ์˜ ํ•จ์ˆ˜ $f^t$๊ฐ€ ํ˜„์žฌ ํƒœ์Šคํฌ์˜ $C^t$ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ํƒœ์Šคํฌ $C^{t\prime}$์—์„œ์˜ ๊ด€์ฐฐ ๊ฐ’๋“ค๋„ ๋ฒ”์šฉ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ํ•™์Šต๊ณผ ์ถ”๋ก ์‹œ์— ์ €์ž๋“ค์€ encoder $q_\phi$์™€ decoder $p_\theta$๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ conditional prior์™€ generative ๋ชจ๋ธ์„ ์ถ”์ •ํ•˜์˜€๋‹ค. ์–ธ๊ธ‰๋œ ๋‹ค์Œ์˜ ์‹์€ $p(Y_D^{1:T}|X_D, C)= \int \int [\prod_{t=1}^T p(Y^{T}_D|X_D^T, v^t)p(v^t|z, C^t)]p(z|C)dv^{1:T}dz$์€ intractableํ•˜๊ธฐ์— variational lower bound์„ ํ†ตํ•ด ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•œ๋‹ค.

logpฮธ(YD1:TโˆฃXD1:T,C)โ‰ฅEqฯ•(zโˆฃD)[โˆ‘t=1TEqฯ•(vtโˆฃz,Dt)[logpฮธ(YDtโˆฃXDt,vt)]โˆ’DKL(qฯ•(vtโˆฃz,Dt)โˆฃโˆฃqฯ•(vtโˆฃz,Ct))]โˆ’DKL(qฯ•(zโˆฃD)โˆฃโˆฃqฯ•(zโˆฃC))log p_\theta(Y_D^{1:T}|X_D^{1:T}, C) \geq \mathbb{E}_{q_{\phi}(z|D)}[\sum_{t=1}^T \mathbb{E}_{q_{\phi}(v^t|z,D^t)}[logp_{\theta}(Y_D^t|X_D^t, v^t)] - D_{KL}(q_\phi(v^t|z, D^t)||q_\phi(v^t|z, C^t))] - D_KL(q_\phi(z|D)||q_\phi(z|C))

๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์˜ Attention Neural Process (ANP) ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ implementation์„ ์ง„ํ–‰ํ•˜์˜€๊ณ  ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋Š” ์œ„์˜ Figure 2์™€ ๊ฐ™๋‹ค.

4. Experiment & Result

Experimental setup

๋ฐ์ดํ„ฐ์…‹ ์ €์ž๋“ค์€ ์ด ์„ธ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹ (synthetic & real-world ๋ฐ์ดํ„ฐ์…‹)์œผ๋กœ MTNP๋ฅผ ๊ฒ€์ฆํ•˜์˜€๊ณ  ๋ชจ๋“  ์‹คํ—˜์—์„œ context ๋ฐ์ดํ„ฐ๋Š” ๋ถˆ์ถฉ๋ถ„ํ•˜๊ฒŒ ๊ตฌ์„ฑํ•œ ํ›„ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๊ณผ ํ•™์Šต ํ™˜๊ฒฝ MTNP ๋ชจ๋ธ์˜ ๋น„๊ต๊ตฐ์œผ๋กœ ์ €์ž๋“ค์ด ๋ฐฉ๋ฒ•๋ก ์—์„œ ์–ธ๊ธ‰ํ•œ STNP์™€ JTNP ๋ชจ๋ธ์„ ANP๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๊ณ„ํ•˜์—ฌ ๊ตฌ์„ฑํ•˜์˜€๋‹ค. JTNP ๋ชจ๋ธ์€ ๋ถˆ์™„์ •ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๊ธฐ์— missing label์€ STNP๋ฅผ ํ†ตํ•ด imputation์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. 1D regreesion task์—์„œ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ Multi-output Gaussian processes ๋ฒ ์ด์Šค ๋ผ์ธ ๋ชจ๋ธ (CSM, MOSM)๊ณผ ๋‘ ๊ฐœ์˜ ๋ฉ”ํƒ€ ํ•™์Šต ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ (MAML, Reptile)๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜์˜€๋‹ค.

๊ฒ€์ฆ ๋ฉ”ํŠธ๋ฆญ Regression ํƒœ์Šคํฌ์—์„œ๋Š” mean squared error (MSE)๋กœ ์„ฑ๋Šฅ ์ธก์ •์„ ํ•˜์˜€๊ณ  image completion ํ…Œ์Šคํฌ์—์„œ๋Š” pseudo-lbael๊ณผ prediction ๊ฐ’์˜ error๋ฅผ MSE์™€ mIoU๋กœ ์ธก์ •ํ•˜์˜€๋‹ค.

Result

์ด ์„ธ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์ฃผ์š” ์‹คํ—˜๊ณผ ablation ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๊ณ  ๋Œ€ํ‘œ์ ์œผ๋กœ ๋‚ ์”จ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ 1D ์‹œ๊ณ„์—ด regression ํƒœ์Šคํฌ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๋ช…์„ ์ง„ํ–‰ํ•˜๊ฒ ๋‹ค.

ํ•ด๋‹น ์‹คํ—˜์˜ ๋ฐ์ดํ„ฐ ์…‹์€ 266 ๊ฐœ ๋„์‹œ์˜ 258์ผ ๊ฐ„์˜ ์ˆ˜์ง‘๋œ ๋‚ ์”จ ๊ธฐ๋ก์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ  ์ด 12๊ฐœ์˜ ๋‚ ์”จ ๊ด€๋ จ attribute ์ •๋ณด (๊ณ ์˜จ, ์ €์˜จ, ์Šต๋„, ๊ตฌ๋ฆ„ ์–‘ ๋“ฑ)๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์œ„์˜ figure์—์„œ table 2๋Š” ์ •๋Ÿ‰์  ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋Š”๋ฐ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ MTNP ๋ชจ๋ธ์ด ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์— ๋น„ํ•ด ์ •ํ™•๋„์™€ ๋ถˆํ™•์‹ค์„ฑ ์ถ”์ • ์ธก๋ฉด์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ์•Œ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ์ œ์‹œ๋œ ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์œผ๋กœ ์ผ๋ฐ˜ํ™” ๋จ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋˜ํ•œ, figure 4์—์„œ๋Š” ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ ํ™˜๊ฒฝ์—์„œ MTNP ๋ชจ๋ธ์ด ํ…Œ์Šคํฌ ๊ฐ„ ์ง€์‹ ์ „์ด (knowledge transfer)๋ฅผ ํšจ๊ณผ์ ์ด๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. figure (a)์—์„œ ๊ด€์ฐฐ๊ฐ’์ด ์ ์„ ์‹œ์— ๋ถˆํ™•์‹ค์„ฑ์ด ๋†’์•„์ง€๋ฉด์„œ ๋†’์€ NLL ์ˆ˜์น˜๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ ์ ์ฐจ์ ์œผ๋กœ ์ถ”๊ฐ€์ ์ธ ๊ด€์ฐฐ ๊ฐ’ (Cloud) ์„ ํ†ตํ•ด ์ง€์‹ ์ „์ด๊ฐ€ ํšจ๊ณผ์ ์œผ๋กœ ์ง„ํ–‰๋˜์–ด ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ๋†’์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

5. Conclusion

์ œ์‹œ๋œ ํ™•๋ฅ  ํ”„๋กœ์„ธ์Šค ๊ธฐ๋ฐ˜์˜ MTNP์€ ๋ถˆ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋‹ค์ค‘ ํ•จ์ˆ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๊ณ ์•ˆ๋˜์—ˆ๊ณ  ๋‹ค์–‘ํ•˜๊ฒŒ ๋””์ž์ธ๋œ ์‹คํ—˜์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•˜์˜€๋‹ค. Large scale ๋ฐ์ดํ„ฐ ์…‹ ํ™˜๊ฒฝ์—์„œ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆ์ด ์ข‹์€ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์ด ๋  ๊ฒƒ์ด๋ผ ์ƒ๊ฐ๋˜๊ณ  ๊ด€์ฐฐ๋˜์ง€ ์•Š์€ ๊ณต๊ฐ„์— ๋Œ€ํ•ด ์ผ๋ฐ˜ํ™”๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋ฐฉํ–ฅ๋„ ๋ชจ๋ธ์˜ ๋ฒ”์šฉ์„ฑ์„ ํ–ฅ์ƒ ์‹œํ‚ค๋Š”๋ฐ ๋„์›€์ด ๋  ๊ฒƒ์ด๋ผ ์ƒ๊ฐ๋œ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

Neural Processes (NPs)๋Š” ์œ„๋Œ€ํ•˜๋‹ค.

์—ฐ๊ตฌ์ž๋‹˜๋“ค ์ˆ˜๊ณ ํ•˜์…จ์Šต๋‹ˆ๋‹ค.

Author / Reviewer information

Author

  • ํ—ˆ์ž์šฑ

  • School of Computing

  • jayheo@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Kim, Donggyun, et al. "Multi-Task Processes." arXiv preprint arXiv:2110.14953 (2021).

  2. Caruana, Rich. "Multitask learning." Machine learning 28.1 (1997): 41-75.

  3. Fortuin, Vincent, Heiko Strathmann, and Gunnar Rรคtsch. "Meta-learning mean functions for gaussian processes." arXiv preprint arXiv:1901.08098 (2019).

  4. Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, and Leonid Sigal. Improved few-shot visual classification. In CVPR, 2020.

  5. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.

  6. Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In ICML, 2018a.

  7. Kiyosi Itรด et al. An Introduction to Probability Theory. Cambridge University Press, 1984.

Last updated