Separation of hand motion and pose [kor]

Liu et al. / Decoupled Representation Learning for Skeleton-Based Gesture Recognition / CVPR 2020

1. Problem definition

์œ ์ €์˜ ์†์„ ์ •ํ™•ํ•˜๊ฒŒ ์ธ์‹ํ•˜๋ ค๋Š” ์—ฐ๊ตฌ๋Š” ์†์„ ์ด์šฉํ•œ ์กฐ์ž‘๋ฐฉ๋ฒ•์ด ์ฃผ๋Š” ํฐ ์ด์  ๋•Œ๋ฌธ์— ์ด์ „๋ถ€ํ„ฐ ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ด๋Ÿฌ์–ด์ ธ์™”๋‹ค. ๊ทธ ์ค‘์— ๋Œ€ํ‘œ์ ์ธ ๋‘ ๊ฐ€์ง€๊ฐ€ ์†์˜ ํฌ์ฆˆ(pose)๋ฅผ ์ธ์‹ํ•˜๋Š” hand pose recognition/estimation๊ณผ ์† ๋ชจ์–‘์˜ ์˜๋ฏธ๋ฅผ ์ธ์‹ํ•˜๋ ค ํ•˜๋Š” hand gesture recognition์ด๋‹ค.

hand pose estimation์€ ์†์˜ RGB ํ˜น์€ RGB-D ์ด๋ฏธ์ง€๋ฅผ ๋ฐ›์•„์„œ ๊ทธ feature๋ฅผ ๋ถ„์„ํ•ด ์†์˜ joint์ด ์–ด๋–ค ๋ชจ์–‘์„ ํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ์•Œ์•„๋‚ด๊ณ ์ž ํ•˜๋Š” task ์ด๊ณ  ๋Œ€๋ถ€๋ถ„์ด ๋‹จ์ผํ•œ ์† ์ด๋ฏธ์ง€๋ฅผ ์ธํ’‹์œผ๋กœ ๋ฐ›๋Š”๋‹ค. ํ•˜์ง€๋งŒ, hand gesture recognition ๊ฐ™์€ ๊ฒฝ์šฐ, ๊ทธ ์ œ์Šค์ฒ˜๊ฐ€ ์ •์ง€ํ•ด์žˆ๋Š” ์ œ์Šค์ฒ˜ -์ˆซ์ž๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์† ๋ชจ์–‘ ๋“ฑ- ์ด ์•„๋‹Œ ์ด์ƒ ์—ฐ๊ตฌ์˜ ๊ด€์‹ฌ์‚ฌ๋Š” ์ œ์Šค์ฒ˜๊ฐ€ ์‹œ์ž‘ํ•ด์„œ ๋๋‚˜๊ธฐ๊นŒ์ง€์˜ ์ผ๋ จ์˜ ์† ๋™์ž‘์„ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•œ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ์ด๋Ÿฌํ•œ hand gesture recognition ๋ชจ๋ธ์€ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ ๋ณต์ˆ˜์˜ ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ์ธํ’‹์œผ๋กœ ๋ฐ›์•„์„œ ๊ทธ ์‹œํ€€์Šค๋“ค์ด ์–ด๋–ค ์˜๋ฏธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ œ์Šค์ฒ˜์ธ์ง€๋ฅผ ์ถœ๋ ฅํ•ด๋‚ด์•ผ๋งŒ ํ•œ๋‹ค(Fig 1.).

Figure 1: Hand gesture recognition[2]

์ด ๋…ผ๋ฌธ์—์„œ๋„ ์‚ฌ์šฉ๋œ SHREC'17 Track ๋ฐ์ดํ„ฐ์…‹์„ ์˜ˆ๋กœ ๋“ค์ž๋ฉด, ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์€ 14๊ฐ€์ง€์˜ ์ œ์Šค์ฒ˜์— ๋Œ€ํ•œ ์† ๋ชจ์–‘ ์˜์ƒ๋“ค๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๋‹ค. ๊ทธ ์ค‘ n๋ฒˆ์งธ ์ œ์Šค์ฒ˜ Gn={Stnโˆฃt=1,2,...,Tn}G_n = \{S^n_t|t=1,2,..., T_n\} (StnS^n_t: t ๋ฒˆ์งธ ์‹œํ€€์Šค์˜ ์† joint๋“ค์˜ ์œ„์น˜, TnT_n: ์‹œํ€€์Šค ๊ธธ์ด) ๋ฅผ ๋ชจ๋ธ์— ์ž…๋ ฅํ–ˆ์„ ๋•Œ, ๋ฐ์ดํ„ฐ์…‹ ๋‚ด์˜ ์ œ์Šค์ฒ˜์˜ ์ธ๋ฑ์Šค์ธ yโˆˆ{1,2,..,14}y \in \{1, 2, .., 14\} ๊ฐ€ ์ถœ๋ ฅ๋œ๋‹ค. ์ด๋ ‡๋“ฏ gesture recognition network๋Š” ์†์— ๋Œ€ํ•œ feature(pose, depth, optical flow ๋“ฑ)๋“ค์˜ ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ์ œ์Šค์ฒ˜๋ฅผ ํŠน์ •ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋‹ค.

2. Motivation

์ด์ „์—๋„ ๋”ฅ๋Ÿฌ๋‹์„ gesture์™€ action recognition์„ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ๋งŽ์ด ์žˆ์—ˆ๋‹ค. CNN(Convolutional Neural Network)[3], RNN(Recurrent Neural Network)[4], LSTM(Long Shorth-term Memory)[5] ๊ทธ๋ฆฌ๊ณ  attention mechanism[6] ์ด๋‚˜ mannifold learning[7], GCN(Graph Convolutional Networks)[8] ๋˜ํ•œ ์ œ์Šค์ฒ˜ ์ธ์‹ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•ด ์ด์šฉ๋˜์–ด์™”๋‹ค. ํ•˜์ง€๋งŒ, ์œ„์˜ ๋ฐฉ๋ฒ•๋“ค์„ ์ด์šฉํ•œ ์—ฐ๊ตฌ๋“ค์€ ์† joint์˜ ์‹œํ€€์Šค๋“ค์„ ๊ณ ์ •๋œ ๊ตฌ์กฐ๋กœ ์ด์šฉํ•˜๋ฉฐ, ๊ฐ ๊ด€์ ˆ์ด ์„œ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๊ณ  ํ•œ ๊ด€์ ˆ์˜ ์›€์ง์ž„์ด ๋‹ค๋ฅธ ๊ด€์ ˆ์˜ ์œ„์น˜์—๋„ ์˜ํ–ฅ์„ ๋ผ์นœ๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜๋‹ค. ์ฆ‰, ๊ฐ ์‹œํ€€์Šค์˜ joint๋“ค์˜ ์œ„์น˜๋ฅผ ๊ทธ์ € ํ•˜๋‚˜์˜ ํ†ต์งธ ์ด๋ฏธ์ง€๋กœ์„œ ํ›ˆ๋ จ์„ ํ–ˆ๊ณ  ๊ทธ๋ ‡๊ธฐ์— ์ธ์ ‘ํ•ด์„œ ์„œ๋กœ ์˜ํ–ฅ์„ ์ฃผ๋Š” joint๋“ค์˜ local feature๋ฅผ ํฌ์ฐฉํ•ด๋‚ด์ง€ ๋ชป ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ ์„ ๊ทน๋ณตํ•˜๊ณ ์ž ์†์˜ joint์„ spatial and temporal volume modeling๋ฅผ ์ด์šฉํ–ˆ๋‹ค. spatial and temporal volume modeling์€ method ๋ถ€๋ถ„์—์„œ๋„ ๋‚˜์˜ค๊ฒ ์ง€๋งŒ ๋‹จ์ˆœํ•˜๊ฒŒ ๋ชจ๋“  ์‹œํ€€์Šค์˜ ๊ฐ joint์˜ ์œ„์น˜๋ฅผ ํ•˜๋‚˜์˜ 3D tensor๋กœ ๋งŒ๋“ ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ด์ „์˜ ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ œ์Šค์ฒ˜ ์ธ์‹๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ์† ๋ชจ์–‘์˜ ๋ณ€ํ™”์™€ ์†์˜ ์›€์ง์ž„์„ ์ „๋ถ€ ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ์—์„œ ํ•™์Šต์„ ์ง„ํ–‰์„ ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์† ๋ชจ์–‘์˜ ๋ณ€ํ™”๋Š” ๊ฐ ์†๊ฐ€๋ฝ joint๋“ค์˜ ์œ„์น˜ ๋ณ€ํ™”์— ๋Œ€ํ•ด ํ•™์Šต์ด ์ด๋ฃจ์–ด์ ธ์•ผํ•˜๋ฉฐ, ์† ์ž์ฒด์˜ ์›€์ง์ž„์€ ์†๊ฐ€๋ฝ๊ณผ๋Š” ํฌ๊ฒŒ ๊ด€๊ณ„์—†์ด ํ•œ ๋ฉ์–ด๋ฆฌ๋กœ์„œ์˜ ์† ๊ทธ ์ž์ฒด์˜ ์œ„์น˜๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜์˜€๋Š”์ง€์— ๋Œ€ํ•œ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ ธ์•ผํ•œ๋‹ค. ์ด๋ ‡๋“ฏ ์† ๋ชจ์–‘์˜ ๋ณ€ํ™”(hand posture variations)์™€ ์†์˜ ์›€์ง์ž„(hand movements)๋ผ๋Š” ์ด ๋‘ ๊ฐ€์ง€ feature๋Š” ๋ชจ๋‘ ์ œ์Šค์ฒ˜ ์ธ์‹์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์ง€๋งŒ, ์† ๋ชจ์–‘์˜ ๋ณ€ํ™”๋Š” ์† ๊ด€์ ˆ๋“ค์˜ localํ•œ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์•ผํ•˜๊ณ , ์†์˜ ์›€์ง์ž„์€ globalํ•œ ์ •๋ณด๋งŒ์„ ํ•„์š”๋กœ ํ•˜๊ธฐ์— ๊ทธ ์„ฑ์งˆ์ด ํฌ๊ฒŒ ๋‹ค๋ฅด๋‹ค. ์ด๊ฒƒ๋“ค์„ ํ•œ ๋„คํŠธ์›Œํฌ์—์„œ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋‘ feature์— ๋Œ€ํ•ด ๋”ฐ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ํ›„์— ๊ฐ๊ฐ์˜ prediction ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ํ•˜์—ฌ ์ตœ์ข… prediction ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ คํ•œ๋‹ค(Fig 2.).

Figure 2: ์†์˜ ํฌ์ฆˆ์™€ ์›€์ง์ž„์˜ ๋ถ„๋ฆฌ

์ด๋Ÿฌํ•œ two-stream ๋„คํŠธ์›Œํฌ๋ฅผ ์ด์šฉํ•œ action recognition์€ [9]์—์„œ๋„ ์ด๋ฃจ์–ด์กŒ์ง€๋งŒ ๋ณธ ์ €์ž๋Š” [9]์—์„œ๋Š” shape์™€ body motion์„ 2d mappingํ•œ shape evolution maps ์™€ motion evolution maps๋ผ๋Š” feature๋ฅผ ์ด์šฉํ–ˆ๊ณ  ๋ณธ์ธ์€ hand posture variations๊ณผ hand movements๋ฅผ 3d volume์œผ๋กœ ๋‚˜ํƒ€๋‚ด์–ด ์ด์šฉํ–ˆ๊ธฐ์— ๊ฑฐ๊ธฐ์— ์ฐจ๋ณ„์ ์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ, [9]์˜ body action recognition์œผ๋กœ ํ’€๊ณ ์ž ํ•˜๋Š” body action๊ณผ ๋ณธ ๋…ผ๋ฌธ์˜ hand gesture recognition์˜ hand gesture๋Š” ๊ทธ ์„ฑ์งˆ์ด ํฌ๊ฒŒ ๋‹ค๋ฅด๋‹ค. ๋ชธ ์ „์ฒด์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์†์€ ๊ทธ ๊ตฌ์กฐ๊ฐ€ ๋”์šฑ ๋ณต์žกํ•˜๋ฉฐ, occlusion๋„ body ๋ณด๋‹ค ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ผ์–ด๋‚˜๊ณ  ๊ทธ occlusion์— ์˜ํ•œ ์˜ํ–ฅ๋„ body ๋ณด๋‹ค ํฌ๋‹ค. ์ด๋Ÿฌํ•œ ์ฐจ๋ณ„์ ์ด ์ €์ž๋Š” ๋ณธ ์—ฐ๊ตฌ์˜ motivation์ด์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

3. Method

Figure 3: ์ „์ฒด ๋ชจ๋ธ ๊ฐœ์š”

์ด ๋ชจ๋ธ์€ ๋จผ์ € ์†์˜ joint ์ •๋ณด(hand skeleton data)๋ฅผ ๊ฐ๊ฐ hand posture variation๊ณผ hand movements๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ•™์Šต์„ ํ•œ๋‹ค.

Hand posture variation์˜ ๊ฒฝ์šฐ, ๋ชจ๋“  ์‹œํ€€์Šค์˜ joint ๋ฐ์ดํ„ฐ๋กœ ํ•˜๋‚˜์˜ tensor์ธ HPEV(hand posture evolution volume)๋ฅผ ์ƒ์„ฑํ•œ ํ›„, ์ด HPEV๋ฅผ 3D CNN์„ ๋ฒ ์ด์Šค๋กœ ํ•œ HPEV-Net์—์„œ ํ•™์Šต์‹œํ‚จ๋‹ค. ์ถ”๊ฐ€๋กœ, ์„ฌ์„ธํ•œ ์†๊ฐ€๋ฝ์˜ ์›€์ง์ž„๋„ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•ด ์—„์ง€ ์†๊ฐ€๋ฝ์„ ๊ธฐ์ค€์œผ๋กœ ํ•œ ๊ฐ ์†๊ฐ€๋ฝ์˜ ์ƒ๋Œ€์  ์œ„์น˜์ธ FRPV(finger relative position vector) ๋˜ํ•œ HPEV-Net์—์„œ ์ถœ๋ ฅ๋œ feature vector์— ์ถ”๊ฐ€ํ•ด์ค€๋‹ค.

Hand movements๋Š” HMM(hand movement map)์œผ๋กœ ๋งตํ•‘ํ•œ ํ›„ CNN ๊ธฐ๋ฐ˜์˜ HMM-Net์„ ํ†ตํ•ด ํ•™์Šต์„ ์‹œํ‚จ๋‹ค. ๊ฐ๊ฐ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ feature vector๋Š” fully connected layer ์™€ softmax๋ฅผ ํ†ตํ•ด ๊ฐ๊ฐ prediction ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ฒŒ๋œ๋‹ค. ์ด ๋‘ ๊ฐœ์˜ prediction ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ ์ตœ์ข…์  ์ œ์Šค์ฒ˜ prediction์ด ์ด๋ฃจ์–ด์ง„๋‹ค.

Hand Posture Volume

์ œ์Šค์ฒ˜์˜ feature๋ฅผ network๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„  ํ•ด๋‹น ์ •๋ณด๋“ค์„ ๋ฐ์ดํ„ฐํ™” ์‹œํ‚ฌ ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด ํŒŒํŠธ์—์„œ๋Š”๋Š” ์ œ์Šค์ฒ˜์˜ ํ•œ ์‹œํ€€์Šค์— ํ•ด๋‹นํ•˜๋Š” ์† ๊ด€์ ˆ๋“ค์˜ ์œ„์น˜์ •๋ณด๋ฅผ 3D tensor๋กœ ๋งŒ๋“œ๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค.

n ๋ฒˆ์งธ ์ œ์Šค์ฒ˜ ์‹œํ€€์Šค๋“ค์˜ ์ง‘ํ•ฉ์„ Gn={Stnโˆฃt=1,2,...,Tn}G_n = \{S^n_t | t = 1, 2, ..., T_n \}(Tn:T_n: ์‹œํ€€์Šค์˜ ๊ธธ์ด)์ด๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, t ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์˜ ์†์˜ joint 3D ์œ„์น˜์ •๋ณด StnS^n_t ๋Š” Stn={pi,tnโˆฃpi,tn=(xi,tn,yi,tn,zi,tn),i=1,2,...,J}S^n_t =\{ \mathbf{p}^n_{i,t} | \mathbf{p}^n_{i,t} = (x^n_{i,t}, y^n_{i,t}, z^n_{i,t} ), i = 1,2,...,J\} (J:J: joint์˜ ๊ฐฏ์ˆ˜, pi,tn:\mathbf{p}^n_{i,t}: t ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์˜ hand joint i์˜ 3D ์œ„์น˜์ •๋ณด) ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ œ์Šค์ฒ˜ ๋งˆ๋‹ค ์‹œํ€€์Šค์˜ ๊ธธ์ด์ธ TnT_n์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ธํ’‹ ์‚ฌ์ด์ฆˆ๋ฅผ ํ†ต์ผ์‹œ์ผœ์•ผ ํ•œ๋‹ค. ์ด ๋•Œ์˜ ์ธํ’‹ ์‚ฌ์ด์ฆˆ๋ฅผ TT๋กœ ํ•˜๊ณ ์ž ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, Tn>TT_n > T์ผ ๊ฒฝ์šฐ์—๋Š” ์ผ์ •ํ•˜๊ฒŒ ์‹œํ€€์Šค๋ฅผ TT๋งŒํผ๋งŒ ์„ ํƒํ•˜๋ฉด ๋˜๊ณ , Tn<TT_n < T ์ผ ๊ฒฝ์šฐ, ๋ช‡ ๊ฐ€์ง€์˜ ์‹œํ€€์Šค๋“ค์„ ๋ฐ˜๋ณตํ•ด์„œ ์‹œํ€€์Šค์˜ ๊ธธ์ด๊ฐ€ TT๊ฐ€ ๋˜๊ฒŒ ํ•˜๋ฉด ๋œ๋‹ค. ์—ฌ๊ธฐ์„œ T๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๊ธฐ๋ณธ๊ฐ’์„ 60์œผ๋กœ ์ €์ž๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒ˜ํ”Œ๋ง ๊ณผ์ •์„ ํ†ตํ•˜๊ฒŒ ๋˜๋ฉด ์ œ์Šค์ฒ˜ GnG_n์€ ๊ธธ์ด๊ฐ€ TT์ธ GnTG^T_n์ด ๋œ๋‹ค.

GnT={Sฯ„nโˆฃฯ„=โŒˆTnTร—tโŒ‰,t=1,2,...,T}.G^T_n = \{S^n_{\tau}|\tau=\lceil \frac{T_n}{T} \times t \rceil , t = 1, 2, ..., T \}.

๊ฐ ๊ด€์ ˆ์˜ 3d ์ขŒํ‘œ๋ฅผ volume์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ์ „์— ๋จผ์ € ๊ฐ ๊ด€์ ˆ์˜ 3d ์ขŒํ‘œ ๊ฐ’์„ [โˆ’1,1][-1, 1]์— normalize ํ•ด์•ผํ•œ๋‹ค. normalize๋ฅผ ์œ„ํ•ด์„  ์†์˜ maximum bounding box๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ํ•œ ํŠน์ • ํ”„๋ ˆ์ž„ t์˜ ํŠน์ • joint i์˜ bounding box๋Š” ์ดํ•˜์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. bounding box์˜ ๊ฐ ๋ณ€์˜ ๊ธธ์ด๋ฅผ ฮ”xtn,ฮ”ytn,ฮ”ztn\Delta x^n_t, \Delta y^n_t, \Delta z^n_t๋ผ๊ณ  ํ–ˆ์„ ๋•Œ,

{ฮ”xtn=max(xi,tn)โˆ’min(xi,tn)ฮ”ytn=max(yi,tn)โˆ’min(yi,tn)i=1,2,...,Jฮ”ztn=max(zi,tn)โˆ’min(zi,tn)\begin{cases} \Delta x^n_t = max(x^n_{i,t}) - min(x^n_{i,t}) \\ \Delta y^n_t = max(y^n_{i,t}) - min(y^n_{i,t}) \quad \quad i = 1,2, ...,J \\ \Delta z^n_t = max(z^n_{i,t}) - min(z^n_{i,t}) \end{cases}

๊ฐ€ ๋œ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ bounding box์˜ ์ตœ๋Œ€ ๊ธธ์ด ฮ”xmax,ฮ”ymax,ฮ”zmax\Delta x_{max}, \Delta y_{max}, \Delta z_{max}๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

{ฮ”xmax=max(ฮ”xtn),ฮ”ymax=max(ฮ”ytn),t=1,2,...,T,n=1,2,...,Nฮ”zmax=max(ฮ”ztn)\begin{cases} \Delta x_{max} = max(\Delta x^n_t), \\ \Delta y_{max} = max(\Delta y^n_t), \quad\quad t=1,2,...,T, n=1,2,...,N \\ \Delta z_{max} = max(\Delta z^n_t) \end{cases}

์ด๋Ÿฌํ•œ ๊ฐ’์„ ์ด์šฉํ•ด์„œ joint์˜ ์œ„์น˜๋ฅผ normalize๋ฅผ ํ•˜๊ฒŒ ๋˜๋ฉด ์ดํ•˜์™€ ๊ฐ™์ด ๋œ๋‹ค. xnormx_{norm} ์ด normalizeํ•œ joint์˜ x ๊ฐ’, xmin,xmaxx_{min}, x_{max}๊ฐ€ ๊ฐ๊ฐ ํ•ด๋‹น joint์˜ ์ตœ์†Œ/์ตœ๋Œ€ x ๊ฐ’์ด๋‹ค.

xnorm=xโˆ’xmin+xmax2ฮ”xmaxร—2.x_{norm} = \frac{x - \frac{x_{min} + x_{max}}{2}}{\Delta x_{max}} \times 2.

y,zy,z์— ๋Œ€ํ•ด์„œ๋„ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ynorm,znormy_{norm}, z_{norm} ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ joint ์œ„์น˜๋ฅผ normalize ํ•˜๊ฒŒ ๋˜๋ฉด ์†์˜ ์ค‘์‹ฌ์„ (0,0,0)(0,0,0)์— ๋งž์ถœ ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค.

์ด ๋‹ค์Œ ์†์˜ ๊ด€์ ˆ์„ Rร—Rร—RR \times R \times R์˜ cube volume์œผ๋กœ ํ•˜๊ธฐ ์œ„ํ•ด์„œ xnormx_{norm}์„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ํ†ตํ•ด xvโˆˆ{1,2,...,R}x_v \in \{1,2,...,R\} ๋กœ ๋ณ€ํ™˜์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

xv=round((xnorm+1)ร—R2).x_v = round \left((x_{norm} + 1) \times \frac{R}{2} \right).

์ด๋ ‡๊ฒŒ xvx_v๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋˜๋ฉด, (Rร—Rร—R)(R \times R \times R) volume์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์ด ์žˆ๋‹ค๋ฉด 1, ์—†๋‹ค๋ฉด 0์„ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํŠน์ • joint ์œ„์น˜์ •๋ณด๋ฅผ volume์˜ ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, xv=3,yv=3,zv=3x_v=3, y_v=3, z_v=3 ์ด ์žˆ๋‹ค๋ฉด volume์˜ ํ•ด๋‹นํ•˜๋Š” voxel์— 1์„ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ volume์œผ๋กœ ํ•œ joint๋“ค์„ ์ „๋ถ€ ๋ชจ์œผ๊ฒŒ ๋˜๋ฉด ํŠน์ • ์ œ์Šค์ฒ˜๋ฅผ Gv={Sv,tโˆฃt=1,2,...,T}G_v = \{S_{v,t} | t = 1, 2, ..., T\} ๋ผ๋Š” volume coordinates๋กœ ํ•  ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค. ์ด๋Ÿฌํ•œ volume์„ input์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ‘์˜ ์‹์œผ๋กœ (R,R,R)(R, R, R) tensor๋กœ ๋ณ€ํ™˜์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

V(i,j,k)={1,if(i,j,k)โˆˆSv,t0,otherwise.V(i, j, k) = \begin{cases} 1, \quad if(i,j,k) \in S_{v,t} \\ 0, \quad otherwise. \end{cases}

Hand Posture Evolution Volume(HPEV)

์œ„์น˜์ •๋ณด๋ฅผ volumeํ™” ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๊ณต๊ฐ„์ •๋ณด๋ฅผ ํ•˜๋‚˜์˜ tensor๋กœ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋ชจ๋“  ์‹œํ€€์Šค์˜ tensor๋ฅผ ์ผ์ • ๊ฐ„๊ฒฉ์œผ๋กœ ํ•ฉ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๊ณต๊ฐ„์ •๋ณด(๊ฐ ๊ด€์ ˆ์˜ ์œ„์น˜)์™€ ์‹œ๊ฐ„์ •๋ณด(์‹œํ€€์Šค ์ง„ํ–‰)๋ฅผ ํ•˜๋‚˜์˜ tensor๋กœ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค.

VHPEV(i+ฮธ(tโˆ’1),j,k)={1,if(i,j,k)โˆˆSv,t0,otherwise.V_{HPEV}(i + \theta(t-1), j, k) = \begin{cases}1, \quad if (i,j,k) \in S_{v,t} \\ 0, \quad otherwise. \end{cases}

์œ„ ์‹์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋ชจ๋“  ์‹œํ€€์Šค์˜ ์† ๊ด€์ ˆ ์œ„์น˜์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š”(R+ฮธ(Tโˆ’1),R,R)(R+\theta(T-1), R, R) tensor์ธ VHPEVV_{HPEV}๋Š” ์œ„์˜ ์„น์…˜์—์„œ ๊ตฌํ•œ V๋ฅผ x ์ถ•์„ ฮธ\theta๋งŒํผ ๊ฐ„๊ฒฉ์„ ๋‘๊ณ  ํ•ฉ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋˜๋ฉด ํ•˜๋‚˜์˜ ์ œ์Šค์ฒ˜์— ๋Œ€ํ•œ ๋ชจ๋“  ์‹œํ€€์Šค๋“ค์€ ํ•˜๋‚˜์˜ tensor๋กœ ํ•ฉ์ณ์ง€๊ฒŒ ๋˜๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทธ๋ฆผ ์ฒ˜๋Ÿผ ๋  ๊ฒƒ์ด๋‹ค.

Figure 4: HPEV

Fingertip Relative Position Vector(FRPV)

joint ์œ„์น˜์ •๋ณด๋ฅผ ํ•˜๋‚˜์˜ volume์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ์‰ฝ๊ฒŒ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ ๊ทธ volume์˜ ํฌ๊ธฐ๊ฐ€ ์ œํ•œ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์†๊ฐ€๋ฝ์˜ ์„ฌ์„ธํ•œ ์›€์ง์ž„๊นŒ์ง€๋Š” ๋‚˜ํƒ€๋‚ด์ง€๋Š” ๋ชปํ•œ๋‹ค. ๊ทธ ๋•Œ๋ฌธ์— ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณด์กฐ์ ์ธ ์ธํ’‹์œผ๋กœ์„œ ์†๊ฐ€๋ฝ์˜ ์ƒ๋Œ€์  ์œ„์น˜๋“ค์ธ FRPV๋ฅผ ์ด์šฉํ–ˆ๋‹ค. ํŠน์ • ํ”„๋ ˆ์ž„ tt์˜ ์—„์ง€์†๊ฐ€๋ฝ์„ ๊ธฐ์ค€์œผ๋กœ ํ•œ ๊ฐ ์†๊ฐ€๋ฝ์˜ ์ƒ๋Œ€์  ์œ„์น˜ ๋ฒกํ„ฐ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

vt=(pI,t,pM,t,pR,t,pL,t)โˆ’(p0,t,p0,t,p0,t,p0,t)\mathbf{v_t} = (\mathbf{p}_{I, t}, \mathbf{p}_{M, t}, \mathbf{p}_{R, t}, \mathbf{p}_{L, t}) - (\mathbf{p}_{0, t}, \mathbf{p}_{0, t}, \mathbf{p}_{0, t}, \mathbf{p}_{0, t}) \\

p0,t:\mathbf{p}_{0, t}: t ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์˜ ์—„์ง€ ์œ„์น˜ ๋ฒกํ„ฐ, pI,t,pM,t,pR,t,โ€‰andโ€…โ€Šโ€…โ€ŠpL,t:\mathbf{p}_{I, t}, \mathbf{p}_{M, t}, \mathbf{p}_{R, t}, \,and \;\; \mathbf{p}_{L, t}: ๊ฐ tt ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์˜ ๊ฒ€์ง€, ์ค‘์ง€, ์•ฝ์ง€, ์†Œ์ง€ ์œ„์น˜ ๋ฒกํ„ฐ

๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ชจ๋“  ํ”„๋ ˆ์ž„์˜ ๊ฐ ๋ฒกํ„ฐ๋“ค์„ ํ•ฉ์นจ์œผ๋กœ์„œ VFRPVV_{FRPV}์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

VFRPV=(v1,v2,...,vt,...,vT)\mathbf{V_{FRPV}} = (\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_t,..., \mathbf{v}_{T})

Hand Movement Map(HMM)

๋‹ค์Œ์€ ์†์˜ ์›€์ง์ž„์„ ๋‚˜ํƒ€๋‚ด๋Š” HMM์„ ์ƒ์„ฑํ•ด๋‚ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์†์˜ ์›€์ง์ž„์€ ์† ์ค‘์‹ฌ์˜ ์›€์ง์ž„๊ณผ ๊ฐ ์†๊ฐ€๋ฝ ๋์˜ ์›€์ง์ž„์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

ํŠน์ • ์ œ์Šค์ฒ˜๋ฅผ GG๋Š” G={Stโˆฃt=1,2,..,T}G = \{ S_t | t = 1,2, .., T \}๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ณ , ์†์˜ joint๋Š” HPEV ๋•Œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ St={pi,tโˆฃpi,t=(xi,t,yi,t,zi,t),i=1,2,...,J}S_t = \{ \mathbf{p}_{i,t} | \mathbf{p}_{i,t} = (x_{i,t}, y_{i,t}, z_{i,t}), i = 1,2, ..., J \}๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ์ด ๋•Œ, ๋ชจ๋“  ๊ด€์ ˆ์— ๋Œ€ํ•œ ์ค‘์‹ฌ์€

Ct=1Jโˆ‘i=1Jpi,tC_t = \frac{1}{J} \sum_{i=1}^J \mathbf{p}_{i,t}

์ด ๋  ๊ฒƒ์ด๊ณ , ์ด ์ค‘์‹ฌ์˜ ์›€์ง์ž„์€ ๋งจ ์ฒ˜์Œ ์‹œํ€€์Šค์˜ ์œ„์น˜์™€ ํ˜„์žฌ ์‹œํ€€์Šค์˜ ์œ„์น˜์˜ ์ฐจ์ด๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

MH={Ctโˆ’C1โˆฃt=1,2,...,T}M_H = \{C_t - C_1|t=1,2,...,T \}

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์†๊ฐ€๋ฝ ๋์˜ ์›€์ง์ž„ ๋˜ํ•œ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ•˜๋ฉด ์ดํ•˜์˜ ์‹์ด ๋œ๋‹ค.

MF,j={pj,tโˆ’pj,1โˆฃt=1,2,...,T}M_{F,j} = \{\mathbf{p}_{j,t} - \mathbf{p}_{j,1}|t=1,2,...,T \}

j๋Š” J๊ฐœ์˜ ์ „์ฒด ๊ด€์ ˆ ์ค‘์—์„œ 5๊ฐœ์˜ ์† ๋ ๊ด€์ ˆ์˜ ์ธ๋ฑ์Šค๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ MH,MF,jM_H, M_{F,j} ๋ฅผ ํ–‰์œผ๋กœ ํ”„๋ ˆ์ž„์„ ์—ด๋กœ ๋งตํ•‘ํ•˜๊ฒŒ ๋˜๋ฉด ๊ฐ xyz ์„ธ ๊ฐœ์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ฑ„๋„๋กœ ํ•˜๋Š” Hand Movement Map์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

HPEV-Net and HMM-Net

Figure 5: Network details

์ด์ œ๊นŒ์ง€ ๊ตฌํ•œ HPEV(Hand Posture Evolution Volume)๊ณผ HMM(Hand Movement Map)์„ ๊ฐ๊ฐ HPEV-Net ๊ณผ HMM-Net์„ ์ด์šฉํ•˜์—ฌ ๊ฐ๊ฐ์˜ feature๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๊ทธ feature๋ฅผ ์ด์šฉํ•ด ์ตœ์ข…์ ์œผ๋กœ gesture๋ฅผ predict ํ•˜๊ฒŒ๋œ๋‹ค.

HPEV-Net

  1. ๋งจ ์ฒ˜์Œ์—” ์ปค๋„ ์‚ฌ์ด์ฆˆ๊ฐ€ 7x3x3์ธ 3D convolution layer๋ฅผ ํ†ตํ•ด low-level features๋ฅผ ์ถ”์ถœ

  2. high-level feature๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ๊ฐœ์˜ bottleneck module์„ ์‚ฌ์šฉ

  3. ๊ฐ bottleneck modul์˜ output channel์€ 128, 128, 256 ๊ทธ๋ฆฌ๊ณ  512

  4. output features์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•œ 4x2x2 max pooling layer๋Š” ๋งจ ์ฒ˜์Œ convolution layer ์™€ ์ค‘๊ฐ„์˜ ๋‘ ๊ฐœ์˜ bottleneck modul์—์„œ๋งŒ ์‚ฌ์šฉ

  5. ์ฒ˜์Œ์˜ 3D convolution layer ์ดํ›„์— Batch Normalization ๊ณผ ReLu๊ฐ€ ์‚ฌ์šฉ

  6. ๋งˆ์ง€๋ง‰ bottleneck module ์ดํ›„์˜ output features๋Š” global average pooling์„ ์ด์šฉํ•ด ์ตœ์ข… feature vector๊ฐ€ ์ถœ๋ ฅ

HMM-Net

  1. HCN(Hierarchical Co-occurrence Network)[10] module์„ ํ†ตํ•ด feature๋ฅผ ์ถ”์ถœ

  2. HPEV-Net์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ 4๊ฐœ์˜ bottleneck module์„ ํ†ตํ•ด high-level features๋ฅผ ํ•™์Šต

  3. globa average pooling์„ ์ด์šฉํ•ด feature vector ์ƒ์„ฑ

  4. fully connected layer์™€ softmax๋ฅผ ํ†ตํ•ด ์ œ์Šค์ฒ˜ ๋ถ„๋ฅ˜๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค

๋งˆ์ง€๋ง‰์œผ๋กœ, HPEV-Net๊ณผ FRPV์— fully connected layer๋ฅผ ์ ์šฉ์‹œ์ผœ ๋‚˜์˜จ feature vector๋ฅผ ์ด์šฉํ•œ ๊ฒฐ๊ณผ์™€ HMM-Net์—์„œ ๋‚˜์˜จ feature vector๋ฅผ ์ด์šฉํ•œ ๋ถ„๋ฅ˜๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ํ•˜์—ฌ ์ตœ์ข… ์ œ์Šค์ฒ˜ ๋ถ„๋ฅ˜๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

Figure 6: Input-output example

4. Experiment & Result

Experimental setup

Dateset

Training setup

  • Optimizer: Adam

  • Loss function: Cross-entropy

  • batch size for training: 40

  • Initial learning rate: 3e-4

  • Learning rate dacay: 1/10 once learning stagnates

  • Final learnign rate: 3e-8

  • hyper parameters: T=60,ฮธ=3,R=32T = 60, \theta = 3, R = 32

Result

Different input combinations

Figure 7: ๊ฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ablation study

SHREC'17 Track ๋ฐ์ดํ„ฐ์…‹๊ณผ FPHA ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฐ๊ณผ์ด๋‹ค. ์†์˜ ์›€์ง์ž„์— ๋Œ€ํ•œ ์ธํ’‹์ธ HMM๋งŒ์„ ์ธํ’‹์œผ๋กœ ํ–ˆ์„๋•Œ SHREC'17 ๋ฐ์ดํ„ฐ์…‹์—์„œ๋งŒ HPEV๋งŒ์„ ์‚ฌ์šฉํ–ˆ์„๋•Œ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐ€๊ณ  FPHA ๋ฐ์ดํ„ฐ์…‹์—๋Š” ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. SHREC'17 ๋ฐ์ดํ„ฐ์…‹์ด FPHA ๋ฐ์ดํ„ฐ์…‹๋ณด๋‹ค ์† ์›€์ง์ž„์ด ๋งŽ์€ ์ œ์Šค์ฒ˜๊ฐ€ ๋งŽ์•„์„œ ๊ทธ๋Ÿฐ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  FPHA ๋ฐ์ดํ„ฐ์…‹์—์„œ FRPV ์ธํ’‹์„ ์‚ฌ์šฉํ•˜์ž ์„ฑ๋Šฅ์ด 8% ๋‚˜ ์ฆ๊ฐ€ํ–ˆ๋Š”๋ฐ ์ด๊ฒƒ์€ FPHA๊ฐ€ ์„ฌ์„ธํ•œ ์†๊ฐ€๋ฝ ์›€์ง์ด ํฌํ•จ๋œ ์ œ์Šค์ฒ˜๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ•œ๋‹ค.

Comparison with the state-of-the-art

Figure 8: SHREC'17 Track ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•œ ํ‘œ
Figure 9: DHG-14/28 ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•œ ํ‘œ
Figure 10: FPHA ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•œ ํ‘œ

FPHA ๋ฐ์ดํ„ฐ์…‹ ๊ฒฐ๊ณผ์—์„œ ST-TS-HGR-NET ์˜ ๊ฒฐ๊ณผ๊ฐ€ ์ด ๋…ผ๋ฌธ์˜ ๊ฒฐ๊ณผ๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ์ €์ž๋Š” FPHA ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— _ST-TS-HGR-NET_์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋” ์ข‹๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์ด๊ณ , SHREC'17 Track, DHG-14/28 ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ฐ™์€ ํฌ๊ธฐ๊ฐ€ ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ณธ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋” ์ข‹์•˜๊ธฐ์— ๋ณธ ๋ฐฉ๋ฒ•์€ ํฐ ๋ฐ์ดํ„ฐ์…‹์–ด์„œ ๊ทธ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ฃผ์žฅํ•œ๋‹ค.

5. Conclusion

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์Šค์ฒ˜์ธ์‹์„ ํ•  ๋•Œ์— ์†์˜ joint ๋ณ€ํ™”์™€ ์†์˜ ์ „์ฒด ์›€์ง์ž„ ๋ณ€ํ™”๋ฅผ ๋”ฐ๋กœ ๋‘ ๊ฐœ์˜ ๋„คํŠธ์›Œํฌ์—์„œ ํ›ˆ๋ จํ•œ ํ›„์— ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ ํ•ฉ์ณ์„œ ์ œ์Šค์ฒ˜๋ฅผ ์ธ์‹ํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์‹œํ–ˆ๋‹ค. ํ™•์‹คํžˆ ์ด์ œ๊นŒ์ง€์˜ ์ œ์Šค์ฒ˜ ์ธ์‹๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๊ฐ ํ”„๋ ˆ์ž„์˜ ์†์„ ํ•˜๋‚˜์˜ ํ†ต์งธ ์ด๋ฏธ์ง€๋กœ๋งŒ ๋ณด๊ณ , ๊ทธ ๊ณณ์—์„œ ์ถ”์ถœํ•œ feature์˜ ๋ณ€ํ™”๋งŒ์„ ๊ฐ€์ง€๊ณ  ์ œ์Šค์ฒ˜๋ฅผ ์ธ์‹ํ•ด์™”๊ธฐ์— ์ด ๋…ผ๋ฌธ๊ณผ ๊ฐ™์ด ์†์˜ ๋ชจ์–‘๊ณผ ์›€์ง์ž„์„ ๋”ฐ๋กœ ๋ถ„๋ฆฌํ•ด์„œ ํ›ˆ๋ จ์‹œํ‚จ ๋‹ค๋Š” ์•„์ด๋””์–ด๋Š” ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ธ ์•„์ด๋””์–ด๋กœ ๋ณด์ธ๋‹ค.

ํ•˜์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์†์˜ joint ํฌ์ฆˆ ์ •๋ณด๋ฅผ ์™„์ „ํžˆ ์•Œ๊ณ ์žˆ๋‹ค๋Š” ์ „์ œํ•˜์—์„œ ๊ทธ joint์˜ ๋ณ€ํ™”๋ฅผ ์ธํ’‹์œผ๋กœ ์ด์šฉํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์  ๋•Œ๋ฌธ์— ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์ด ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•  ๋•Œ์— ์–ด๋–ป๊ฒŒ ์ •ํ™•ํ•œ ์†์˜ joint ํฌ์ฆˆ๋ฅผ ์–ป์–ด๋‚ผ ๊ฒƒ์ธ๊ฐ€ ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์ผ์–ด๋‚œ๋‹ค. ์ผ๋ฐ˜์ ์ธ RGB ํ˜น์€ RGB-D ์ด๋ฏธ์ง€์—์„œ ๋ชจ๋“  ์†๊ณผ ์†๊ฐ€๋ฝ joint์˜ ๊ณต๊ฐ„ ์œ„์น˜ ์ •๋ณด๋ฅผ ์–ป์–ด์˜ค๊ธฐ ์œ„ํ•ด์„œ๋Š” hand pose estimation ๊ณผ์ •์ด ํ•„์š”ํ•œ๋ฐ, ์ด๊ฒƒ์„ real-time์œผ๋กœ ์ด๋ฃจ์–ด๋‚ด๊ธฐ ์œ„ํ•ด์„  ์ด ๋…ผ๋ฌธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€์˜ ํ˜น์€ ๋” ํฐ ๋ชจ๋ธ์˜ ํ•™์Šต์„ ํ•„์š”๋กœ ํ•˜๊ณ  ๊ณ„์‚ฐ๊ณผ์ •์˜ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„๊ณผ ์ž์›์ด ๋˜ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋Š” ๊ฒƒ์€ ์ด ๋…ผ๋ฌธ์˜ ์ œ์Šค์ฒ˜ ์ธ์‹ ๊ณผ์ •์„ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์“ฐ๊ฒŒ๋œ๋‹ค๋ฉด joint์„ ์ฐพ๋Š” ๊ณผ์ • + ์ œ์Šค์ฒ˜ ์ธ์‹ ๊ณผ์ •์ด ๋”ํ•ด์ ธ์„œ ํ•œ ๋™์ž‘์˜ ์ œ์Šค์ฒ˜๋ฅผ ์ธ์‹ํ•˜๋Š” ๋ฐ๋งŒ ์‹œ๊ฐ„์˜ ์ง€์—ฐ์ด ๋งŽ์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ๋˜, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์† joint์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์•Œ๊ณ  ์žˆ๊ธฐ์— ์† ๋์€ ๋Š˜ ํฌ์ฐฉ์ด ๊ฐ€๋Šฅํ•œ ๋ถ€๋ถ„์ด์—ˆ๊ณ  ๊ทธ๋ ‡๊ธฐ์— FRPV ์ธํ’‹์ด ๊ทธ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์‹ค์ œ ์† ์ œ์Šค์ฒ˜์—์„œ๋Š” ์†๊ฐ€๋ฝ์ด ์†์— ์˜ํ•ด์„œ ๊ฐ€๋ ค์ง€๋Š” occlusion ์ƒํ™ฉ์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๊ฒŒ ๋˜๊ณ  occlusion ๋ฌธ์ œ๋Š” ์ œ์Šค์ฒ˜ ์ธ์‹ ๋ถ„์•ผ์—์„œ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•˜๊ฒŒ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋…ผ๋ฌธ์— ์ด๋Ÿฌํ•œ occlusion ์ƒํ™ฉ์— ๋Œ€ํ•œ ๊ณ ์ฐฐ์„ ์ „ํ˜€ ํ•˜๊ณ  ์žˆ์ง€ ์•Š๋‹ค. ์ด๋ ‡๋“ฏ ์†์˜ joint ์ •๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ์–ป์„ ๊ฒƒ์ด๊ฐ€ ํ•˜๋Š” ๋ฌธ์ œ์™€ occlusion ๋ฌธ์ œ, ์ด ๋‘ ๊ฐ€์ง€์˜ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์ œ์— ๋Œ€ํ•ด์„œ ๊ณ ์ฐฐ์ด ์—†๋‹ค๋Š” ์ ์ด ์ด ๋…ผ๋ฌธ์— ์•„์‰ฌ์šด ์ ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

Simple is the best!

๋‹ค์–‘ํ•œ ๋ฌธ์ œ๋ฅผ ๊ณ ๋ คํ•˜์ž!

Author / Reviewer information

Author

ํ•˜ํƒœ์šฑ (HA TAEWOOK)

  • KAIST CT

  • hatw95@kaist.ac.kr

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Liu, Jianbo, et al. "Decoupled representation learning for skeleton-based gesture recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

  2. Google Mediapipe (Official Github repository)

  3. Devineau, Guillaume, et al. "Deep learning for hand gesture recognition on skeletal data." 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018.

  4. Du, Yong, Wei Wang, and Liang Wang. "Hierarchical recurrent neural network for skeleton based action recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

  5. Liu, Jun, et al. "Global context-aware attention lstm networks for 3d action recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

  6. Weng, Junwu, et al. "Deformable pose traversal convolution for 3d action and gesture recognition." Proceedings of the European conference on computer vision (ECCV). 2018.

  7. Nguyen, Xuan Son, et al. "A neural network based on SPD manifold learning for skeleton-based hand gesture recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

  8. Chen, Yuxiao, et al. "Construct dynamic graphs for hand gesture recognition via spatial-temporal attention." arXiv preprint arXiv:1907.08871 (2019).

  9. Liu, Hong, et al. "Learning explicit shape and motion evolution maps for skeleton-based human action recognition." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

  10. Li, Chao, et al. "Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation." arXiv preprint arXiv:1804.06055 (2018).

Last updated

Was this helpful?