GPS-Net [Kor]

Description

  • Xin Lin et al. / GPS-Net: Graph Property Sensing Network for Scene Graph Generation / CVPR 2020


1. Problem definition

Scene Graph Generation (SGG) ๋Š”, ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•˜์„ ๋•Œ ์ด๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” Task ์ž…๋‹ˆ๋‹ค.

1

๊ทธ๋ฆผ1์€ SGG ์˜ ์ผ๋ จ์˜ ๊ณผ์ •์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ๋žŒ๊ณผ ๋ง์ด ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ชจ๋ธ์ด ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ด ๋•Œ ์šฐ๋ฆฌ๊ฐ€ ์ƒ์„ฑํ•˜๊ณ  ์‹ถ์€ ๊ทธ๋ž˜ํ”„ G๋Š” V, E, R, O ์ด 4๊ฐ€์ง€ ์ปดํฌ๋„ŒํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

V ๋Š” ๋…ธ๋“œ, object detector์˜ proposal ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ E ๋Š” edge๋กœ, ์—ฐ๊ด€์ด ์žˆ๋Š” object ๋ผ๋ฆฌ ์—ฐ๊ฒฐ์ด ๋ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ SGG ์—์„œ๋Š” ๊ฐ ๋…ธ๋“œ์™€ ์—ฃ์ง€์˜ label ์˜ class ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๋Š” classification Task๋„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

R ์€ Edge์˜ Relation class๋ฅผ ๋œปํ•˜๋ฉฐ, O ์€ Object์˜ class๋ฅผ ๋œปํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ตœ์ข… ์–ป์€ Graph ๋Š”

<object, predicate, subject> (์‚ฌ๋žŒ, ๋จน์ด์ฃผ๋‹ค, ๋ง) ์™€ ๊ฐ™์€ triplet ์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ์œ„์˜ ์‹์œผ๋กœ ๋ถ€ํ„ฐ

P(V | I ) - object detector

P(E | V, I ) - relation proposal netowrk

P(R, O | V, E, I ) - Classification models for entity and predicate.

์ด 3๊ฐ€์ง€๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๋ฉด ์ €ํฌ๋Š” Scene Graph ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

2. Motivation

๊ทธ๋ ‡๋‹ค๋ฉด Scene Graph Generation ํ•  ๋•Œ ๊ธฐ์กด์— ์‚ฌ์šฉํ–ˆ๋˜ ๋ชจ๋ธ์€ ๋ฌด์—‡๋“ค์ด ์žˆ์œผ๋ฉฐ, ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ ๋ฌธ์ œ๋Š” ๋ฌด์—‡์ด์—ˆ์„์ง€ ์งš์–ด๋ณด์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” Previous Works ์— ๋Œ€ํ•ด ๊ฐ„๋‹จํ•œ ์š”์•ฝ๊ณผ, ์ €์ž์˜ Idea๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Knowledge Graph Embedding VTransE, DTransE [2], [3] ์™€ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ Knowledge Graph Embedding Method๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, object, predicate, subject ๋ฅผ ๋™์ผํ•œ, ๋˜๋Š” ๊ฐ๊ฐ์˜ Latent Space ์— Mapping ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ hidden representation์˜ ์œ ์‚ฌ์„ฑ์„ ์ธก์ •ํ•˜์—ฌ Scene Graph Generation ์— ์ ์šฉํ•œ framework ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด ๋ชจ๋ธ๋“ค์€ ์ฃผ๋ณ€ context๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ์˜ค์ง ๊ฐ๊ฐ์˜ object์˜ embedding ๋งŒ์„ ๋ณด๊ณ  graph๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ์ง€ ์ƒ์— ์กด์žฌํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถฉ๋ถ„ํžˆ ์ด์šฉํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค.

Scene Graph Generation Neural-Motif [4] ์€ ์ฃผ๋ณ€ ์ปจํƒ์ŠคํŠธ, ๋˜๋Š” entity A(subject), entity B (subject) ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€ entity ์˜ feature๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด bi-directional RNN ๊ณผ ๊ฐ™์€ sequnce ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Graph R-CNN [5] ์€ Neural Motif ๋ฅผ ์ข€ ๋” ํšจ์œจ์ ์œผ๋กœ ๊ทธ๋ž˜ํ”„ ์ž์ฒด์—์„œ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„์— ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. GNN์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ๋ณ€ context๋ฅผ ๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ณ , ์ด์šฉํ•˜์—ฌ Scene Graph Generation์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, Graph R-CNN ๋˜ํ•œ SGG ๋ฅผ ์œ„ํ•œ ์ตœ์ ์˜ framework ๋ผ๊ณ  ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” ๋‹ค์Œ ์„ธ์…˜์—์„œ GPS-Net์˜ Idea์™€ ํ•จ๊ป˜ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Idea

2

๊ทธ๋ฆผ 2๋Š” GPS-Net ์ €์ž์˜ Motivation์„ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ €์ž๋Š” 3๊ฐ€์ง€ ์ค‘์š”ํ•œ ์‚ฌ์‹ค์„ ์ง€๋ชฉํ•ฉ๋‹ˆ๋‹ค.

์ฒซ์งธ, ๋ชจ๋ธ์€ ๋ฐฉํ–ฅ์„ฑ์„ ์ธ์‹ํ•ด์•ผํ•œ๋‹ค. (b)

๊ธฐ์กด Graph Neural Network (GNN) ์„ ์ผ๊ด„ ์ ์šฉํ•  ๊ฒฝ์šฐ์—๋Š” triplet ์˜ ๋ฐฉํ–ฅ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐฉํ–ฅ์„ฑ์„ ๊ณ ๋ คํ•˜๋Š”

Direct aware Message Passing Neural Network (DMP)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๋‘˜์งธ, degree๊ฐ€ ๋†’์€ node๊ฐ€ ์ค‘์š”ํ•˜๋‹ค (c)

SGG๋Š” Image ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ชจ๋“ˆ์„ ๊ฑฐ์ณ Graph ๋ฅผ ์ƒ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ hub node (degree๊ฐ€ ๋†’์€ node)๊ฐ€ ์ž˜๋ชป clasfficiation

๋˜์–ด ์žˆ๋‹ค๋ฉด, GNN์„ ํ†ตํ•ด ์ฃผ๋ณ€๋…ธ๋“œ๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋•Œ, ์ž˜๋ชป๋œ ์ •๋ณด๋ฅผ ๋งŽ์ด ํผ๋œจ๋ฆฌ๊ฒŒ ๋  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, degree๊ฐ€ ๋” ๋†’์€ ๋…ธ๋“œ๋ฅผ ์ง‘์ค‘์ ์œผ๋กœ

ํ•™์Šตํ•˜๋Š” _Node Priority Sensitive Loss_๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

์…‹์งธ, SGG๋Š” Imblanced Classification ๋ฌธ์ œ์ด๋‹ค

subject, object ์‚ฌ์ด์˜ Predicate ๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ Predicate class ๋Š” long-tail distribution ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

์‰ฝ๊ฒŒ ์„ค๋ช…ํ•˜์ž๋ฉด, 'on', 'has' ์™€ ๊ฐ™์€ predicate๋Š” ์ •๋ง ๋นˆ๋ฒˆํžˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, standing in, feeding ์™€ ๊ฐ™์€ ๋””ํ…Œ์ผํ•œ ํ–‰๋™๋“ค์€

์ƒ๋Œ€์ ์œผ๋กœ ์ ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” label class ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ on, has ์œ„์ฃผ๋กœ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด, ๋†’์€ performance ๋ฅผ ๊ธฐ๋กํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ on, has ๊ฐ€ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” Scene Graph๊ฐ€ ์•„๋‹Œ, ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ์ •๋ณด๋ฅผ ๋‹ด์„ Scene Graph๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์ €ํฌ์˜ ๊ถ๊ทน์ ์ธ ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค !!

3. Method

GPS-Net์€ Object Detector๋ฅผ Faster R-CNN ์˜ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Pretrained detector๋ฅผ ํ†ตํ•ด Object proposal์ด ์ƒ์„ฑํ•ด๋‚ด๊ณ , ๊ฐ๊ฐ์˜ box๋กœ ๋ถ€ํ„ฐ visual feature, class logits, box ์˜ ์œ„์น˜๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. box i์—์„œ ์ด์™€ ๊ฐ™์€ feature ๋“ค์„ ๋ฌถ์–ด x_i ๋ผ๊ณ  ์นญํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋˜, ๊ธฐ์กด Graph R-CNN ๊ณผ ๋‹ฌ๋ฆฌ ์ถ”๊ฐ€์ ์œผ๋กœ 2๊ฐœ์˜ box๋ฅผ unionํ•œ, union feature u_ij ๋„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

์ œ์•ˆ๋œ feature๋ฅผ x_1,.., x_n ๊ณผ u_12, ..., u_ij, ... ๋ฅผ ์–ป์—ˆ๋‹ค๋ฉด ์•ž์„œ ์–ธ๊ธ‰ํ•œ GPS-Net์˜ architecture์— ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค.

1. Direction-aware Message Passing

3

๊ทธ๋ฆผ3์€ ๊ธฐ์กด์— ์‚ฌ์šฉํ•˜๋Š” Message Passing Network ๋“ค์˜ ๊ตฌ์กฐ (a), (b)์™€ ์ œ์•ˆ๋œ DMP ๊ตฌ์กฐ (c) ๋ฅผ ๊ฐ€์ ธ์˜จ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ x_i ๋Š” ์—…๋ฐ์ดํŠธํ•˜๊ณ ์ž ํ•˜๋Š” Target, x_j๋Š” Target์„ ์—…๋ฐ์ดํŠธ ํ•˜๊ธฐ ์œ„ํ•œ Neighbor์˜ Feature vector์ด๋ฉฐ, u_ij๋Š” ๋‘ bounding box i, j ์˜ union box์˜ feature๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. Message Passing Network์˜ ํ•ต์‹ฌ์€ Message๋ฅผ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋А๋ƒ ์ž…๋‹ˆ๋‹ค.

๋จผ์ €, (a)์˜ ๊ฒฝ์šฐ Target๊ณผ Neighbor์˜ Feature๋ฅผ ๋‹จ์ˆœํžˆ concat ํ•˜์—ฌ Weight๋ฅผ ๊ณฑํ•ด์ค€ ๊ฒƒ์ด ๋ฉ”์„ธ์ง€์ž…๋‹ˆ๋‹ค. ์ด ๋ฉ”์„ธ์ง€๋ฅผ Transforemer์— ํ†ต๊ณผ์‹œํ‚ค๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž์‹ ์˜ Feature์™€ ๋‹ค์‹œ ์—…๋ฐ์ดํŠธ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

(b)์˜ ๊ฒฝ์šฐ Message๋ฅผ ์˜ค์ง Neighbor์˜ Feature๋งŒ ๊ฐ€์ง€๊ณ  ์—…๋ฐ์ดํŠธ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Transforemr Layer์— ํ†ต๊ณผ์‹œํ‚จํ›„ ์ž๊ธฐ์ž์‹ ์˜ Feature์™€ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ, SGG์˜ Framework์—์„œ ์ด๊ฒƒ์€ ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. SGG๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด, GNN์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ๋ณ€ Object๋“ค์˜ Feature๋ฅผ ๋ชจ์œผ๊ฒŒ ๋˜๋Š”๋ฐ, ์ด ๋•Œ ์ค‘์š”ํ•œ ์‚ฌ์‹ค์€ GNN์— ์‚ฌ์šฉํ•  Graph๊ฐ€ Cleanํ•˜์ง€ ์•Š๊ณ  Noiseํ•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•˜๋ฉด, ์ด GNN์— ์‚ฌ์šฉํ•  ๊ทธ๋ž˜ํ”„๋Š” Object Detector Proposal Boxe๋“ค์˜ ์—ฐ๊ฒฐ๊ด€๊ณ„๋ฅผ ์ž„์˜๋กœ ์ •ํ•ด๋‘” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๊ทธ๋ž˜ํ”„๋Š” ๋ฐฉํ–ฅ์„ฑ ์กฐ์ฐจ ์• ๋งคํ•œ ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค.

(c)๋Š” ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์–‘์ชฝ ๋ฐฉํ–ฅ์„ฑ์„ ๋‹ค ๊ณ ๋ คํ•˜๋Š” Message๋ฅผ ๋งŒ๋“ค๊ณ ์ž ํ–ˆ๊ณ , ์–‘์ชฝ ๋ฐฉํ–ฅ์„ฑ์„ ๋‹ค ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ๋‘๊ฐ€์ง€ ์ฐจ์ด๋ฅผ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

์ฐจ์ด๋ฅผ ๋ณด์ž๋ฉด

####1. MPNN Layer ์— u_ij ๋ผ๋Š” edge feature ๊ฐ€ ๊ฐ™์ด ๋„์ž…๋˜์—ˆ๋‹ค.

u_ij ๋Š” ์•ž์„œ ๋งํ–ˆ๋“ฏ union box ๋กœ๋ถ€ํ„ฐ ๋ฝ‘์€ visual feature ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด Graph-RCNN ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋˜ ์ถ”๊ฐ€์ ์ธ feature๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ์ธ๋ฐ,
์ด๋Š” relation ์„ ์˜ˆ์ธกํ•  ๋•Œ ๋ณด๋‹ค ๋„“์€ receptive field๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ GNN ์˜ ๊ตฌ์กฐ์  ํŠน์„ฑ์ƒ layer๋ฅผ ๋งŽ์ด ์Œ“์„ ์ˆ˜๋ก ์ฃผ๋ณ€์œผ๋กœ ์ •๋ณด๋ฅผ propagation ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—	
image๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ์˜ context ๋„ ๋” ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 
์˜ˆ๋ฅผ ๋“ค๋ฉด, ์‚ฌ๋žŒ(Object)์™€ ๋ง(Object) ์‚ฌ์ด์˜ Relation์„ ์˜ˆ์ธก ํ•  ๋•Œ, ์‚ฌ๋žŒ ์†๊ณผ ๋ง์ด ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์˜ visual feature๊ฐ€ ๋„์›€์ด ๋  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. (Union Box์˜ ์—ญํ• )

####2. MPNN Layer์˜ Element wise product๋ฅผ Kronecker Prdouct๋กœ ๋Œ€์ฒดํ•˜์—ˆ๋‹ค.

๊ตฌ์กฐ๋ฅผ ๋ณด์•˜์„ ๋•Œ (a) ๋Š” x_i, x_j๋ฅผ ๋‹จ์ˆœ concat ํ•˜์˜€๊ณ  (b) ๋Š” destination node(x_j) ์˜ ์ •๋ณด๋งŒ์„ ์ถ”์ถœํ•˜์—ฌ Message passing์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ฉด, ์ €์ž๊ฐ€ ์ œ์•ˆํ•œ DMP๋Š” (x_i, x_j, u_ij) ๋ฅผ ํ†ตํ•ด attention weight ๋ฅผ ์ถ”์ถœํ•˜๊ณ , destination node ์— ๊ณฑ์„ ํ•˜์—ฌ Message passing์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
์ฆ‰, (c) ๋Š” feature๊ฐ€ ๋“ค์–ด์˜ค๋Š” ๋ฐฉํ–ฅ์— ๋”ฐ๋ผ ๊ฐ๊ฐ์˜ attention weight๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉฐ, ๋ฐฉํ–ฅ์ด ๋‹ฌ๋ผ์ง€๋ฉด destination node์˜ ์—…๋ฐ์ดํŠธํ•  ์–‘์ด ์กฐ์ ˆ๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. 
์ด๋ฅผ Kronecker Product๋กœ ๊ตฌํ˜„ ํ•˜์˜€๋Š”๋ฐ, ์ด๋Š” MPNN ๊ตฌ์กฐ๊ฐ€ Direction-aware๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

2. Node Prioirty Sensitive Loss

์ €์ž๋Š” Node ์˜ priority ์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅธ update๋ฅผ ํ•ด์ค˜์•ผ ํ•œ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. SGG Task ์ž์ฒด๊ฐ€ Faster R-CNN, Graph Generation, Object classification, Edge Classification ๊ณผ ๊ฐ™์ด ๋งŽ์€ Task๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰ํ•˜๋Š”๋ฐ, ์ˆ˜ํ–‰ํ•˜๋Š” Task๊ฐ€ ๋งŽ๊ณ  ๋˜ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋‹ค๋ณด๋‹ˆ ์ค‘๊ฐ„์— ์ž˜๋ชป๋œ ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ€๋ น, Faster R-CNN ์—์„œ ๊ฐœ๋ฅผ ๊ณ ์–‘์ด๋ผ๊ณ  ์ž˜๋ชป Detect ํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๊ทธ ๋’ค์— ์žˆ๋Š” ๋ชจ๋“  MPNN Layer ๋Š” ์ž˜๋ชป๋œ node feature๋ฅผ propagate ํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌํ•œ ๋…ธ๋“œ๊ฐ€ degree๊ฐ€ ๋†’์€ hub node๋ผ๋ฉด ? ์ž˜๋ชป๋œ ์ •๋ณด๊ฐ€ ๋” ๋งŽ์ด ํผ์งˆ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์„ ์ปจํŠธ๋กคํ•˜๊ธฐ ์œ„ํ•ด Node sensitive loss๋ฅผ ์ œ์•ˆํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

4

๊ทธ๋ฆผ 4๋Š” ์ œ์•ˆ๋œ ๋กœ์Šค์˜ ์ˆ˜์‹์ž…๋‹ˆ๋‹ค.

์„ธํƒ€๋Š” priority ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, ์ „์ฒด triplet ์˜ ์ˆ˜ ์ค‘์—์„œ ํ•ด๋‹น node๋ฅผ ๊ฑฐ์น˜๋Š” triplet์˜ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ฆ‰ ์ž์‹ ์„ ๊ฑฐ์น˜๋Š” triplet์ด ๋งŽ๋‹ค๋ฉด priority๊ฐ€ ๋†’๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ degree ๊ฐ€ ๋†’์€ node๋กœ ์ดํ•ดํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ ๋‹ค์Œ, priority๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ focusing factor๋ฅผ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์„ธํƒ€๊ฐ€ 0๊ณผ 1์‚ฌ์ด์˜ ์ˆ˜์ด๋ฏ€๋กœ, ์„ธํƒ€๊ฐ€ ํด์ˆ˜๋ก focusing factor๊ฐ€ ์ž‘์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์€ Focal Loss ์ž…๋‹ˆ๋‹ค. gamma ๊ฐ’์€ node ์— ๋”ฐ๋ผ ๋ฐ”๋€Œ๊ฒŒ ๋˜๋Š”๋ฐ์š”. ์šฐ์„  gamma๊ฐ’์ด 1์ด๋ผ๋ฉด, binary cross entropy ์˜ loss ํ˜•ํƒœ๋ฅผ ๋– ์˜ฌ๋ฆด ์ˆ˜ ์žˆ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋งŒ์•ฝ gamma ๊ฐ’์ด ํฌ๋‹ค๋ฉด, Loss๊ฐ€ ์ž‘์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ƒ๋Œ€์ ์œผ๋กœ ํ•ด๋‹น node์— ๋Œ€ํ•ด์„œ gradient update๋ฅผ ์ ๊ฒŒ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ gamma ์ž‘๋‹ค๋ฉด Loss๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ํด ๊ฒƒ ์ด๊ณ , ํ•ด๋‹น node์— ๋Œ€ํ•ด ๋” ๋งŽ์€ update๋ฅผ ํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

์ฆ‰, degree๊ฐ€ ๋†’๋‹ค -> focusing factor(gamma)๊ฐ€ ์ž‘๋‹ค -> Loss๊ฐ€ ํฌ๋‹ค -> update ๋” ๋งŽ์ด ์ˆ˜ํ–‰. degree๊ฐ€ ๋‚ฎ๋‹ค -> focusing factor(gamma)๊ฐ€ ํฌ๋‹ค -> Loss๊ฐ€ ์ž‘๋‹ค -> update ๋” ์ ๊ฒŒ ์ˆ˜ํ–‰.

Degree๊ฐ€ ๋†’์€ node ์— ๋Œ€ํ•ด ๋”์šฑ ์ค‘์ ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Adaptive Reasoning Module

๋งˆ์ง€๋ง‰์œผ๋กœ, Loss๋ฅผ SGG ์˜ ์ƒํ™ฉ์— ๋งž์ถฐ Adapation ํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์น˜๋“ค์„ ๋”ํ•ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋ฐ”๋กœ Frequency Softening ๊ณผ Bias Adaptation ์ธ๋ฐ์š”.

5

๊ทธ๋ฆผ 5๋ฅผ ํ†ตํ•ด ์ˆ˜์‹์„ ๋ณด์‹œ๋ฉด, ๋ฐ”๋กœ ์ดํ•ดํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Bias Adaptation ์€ training data์— ๋“ฑ์žฅํ•˜๋Š” label distribution ์˜ ํŒจํ„ด์„ bias๋กœ์„œ ๋„ฃ์–ด์ฃผ์ž๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

์ด ์•„์ด๋””์–ด๋Š” Neural-motifs [3] ์—์„œ ๋“ฑ์žฅํ•œ ๊ฐœ๋…์ธ๋ฐ์š”. ํŠน์ • triplet ํŒจํ„ด์ด ๋งŽ์ด ๋“ฑ์žฅํ•˜๋ฉด, ๊ทธ๊ฒƒ์„ ์˜ˆ์ธกํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” bias๋ฅผ ๋”ํ•ด์ค€๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Bias Adaptation์˜ ์•ž์ชฝ์˜ fusion term์€ DMP๋ฅผ ํ†ตํ•ด ์–ป์€ feature ๋“ค์„ ํ†ตํ•ด class๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ ๋’ค์— ๋”ํ•ด์ง„ d*p term ์ด frequency softening ๋ถ€๋ถ„์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ union feature u_ij ๋ฅผ ํ†ตํ•ด, ๋งŽ์ด ๋“ฑ์žฅํ–ˆ๋˜ triplet์ธ์ง€ ํŒ๋‹จํ•˜์—ฌ d๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , training data์˜ distribution์ด ๋ฐ˜์˜๋œ p๋ฅผ ๊ณฑํ•ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์„œ, ๋งŽ์ด ๋“ฑ์žฅํ•œ ํŒจํ„ด์— ๋Œ€ํ•ด ์ ํ•ฉํ•œ bias๋ฅผ ๋”ํ•ด์ค€ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ์—ฌ๊ธฐ์„œ๋Š” Frequency Softening ์˜ ๊ตฌ์กฐ๋ฅผ ์กฐ๊ธˆ ๋ณ€ํ˜•ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š”๋ฐ์š”. SGG๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” visual genome dataset์ด long-tail shaped class distribution ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. GPS-Net์—์„œ๋Š” ์ด๋Ÿฌํ•œ long-tail distribution์„ ๊ณ ๋ คํ•˜์—ฌ Frequency softening ํ•˜๊ธฐ ์œ„ํ•ด์„œ log-softmax function์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ์€ label ์— ๋Œ€ํ•ด์„œ๋„ ๋“ฑ์žฅํ•  ๊ฐ€๋Šฅ์„ฑ์„ ์กฐ๊ธˆ ์—ด์–ด๋‘๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

4. Experiment & Result

Experimental setup

SGG Framework ์—์„œ Data ๋Š” Visual genome ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ •ํ˜•ํ™” ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ์˜ˆ์ธกํ•˜๋Š” metric ์€ Recall@K ์ด๋ฉฐ SGDET, SGCLS, PREDCLS 3๊ฐ€์ง€ Task๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

SGDET - Image -> Object detect / object classification / predicate classification ์ˆ˜ํ–‰.

์ „ํ˜•์ ์œผ๋กœ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Graph๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํƒœ์Šคํฌ ์ž…๋‹ˆ๋‹ค. ์„ธ๊ฐ€์ง€ ์ค‘์— ๊ฐ€์žฅ ์–ด๋ ค์šด ํƒœ์Šคํฌ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ,
๋ง ๊ทธ๋Œ€๋กœ ์ด๋ฏธ์ง€๊ฐ€ ๊ทธ๋ž˜ํ”„ ์ž์ฒด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋งตํ•‘์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, Object Detector, Graph Edge Prediction, Object, relation classifier์˜
๋ชจ๋“  ์„ฑ๋Šฅ์„ ๋‹ค ์ฒดํฌํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

SGCLS - Ground Truth Box -> object classification / Predicate classification ์ˆ˜ํ–‰

์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๊ณ , ์‹ค์ œ Bounding Box๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ Scene Graph๋ฅผ ๋งŒ๋“œ๋Š” ํƒœ์Šคํฌ ์ž…๋‹ˆ๋‹ค. Object Detector์— Dependentํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—
์œ„์˜ SGDET Task๋ณด๋‹ค๋Š” ์‚ด์ง ์‰ฌ์›Œ์ง„ Task ์ž…๋‹ˆ๋‹ค. ์˜ค์ง Object, Predicate Classifer์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๊ธฐ์ค€ ์ž…๋‹ˆ๋‹ค.

PREDCLS - Ground Truth Box, object category -> Predciate Classification ์ˆ˜ํ–‰

๋งˆ์ง€๋ง‰์œผ๋กœ, ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๊ณ , ์‹ค์ œ Bounding Box์™€ Object์˜ Classs๊นŒ์ง€ ๋ฌด์—‡์ธ์ง€ ์ฃผ์–ด์กŒ์„ ๋•Œ Scene Graph๋ฅผ ๋งŒ๋“œ๋Š” ํƒœ์Šคํฌ ์ž…๋‹ˆ๋‹ค. 
Object Detector์— Dependentํ•˜์ง€ ์•Š๊ณ , Object์˜ Class๋„ ์ด๋ฏธ ์•Œ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์žฅ ์‰ฌ์šด ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค. ์˜ค์ง, Predicate Classifer์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๊ธฐ์ค€ ์ž…๋‹ˆ๋‹ค.

Result

6

ํ‘œ1 ์€ Recall@K ๋ฅผ K=20, 50, 100 ์— ๋”ฐ๋ผ ๊ฐ๊ฐ์˜ Task์— ๋น„๊ตํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ์˜†์˜ ๋„ํ˜•์€ ๋™์ผํ•œ object detector ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ ๋ผ๋ฆฌ ๋ฌถ์€ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋ณด์‹œ๋Š” ๋ฐ”์™€ ๊ฐ™์ด GPS-Net์€ ์–ด๋–ค object detector๋ฅผ ์‚ฌ์šฉํ–ˆ๋˜๊ฐ„์— ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค์„ ๋ชจ๋“  Task์—์„œ ์••๋„ํ•˜๊ณ  ์žˆ๋„ค์š”.

7

ํ‘œ2๋Š” ๊ฐ๊ฐ์˜ class๋ณ„ Recall@K๋ฅผ ๋”ฐ๋กœ ๊ตฌํ•˜๊ณ , ๋ชจ๋“  class์˜ ํ‰๊ท ์„ ์ทจํ•œ mR@K ๋ฅผ ๋น„๊ตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐ๊ฐ์˜ class ๋ณ„๋กœ performance gain์ด ์–ผ๋งŒํผ ์ผ์–ด ๋‚ฌ๋Š”์ง€ ๋น„๊ตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ™•์‹คํžˆ mR@K๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ , ์šฐ์ธก ๊ทธ๋ฆผ์„ ๋ณด์•˜์„ ๋•Œ long-tail class ์— ๋Œ€ํ•ด์„œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

8

ํ‘œ3 (a), (b)๋Š” ๊ฐ๊ฐ ๋ชจ๋ธ component ๋“ค์— ๋Œ€ํ•ด ablation study๋ฅผ ํ•œ ๊ฒฐ๊ณผ ์ž…๋‹ˆ๋‹ค. ํ‘œ(a) ๋ฅผ ์‚ดํŽด๋ณด๋ฉด SGDET์™€ SGCLS์˜ Task์—์„œ๋Š” ๊ธฐ์กด์˜ ๋ชจ๋ธ์— DMP๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€ ๋•Œ ๊ฐ€์žฅ ํฐ performance gain์ด ์žˆ์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. SGG์—์„œ ๋ฐฉํ–ฅ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋Š” ๋Œ€๋ชฉ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ NPS, ARM ๋˜ํ•œ ์กฐ๊ธˆ์”ฉ์˜ performance gain์— ๋„์›€์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. DMP ๋งŒํผ์€ ์•„๋‹ˆ์ง€๋งŒ, ์ข…ํ•ฉ์ ์œผ๋กœ ๋ณด์•˜์„ ๋•Œ ๊ธฐ์กด์— ๋น„ํ•ด ์„ฑ๋Šฅ ๊ฐœ์„ ์— ๋„์›€์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, PREDCLS Task์—์„œ๋Š” ARM์ด ๊ฐ€์žฅ ํฐ ๊ฐœ์„ ์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ARM์ด ํƒ€๊ฒŸํ•˜๋Š” ํŒŒํŠธ๊ฐ€ PREDCLS์™€ ์—ฐ๊ด€์ด ๊ฐ€์žฅ ํฐ ๋งŒํผ, ์ด TASK ์—์„œ๋Š” DMP๋ณด๋‹ค ๋” ๋งŽ์€ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

ํ‘œ(b) ์—์„œ๋Š” DMP์˜ ์„ฑ๋Šฅ์„ stack์„ ํ–ˆ์„๋•Œ, ๊ธฐ์กด MP ์™€์˜ ๋น„๊ต๋ฅผ ์‹คํ—˜ํ•˜์˜€๊ณ , ๋˜ํ•œ NPS์—์„œ node focusing์„ ์–ผ๋งˆ๋‚˜ ํ• ๊ฒƒ์ธ์ง€๋ฅผ ์กฐ์ ˆํ•˜๋Š” hyperparmaeter, mu์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ „๋ฐ˜์ ์œผ๋กœ Baseline๊ณผ์˜ ์‹คํ—˜ ๋น„๊ต์™€, ์ œ์•ˆ๋œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ablation study๊ฐ€ ์ฐฉ์‹คํ•˜๊ฒŒ ์ž˜ ์ด๋ฃจ์–ด์ง„ ๋…ผ๋ฌธ์œผ๋กœ์„œ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉด์„œ๋„ ์‹คํ—˜์„ ํ†ตํ•œ ๊ฐ€์„ค์˜ ๊ฒ€์ฆ์ด ์ž˜ ์ง„ํ–‰๋˜์—ˆ๋‹ค๊ณ  ๋ณด์—ฌ์ง‘๋‹ˆ๋‹ค.

5. Conclusion

๋ณธ GPS-Net ์—์„œ๋Š” Scene Graph Generation ์—์„œ ๋‹ค๋ฃจ์–ด์•ผํ•  ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋“ค์„ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด object ๊ฐ„์˜ ๋ฐฉํ–ฅ์„ ์ธ์ง€ํ•˜์—ฌ์•ผํ•˜๊ณ , ๊ฐ node๋ณ„ ์ค‘์š”๋„๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ์ ์„ ์ธ์‹ํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ ์ ˆํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ด๊ฒƒ๋“ค์„ ํ•ด์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์‹คํ—˜ ๊ตฌ์„ฑ ๋ฉด์—์„œ appendix์— u_ij ๋ผ๋Š” feature ์— ๋Œ€ํ•œ ์—ญํ•  ๊ทœ๋ช…์„ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉด ๋” ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๊ตฌ์กฐ ๋•Œ๋ฌธ์ธ์ง€ ์ € feature๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์„œ ์–ป์€ ์„ฑ๋Šฅํ–ฅ์ƒ์ธ์ง€ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. Recall์„ ์‚ดํŽด๋ณด์•˜์„ ๋•Œ, image to graph ๋ฅผ ํ•˜๋Š” task๋“ค์ด, ์‹ค์ œ ์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ์—๋Š” ์•„์ง ๋„ˆ๋ฌด๋„ ๋‚ฎ์€ ์ˆ˜์น˜๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค.

Take home message

SGG ๋ฌธ์ œ ์ƒ์—์„œ ์กด์žฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ„๋‹จํ•œ ๊ฐ€์„ค์„ ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด, ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ๋…ธ๋ ฅ, ์‹คํ—˜์„ ํ•œ ๋…ผ๋ฌธ์ด๋ผ๊ณ  ๋ณด์—ฌ์ง‘๋‹ˆ๋‹ค. ์‰ฌ์šด ๊ฐ€์„ค ํ•˜๋‚˜๋ฅผ ์„ธ์šฐ๋Š” ๊ฒƒ์€ ์ฐฐ๋‚˜์ด์ง€๋งŒ, ์ž…์ฆํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ •๋ง ๋งŽ์€ ๋…ธ๋ ฅ๊ณผ ์‹œ๊ฐ„์ด ํ•„์š”ํ•จ์„ ๋А๋ผ๊ณ , ๋ณธ๋ฐ›์Šต๋‹ˆ๋‹ค.

Author

์œค๊ฐ•ํ›ˆ (Kanghoon Yoon)

  • Affiliation (KAIST Industrial Engineering Department)

  • (optional) ph.D students in DSAIL

Reference & Additional materials

  1. Visual translation embedding network for visual relation detection

  2. Representation learning for scene graph completion via jointly structural and visual embedding

  3. Neural Motifs: Scene Graph Parsing with Global Context

  4. Graph R-CNN for Scene Graph Generation.

  5. GPS-net: Graph property sensing network for scene graph generation

Last updated

Was this helpful?