BGNN [Kor]

Li et al. / Bipartite Graph Network With Adaptive Message Passing For Unbiased Scene Graph Generation / CVPR 2021

1. Problem definition

Paper Topic

Scene Graph Generation(SGG) in Computer Vision

Scene Graph Generation์€ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ Scene Graph๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Task๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์™ผ์ชฝ์˜ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ Graph๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค. Graph์˜ Node์€ Entity(e.g.์‚ฌ๋žŒ, ๋Œ)์„ ์˜๋ฏธํ•˜๊ณ  Edge์˜ ๊ฒฝ์šฐ์—๋Š” ๋‘ Node ์‚ฌ์ด์˜ Edge๋Š” Entity ์‚ฌ์ด์˜ Predicate(์ˆ ์–ด)๋ฅผ ์˜๋ฏธํ•œ๋‹ค. "์‚ฌ๋žŒ์ด ๋Œ ์œ„์— ์žˆ๋‹ค"๋ผ๊ณ  ํ•œ๋‹ค๋ฉด Node๋Š” "์‚ฌ๋žŒ", "๋Œ"์ด ๋  ๊ฒƒ์ด๊ณ  Edge(Predicate)์€ "standing on(์œ„์— ์„œ ์žˆ๋‹ค)"๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

๊ทธ๋ฆผ ์ถœ์ฒ˜: [CVPR 21]Energy-Based Learning For Scene Graph Generation

2. Motivation

Scene Graph Generation(SGG) ํ•  ๋•Œ์˜ Main Challenge ์ค‘์— ํ•˜๋‚˜๊ฐ€ Predicate(e.g standing on, has)์˜ Distribution์ด Long-Tailed๋กœ ๋˜์–ด์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. SGG์—์„œ์˜ Benchmark Dataset ์ค‘์— ํ•˜๋‚˜๋Š” "Visual Genome(VG)"์ด๋‹ค. VG์˜ Image์—์„œ ๋‚˜์˜ค๋Š” Predicate๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ Distribution์„ ๊ฐ–๊ณ  ์žˆ๋‹ค.

๊ทธ๋ฆผ ์ถœ์ฒ˜ : [CVPR 20]Unbiased Scene Graph Generation from Biased Training

Long-Tailed Distribution์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋ธ๋ง์„ ํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด Training์—์„œ ๋งŽ์ด ๋‚˜์˜ค๋Š” "On, Has" ๋“ฑ์˜ Predicate๊ฐ€ ๋งŽ์ด ํ•™์Šต๋  ๊ฒƒ์ด๊ณ , Test์—์„œ Long Tailed์— ํ•ด๋‹นํ•˜๋Š” Predicate๊ฐ€ ๋‚˜์˜ค๋”๋ผ๋„ ๋น„์Šทํ•œ ์˜๋ฏธ์ธ ๊ฒฝ์šฐ Head Tailed ๋ถ€๋ถ„์˜ Predicate๋ฅผ ๋งž์ถœ ๊ฒƒ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด 'Standing On'์ด Test์— ๋‚˜์˜ค๋”๋ผ๋„ ๋น„์Šทํ•œ ์˜๋ฏธ์ธ 'On'์„ ๋งž์ถ”๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ •ํ™•ํ•œ Scene Graph๋ฅผ ๋งŒ๋“ค์ง€ ๋ชปํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.

Scene Graph Generation

์•„๋ž˜์˜ 3๊ฐœ์˜ ๋…ผ๋ฌธ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” Sparse Graph๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  Entity๊ฐ„์— Predicate๊ฐ€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์—ฌ Scene Graph๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ชจ๋“  Entity Pair์˜ ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๊ฒ ์ง€๋งŒ, ์˜๋ฏธ ์—†๋Š” Entity Pair์˜ Predicate๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ์€ Noise๋ฅผ ๋ฐœ์ƒ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Scene graph generation by iterative message passing(CVPR 17)  
Scene Graph Generation from Objects, Phrases and Region Captions(ICCV 17)  
Gps-net: Graph property sensing network for scene graph
generation(CVPR 20)  

Long-Tailed

Long-Tailed ๋ฌธ์ œ๋กœ Biased Prediction ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์˜จ Effort๋“ค์ด ์กด์žฌํ•œ๋‹ค. Long Tailed Problem์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ๊ฐ€์ง€ Technique์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ์•„๋ž˜์™€ ๊ฐ™์€ ๋…ผ๋ฌธ์˜ ๊ฒฝ์šฐ์—๋Š” Loss๋ฅผ ์ƒˆ๋กญ๊ฒŒ Designํ•˜์—ฌ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ๋‹ค.

Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation(ECCV 17)  
Pcpl: Predicate-correlation perception learning for unbiased scene graph generation(MM 20)

Idea

์ด ๋…ผ๋ฌธ์—์„œ๋Š” Introduction์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์ด Predicate์˜ Long Tailed Distribution Problem์„ ๋‹ค๋ฃฌ๋‹ค. ์ผ๋ฐ˜์ ์ธ Scene Graph Generation์„ ํ•  ๋•Œ๋Š” ๋ชจ๋“  Node๊ฐ„์— Predicate๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์—ฌ Fully Connected Graph๋ฅผ ๋งŒ๋“ค๊ณ  ์ง„ํ–‰ํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ Node๊ฐ„์— Predicate๊ฐ€ meaninglessํ•œ ๊ฒฝ์šฐ์—๋Š” Scene Graph์—๊ฒŒ Negative Effect๋ฅผ ์ค„ ๊ฒƒ์ด๋ผ๊ณ  ๋งํ•œ๋‹ค. ๋”ฐ๋ผ์„œ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ Node๊ฐ„์— Meaninglessํ•œ Predicate๋ฅผ ๊ณจ๋ผ๋‚ด๋Š” Confidence Module๋ฅผ ์ด์šฉํ•˜๊ฒŒ ๋œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ ๋„ค๋ชจ์นธ์„ ์—†์• ๋ฉด Accurate Graph๋ฅผ ๋งŒ๋“ ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค.

Confidence Module ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 'Bi-Level Sampling'์ด๋ผ๋Š” Sampling ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ Long-Tailed Problem์„ ํ•ต์‹ฌ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

3. Method

๋ชจ๋ธ์„ ์„ค๋ช…ํ•˜๊ธฐ ์ „์— Scene Graph์˜ ๊ฒฝ์šฐ์—๋Š” 'Faster-RCNN'์™€ ๊ฐ™์€ Object Detection Module๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Bounding Box์™€ Object Class Distribution์ด ์ฃผ์–ด์ง€๊ฒŒ ๋œ๋‹ค. ์ฃผ์–ด์ง„ Bounding Box๋ฅผ Graph์—์„œ์˜ ํ•˜๋‚˜์˜ Node๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

1. Bipartite Graph Construction

Image๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Entity๋ฅผ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด Bipartite Graph์—์„œ ํ•œ Group์€ Entity Group์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , ๋‹ค๋ฅธ ํ•œ Group์€ Predicate๊ฐ€ ์กด์žฌํ•œ๋‹ค. Introduction์—์„œ๋Š” ๋‘ Node๊ฐ„์— Meaningless Predicate๊ฐ€ ์กด์žฌํ•˜๋ฉด Noise๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ํ–ˆ์ง€๋งŒ, Graph Constructionํ•  ๋•Œ๋Š” ๋จผ์ € ๋‘ Node๊ฐ„์— Predicate๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์‹œ์ž‘ํ•œ๋‹ค. (์ดํ›„์—, ์กด์žฌํ•˜๋Š”์ง€ ์•ˆํ•˜๋Š”์ง€๋ฅผ Modelingํ•œ๋‹ค)

Bipartite Graph์— Direction์„ ์ค€ ์ด์œ ๋Š” Message Passing ํ•  ๋•Œ, Entity->Predicate์™€ Predicate->Entity์˜ Message Passing์„ ๋‹ค๋ฅด๊ฒŒ ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ๋‹ค.

Fully Connected Graph์ด๊ธฐ ๋•Œ๋ฌธ์— Pair Node๊ฐ„์—๋Š” Predicate Proposal์ด ์กด์žฌํ•˜์—ฌ ์œ„์˜ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ํ•œ Predicate Proposal์—๋Š” ๋‘ Node Pair์™€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค. (Proposal : Ground Truth Predicate๋Š” ์•„๋‹ˆ์ง€๋งŒ ๊ฐ€์ •ํ•˜๋Š”(?) Predicate)


2. Relation Confidence Estimation(RCE) + Confidence Message Passing(CMP)

  • RCE

์œ„์˜ ๊ทธ๋ฆผ์˜ "RCE" Branch๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด Module์€ ๋‘ Node ๊ฐ„์— Meaning Predicate์ธ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•œ Branch๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

siโˆ’>jm=gx(riโˆ’>jโŠ•piโŠ•pj)โˆˆRโˆฃCpโˆฃs_{i->j}^m=g_x(r_{i->j}\oplus p_i \oplus p_j) \in \mathbb{R}^{|C_{p}|}

riโˆ’>j:UnionFeaturer_{i->j}: Union Feature

pi:BoundingBox์˜ClassProbabilityp_{i} :Bounding Box์˜ Class Probability

์œ„์˜ Class Confidence๋Š” Predicate Proposal์ด ๊ฐ Predicate๋งˆ๋‹ค ์–ผ๋งˆ๋‚˜์˜ Confidence๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋งŒ์•ฝ ์ „์ฒด์ ์œผ๋กœ Predicate์— ๋Œ€ํ•œ Confidence๊ฐ€ ๋‚ฎ์•„์„œ ์˜๋ฏธ์—†๋Š” ์ •๋ณด๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค๋ฉด ์ „์ฒด์ ์ธ Predicate์˜ Confidence ๊ฐ’์€ ๋‚ฎ์•„์ง€๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.

siโˆ’>jb=ฯƒ(wbTsiโˆ’>jm),wbโˆˆRCps_{i->j}^b = \sigma(w_{b}^Ts_{i->j}^m),w_{b} \in \mathbb{R}^{C_p}

์œ„์˜ ์‹์€ ๋‘ Node๊ฐ„์— Meaning Predicate๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€์˜ Global Confidence Score(Scalar)๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ๋งŒ์•ฝ Score๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์€ ๋‘ Node๊ฐ„์— Meaning Predicate๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ , ๋‚ฎ์œผ๋ฉด Meaningless๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

  • CMP

CMP Branch๋Š” RCE์˜ Branch๋กœ ์•Œ๊ฒŒ ๋œ Confidence Score๋ฅผ ํ†ตํ•ด Message Passing์„ ํ•œ๋‹ค. Graph Construction์—์„œ ์„ค๋ช…ํ–ˆ๋“ฏ์ด Entity->Predicate์™€ Predicate->Entity์˜ Message๋Š” ๋‹ค๋ฅด๊ฒŒ Propagation ๋˜์–ด์•ผ ํ•œ๋‹ค.

riโˆ’>jl=riโˆ’>jl+ฯ•(dsWrTeil+doWrTejl)r_{i->j}^{l} = r_{i->j}^{l} + \phi(d_{s}W_{r}^{T}e_{i}^{l} + d_{o}W_{r}^{T}e_{j}^{l})

ds=ฯƒ(wsT[riโˆ’>jlโŠ•eil]),do=ฯƒ(woT[riโˆ’>jlโŠ•ejl])d_s = \sigma(w_s^T [r_{i->j}^{l}\oplus e_i^l]), d_o = \sigma(w_o^T [r_{i->j}^l \oplus e_j^l])

eil:EntityFeaturee_i^l : Entity Feature

์œ„์˜ ์‹์€ Entity->Predicate์˜ Message Passing์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์œ„์˜ ์‹์„ ํ•ด์„ํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๋‹ค. Entity์—์„œ Predicate๋กœ Message๋ฅผ Passingํ•  ๋•Œ๋Š” Entity์™€ Relationship Proposal์„ ๋ณด๊ณ  Message๋ฅผ ์–ผ๋งˆ๋‚˜ Passingํ•  ๊ฒƒ์ธ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.

Predicate->Entity์ผ ๋•Œ๋Š” Predicate์— ์•„์ง ๊ฐ€์ •์ด๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ Noise๊ฐ€ ์กด์žฌํ•œ๋‹ค. ์•„์ง ์˜๋ฏธ ์žˆ๋Š” Predicate์ธ์ง€ ๋ชจ๋ฅธ๋‹ค. ๋”ฐ๋ผ์„œ Predicate์—์„œ Entity๋กœ Message Passing์ด ์ผ์–ด๋‚  ๋•Œ๋Š” RCE์˜ Global Confidence Score๋ฅผ ํ†ตํ•ด์„œ Noise๋ฅผ ์ค„์—ฌ์ค€๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

Global Confidence Score๋Š” Gating Function์„ ํ†ตํ•ด Hard Controlํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, 1๋ณด๋‹ค ์ปค์ง€๋ฉด 1๋กœ Clipํ•˜์—ฌ ๋” ํฐ ๊ฒƒ์„ ๊ณ ๋ ค ์•ˆํ•˜๊ณ , 1๋ณด๋‹ค ์ž‘์•„์ง€๋ฉด 0์œผ๋กœ Clipํ•˜์—ฌ ๋” ์ž‘์•„์ง€๊ฒŒ ํ•˜์ง€ ์•Š๋Š”๋‹ค.

Entity->Predicate๋กœ Message๋ฅผ ์ค„ ๋•Œ ํ•œ ๋ฒˆ Confidence Score๋กœ ๊ฑฐ๋ฅธ ๋‹ค์Œ์—, Global Confidence Score๋ฅผ ํ†ตํ•ด์„œ ํ•œ ๋ฒˆ๋” ๊ฑธ๋Ÿฌ์„œ Noise๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. Predicate->Entity๋กœ Message๋ฅผ ์ค„ ๋•Œ ์œ„์˜ ์‹๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ Global Confidence Score๋ฅผ Message์— ํ•œ ๋ฒˆ ๋” ๊ณฑํ•ด์„œ ๊ฑธ๋Ÿฌ์ง„๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.


3. Bi-Level Resampling

Train Data๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ Randomํ•˜๊ฒŒ Image๋ฅผ ๋ฝ‘๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ Predicate์˜ Distribution์— ๋”ฐ๋ผ ์„ ํƒ๋˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 2๋‹จ๊ณ„์— ๊ฑธ์ณ์„œ Samplingํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ฒซ ๋ฒˆ์งธ๋กœ Image-level๋กœ Image๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ๋‹ค์Œ์—, ๋‘ ๋ฒˆ์งธ๋กœ Instance-Level Sampling์œผ๋กœ ํ•œ Image์—์„œ Predicate๋ฅผ ์ผ์ • ํ™•๋ฅ ๋กœ Drop-out์‹œํ‚จ๋‹ค.

  • Image-Level Over-Sampling

์œ„์˜ ๊ทธ๋ฆผ ๋‘ ๋ฒˆ์งธ์ฒ˜๋Ÿผ Image๋ฅผ ํ•œ Image์—์„œ ์ด๋ฏธ์ง€ ๋‚ด์˜ Predicate๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ์ˆ˜๋งŒํผ Image๋ฅผ OverSamplingํ•œ๋‹ค.

rc=max(1,t/fcr^c = max(1, \sqrt{t/f^c}

fc:์ „์ฒด๋ฐ์ดํ„ฐ์—์„œ๋‚˜ํƒ€๋‚˜๋Š”Predicate์˜Frequencyf^{c} : ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” Predicate์˜ Frequency

t:Oversampling์„์กฐ์ ˆํ•˜๋Š”Hyperparametert : Oversampling์„ ์กฐ์ ˆํ•˜๋Š” Hyperparameter

๊ฐ Image๋‹น Predicate์˜ Frequency๊ฐ€ ๋†’์€ ๋น„์œจ๋กœ Image๋ฅผ Oversamplingํ•˜๊ฒŒ ๋œ๋‹ค.

  • Instance-level Under-Sampling

์œ„์˜ ๊ทธ๋ฆผ ์„ธ ๋ฒˆ์งธ์ฒ˜๋Ÿผ Image๋‚ด์—์„œ Predicate๋ฅผ Drop์‹œํ‚ฌ์ง€ ์•ˆ ์‹œํ‚ฌ์ง€ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋œ๋‹ค. ๋งŒ์•ฝ ์ „์ฒด์ ์œผ๋กœ Predicate๊ฐ€ Head ๋ถ€๋ถ„์œผ๋กœ ๋งŽ์ด ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค๋ฉด Dropํ™•๋ฅ ์„ ์ฆ๊ฐ€์‹œ์ผœ Drop์‹œํ‚จ๋‹ค.

dic=max((riโˆ’rc)/riโˆ—ฮณd,1.0)d_i^c = max((r_i-r^c)/r_i * \gamma_d, 1.0)

ฮณd:Dropโˆ’outRate๋ฅผ์กฐ์ ˆํ•˜๋Š”hyperparameter\gamma_d : Drop-out Rate๋ฅผ ์กฐ์ ˆํ•˜๋Š” hyperparameter

Droput-Rate๋Š” ์œ„์™€ ๊ฐ™์€ ์‹์„ ํ†ตํ•ด์„œ ์ •ํ•ด์ง„๋‹ค. ์œ„์˜ ์‹์œผ๋กœ ์˜ˆ์‹œ๋ฅผ ๋“ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Image-level๋กœ $r_i$๊ฐ€ 0.5๋กœ Image ์ค‘์— Max๊ฐ’์ผ ๋•Œ, $c$๊ฐ€ ๊ฐ•์•„์ง€๋กœ $r_c$๊ฐ€ 0.2์ด๊ณ  Hyperparameter๊ฐ€ 1์ด๋ฉด 0.6($d_i^c$) ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋œ๋‹ค. ์ฆ‰, ๊ฐ•์•„์ง€๋ฅผ 0.6ํ™•๋ฅ ๋กœ Dropout ์‹œํ‚จ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

Image-level๊ณผ Instance-level๋กœ Bi-level Sampling์„ ํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด Long-Tailed Distribution์„ ๊ณ ๋ คํ•˜์—ฌ Samplingํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.

4. Experiment & Result

Experimental Setup

  • Datset Scene Graph Generation์„ ํ•  ๊ฒฝ์šฐ์—๋Š” Benchmark Dataset์œผ๋กœ Visual Genome Dataset์„ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Open Images๋„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ Baseline๊ณผ ๋น„๊ตํ–ˆ๋‹ค.

  • Baseline

  1. Pcpl:Predicate-correlation perception learning for unbiased scene graph generation(MM 20) -SOTA

  2. Neural motifs: Scene graph parsing with global contex(CVPR 18)

  3. Graph r-cnn for scene graph generation(ECCV 18)

  4. Learning to compose dynamic tree structures for visual contexts(CVPR 19)

  5. Graphical Contrastive Losses for Scene Graph Generation(CVPR 19)

  6. Knowledge-embedded routing network for scene graph generation(CVPR 19)

  7. Gps-net: Graph property sensing network for scene graph generation(CVPR 20)

  8. Unbiased scene graph generation from biased training(CVPR 20)

  • Training Setup Convolution Feature๋ฅผ ์–ป์–ด๋‚ด๊ธฐ ์œ„ํ•ด์„œ ResNet-101์„ Backbone์œผ๋กœ ์‚ฌ์šฉํ–ˆ๊ณ , Faster R-CNN์„ ํ†ตํ•ด Object Dectection์„ ์ง„ํ–‰ํ–ˆ๋‹ค. Trainingํ•  ๋•Œ๋Š” ์œ„์˜ Parameter๋ฅผ Frozen์ƒํƒœ๋กœ Training์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ์ฆ‰, Backbone์ชฝ๊ณผ Detector์ชฝ์˜ Parameter๋ฅผ Pretrained๋œ ๊ฒƒ์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • Evaluation Metric : Recall@K, mean recall@K PredCls : ํ•œ Image์—์„œ Subject-Predicate-Object๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Predicate๋งŒ์„ ๋ณด๊ณ  ๋งž์ท„๋Š”์ง€์— ๋”ฐ๋ผ Recall๋ฅผ ์ค€๋‹ค. SGCls : ํ•œ Image์—์„œ Subject-Predicate-Object๊ฐ€ ์žˆ์„ ๋•Œ, 3๊ฐœ์˜ Triple์„ ๋ชจ๋‘ ๋งž์ท„์„ ๊ฒฝ์šฐ์— ๋”ฐ๋ผ Recall๋ฅผ ์ค€๋‹ค. SGGen : ์œ„์˜ SGCls์˜ ์กฐ๊ฑด์—๋‹ค๊ฐ€ Object Detect๋ฅผ ํ–ˆ์„ ๋•Œ, Ground Truth Bounding Box์™€์˜ IoU๊ฐ€ 0.5์ด์ƒ์ธ ๊ฒฝ์šฐ์— ๋งž์ท„๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

Result

์ด ๋…ผ๋ฌธ์€ Long-Tailed Distribution์˜ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— Long-Tailed์— ํ•ด๋‹นํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ „์ฒด์ ์ธ Recall ๊ฐ’์€ Baseline Model๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค.

GPS-Net๊ณผ Unbias ๋‘ Model๋„ Long-Tailed Distribution์„ ๋‹ค๋ฃฌ ๋…ผ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ•ด๋‹น ๋ชจ๋ธ์ด Long-Tailed๋ฅผ ๋” ์ž˜ ์žก์•„๋‚ธ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋‹ค๋ฅธ Baseline๊ณผ์˜ ์‹คํ—˜ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ(Recall)๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

SOTA Model์˜ ๊ฒฝ์šฐ์—๋Š” PCPL์œผ๋กœ mean Recall์—์„œ๋Š” ๋‚ฎ์ง€๋งŒ, Recall์˜ ๊ฒฝ์šฐ์—๋Š” Proposed Method๊ฐ€ ๋” ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

5. Conclusion

SGG์—์„œ Long-Tailed ๋ฌธ์ œ๊ฐ€ ๋” ์‹ฌ๊ฐํ•œ๋ฐ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Confidence Module๋ฅผ ๋„ฃ์–ด ๋‘ Node์˜ Predicate๊ฐ€ Meaningํ•œ์ง€๋ฅผ ๋จผ์ € ์žก์•„๋‚ด์„œ Noise๋ฅผ ์žก์•„๋‚ธ ํ›„, Message Passing์„ ํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Bi-Level Resampling ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ Long-Tailed Distribution์— ๋งž๊ฒŒ Samplingํ•œ๋‹ค.

Take home message (์˜ค๋Š˜์˜ ๊ตํ›ˆ)

Scene Graph Generationํ•  ๋•Œ Long-Tailed ๋ฌธ์ œ๋ฅผ ๊ฐ™์ด ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์ด ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๋ ค์ค€๋‹ค.

๋งŒ์•ฝ Long-Tailed ๋ฌธ์ œ๋ฅผ Focusingํ•˜๋Š” ๋…ผ๋ฌธ์ผ ๊ฒฝ์šฐ์— Recent Paper ์ค‘์— ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๊ณ  ์žˆ๋Š” ๋…ผ๋ฌธ๊ณผ์˜ ๋น„๊ต์˜ ํ•„์š”์„ฑ ์กด์žฌ

Confidence Module๋ฅผ ํ†ตํ•ด Liveํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ Entity Pair๋ฅผ ์ž˜๋ผ๋‚ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, Module๋‚ด์—์„œ ์˜๋ฏธ์žˆ๋Š” Predicate๊ฐ€ ์กด์žฌํ•˜๋Š”์ง€ ํ•™์Šต๊ณผ์ •์—์„œ ๋‚˜์˜ค๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฅธ ๊ณณ์— ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

Author / Reviewer information

Author

๊น€๊ธฐ๋ฒ”(Kibum Kim)

  • KAIST ISysE(์‚ฐ์—…๋ฐ์‹œ์Šคํ…œ๊ณตํ•™๊ณผ) ์„์‚ฌ์ƒ

  • Research Topic : Recommendation, Graph Neural Network

Reference & Additional materials

  1. Tang, K., Niu, Y., Huang, J., Shi, J., & Zhang, H. (2020). Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3716-3725)

  2. Yan, S., Shen, C., Jin, Z., Huang, J., Jiang, R., Chen, Y., & Hua, X. S. (2020, October). Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 265-273).

  3. Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5831-5840). vision, pages 2980โ€“2988, 2017

  4. Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 670-685).

  5. Tang, K., Zhang, H., Wu, B., Luo, W., & Liu, W. (2019). Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6619-6628).

  6. Zhang, J., Shih, K. J., Elgammal, A., Tao, A., & Catanzaro, B. (2019). Graphical Contrastive Losses for Scene Graph Generation.

  7. Chen, T., Yu, W., Chen, R., & Lin, L. (2019). Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6163-6171).

  8. Suhail, M., Mittal, A., Siddiquie, B., Broaddus, C., Eledath, J., Medioni, G., & Sigal, L. (2021). Energy-Based Learning for Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13936-13945).

  9. Lin, X., Ding, C., Zeng, J., & Tao, D. (2020). Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3746-3753). .....

Last updated

Was this helpful?