๐Ÿ“
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 1. Viewpoint Equivariance
  • 2. Capsule Network
  • 3. Capsule Network is Not More Robust than Convolutional Network[12]
  • 4. Discussion

Was this helpful?

  1. Paper review
  2. [2021 Fall] Paper review

CapsNet [Kor]

Gu et al. / Capsule Network is Not More Robust than Convolutional Network / CVPR 2021

PreviousCNN Cascade for Face Detection [Kor]NextTowards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]

Last updated 3 years ago

Was this helpful?

1. Viewpoint Equivariance

๋”ฅ๋Ÿฌ๋‹์€ ์ˆ˜ ๋…„๊ฐ„ ์„ฑ๋Šฅ์„ ๋†’์—ฌ ์™”๊ณ  ๋งŽ์€ ์˜์—ญ์—์„œ ์ธ๊ฐ„์„ ์ถ”์›”ํ–ˆ์ง€๋งŒ, ๋ถˆํ–‰ํžˆ๋„ ๊ทผ๋ณธ์ ์ธ ๋ถ€๋ถ„์—์„œ์˜ ๋ฐœ์ „์€ ๋”๋””๋‹ค. ํ˜„์žฌ์˜ ๋”ฅ๋Ÿฌ๋‹์ด hard AI๊ฐ€ ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทน๋ณตํ•ด์•ผ ํ•  ๋งŽ์€ ๊ณผ์ œ ์ค‘, Capsule network๊ฐ€ ์ฃผ๋ชฉํ•˜๋Š” ๊ฒƒ์€ viewpoint equivariance์— ๋Œ€ํ•œ ๊ฒƒ์ด๋‹ค.

์œ„ ํ‘œ๋Š” ์ž˜ ํ•™์Šต๋œ Object detection ๋ชจ๋ธ์—์„œ ์†ŒํŒŒ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐ๋„์—์„œ ์ฐ์€ ์‚ฌ์ง„์„ ๋„ฃ์—ˆ์„ ๋•Œ Average Precision(AP)๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ–ˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค. ์‚ฌ๋žŒ์€ ์†ŒํŒŒ๋ฅผ ์–ด๋–ค ๊ฐ๋„์—์„œ ๋ณด์•„๋„ ์†ŒํŒŒ๋ผ๊ณ  ์ธ์ง€ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ 0.1์—์„œ 1.0๊นŒ์ง€ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ถ„ํฌํ•œ๋‹ค.

๊ทธ ์ด์œ ๋ฅผ ๋”ฅ๋Ÿฌ๋‹์˜ ์–ธ์–ด๋กœ ์„ค๋ช…ํ•˜์ž๋ฉด training data(์ด ๊ฒฝ์šฐ PASCAL VOC)์— bias๊ฐ€ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์‚ฌ๋žŒ์€ ์†ŒํŒŒ๋ฅผ ๋ฌด์ž‘์œ„ elevation๊ณผ azimuth์—์„œ ๊ท ์ผํ•˜๊ฒŒ ์ดฌ์˜ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๋ชจ๋“  ๋ฐฉํ–ฅ์—์„œ ์ดฌ์˜ํ•œ ์†ŒํŒŒ๋ฅผ ๋ฐ์ดํ„ฐ์— ํฌํ•จ์‹œํ‚ค๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋ ๊นŒ? ์ด๋ก ์ƒ ๊ทธ๋ ‡๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋“  object๋ฅผ ๋ชจ๋“  ๊ฐ๋„์—์„œ ์ดฌ์˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ณ , ๋ฐ”๋žŒ์งํ•˜์ง€๋„ ์•Š๋‹ค.

"The set of real world images is infinitely large and so it is hard for any dataset, no matter how big, to be representative of the complexity of the real world." [2]

์ด๋Š” ๋ถ„๋ช… ์–ด๋ ค์šด ๋ฌธ์ œ์ด์ง€๋งŒ ๊ทผ๋ณธ์ ์œผ๋กœ ํ•ด๊ฒฐ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ฌธ์ œ๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๊ฒƒ์€ ๋ง์„ค์—ฌ์ง„๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์ธ๊ฐ„์€ '์†ŒํŒŒ'๋ผ๋Š” ๋ฌผ์ฒด๋ฅผ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์†ŒํŒŒ๋ฅผ ๋ชจ๋“  ๋ฐฉํ–ฅ์—์„œ ๊ด€์ฐฐํ•œ ์ˆ˜๋งŒ ์žฅ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. Geoffrey Hinton์€ ์ด๊ฒƒ์ด ์ธ๊ฐ„์€ ๋ฌผ์ฒด์˜ part-whole hierarchy๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๊ณ , ์˜ค๋ž˜ ์ „๋ถ€ํ„ฐ ์ด๋ฅผ ๋”ฅ ๋Ÿฌ๋‹์—์„œ ์‹คํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋ฅผ ํ–ˆ๋‹ค.

Part-Whole Hierarchy

์ด ๊ฐœ๋…์€ CapsNet[3] ๋…ผ๋ฌธ์„ ํ†ตํ•ด ๋„๋ฆฌ ์•Œ๋ ค์กŒ์ง€๋งŒ, ๊ทธ ๊ณ ๋ฏผ์€ ํ›จ์”ฌ ์ด์ „์˜ ๋…ผ๋ฌธ๋“ค์—์„œ๋ถ€ํ„ฐ[4] [5] ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ์‚ฌ๋žŒ์€ ์†ŒํŒŒ๋ฅผ ์–ด๋–ค ํŠน์ •ํ•œ ๊ฐ๋„์—์„œ์˜ ์ด๋ฏธ์ง€๋กœ์จ ๊ธฐ์–ตํ•˜๋Š” ๋Œ€์‹  '๊ธด ๊น”๊ฐœ ๋’ค์— ๋“ฑ๋ฐ›์ด๊ฐ€ ์žˆ๊ณ  ์–‘ ์˜†์— ํŒ”๊ฑธ์ด๊ฐ€ ์žˆ๋Š” ๊ฒƒ'์œผ๋กœ ์ธ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— viewpoint equivariance๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋‹ฌ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

"There is strong psychological evidence that people parse visual scenes into part-whole hierarchies and model the viewpoint-invariant spatial relationship between a part and a whole as the coordinate transformation between intrinsic coordinate frames that they assign to the part and the whole."[6]

์‚ฌ๋žŒ์€ ์‹ค์ œ๋กœ ๋‡Œ ์†์—์„œ 3d ์ขŒํ‘œ๊ณ„๋ฅผ ๋งŒ๋“ค์–ด ๊ทธ ์†์—์„œ object์˜ ํ˜•ํƒœ๋ฅผ ์ธ์‹ํ•œ๋‹ค. ์ด๋Š” psychological evidence๋กœ ๋’ท๋ฐ›์นจ๋˜๊ณ , ์šฐ๋ฆฌ์˜ ์ƒ์‹์—๋„ ๋ถ€ํ•ฉํ•œ๋‹ค. Part-whole hierarchy๋ฅผ ์ธ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด viewpoint equivariance๋Š” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋‹ฌ์„ฑ๋  ๊ฒƒ์ด๊ณ , ๋ฐ˜๋Œ€๋กœ part-whole hierarchy ์—†์ด viewpoint equivariance๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ ์—ญ์‹œ ์ƒ์ƒํ•˜๊ธด ํž˜๋“ค๋‹ค.

๋ฌผ๋ก  ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ด๋ฅผ ์‹คํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ, ํŠน์ •ํ•œ transform์— invariantํ•œ kernel์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ data augmentation์— ์˜์กดํ•˜๋Š” ๋“ฑ์˜ ๋ฐฉ์‹์œผ๋กœ viewpoint equivariance๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐฉํ–ฅ๊ณผ๋Š” ๊ฑฐ๋ฆฌ๊ฐ€ ์žˆ์—ˆ๋‹ค. [7,8,9] ๊ทธ๋ฆฌ๊ณ  ํ˜„์‹ค์—์„œ ๊ฐ€๋Šฅํ•œ transformation matrix๋Š” ๋ฌดํ•œํžˆ ๋งŽ๊ธฐ์— ์ด๋ฅผ invariance๋ฅผ ํ†ตํ•ด ๋‹ฌ์„ฑํ•˜๋ ค๋Š” ๋ฐฉ์‹์€ ๋ช…ํ™•ํ•œ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด part-whole hierarchy๋ฅผ ๋”ฅ๋Ÿฌ๋‹์— ์–ด๋–ป๊ฒŒ ๊ตฌํ˜„ํ•  ๊ฒƒ์ธ๊ฐ€ ๋ผ๋Š” ๋ฌธ์ œ๋งŒ ๋‚จ๋Š”๋‹ค. Bottom-up ๋ฐฉ์‹์˜ ๋”ฅ ๋Ÿฌ๋‹์—์„œ ์šฐ๋ฆฌ๊ฐ€ ๋ฌด์—‡์ด 'ํŒ”๊ฑธ์ด'์ด๊ณ  ๋ฌด์—‡์ด '๋“ฑ๋ฐ›์ด'์ธ์ง€ ์ง์ ‘ ์•Œ๋ ค์ค„ ์ˆ˜๋Š” ์—†์ง€๋งŒ, ์ตœ์†Œํ•œ ๋„คํŠธ์›Œํฌ๊ฐ€ ์ด๋Ÿฐ ๊ฐœ๋…๋“ค์„ ๋‹ด๊ณ  ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์กฐ๋ฅผ top-down์œผ๋กœ ๊ตฌ์ถ•ํ•ด ์ค„ ์ˆ˜๋Š” ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋“ฑ์žฅํ•œ ๊ธฐ๋…๋น„์ ์ธ ์ฒซ ์ž‘ํ’ˆ์ด [3], 'Capsule Network'์˜€๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋”ฅ๋Ÿฌ๋‹ ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ํฐ ์ด์Šˆ๊ฐ€ ๋˜์—ˆ๊ณ , ์ˆ˜๋ฐฑ ํŽธ์˜ ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ๋‚˜์™”๋‹ค.

2. Capsule Network

Capsule network๋ฅผ ์–ด๋ ต๊ฒŒ ์„ค๋ช…ํ•  ๋ฐฉ๋ฒ•์€ ๋งŽ์ง€๋งŒ ๊ฐœ๋…์ ์œผ๋กœ๋Š” ๋‹จ์ˆœํ•˜๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ์šฐ์„ , ํŒ”๊ฑธ์ด์˜ ์กด์žฌ ์œ ๋ฌด๋ฅผ ๋‹จ ํ•˜๋‚˜์˜ scalar value๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๋ฌด๋ฆฌ๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ํŒ”๊ฑธ์ด์˜ ์ƒ‰์ƒ์ด๋‚˜ ์งˆ๊ฐ ๋“ฑ์€ ๋ฌผ๋ก ์ด๊ณ , part-whole hierarchy๋ฅผ ์œ„ํ•ด์„œ๋Š” ํŒ”๊ฑธ์ด๊ฐ€ ์–ด๋–ค ๊ฐ๋„๋กœ ๋ถ™์–ด ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋„ ์œ ์ง€ํ•ด์•ผ๋งŒ ํ•œ๋‹ค. ์ขŒ์„์— ๋“ฑ๋ฐ›์ด๊ฐ€ ์ˆ˜ํ‰์œผ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๋ฉด ์ด๋Š” ๋”์ด์ƒ ์†ŒํŒŒ๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Ÿฐ ์ •๋ณด๋ฅผ ๋‹ด๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ CNN์„ ํ†ตํ•ด feature vector๋ฅผ ๋ฝ‘๊ณ , 8~16๊ฐœ์˜ feature๋ฅผ ๋ฌถ์–ด ํ•˜๋‚˜์˜ capsule๋กœ ๋งŒ๋“ ๋‹ค.

Deep learning์˜ ์—ฐ์‚ฐ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ linearํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ˆœํžˆ ๋ช‡ ๊ฐœ์˜ feature๋ฅผ ๋ฌถ์–ด ๋†“๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ๋Š” ์•„๋ฌด ์ผ๋„ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ฐ๊ฐ์˜ capsule์˜ object๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋‹จ์œ„๋กœ์จ ๊ธฐ๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์„ ํƒํ•œ ๋ฐฉ๋ฒ•์€ 'routing algorithm'์ด๋ผ๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๋Š” capsule ๋‹จ์œ„๋กœ ์ž‘๋™ํ•˜๋Š” Hebbian learning[10]์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ์ดˆ๊ธฐ์—๋Š” ๋ชจ๋“  ์บก์А์ด ๋™์ผํ•œ ๊ฐ•๋„๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ high-level capsule์˜ ๊ฐ’์ด ๊ฒฐ์ •๋˜๋ฉด, ๊ทธ activation๊ณผ ์ž˜ align๋˜๋Š” low-level capsule๊ณผ์˜ ์—ฐ๊ฒฐ์ด ๊ฐ•ํ™”๋œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ high-level capsule์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•˜๊ณ , ์ด๋ฅผ ๋ฐ˜๋ณตํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ๊ฐ‘์ž๊ธฐ Hebbian learning์ด ์™œ ๋“ฑ์žฅํ–ˆ๋Š”์ง€, ์–ด๋–ป๊ฒŒ ์ด๋Ÿฐ ๊ณผ์ •์ด (3d object์˜) part-whole hierarchy๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ  viewpoint equivariance๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•˜์ง€ ๋ชปํ–ˆ๋‹ค๋ฉด, ๋‹น์‹ ์˜ ์ดํ•ด๋ ฅ์ด ๋ถ€์กฑํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ง€๊ทนํžˆ ์ •์ƒ์ ์ธ ์‚ฌ๊ณ ๋ฅผ ํ•œ ๊ฒƒ์ด๋‹ˆ ์•ˆ์‹ฌํ•ด๋„ ์ข‹๋‹ค. Routing algorithm์ด ์–ด๋–ค ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ viewpoint equivariance๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์€ ๋…ผ๋ฌธ ์–ด๋””์—๋„ ์ฐพ์•„๋ณผ ์ˆ˜ ์—†์œผ๋ฉฐ, ์ž ์‹œ ํ›„์— ๋ณด๊ฒ ์ง€๋งŒ ์‹ค์ œ๋กœ ๊ทธ๋Ÿฐ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์„ค๋ช…ํ•  ๋ฐฉ๋ฒ•๋„ ์—†๋‹ค.

๊ฐ„๋‹จํ•œ ์‚ฌ๊ณ  ์‹คํ—˜์„ ํ•ด ๋ณด์ž. 3D viewpoint equivariance๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ„๋‹จํ•œ ๊ฒƒ์œผ๋กœ ์šฐ๋ฆฌ๋Š” 2D์—์„œ rotational equivariance๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ํ•˜์œ„ ๋ชฉํ‘œ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ capsule network๋ฅผ ์ž˜ ํ•™์Šต ์‹œ์ผœ์„œ 'T' ๋ผ๋Š” ๊ธ€์ž๋ฅผ ์ธ์‹ํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•˜์ž. ์ฒซ ๋ฒˆ์งธ capsule์€ ์ค‘์•™์— ์žˆ๋Š” '๊ธด vertical line'์„ ํ•™์Šตํ–ˆ๊ณ , ๋‘ ๋ฒˆ์งธ capsule์€ '์งง์€ horizontal line'์„ ์ธ์ง€ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ƒ์œ„ ์บก์А์ด '๊ธด vertical line ์œ„์— ์งง์€ horizontal line์ด ์žˆ๋‹ค๋ฉด ์ด๊ฒƒ์€ letter 'T'์ด๋‹ค' ๋ผ๋Š” ๊ฒƒ์„ ์–ด๋–ป๊ฒŒ๋“  ํ•™์Šตํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ณด์ž.

๊ทธ๋Ÿฐ๋ฐ ๊ฐ‘์ž๊ธฐ, ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋Š” ์—†๋˜ ๊ธฐ์šธ์–ด์ง„ 'T'๊ฐ€ input์œผ๋กœ ๋“ค์–ด์˜จ๋‹ค. ์•ฝ๊ฐ„๋งŒ ๊ธฐ์šธ์–ด์กŒ๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋„ ์ข‹๊ณ  ์™„์ „ํžˆ ์ˆ˜ํ‰์œผ๋กœ ๋ˆ„์› ๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋„ ์ข‹๋‹ค. ์ด์ œ capsule์€ '๊ธด horizontal line ์šฐ์ธก์— ์งง์€ vertical line์ด ์žˆ๋‹ค๋ฉด ์ด๊ฒƒ์ธ letter 'T'์ด๋‹ค' ๋ผ๋Š” ๊ฒƒ์„ ์ธ์ง€ํ•ด์•ผ๋งŒ ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ capsule์€ ๋ณธ๋ž˜ vertical line์„ ํ•™์Šตํ–ˆ์ง€๋งŒ ๋Œ์—ฐ horizontal line์— activate ๋˜์–ด์•ผ ํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ ์บก์А์€ ๋ฐฉํ–ฅ์„ ๋ฐ”๊พธ๋Š” ๊ฒƒ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ์•„์˜ˆ ์œ„์น˜๊ฐ€ ๋ณ€ํ™”ํ•˜๋ฉฐ, ์ƒ์œ„ ์บก์А์€ ์ฒซ ๋ฒˆ์งธ ์บก์А์ด horizontal line์— activate๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ T์˜ ์œ—๋ฉด์„ ์šฐ์ธก์—์„œ ์ฐพ์•„์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์‚ฌ๊ณ ํ•ด์•ผ๋งŒ ํ•œ๋‹ค. ์–ด๋–ป๊ฒŒ ์ด๊ฒŒ ๊ฐ€๋Šฅํ• ๊นŒ? Capsule ์‚ฌ์ด์— ์ ์šฉ๋˜๋Š” Hebbian learning์ด ์–ด๋–ป๊ฒŒ ์ด๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์„๊นŒ? ์• ์ดˆ์— ์•ฝ๊ฐ„์ด๋ผ๋„ ๋„์›€์ด ๋˜๊ธฐ๋Š” ํ• ๊นŒ?

Original Capsule paper[3,11]์— ์˜ํ•˜๋ฉด, ''๊ทธ๋ ‡๋‹ค''. ๊ทธ๋“ค์€ routing algorithm์ด ์™„๋ฒฝํ•œ viewpoint equivariance๋Š” ์•„๋‹์ง€๋ผ๋„, ์ตœ์†Œํ•œ ๊ธฐ์กด์˜ CNN๋ณด๋‹ค ๋” robustํ•จ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์˜€๋‹ค. ๊ทธ๊ฒƒ์ด ๋งŽ์€ ์‚ฌ๋žŒ๋“ค์ด capsule network์— ์ถฉ๊ฒฉ์„ ๋ฐ›์€ ์ด์œ ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” ์–ด๋–ค ์ˆ˜ํ•™์  ์ฆ๋ช…์€ ๋ฌผ๋ก ์ด๊ณ  ์ถฉ๋ถ„ํ•œ ๋‚ฉ๋“ ๊ฐ€๋Šฅํ•œ ์„ค๋ช… ์—†์ด ์˜ค์ง ์‹คํ—˜์œผ๋กœ์จ ์ฆ๋ช…๋˜์—ˆ๋‹ค. ๋งŒ์•ฝ์— ๊ทธ ์‹คํ—˜์ด ๋ถ€์ •๋œ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” Capsule Network๊ฐ€ ํ˜„์žฌ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๊ณ  ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด ๋‹ค์‹œ ํ•œ ๋ฒˆ ์ƒ๊ฐํ•ด ๋ณด์•„์•ผ ํ•  ๊ฒƒ์ด๋‹ค.

3. Capsule Network is Not More Robust than Convolutional Network[12]

์ด ๋…ผ๋ฌธ์€ CapsNet์˜ ์„ฑ๋Šฅ์ด ๊ธฐ๋Œ€ ์ดํ•˜๋ผ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‹ด์€ ์ฒซ ๋ฒˆ์งธ ์—ฐ๊ตฌ๊ฐ€ ์•„๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ์„ฑ๋Šฅ ๋น„๊ต์—์„œ ์ด์ ์„ ์ฐพ์ง€ ๋ชปํ•œ ์—ฐ๊ตฌ๋„ ์žˆ์—ˆ๊ณ [13,14], SmallNorb๋‚˜ Rotational MNIST ๋“ฑ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด CapsNet์ด ์ผ๋ฐ˜ CNN๋ณด๋‹ค ๋”ฑํžˆ viewpoint change์— robustํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ ์—ฐ๊ตฌ๋“ค๋„ ์žˆ์—ˆ๋‹ค[15,16,17].

๊ทธ๋ ‡๋‹ค๋ฉด ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๊ฐ™์€ dataset์—์„œ ๊ฐ™์€ network๋กœ ์‹คํ—˜์„ ํ–ˆ๋Š”๋ฐ ๋‹ค๋ฅธ ๊ฒฐ๋ก ์ด ๋‚˜์™”๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ๋‘˜ ์ค‘ ํ•œ ์ชฝ์€ ๊ฑฐ์ง“๋ง์„ ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์ธ๊ฐ€? ์ œํ”„๋ฆฌ ํžŒํŠผ์ด ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์กฐ์ž‘ํ•ด์„œ ๋…ผ๋ฌธ์„ ์“ฐ๊ณ  NIPS์™€ ICLR์— ์‹ค์—ˆ๋‹ค๋Š” ์˜๋ฏธ์ธ๊ฐ€? ์•„๋‹ˆ, ๊ผญ ๊ทธ๋ ‡์ง€๋Š” ์•Š๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ๋Ÿฐ '์˜คํ•ด'๊ฐ€ ๋ฐœ์ƒํ•œ ๊ณผ์ •์„ ์ƒ์„ธํžˆ ๋ฐํžŒ๋‹ค.

The Baseline Problem

CapsNet ๋…ผ๋ฌธ์—์„œ, ์ €์ž๋Š” Capsule network๊ฐ€ CNN๋ณด๋‹ค general performance๊ฐ€ ๋” ๋†’์€ ๊ฒƒ์€ ๋ฌผ๋ก ์ด๊ณ  viewpoint change์— ๋Œ€ํ•ด ๋” robustํ•˜๋‹ค๊ณ  ์‹คํ—˜์„ ํ†ตํ•ด ๋ฐํ˜”๋‹ค. ํ•˜์ง€๋งŒ 'CNN'์€ ๋‹จ์ผํ•œ ๋ชจ๋ธ์„ ์ง€์นญํ•˜์ง€ ์•Š๋Š”๋‹ค. AlexNet์ด๋‚˜ VGG๋„ ์žˆ๊ณ , ResNet, SENet, MobileNet, EfficientNet ๋“ฑ ์ˆ˜๋งŽ์€ architecture๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ด๋“ค ์ค‘ ๊ฐ€์žฅ ์ข‹์€ SOTA ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜๋ฉด ๋ ๊นŒ? ์ด๋Š” ๊ณต์ •ํ•œ ๋น„๊ต์ด๊ธฐ๋Š” ํ•˜์ง€๋งŒ, ๋‹จ 4๊ฐœ์˜ layer๋ฅผ ๊ฐ€์ง„ CapsNet์ด ์ˆ˜๋ฐฑ ๊ฐœ์˜ layer์™€ ์ˆ˜์ฒœ๋งŒ ๊ฐœ์˜ parameter๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ์ด๊ฒจ์•ผ๋งŒ ๊ฐ€์น˜๋ฅผ ์ธ์ •๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค๋ฉด ์ด๋Š” ์ง€๋‚˜์น˜๊ฒŒ ๊ฐ€ํ˜นํ•œ ์ฒ˜์‚ฌ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

๊ทธ๋ž˜์„œ ์ €์ž๋“ค์ด ์ฑ„ํƒํ•œ ๋ฐฉ์‹์€ CapsNet๋ณด๋‹ค ๋” ํฐ, layer ์ˆ˜๋Š” ๋น„์Šทํ•˜๋˜ ๋” ๋งŽ์€ parameter๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์™€ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค. ์–ธ๋œป ์ด๋Š” ๊ณต์ •ํ•ด ๋ณด์ธ๋‹ค. ๋ฌด๋ ค parameter ๊ฐœ์ˆ˜๊ฐ€ ๋‘ ๋ฐฐ๋‚˜ ๋” ๋งŽ์€ CNN์„ ์ƒ๋Œ€๋กœ ์Šน๋ฆฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์—„๋ฐ€ํ•˜๊ฒŒ ๋งํ•˜์ž๋ฉด CapsNet์€ '๋น„์Šทํ•œ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ CNN๋“ค์˜ ์„ฑ๋Šฅ์˜ upper bound'๋ฅผ ๋„˜์–ด์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ฒƒ์„ ์•Œ๊ธฐ์— ์ž์‹ ๋ณด๋‹ค ๋” ํฐ CNN์„ ์ ๋‹นํžˆ ํ•˜๋‚˜ ๋งŒ๋“ค๊ณ , ์ด๊ฒƒ์ด ์ ์ ˆํ•œ upper bound๊ฐ€ ๋˜๋ฆฌ๋ผ ๋ฏฟ์—ˆ๋‹ค. ์„ค๋งˆ ๊ทธ๋Ÿฐ ์‚ฌ์†Œํ•œ ์„ธ๋ถ€์‚ฌํ•ญ๋“ค๋กœ ์ธํ•ด parameter 2๋ฐฐ์˜ ์ฐจ์ด๊ฐ€ ๋’ค์ง‘ํžˆ์ง„ ์•Š์„ ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ CapsNet์€ routing algorithm์ด ์ž˜ ์ž‘๋™ํ•ด์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์ด ํ‹€๋ฆผ์—†๋‹ค. ๊ทธ๋Ÿฐ๊ฐ€?

ํฌ๊ฒŒ ๋ณด์ž๋ฉด, Capsule network์—์„œ routing algorithm์„ ์ œ๊ฑฐํ•˜๋ฉด ์ผ๋ฐ˜ CNN์ด ๋œ๋‹ค. ๋” ๊ตฌ์ฒด์ ์œผ๋กœ๋Š”, shared transform matrix์™€ ์ƒ์†Œํ•œ activation function(squash), ๊ทธ๋ฆฌ๊ณ  reconstruction์œผ๋กœ auxiliary loss๋ฅผ ์ฃผ๊ณ  MarginLoss๋กœ ํ•™์Šตํ•˜๋Š” CNN์ด ๋œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด๋Ÿฐ ์ฐจ์ด๋“ค์„ ํ•˜๋‚˜์”ฉ on/offํ•ด ๊ฐ€๋ฉด์„œ ์‹คํ—˜์„ ํ•ด ๋ณธ๋‹ค๋ฉด CapsNet์˜ ์–ด๋–ค ์š”์†Œ๊ฐ€ ์‹ค์ œ๋กœ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Experiment

Model์˜ (viewpoint) transformation์— ๋Œ€ํ•œ robustness๋ฅผ ์ง์ ‘์ ์œผ๋กœ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด AffNIST dataset[3,17]์ด ์ฃผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. Training ์‹œ์—๋Š” ์ •์ƒ์ ์ธ MNIST ๋ฐ์ดํ„ฐ๋งŒ ๋ณด์—ฌ์ฃผ๊ณ , ์—ฌ๊ธฐ์— ๊ฐ์ข… affine transform์„ ๊ฐ€ํ•œ ์ด๋ฏธ์ง€๋กœ evaluateํ•˜์—ฌ generalization power๋ฅผ ์ธก์ •ํ•œ๋‹ค.

๋งŒ์•ฝ์— CapsNet์˜ robustness๊ฐ€ capsule ๊ตฌ์กฐ์— ์˜ํ•œ ๊ฒƒ์ด๋ผ๋ฉด routing algorithm์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ์˜ ํ–ฅ์ƒ์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ routing algorithm์€ robustness์— ๋„์›€์ด ๋˜์ง€ ์•Š๊ณ  ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ์†Œํญ ๊ฐ์†Œํ–ˆ์œผ๋ฉฐ, ์ด ๊ฒฐ๊ณผ๋Š” ๋‹ค๋ฅธ ์—ฐ๊ตฌ์—์„œ ๋ณด๊ณ ํ•œ ๊ฒƒ[16,17]๊ณผ ๊ฐ™๋‹ค. ์˜คํžˆ๋ ค squash function ๋“ฑ ๋ถ€๊ฐ€์ ์ธ ์š”์†Œ๊ฐ€ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฐ ์š”์ธ์ธ ๊ฒƒ์œผ๋กœ ๋ณด์ด๋ฉฐ, ์ด ์™ธ์— ์ €์ž๋“ค์€ kernel size๊ฐ€ AffNIST์—์„œ์˜ ์„ฑ๋Šฅ์— ๊ฒฐ์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ์•„๋ƒˆ๋‹ค.

๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์— ๊ด€๊ณ„์—†์ด kernel size๊ฐ€ ์ปค์งˆ์ˆ˜๋ก robustness๊ฐ€ ์ปค์ง์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. CapsNet์€ (9,9) kernel์„ ์‚ฌ์šฉํ–ˆ๊ณ  ์› ๋…ผ๋ฌธ์—์„œ baseline CNN์€ (5,5)๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ด๋Š” ์˜๋„ํ–ˆ๊ฑด ์˜๋„์น˜ ์•Š์•˜๊ฑด CapsNet์—๊ฒŒ ์œ ๋ฆฌํ•œ ์‹คํ—˜ ์„ค๊ณ„์˜€๋˜ ๊ฒƒ์œผ๋กœ ๋ณด์ด๋ฉฐ, ์ €์ž๋“ค์€ ์œ„์™€ ๊ฐ™์€ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ† ๋Œ€๋กœ (9,9) kernel๊ณผ average pooling์„ ์‚ฌ์šฉํ•œ ๊ฐ„๋‹จํ•œ 3-layer network(5.3M parameter)๋กœ AffNIST์—์„œ CapsNet์„ ํฌ๊ฒŒ ์ƒํšŒํ•˜๋Š” ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์› ๋…ผ๋ฌธ์—์„œ CapsNet์ด 35M๊ฐœ์˜ parameter๋ฅผ ๊ฐ€์ง„ CNN๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค๊ณ  ๋ณด๊ณ ํ•œ ๊ฒƒ์„ ์ƒ๊ฐํ•˜๋ฉด ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ์— ๋”ฐ๋ผ transform์— ๋Œ€ํ•œ robustness์— ํฐ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  CapsNet์˜ ์ €์ž๋“ค์€ ์ด๋ฅผ ์˜ˆ์ƒ์น˜ ๋ชปํ•˜๊ณ  ๋„ˆ๋ฌด ๋‚˜์œ baseline์„ ์„ค์ •ํ•˜์—ฌ ์ž˜๋ชป๋œ ๊ฒฐ๋ก ์„ ๋„์ถœํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ์ด๋Š” ์šฐ์—ฐ์ผ ์ˆ˜๋„ ์žˆ๊ณ , ์›ํ•˜๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ ๋•Œ๊นŒ์ง€ ์‹คํ—˜์„ ๋ฐ˜๋ณตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ผ ์ˆ˜๋„ ์žˆ๋‹ค.

4. Discussion

๋”ฅ๋Ÿฌ๋‹ ์—ฐ๊ตฌ๋Š” noise์— ํŠนํžˆ ์ทจ์•ฝํ•˜๋‹ค. ์ ์ ˆํ•œ Baseline์„ ์žก๋Š” ๊ฒƒ์€ ์–ธ์ œ๋‚˜ ์–ด๋ ค์šฐ๋ฉฐ, ๋˜‘๊ฐ™์€ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•ด๋„ ๋งค๋ฒˆ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ณ , ์•ฝ๊ฐ„์˜ ์ฐจ์ด๋กœ ์™„์ „ํžˆ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜๋„ ์žˆ๊ธฐ์— ์‹คํ—˜ ๊ฒฐ๊ณผ์˜ ์žฌ์—ฐ๋„ ์–ด๋ ต๋‹ค. ๋•Œ๋ฌธ์— ์ž˜๋ชป๋œ ๋…ผ๋ฌธ์ด ๋‚˜์™”์„ ๋•Œ ์ด๋ฅผ ๊ฒ€์ฆํ•˜๋Š” ๊ฒƒ๋„ ์‰ฝ์ง€ ์•Š๋‹ค. ๊ฒฐ๊ณผ๊ฐ€ ์žฌ์—ฐ๋˜์ง€ ์•Š๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋…ผ๋ฌธ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์กฐ์šฉํžˆ ๋ฌปํžˆ์ง€๋งŒ Capsule network์˜ ๊ฒฝ์šฐ์—๋Š” ์•„์ง๊นŒ์ง€๋„ ์ž˜๋ชป๋œ ๊ฐ€์ •์— ๊ธฐ์ดˆํ•œ ํ›„์† ์—ฐ๊ตฌ๋“ค์ด ๊พธ์ค€ํžˆ ๋‚˜์˜ค๊ณ  ์žˆ๋‹ค.

์ด ์‚ฌ๊ฑด์€ ๋…ผ๋ฌธ์„ ์“ธ ๋•Œ ์ ์ ˆํ•œ baseline์„ ๊ฐ€์ง€๊ณ  ๊ฐ€์„ค์„ ์ง์ ‘ ๊ฒ€์ฆํ•˜๋Š” ๊ฒƒ์˜ ์ค‘์š”์„ฑ์„ remindํ•ด ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋Ÿฐ ์›์น™์„ ์ œ๋Œ€๋กœ ์ง€ํ‚ค์ง€ ์•Š์€ ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ํ•œ ๋ฒˆ ๋” ์˜์‹ฌํ•˜๊ณ  ๊ฒ€์ฆํ•ด์•ผ ํ•  ๊ฒƒ์ด๋‹ค.

CapsNet ์ž์ฒด๋Š” ์„ฑ๊ณต์ ์ด์ง€ ์•Š์•˜์ง€๋งŒ ๊ทธ ๊ณผ์ •์— ์ด๋ฅด๋Š” ๋…ผ๋ฆฌ๋Š” ์—ฌ์ „ํžˆ ์ฃผ๋ชฉํ•  ๋งŒ ํ•˜๋ฉฐ, part-whole hierarchy๋ฅผ ์œ„ํ•ด capsule ๊ตฌ์กฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ๋…ผ๋ฆฌ๋Š” ์—ฌ์ „ํžˆ ์œ ํšจํ•  ์ˆ˜ ์žˆ๋‹ค. CapsNet์€ ์„ฑ๊ณต์ ์ด์ง€ ๋ชปํ–ˆ์ง€๋งŒ capsule์ด๋ผ๋Š” ๊ฐœ๋…์˜ ์กด์žฌ ๊ฐ€์น˜๊ฐ€ ๋ถ€์ •๋˜์—ˆ๋‹ค๊ธฐ๋ณด๋‹ค๋Š” ๊ฐ capsule์— ์˜๋ฏธ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” routing algorithm์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•ด์„ํ•˜๋Š” ๊ฒƒ์ด ๋” ์ •ํ™•ํ•  ๊ฒƒ์ด๋‹ค. ํŠนํžˆ ์ฒ˜์Œ ์ œ์•ˆ๋œ ๋‘ routing algorithm์€ ์ˆ˜ํ•™์ ์œผ๋กœ stableํ•˜์ง€ ๋ชปํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.[15]

Geoffrey Hinton์€ ์ดํ›„์˜ ๋…ผ๋ฌธ์—์„œ ๊ฐ patch์— ํ•˜๋‚˜์˜ capsule์„ ํ• ๋‹นํ•˜๊ณ  ์ด๋“ค์ด ๊ณ„์ธต ๊ฐ„์— ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜๋ฉด์„œ CapsNet์ด ์„ฑ๊ณตํ•˜์ง€ ๋ชปํ•œ ์ด์œ ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„์„ํ–ˆ๋‹ค.

"The fundamental weakness of capsules is that they use a mixture to model the set of possible parts. This forces a hard decision about whether a car headlight and an eye are really different parts. If they are modeled by the same capsule, the capsule cannot predict the identity of the whole. If they are modeled by different capsules the similarity in their relationship to their whole cannot be captured."[18]

์ด ์˜ˆ์‹œ๊ฐ€ ์ ์ ˆํ•œ์ง€ ์•„๋‹Œ์ง€์˜ ์—ฌ๋ถ€์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ, ์šฐ๋ฆฌ๋Š” capsule์ด๋ผ๋Š” ๊ตฌ์กฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์‚ฌ์‹ค์— ๋Œ€์ฒด๋กœ ๊ณต๊ฐํ•˜์ง€๋งŒ capsule network๊ฐ€ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋Œ€๋กœ ๋™์ž‘ํ•˜๋„๋ก ํ•™์Šตํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฒ•์„ ์•„์ง ์ฐพ์ง€ ๋ชปํ–ˆ๋‹ค. ์–ด๋–ค ์‚ฌ๋žŒ๋“ค์€ ๋ฐ์ดํ„ฐ์˜ ๋ถ€์กฑ์„ ์ด์œ ๋กœ ๊ผฝ๋Š”๋‹ค. ๋ฌผ์ฒด์˜ ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๊ณ  part-whole hierarchy๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์€ image classificationํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ์–ด๋ ต๋‹ค. Linear transform์˜ ๋ฐ˜๋ณต์œผ๋กœ๋„ ์ž˜ ํ•  ์ˆ˜ ์žˆ๋Š” task๋ฅผ ๊ตณ์ด ๋” ์–ด๋ ค์šด ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ด์œ ๋„ ์—†๊ณ , ๊ทธ๋ ‡๊ฒŒ ํ•  ๋งŒํ•œ ์ •๋ณด๋„ ๋ถ€์กฑํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋•Œ๋ฌธ์— viewpoint equivariance๋Š” training ๋ฐฉ๋ฒ•์˜ ํ˜์‹ (unsupervised learning ๋“ฑ)์ด ์„ ํ–‰๋˜์–ด์•ผ ๋‹ฌ์„ฑ๋  ์ˆ˜๋„ ์žˆ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ ๊ทธ ๋์—๋Š” capsule์ด ์žˆ์„ ๊ฒƒ์ด๋ผ ๋ฏฟ์–ด ์˜์‹ฌ์น˜ ์•Š๋Š” ์‚ฌ๋žŒ๋“ค์ด ์žˆ๊ณ , ๋‚˜๋„ ๊ฑฐ๊ธฐ์— ์ผ๋ถ€ ๊ณต๊ฐํ•œ๋‹ค.

[1] Qiu, Weichao, and Alan Yuille. "Unrealcv: Connecting computer vision to unreal engine." European Conference on Computer Vision. Springer, Cham, 2016.

[2] Yuille, Alan L., and Chenxi Liu. "Deep nets: What have they ever done for vision?." International Journal of Computer Vision 129.3 (2021): 781-802.

[3] Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic routing between capsules." arXiv preprint arXiv:1710.09829 (2017).

[4] Hinton, Geoffrey E., Alex Krizhevsky, and Sida D. Wang. "Transforming auto-encoders." International conference on artificial neural networks. Springer, Berlin, Heidelberg, 2011.

[5] Hinton, Geoffrey E. "Mapping part-whole hierarchies into connectionist networks." Artificial Intelligence 46.1-2 (1990): 47-75.

[6] Hinton, Geoffrey. "Some demonstrations of the effects of structural descriptions in mental imagery." Cognitive Science 3.3 (1979): 231-250.

[7] Esteves, Carlos, et al. "Equivariant multi-view networks." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

[8] Kim, Jinpyo, et al. "CyCNN: a rotation invariant CNN using polar mapping and cylindrical convolution layers." arXiv preprint arXiv:2007.10588 (2020).

[9] Marcos, Diego, Michele Volpi, and Devis Tuia. "Learning rotation invariant convolutional filters for texture classification." 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016.

[10] โ€œHebbian theoryโ€ Wikipedia, Wikimedia Foundation, https://en.wikipedia.org/wiki/Hebbian_theory

[11] Hinton, Geoffrey E., Sara Sabour, and Nicholas Frosst. "Matrix capsules with EM routing." International conference on learning representations. 2018.

[12] Gu, Jindong, Volker Tresp, and Han Hu. "Capsule Network is Not More Robust than Convolutional Network." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[13] Andersen, Per-Arne. "Deep reinforcement learning using capsules in advanced game environments." arXiv preprint arXiv:1801.09597 (2018).

[14] Xi, Edgar, Selina Bing, and Yang Jin. "Capsule network performance on complex data." arXiv preprint arXiv:1712.03480 (2017).

[15] Paik, Inyoung, Taeyeong Kwak, and Injung Kim. "Capsule networks need an improved routing algorithm." Asian Conference on Machine Learning. PMLR, 2019.

[16] Mukhometzianov, Rinat, and Juan Carrillo. "CapsNet comparative performance evaluation for image classification." arXiv preprint arXiv:1805.11195 (2018).

[17] Gu, Jindong, and Volker Tresp. "Improving the robustness of capsule networks to image affine transformations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[18] Hinton, Geoffrey. "How to represent part-whole hierarchies in a neural network." arXiv preprint arXiv:2102.12627 (2021).