📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 1. Problem definition
  • Class activation map for visual explanation
  • 2. Motivation
  • Related work
  • Idea
  • 3. Method
  • Architecture
  • Memory value reading
  • Training
  • Generating Visual Explanation
  • 4. Experiment & Result
  • Experimental setup
  • Result
  • 5. Conclusion
  • Take home message (오늘의 교훈)
  • Author / Reviewer information
  • Author
  • Reviewer
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2022 Spring] Paper review

M-CAM [Eng]

Kim et al. / M-CAM - Visual Explanation of Challenging Conditioned Dataset with Bias-reducing Memory / BMVC 2021

PreviousMM-TTA [Eng]NextMipNerF [Kor]

Last updated 2 years ago

Was this helpful?

1. Problem definition

Class activation map for visual explanation

Given a pre-trained feature encoder F of the target network, the spatial feature representation fxf_xfx​ of an input image is extracted where fx∈Rw×h×cf_x \in \mathbb{R}^{w \times h \times c}fx​∈Rw×h×c and fxi∈Rw×hf_{x_i} \in \mathbb{R}^{w \times h}fxi​​∈Rw×h is the activation at the iii th channel. Importance weight wiw_iwi​ is assigned to each spatial feature representation map fxif_{x_i}fxi​​ with respect to their relevance in target network's decision making for target class c^\hat{c}c^. Different methods are used in this weight assignment. By taking weighted sum of fxif_{x_i}fxi​​ with the set of importance weight w=w1,w2,...,wcw = {w_1,w_2,...,w_c}w=w1​,w2​,...,wc​ over ccc channels, class activation map is generated for visual explanation.

2. Motivation

A class activation map(CAM) for a specific class shows the discriminative image regions used by CNN to make decision to classify images into that class. Challenging conditioned dataset such as dataset with imbalanced distribution of class and frequent co-occurrence of multiple objects in multi-label classification training dataset might give rise to unwanted bias on internal components of the target deep network. CAM utilizes these internal components of deep networks such as gradients and feature maps to generate visual explanations. This means when it comes to challenging conditioned dataset, there is no guarantee on the reliability of internal components of deep networks which leads to degradation of credibility on generated visual explanations. To tackle such problems, they propose Bias-reducing memory module that provides quality visual explanations even with datasets of challenging conditions.

Related work

Class Activation Map (CAM)

Key-Value Structure Memory Network

Idea

Imbalanced dataset below (there are a lot of data for dogs but only few samples for whales) might cause model to tune it weights to correctly fit dogs data more because the loss of dogs data affects total loss more than whale data. To avoid this issue, memory module is used to store features of distinct class into different slots and the trained module will be used for inference. In this way, the classification does not depend on the biased parameters of the model anymore.

Multiple objects might co-occur in a single training image. For example, there would be a lot of images containing peson and horse at the same time. If there is much lesser training images where horse exists by itself, the network might rely on the occurence of man to classify horse and unable to recognize horse when person is absent. To mitigate this issue, memory module helps by learning to disentagle horse features from person features and store it into different slots, even when they exist together.

3. Method

Architecture

The figure above describes an overall flow on how the proposed Bias-reducing memory module learns desired information from the target network. For training, the memory module takes the spatial feature representation extracted from a pre-trained feature encoder F $$f \in \mathbb{R}^{w \times h \times c}$$, query feature representation $$q \in \mathbb{R}^{c}$$ and a value feature representation $$v'\in \mathbb{R}^{c}$$. They design the semantic information encoder G to map the hot encoded ground truth label vector y into the same number of dimensionality as $$q$$. $$f$$ and $$v'$$ are not used in inference step. The memory module outputs read value feature $$v'\in \mathbb{R}^{c}$$ and the classifier takes a concatenated vector of $$q$$ and $$v$$ as an input and output classification score $$z$in both training and inference step.

Memory value reading

Before going into training part, it would be useful to discuss how to get a value reading from the memory. This method will be used in the training steps . Application of key-value memory involves two major steps, which are key addressing and value reading. Given an embedded query value q∈Rcq \in \mathbb{R}^{c}q∈Rc with c as number of channels of the resulted spatial features , similarity between q and each slot of key memory Ki∈RcK_i \in \mathbb{R}^{c}Ki​∈Rc is measured. An address vector p∈R1×Np \in \mathbb{R}^{1 \times N}p∈R1×N is obtained for a key memory KKK with NNN slots, where each scalar value of ppp represents similarity between the query and each memory slot:\

pi=Softmax(q⋅Ki∥q∥∥Ki∥)(1)p_i = Softmax(\frac{q \cdot K_i}{\|q\| \|K_i\|}) (1)pi​=Softmax(∥q∥∥Ki​∥q⋅Ki​​)(1)

where i=1,2,...,N and Softmax(zi)=eiz/∑j=1NejzSoftmax(z_i) = {e_i}^{z} / \sum_{j=1}^{N} {e_j}^{z}Softmax(zi​)=ei​z/∑j=1N​ej​z

In value reading step, the value memory is accessed by the key address vector p as a set of relative weights of importance for each slot. The read value v∈Rcv \in \mathbb{R}^{c}v∈Rc is obtained such that v=pVv = pVv=pV, where V∈RN×cV \in \mathbb{R}^{N \times c}V∈RN×c is a trained value memory with NNN slots. By doing so, key-value memory structure allows it to flexibly access to desired information stored in the value memory corresponding to different query values.

Training

Memory module is trained to store corresponding information at the same sequential location of slot. In other words, if the second slot of V turns out to contain semantic information related to dog class, we guide the second slot of S to learn corresponding distribution of spatial feature representation of dog class. To effectively guide Bias-reducing memory module to learn the distribution of spatial feature representation with the corresponding semantic information distilled from the target network, we design three objective functions Lclassifier,Lsparse,LaddressL_{classifier}, L_{sparse}, L_{address}Lclassifier​,Lsparse​,Laddress​.

As in the architecture figure, a new classifier has to be trained from the scratch in order to train the memory module. LclassifierL_{classifier}Lclassifier​ is devised as: Lclassifier=BCE(fc(cat(vt,f)),Y)+BCE(fc(cat(v,f)),Y)L_{classifier} = BCE(fc(cat(v_t, f)),Y) + BCE(fc(cat(v, f)),Y)Lclassifier​=BCE(fc(cat(vt​,f)),Y)+BCE(fc(cat(v,f)),Y) where BCE is a Binary Cross Entropy loss function, fc()fc()fc() is a fully connected layer classifier and cat()cat()cat() represents concatenation between two vectors. vvv is a value reading obtained by using formula (1) from the memory reading section above. vtv_tvt​ is also a value reading obtained by using formula (1) with Value memory instead of Key memory and value feature representation v' instead of query feature representation q. The first term uses vt which is influenced by ground truth labels and this term is used to train value memory to contain ground truth values. The second term contain v which is influenced by query features and this term is used to train key memory.

We want the memory module to effectively arrange the features into slots without leaving any blank slots so that memory space is used efficiently. LsparseL_{sparse}Lsparse​ is utilized to achieve this, a L2 norm between the two read value features vt and v: Lsparse=1N∑i=1N(vi−vti)2L_{sparse} = \frac{1}{N}\sum_{i=1}^N {(v_i - v_{t_i})^2}Lsparse​=N1​∑i=1N​(vi​−vti​​)2

We need to make the same index of memory slots at S, K, and V to store information related to each other. An address matching objective function LaddressL_{address}Laddress​ is used to guide the spatial feature representation dictionary and key memory to output similar address vectors psp_sps​ and ppp to the value address vector p′p′p′. Laddress=KL(p′∥ps)+KL(p′∥p)L_{address} = KL(p' \parallel p_s) + KL(p' \parallel p)Laddress​=KL(p′∥ps​)+KL(p′∥p) where KL(p′∥ps)=∑i=1Npi⋅log(qi/pi)KL(p' \parallel p_s) = \sum_{i=1}^N {p_i \cdot log(q_i/p_i)}KL(p′∥ps​)=∑i=1N​pi​⋅log(qi​/pi​) is Kullback-Leibler divergence.

Taking account of all these losses, the total loss is then L=Lclassifier+Lsparse+LaddressL = L_{classifier} +L_{sparse} +L_{address}L=Lclassifier​+Lsparse​+Laddress​

Generating Visual Explanation

The key-value structure memory module learns the distribution of spatial feature representation from the target deep network and discretely organizes the distributions into separate memory slots. After training is completed, we would have constructed a Spatial Feature Representation Dictionary S from the training images. Given the query feature representation qxq_xqx​ of an input image x, a target class cˆcˆcˆ, and the original prediction score zzz of x, we would want to find a slot ncˆncˆncˆ of the trained S memory module that contains the most closely related information for the target class cˆcˆcˆ. This slot can be found by perturbing each slot with random noise and get the slot which suffers highest prediction score decrease. This algorithm below will return the slot sequence number ncˆncˆncˆ that contains the most closely related information for the target class cˆcˆcˆ in the trained memory module.

Trained model will refer to the Spatial Feature Representation Dictionary S when classifying images. We want to know which part of the images is being taken into consideration the most in the model's decision making. Weight adjustment of the memory slots is done to reduce the importance of spatial feature representations that are irrelevant to the target class cˆcˆcˆ while giving more emphasis on the ones similar to the retrieved feature distribution SncˆS_{ncˆ}Sncˆ​ .They take exponential function on τi to map the output range of cosine similarity [-1,1] to positive number of range [e−1, e] giving more emphasis on the cosine similarity value that is close to 1. Class activation map M-CAM is then constructed by taking weighted sum of fxif_{x_i}fxi​​ with the set of importance weight w = {w1,w2,...,wc} over c channels.

4. Experiment & Result

Experimental setup

Datasets used are those that have challenging conditions introduced in the beginning. MS COCO is chosen because it contains a lot of images with multiple objects co-occurence. Additionaly, four medical datasets are chosen, NIH C hest X-ray 14 and VinDr-CXR because of their multiple object co-occurrence and retinal optical coherence tomography(OCT) and EndoTect Challenge dataset because of their class imbalance.

Several visual explanation frameworks are chosen as baselines: Grad, Grad++, Eigen, Eigengrad, Ablation.

M-CAM is verified qualitatively by plotting the visual explanations and also quantitatively using several evaluation metrics such as Average Drop Percentage, Percentage Increase in Confidence, Infidelity, Sensitivity metric.

Result

Qualitative Result

In MS COCO dataset, while other visual explanation frameworks fail in emphasizing only the single intended object in an image where a person and a skateboard co-occur, M-CAM succeeded in highlighting person and skateboard separately accurately. M-CAM also performs better than the other frameworks on dataset with class imbalance (EndoTect).

Quantitative Result

5. Conclusion

M-CAM contribute three things: key-value structure bias-reducing memory and its training scheme, novel CAM-based visual explanation method based on the memory module, and verification of the proposed method on MS COCO and four medical datasets. The memory module might take a lot of memory depending on how many objects there are but it will no get into infinite when it is a close dataset as size of vocabulary will be limited.

Take home message (오늘의 교훈)

Dataset with class imbalance and multi objects co-occurrence can make network become bias and hence network parameter based visual explanation framework might not be dependable.

Key-value memory based visual explanation can tackle this issue by learning to match representation feature to its object and store each linkage on different slot index.

Author / Reviewer information

Author

김성엽 (Kim Seongyeop)

  • Affiliation \KAIST EE

  • Contact information \seongyeop@kaist.ac.kr

Reviewer

Reference & Additional materials

[1] Seongyeop Kim, Yong Man Ro. M-CAM: Visual Explanation of Challenging Conditioned Dataset. In British Machine Vision Conference (BMVC), 2021. [2] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, 2016. [3] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision (ICCV), pages 618–626, 2017. [4] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubrama- nian. Grad-cam++: Generalized gradient-based visual explanations for deep convolu- tional networks. In Winter Conference on Applications of Computer Vision (WACV), pages 839–847. IEEE, 2018. [5] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1400–1409, Austin, Texas, November 2016. Association for Computa- tional Linguistics. doi: 10.18653/v1/D16-1147. [6] Kenneth Kreutz-Delgado, Joseph F Murray, Bhaskar D Rao, Kjersti Engan, Te-Won Lee, and Terrence J Sejnowski. Dictionary learning algorithms for sparse representa- tion. Neural computation, 15(2):349–396, 2003.

CAM is first introduced in where they compute weighted sum of the features at the last convolutional layer using the weights from the output layer that comes after applying global average pooling to the last convolutional layer. Afterward, Grad-CAM generalizes the concept of CAM, verifying that generation of activation map does not require specific structure (global average pool layer at the end of the network) of CNN anymore. Grad-CAM++ further generalizes Grad-CAM, enhancing its visual explanation by taking advantage of higher order derivatives for importance weight assignment. The mentioned CAM methods are the most popular ones but several other variations of CAM have been published as well, with each having specialized advantages.

In , in order to open up possibilities for open domain Question Answering task, a new dynamic neural network architecture called key-value structure memory network is introduced for storing diverse patterns of feature representation of trained images in the key memory and use it to make inference for the class-related semantic information which is stored in the value memory. This trained correlation between key and value memory will then be used to classify the input image.

To mitigate the effect of biases of the target network caused by the inherent challenging conditions of the training dataset, the proposed key-value structure memory module learns the distribution of spatial feature representation from the target deep network and discretely organizes the distributions into separate memory slots. To further boost the memory network to its fullest extent, they use the sparse dictionary learning concept from , where diverse information can be stored sparsely over different memory slots.

[2]
[3]
[4]
[5]
[6]