M-CAM [Eng]
Kim et al. / M-CAM - Visual Explanation of Challenging Conditioned Dataset with Bias-reducing Memory / BMVC 2021
Last updated
Kim et al. / M-CAM - Visual Explanation of Challenging Conditioned Dataset with Bias-reducing Memory / BMVC 2021
Last updated
Given a pre-trained feature encoder F of the target network, the spatial feature representation of an input image is extracted where and is the activation at the th channel. Importance weight is assigned to each spatial feature representation map with respect to their relevance in target network's decision making for target class . Different methods are used in this weight assignment. By taking weighted sum of with the set of importance weight over channels, class activation map is generated for visual explanation.
A class activation map(CAM) for a specific class shows the discriminative image regions used by CNN to make decision to classify images into that class. Challenging conditioned dataset such as dataset with imbalanced distribution of class and frequent co-occurrence of multiple objects in multi-label classification training dataset might give rise to unwanted bias on internal components of the target deep network. CAM utilizes these internal components of deep networks such as gradients and feature maps to generate visual explanations. This means when it comes to challenging conditioned dataset, there is no guarantee on the reliability of internal components of deep networks which leads to degradation of credibility on generated visual explanations. To tackle such problems, they propose Bias-reducing memory module that provides quality visual explanations even with datasets of challenging conditions.
CAM is first introduced in [2] where they compute weighted sum of the features at the last convolutional layer using the weights from the output layer that comes after applying global average pooling to the last convolutional layer. Afterward, Grad-CAM [3] generalizes the concept of CAM, verifying that generation of activation map does not require specific structure (global average pool layer at the end of the network) of CNN anymore. Grad-CAM++ [4] further generalizes Grad-CAM, enhancing its visual explanation by taking advantage of higher order derivatives for importance weight assignment. The mentioned CAM methods are the most popular ones but several other variations of CAM have been published as well, with each having specialized advantages.
In [5], in order to open up possibilities for open domain Question Answering task, a new dynamic neural network architecture called key-value structure memory network is introduced for storing diverse patterns of feature representation of trained images in the key memory and use it to make inference for the class-related semantic information which is stored in the value memory. This trained correlation between key and value memory will then be used to classify the input image.
To mitigate the effect of biases of the target network caused by the inherent challenging conditions of the training dataset, the proposed key-value structure memory module learns the distribution of spatial feature representation from the target deep network and discretely organizes the distributions into separate memory slots. To further boost the memory network to its fullest extent, they use the sparse dictionary learning concept from [6], where diverse information can be stored sparsely over different memory slots.
Imbalanced dataset below (there are a lot of data for dogs but only few samples for whales) might cause model to tune it weights to correctly fit dogs data more because the loss of dogs data affects total loss more than whale data. To avoid this issue, memory module is used to store features of distinct class into different slots and the trained module will be used for inference. In this way, the classification does not depend on the biased parameters of the model anymore.
Multiple objects might co-occur in a single training image. For example, there would be a lot of images containing peson and horse at the same time. If there is much lesser training images where horse exists by itself, the network might rely on the occurence of man to classify horse and unable to recognize horse when person is absent. To mitigate this issue, memory module helps by learning to disentagle horse features from person features and store it into different slots, even when they exist together.
The figure above describes an overall flow on how the proposed Bias-reducing memory module learns desired information from the target network. For training, the memory module takes the spatial feature representation extracted from a pre-trained feature encoder F $$f \in \mathbb{R}^{w \times h \times c}$$, query feature representation $$q \in \mathbb{R}^{c}$$ and a value feature representation $$v'\in \mathbb{R}^{c}$$. They design the semantic information encoder G to map the hot encoded ground truth label vector y into the same number of dimensionality as $$q$$. $$f$$ and $$v'$$ are not used in inference step. The memory module outputs read value feature $$v'\in \mathbb{R}^{c}$$ and the classifier takes a concatenated vector of $$q$$ and $$v$$ as an input and output classification score $$z$in both training and inference step.
Datasets used are those that have challenging conditions introduced in the beginning. MS COCO is chosen because it contains a lot of images with multiple objects co-occurence. Additionaly, four medical datasets are chosen, NIH C hest X-ray 14 and VinDr-CXR because of their multiple object co-occurrence and retinal optical coherence tomography(OCT) and EndoTect Challenge dataset because of their class imbalance.
Several visual explanation frameworks are chosen as baselines: Grad, Grad++, Eigen, Eigengrad, Ablation.
M-CAM is verified qualitatively by plotting the visual explanations and also quantitatively using several evaluation metrics such as Average Drop Percentage, Percentage Increase in Confidence, Infidelity, Sensitivity metric.
In MS COCO dataset, while other visual explanation frameworks fail in emphasizing only the single intended object in an image where a person and a skateboard co-occur, M-CAM succeeded in highlighting person and skateboard separately accurately. M-CAM also performs better than the other frameworks on dataset with class imbalance (EndoTect).
M-CAM contribute three things: key-value structure bias-reducing memory and its training scheme, novel CAM-based visual explanation method based on the memory module, and verification of the proposed method on MS COCO and four medical datasets. The memory module might take a lot of memory depending on how many objects there are but it will no get into infinite when it is a close dataset as size of vocabulary will be limited.
Dataset with class imbalance and multi objects co-occurrence can make network become bias and hence network parameter based visual explanation framework might not be dependable.
Key-value memory based visual explanation can tackle this issue by learning to match representation feature to its object and store each linkage on different slot index.
김성엽 (Kim Seongyeop)
Affiliation \KAIST EE
Contact information \seongyeop@kaist.ac.kr
[1] Seongyeop Kim, Yong Man Ro. M-CAM: Visual Explanation of Challenging Conditioned Dataset. In British Machine Vision Conference (BMVC), 2021. [2] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, 2016. [3] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision (ICCV), pages 618–626, 2017. [4] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubrama- nian. Grad-cam++: Generalized gradient-based visual explanations for deep convolu- tional networks. In Winter Conference on Applications of Computer Vision (WACV), pages 839–847. IEEE, 2018. [5] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1400–1409, Austin, Texas, November 2016. Association for Computa- tional Linguistics. doi: 10.18653/v1/D16-1147. [6] Kenneth Kreutz-Delgado, Joseph F Murray, Bhaskar D Rao, Kjersti Engan, Te-Won Lee, and Terrence J Sejnowski. Dictionary learning algorithms for sparse representa- tion. Neural computation, 15(2):349–396, 2003.
Before going into training part, it would be useful to discuss how to get a value reading from the memory. This method will be used in the training steps . Application of key-value memory involves two major steps, which are key addressing and value reading. Given an embedded query value with c as number of channels of the resulted spatial features , similarity between q and each slot of key memory is measured. An address vector is obtained for a key memory with slots, where each scalar value of represents similarity between the query and each memory slot:\
where i=1,2,...,N and
In value reading step, the value memory is accessed by the key address vector p as a set of relative weights of importance for each slot. The read value is obtained such that , where is a trained value memory with slots. By doing so, key-value memory structure allows it to flexibly access to desired information stored in the value memory corresponding to different query values.
Memory module is trained to store corresponding information at the same sequential location of slot. In other words, if the second slot of V turns out to contain semantic information related to dog class, we guide the second slot of S to learn corresponding distribution of spatial feature representation of dog class. To effectively guide Bias-reducing memory module to learn the distribution of spatial feature representation with the corresponding semantic information distilled from the target network, we design three objective functions .
As in the architecture figure, a new classifier has to be trained from the scratch in order to train the memory module. is devised as: where BCE is a Binary Cross Entropy loss function, is a fully connected layer classifier and represents concatenation between two vectors. is a value reading obtained by using formula (1) from the memory reading section above. is also a value reading obtained by using formula (1) with Value memory instead of Key memory and value feature representation v' instead of query feature representation q. The first term uses vt which is influenced by ground truth labels and this term is used to train value memory to contain ground truth values. The second term contain v which is influenced by query features and this term is used to train key memory.
We want the memory module to effectively arrange the features into slots without leaving any blank slots so that memory space is used efficiently. is utilized to achieve this, a L2 norm between the two read value features vt and v:
We need to make the same index of memory slots at S, K, and V to store information related to each other. An address matching objective function is used to guide the spatial feature representation dictionary and key memory to output similar address vectors and to the value address vector . where is Kullback-Leibler divergence.
Taking account of all these losses, the total loss is then
The key-value structure memory module learns the distribution of spatial feature representation from the target deep network and discretely organizes the distributions into separate memory slots. After training is completed, we would have constructed a Spatial Feature Representation Dictionary S from the training images. Given the query feature representation of an input image x, a target class , and the original prediction score of x, we would want to find a slot of the trained S memory module that contains the most closely related information for the target class . This slot can be found by perturbing each slot with random noise and get the slot which suffers highest prediction score decrease. This algorithm below will return the slot sequence number that contains the most closely related information for the target class in the trained memory module.
Trained model will refer to the Spatial Feature Representation Dictionary S when classifying images. We want to know which part of the images is being taken into consideration the most in the model's decision making. Weight adjustment of the memory slots is done to reduce the importance of spatial feature representations that are irrelevant to the target class while giving more emphasis on the ones similar to the retrieved feature distribution .They take exponential function on τi to map the output range of cosine similarity [-1,1] to positive number of range [e−1, e] giving more emphasis on the cosine similarity value that is close to 1. Class activation map M-CAM is then constructed by taking weighted sum of with the set of importance weight w = {w1,w2,...,wc} over c channels.