📝
Awesome reviews
  • Welcome
  • Paper review
    • [2022 Spring] Paper review
      • RobustNet [Eng]
      • DPT [Kor]
      • DALL-E [Kor]
      • VRT: A Video Restoration Transformer [Kor]
      • Barbershop [Kor]
      • Barbershop [Eng]
      • REFICS [ENG]
      • Deep texture manifold [Kor]
      • SlowFast Networks [Kor]
      • SCAN [Eng]
      • DPT [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Kor]
      • Chaining a U-Net With a Residual U-Net for Retinal Blood Vessels Segmentation [Eng]
      • Patch Cratf : Video Denoising by Deep Modeling and Patch Matching [Eng]
      • LAFITE: Towards Language-Free Training for Text-to-Image Generation [Kor]
      • RegSeg [Eng]
      • D-NeRF [Eng]
      • SimCLR [Kor]
      • LabOR [Kor]
      • LabOR [Eng]
      • SegFormer [Kor]
      • Self-Calibrating Neural Radiance Fields [Kor]
      • Self-Calibrating Neural Radiance Fields [Eng]
      • GIRAFFE [Kor]
      • GIRAFFE [Eng]
      • DistConv [Kor]
      • SCAN [Eng]
      • slowfastnetworks [Kor]
      • Nesterov and Scale-Invariant Attack [Kor]
      • OutlierExposure [Eng]
      • TSNs [Kor]
      • TSNs [Eng]
      • Improving the Transferability of Adversarial Samples With Adversarial Transformations [Kor]
      • VOS: OOD detection by Virtual Outlier Synthesis [Kor]
      • MultitaskNeuralProcess [Kor]
      • RSLAD [Eng]
      • Deep Learning for 3D Point Cloud Understanding: A Survey [Eng]
      • BEIT [Kor]
      • Divergence-aware Federated Self-Supervised Learning [Eng]
      • NeRF-W [Kor]
      • Learning Multi-Scale Photo Exposure Correction [Eng]
      • ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [Eng]
      • ViT [Eng]
      • CrossTransformer [Kor]
      • NeRF [Kor]
      • RegNeRF [Kor]
      • Image Inpainting with External-internal Learning and Monochromic Bottleneck [Eng]
      • CLIP-NeRF [Kor]
      • CLIP-NeRF [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Eng]
      • DINO: Emerging Properties in Self-Supervised Vision Transformers [Kor]
      • DatasetGAN [Eng]
      • MOS [Kor]
      • MOS [Eng]
      • PlaNet [Eng]
      • MAE [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Kor]
      • Fair Attribute Classification through Latent Space De-biasing [Eng]
      • Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning [Kor]
      • PointNet [Kor]
      • PointNet [Eng]
      • MSD AT [Kor]
      • MM-TTA [Kor]
      • MM-TTA [Eng]
      • M-CAM [Eng]
      • MipNerF [Kor]
      • The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [Eng]
      • Calibration [Eng]
      • CenterPoint [Kor]
      • YOLOX [Kor]
    • [2021 Fall] Paper review
      • DenseNet [Kor]
      • Time series as image [Kor]
      • mem3d [Kor]
      • GraSP [Kor]
      • DRLN [Kor]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Eng]
      • VinVL: Revisiting Visual Representations in Vision-Language Models [Kor]
      • NeSyXIL [Kor]
      • NeSyXIL [Eng]
      • RCAN [Kor]
      • RCAN [Eng]
      • MI-AOD [Kor]
      • MI-AOD [Eng]
      • DAFAS [Eng]
      • HyperGAN [Eng]
      • HyperGAN [Kor]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Eng]
      • Scene Text Telescope: Text-focused Scene Image Super-Resolution [Kor]
      • UPFlow [Eng]
      • GFP-GAN [Kor]
      • Federated Contrastive Learning [Kor]
      • Federated Contrastive Learning [Eng]
      • BGNN [Kor]
      • LP-KPN [Kor]
      • Feature Disruptive Attack [Kor]
      • Representative Interpretations [Kor]
      • Representative Interpretations [Eng]
      • Neural Discrete Representation Learning [KOR]
      • Neural Discrete Representation Learning [ENG]
      • Video Frame Interpolation via Adaptive Convolution [Kor]
      • Separation of hand motion and pose [kor]
      • pixelNeRF [Kor]
      • pixelNeRF [Eng]
      • SRResNet and SRGAN [Eng]
      • MZSR [Kor]
      • SANforSISR [Kor]
      • IPT [Kor]
      • Swin Transformer [kor]
      • CNN Cascade for Face Detection [Kor]
      • CapsNet [Kor]
      • Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [Kor]
      • CSRNet [Kor]
      • ScrabbleGAN [Kor]
      • CenterTrack [Kor]
      • CenterTrack [Eng]
      • STSN [Kor]
      • STSN [Eng]
      • VL-BERT:Visual-Linguistic BERT [Kor]
      • VL-BERT:Visual-Linguistic BERT [Eng]
      • Squeeze-and-Attention Networks for Semantic segmentation [Kor]
      • Shot in the dark [Kor]
      • Noise2Self [Kor]
      • Noise2Self [Eng]
      • Dynamic Head [Kor]
      • PSPNet [Kor]
      • PSPNet [Eng]
      • CUT [Kor]
      • CLIP [Eng]
      • Local Implicit Image Function [Kor]
      • Local Implicit Image Function [Eng]
      • MetaAugment [Eng]
      • Show, Attend and Tell [Kor]
      • Transformer [Kor]
      • DETR [Eng]
      • Multimodal Versatile Network [Eng]
      • Multimodal Versatile Network [Kor]
      • BlockDrop [Kor]
      • MDETR [Kor]
      • MDETR [Eng]
      • FSCE [Kor]
      • waveletSR [Kor]
      • DAN-net [Eng]
      • Boosting Monocular Depth Estimation [Eng]
      • Progressively Complementary Network for Fisheye Image Rectification Using Appearance Flow [Kor]
      • Syn2real-generalization [Kor]
      • Syn2real-generalization [Eng]
      • GPS-Net [Kor]
      • Frustratingly Simple Few Shot Object Detection [Eng]
      • DCGAN [Kor]
      • RealSR [Kor]
      • AMP [Kor]
      • AMP [Eng]
      • RCNN [Kor]
      • MobileNet [Eng]
  • Author's note
    • [2022 Spring] Author's note
      • Pop-Out Motion [Kor]
    • [2021 Fall] Author's note
      • Standardized Max Logits [Eng]
      • Standardized Max Logits [Kor]
  • Dive into implementation
    • [2022 Spring] Implementation
      • Supervised Contrastive Replay [Kor]
      • Pose Recognition with Cascade Transformers [Eng]
    • [2021 Fall] Implementation
      • Diversity Input Method [Kor]
        • Source code
      • Diversity Input Method [Eng]
        • Source code
  • Contributors
    • [2022 Fall] Contributors
    • [2021 Fall] Contributors
  • How to contribute?
    • (Template) Paper review [Language]
    • (Template) Author's note [Language]
    • (Template) Implementation [Language]
  • KAIST AI
Powered by GitBook
On this page
  • 1. Problem definition
  • 2. Motivation
  • Related work
  • Idea
  • 3. Method
  • 4. Experiment & Result
  • 4.1. Few-shot image classification (Omniglot, Mini-ImageNet, D’Claw)
  • 4.2. Regression tasks (Sinusoid, Pascal3D Pose Regression)
  • 5. Conclusion
  • Take home message (오늘의 교훈)
  • Author / Reviewer information
  • Author
  • Reviewer
  • Reference & Additional materials

Was this helpful?

  1. Paper review
  2. [2021 Fall] Paper review

MetaAugment [Eng]

Rajendran et al. / Meta-Learning Requires Meta-Augmentation / NeurIPS 2020

PreviousLocal Implicit Image Function [Eng]NextShow, Attend and Tell [Kor]

Last updated 3 years ago

Was this helpful?

1. Problem definition

A standard supervised machine learning problem considers a set of training data (xi,yix^i, y^ixi,yi) indexed by i and sampled from a task T\mathcal{T}T, where the goal is to learn a function x↦y^x \mapsto \widehat{y}x↦y​. Following [1], we rely mostly on meta-learning centric nomenclature but borrow the terms "support", "query", "episode" from the few-shot learning literature. In meta-learning, we have a set of tasks {Ti\mathcal{T}^iTi}, where each task Ti\mathcal{T}^iTi is made of a support set Dis\mathcal{D}^s_iDis​ contains (xs,ys)(x_s, y_s)(xs​,ys​), and a query set Diq\mathcal{D}^q_iDiq​ contains (xq,yqx_q, y_qxq​,yq​) samples. The grouped support and query sets are referred to as an episode. The training and test sets of examples are replaced by meta-training and meta-test sets of tasks, each of which consist of episodes. The goal is to learn a base learner that first observes support data (xs,ysx_s, y_sxs​,ys​) for a new task, then outputs a model which yields correct prediction y^q\widehat{y}_qy​q​ for xqx_qxq​. The learned model for a i-th task is parameterized by ϕi\phi_iϕi​ and the well-generalized model which is adapted to each i-th task, i.e., the based learner, is θ0\theta_0θ0​. The generalization of the adapted model is measured on the query set Diq\mathcal{D}^q_iDiq​, and in turn used to optimize the based learner θ0\theta_0θ0​ during meta-training. In the scope of the project, to make it simplifier, we only consider classification tasks, this is commonly described as k-shot, N-way classification, indicating k examples in the support set, with class labels ys,yq∈[1,N]y_s, y_q \in [1, N]ys​,yq​∈[1,N]. Let L\mathcal{L}L and μ\muμ denote the loss function and the inner-loop learning rate, respectively. The above process is formulated as the following optimization problem

θ0∗:=minθ0ETi∼p(T)[L(fϕi(Xiq),Yiq)],s.t.ϕi=θ0−μ∇θ0L(fθ0(Xis),Yis),\theta^*_0 := \underset{\theta_0}{min} \mathbb{E}_{\mathcal{T}^i \sim p(\mathcal{T})}[\mathcal{L}(f_{\phi_i}(\textbf{X}^q_i), \textbf{Y}^q_i)], \\ s.t. \phi_i = \theta_0 - \mu\nabla_{\theta_0}\mathcal{L}(f_{\theta_0}(\textbf{X}^s_i), \textbf{Y}^s_i),θ0∗​:=θ0​min​ETi∼p(T)​[L(fϕi​​(Xiq​),Yiq​)],s.t.ϕi​=θ0​−μ∇θ0​​L(fθ0​​(Xis​),Yis​),

where (Xis(q),Yis(q))(\textbf{X}^{s(q)}_i, \textbf{Y}^{s(q)}_i)(Xis(q)​,Yis(q)​) represent the collection of samples and their corresponding labels for the support and query set respectively. In the meta-testing phase, to solve the new task T∗\mathcal{T}^*T∗, the optimal θ0∗\theta^*_0θ0∗​ is fine-tuned on it support set Ds∗\mathcal{D}^{s*}Ds∗ to the resulting task-specific parameters θ∗\theta_*θ∗​

2. Motivation

There are two forms of overfitting: (1) memorization overfitting, in which the model is able to overfit to the training set without relying on the learner and (2) learner overfitting, in which the learner overfits to the training set and does not generalize to the test set. Both types of overfitting hurt the generalization from meta-training to meta-testing tasks.

This paper introduces an information-theoretic framework of meta-augmentation, whereby adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks. Specifically they propose to augment new tasks based on existing tasks.

Related work

Data augmentation has been applied to several domains with strong results, including image classification, speech recognition, reinforcement learning, and language learning. Within meta-learning, augmentation has been applied in several ways. Mehrotra and Dukkipati [3] train a generator to generate new examples for one-shot learning problems. Santoro et al. [4] augmented Omniglot using random translations and rotations to generate more examples within a task. Liu et al. [5] applied similar transforms, but treated it as task augmentation by defining each rotation as a new task. These augmentations add more data and tasks, but do not turn non-mutually-exclusive problems into mutually-exclusive ones, since the pairing between xs,ysx_s, y_sxs​,ys​ is still consistent across meta-learning episodes, leaving open the possibility of memorization overfitting. Antoniou and Storkey [6] and Khodadadeh et al. [7] generate tasks by randomly selecting xs from an unsupervised dataset, using data augmentation on xs to generate more examples for the random task. The authors instead create a mutually-exclusive task setting by modifying ys to create more tasks with shared xs. The large interest in the field has spurred the creation of meta-learning benchmarks, investigations into tuning few-shot models, and analysis of what these models learn. For overfitting in MAML in particular, regularization has been done by encouraging the model’s output to be uniform before the base learner updates the model [8], limiting the updateable parameters [9], or regularizing the gradient update based on cosine similarity or dropout [10, 11]. Yin et al. [2] propose using a Variational Information Bottleneck to regularize the model by restricting the information flow between xqx_qxq​ and yqy_qyq​.

Idea

Correctly balancing regularization to prevent overfitting or underfitting can be challenging, as the relationship between constraints on model capacity and generalization are hard to predict. For example, overparameterized networks have been empirically shown to have lower generalization error[12]. Rather than crippling the model by limiting its access to xqx_qxq​, the authors instead use data augmentation to encourage the model to pay more attention to (xs,ys)(x_s, y_s)(xs​,ys​).

We define an augmentation to be CE-preserving (conditional entropy preserving) if conditional entropy H(y′∣x′)=H(y∣x)H(y^{'}|x^{'})=H(y|x)H(y′∣x′)=H(y∣x) is conserved; for instance, the rotation augmentation is CE-preserving because rotations in x′x^{'}x′ do not affect the predictiveness of the original or rotated image to the class label. CE-preserving augmentations are commonly used in image-based problems. Conversely, an augmentation is CE-increasing if it increases conditional entropy, H(y′∣x′)>H(y∣x)H(y^{'}|x^{'})>H(y|x)H(y′∣x′)>H(y∣x). For example, if Y is continuous and ϵ∼U[−1,1]\epsilon∼ U[−1, 1]ϵ∼U[−1,1], then f(x,y,ϵ)=(x,y+ϵ)f(x, y, \epsilon)=(x, y+\epsilon)f(x,y,ϵ)=(x,y+ϵ) is CE-increasing, since (x′,y′)(x^{'}, y^{'})(x′,y′) will have two examples (x,y1),(x,y2)(x, y_1),(x, y_2)(x,y1​),(x,y2​) with shared x and different y, increasing H(y′∣x′)H(y^{'}|x^{'})H(y′∣x′).

The authors propose to augment in the same way as classical machine learning methods, by applying a CE-preserving augmentation to each task. However, the overfitting problem in meta-learning requires different augmentation. They wish to couple (xs,ys),(xq,yq)(x_s, y_s),(x_q, y_q)(xs​,ys​),(xq​,yq​) together such that the model cannot minimize training loss using xqx_qxq​ alone. This can be done through CE-increasing augmentation. Labels ys,yqy_s, y_qys​,yq​ are encrypted to ys′,yq′y^{'}_s, y^{'}_qys′​,yq′​ with the same random key ϵ\epsilonϵ, in a way such that the base learner can only recover ϵ\epsilonϵ by associating xs→ys′x_s \rightarrow y^{'}_sxs​→ys′​ and doing so is necessary to associate xq→yq′x_q \rightarrow y^{'}_qxq​→yq′​.

3. Method

Theorem 1. Let ϵ\epsilonϵ be a noise variable independent from X, Y, and g:ϵ,Y→Yg:\epsilon, Y \rightarrow Yg:ϵ,Y→Y be the augmentation function. Let y′=g(ϵ,y)y^{'}=g(\epsilon, y)y′=g(ϵ,y), and assume that (ϵ,x,y)↦(x,y′)(\epsilon, x, y)\mapsto (x,y^{'})(ϵ,x,y)↦(x,y′) is a one-to-one function. Then H(Y′∣X)=H(Y∣X)+H(ϵ)H(Y^{'}|X)=H(Y|X)+H(\epsilon)H(Y′∣X)=H(Y∣X)+H(ϵ).

In order to lower H(Yq′∣Xq)H(Y^{'}_q|X_q)H(Yq′​∣Xq​) to a level where the task can be solved, the learner must extract at least H(ϵ)H(\epsilon)H(ϵ) bits from (xs,ys′)(x_s, y^{'}_s)(xs​,ys′​). This reduces memorization overfitting, since it guarantees (xs,ys′)(x_s, y^{'}_s)(xs​,ys′​) has some information required to predict yq′y^{'}_qyq′​, even given the model and xqx_qxq​.

By adding new and different varieties of tasks to the meta-train set, CE-increasing augmentations also help avoid learner overfitting and help the base learner generalize to test set tasks. This effect is similar to the effect that data augmentation has in classical machine learning to avoid overfitting.

Few shot classification benchmarks such as Mini-ImageNet [13] have meta-augmentation in them by default. New tasks created by shuffling the class index y of previous tasks are added to the training set. Here y′=g(ϵ,y)y^{'}=g(\epsilon, y)y′=g(ϵ,y), where ϵ\epsilonϵ is a permutation sampled uniformly from all permutations. The ϵ\epsilonϵ can be viewd as an encryption key, which g applies to y to get y′y^{'}y′. This augmentation is CE-increasing, since given an initial label distribution Y, augmenting with Y′=g(ϵ,Y)Y^{'}=g(\epsilon, Y)Y′=g(ϵ,Y) gives a uniform Y′∣XY^{'}|XY′∣X. Therefore, this makes the task setting mutually-exclusive, thereby reducing memorization overfitting. This is accompanied by creation of new tasks through combining classes from different tasks, adding more variation to the meta-train set. These added tasks help avoid learner overfitting.

For multivariate regression tasks where the support set contains a regression target, the dimensions of ys,yqy_s, y_qys​,yq​ can be treated as class logits to be permuted. This reduces to an identical setup to the classification case. For scalar meta-learning regression tasks and situations where output dimensions cannot be permuted, we show the CE-increasing augmentation of adding uniform noise to the regression targets ys′=ys+ϵ,yq′=yq+ϵy^{'}_s=y_s+\epsilon, y^{'}_q=y_q+\epsilonys′​=ys​+ϵ,yq′​=yq​+ϵ generates enough new tasks to help reduce overfitting.

4. Experiment & Result

4.1. Few-shot image classification (Omniglot, Mini-ImageNet, D’Claw)

Experimental setup

  • General settings

    • 1-shot, 5-way

    • Turn default mutually-exclusive benchmarks into non-mutually-exclusive versions of themselves by partitioning the classes into groups of N classes without overlap. These groups form the meta-train tasks, and over all of training, class order is never changed.

  • Dataset

    • Omniglot: 1623 different handwritten characters from different alphabets

    • Mini-ImageNet: is a more complex few-shot dataset, based on the ILSVRC object classification dataset. There are 100 classes in total, with 600 samples each.

    • D'Claw: a small robotics-inspired image classification dataset that authors collected.

  • Classification models: MAML, Prototypical, Matching

  • Baselines: non-mutually-exclusive settings

  • Evaluation metric: classification accuracy

Result

Common few-shot image classification benchmarks, like Omniglot and Mini-ImageNet, are already mutuallyexclusive by default through meta-augmentation. In order to study the effect of meta-augmentation using task shuffling on various datasets, we turn these mutually-exclusive benchmarks into nonmutually-exclusive versions of themselves by partitioning the classes into groups of N classes without overlap. These groups form the meta-train tasks, and over all of training, class order is never changed.

Table 1: Few-shot image classification test set results. Results use MAML unless otherwise stated. All results are in 1-shot 5-way classification, except for D’Claw which is 1-shot 2-way. (unit: %)

Problem setting
Non-mutually-exclusive accuracy
Intrashuffle accuracy
Intershuffle accuracy

Omniglot

98.1

98.5

98.7

Mini-ImageNet (MAML)

30.2

42.7

46

Mini-ImageNet (Prototypical)

32.5

32.5

37.2

Mini-ImageNet (matching)

33.8

33.8

39.8

D'Claw

72.5

79.8

83.1

4.2. Regression tasks (Sinusoid, Pascal3D Pose Regression)

Experimental setup

  • General settings

    • Scalar output

  • Dataset

    • Sinusoid: Synthesized 1D sine wave regression problem

    • Pascal3D pose regression . Each task is to take a 128x128 grayscale image of an object from the Pascal 3D dataset and predict its angular orientation yqy_qyq​ (normalized between 0 and 10) about the Z-axis, with respect to some unobserved canonical pose specific to each object.

  • Baselines: MAML, MR-MAML, CNP, MR-CNP

  • Evaluation metric: Prediction mean square error and standard deviations

Result

Table 2: Pascal3D pose prediction error (MSE) means and standard deviations. Removing weight decay (WD) improves the MAML baseline and augmentation improves the MAML, MR-MAML, CNP, MR-CNP results. Bracketed numbers copied from Yin et al. [2].

Method

MAML (WD=1e-3)

MAML (WD=0)

MR-MAML (β=0.001)

MR-MAML (β=0)

CNP

MR-CNP

No Aug

[5.39±1.31]

3.74 ± .64

2.41 ± .04

2.8 ± .73

[8.48±.12]

[2.89±.18]

Aug

4.99±1.22

** 2.34 ± .66 **

**1.71 ± .16 **

**1.61 ± .06 **

**2.51±.17 **

2.51±.20

5. Conclusion

Memorization overfitting is just a classical machine learning problem in disguise: a function approximator pays too much attention to one input xqx_qxq​, and not enough to the other input (xs,ys)(x_s, y_s)(xs​,ys​), when the former is sufficient to solve the task at training time. The two inputs could take many forms, such as different subsets of pixels within the same image. By a similar analogy, learner overfitting corresponds to correct function approximation on input examples ((xs,ys),xq)((x_s, y_s), x_q)((xs​,ys​),xq​) from the training set, and a systematic failure to generalize from those to the test set. Although meta-augmentation is helpful, it still has its limitations. Distribution mismatch between train-time tasks and test-time tasks can be lessened through augmentation, but augmentation may not entirely remove it. Creating new tasks by shuffling the class index of previous tasks is a simple procedure. It may be more explored by considering a mathematical framework to augment novel tasks that systematically abide by a specific rule.

Take home message (오늘의 교훈)

2 forms of meta-over fitting: memorization overfitting, learner overfitting.

Meta-augmentation avoid both by shuffle classes in classification and add a random variable to y in regression.

Author / Reviewer information

Author

Nguyen Ngoc Quang

  • KAIST AI

  • https://github.com/quangbk2010

Reviewer

  1. Korean name (English name): Affiliation / Contact information

  2. Korean name (English name): Affiliation / Contact information

  3. ...

Reference & Additional materials

  1. Eleni, T., e. a. Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, 2020

  2. Yin, M., T. G. Z. M. L. S. and Finn, C. Meta-learning without memorization. In ICLR, 2020

  3. Akshay Mehrotra and Ambedkar Dukkipati. Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033, 2017

  4. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016

  5. Jialin Liu, Fei Chao, and Chih-Min Lin. Task augmentation by rotating for meta-learning. arXiv preprint arXiv:2003.00804, 2020

  6. Antreas Antoniou and Amos Storkey. Assume, augment and learn: Unsupervised few-shot meta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884, 2019

  7. Siavash Khodadadeh, Ladislau Boloni, and Mubarak Shah. Unsupervised meta-learning for few-shot image classification. In Advances in Neural Information Processing Systems, pages 10132–10142, 2019

  8. Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11719–11727, 2019

  9. Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pages 7693–7702, 2019

  10. Simon Guiroy, Vikas Verma, and Christopher Pal. Towards understanding generalization in gradient-based meta-learning. arXiv preprint arXiv:1907.07287, 2019

  11. Hung-Yu Tseng, Yi-Wen Chen, Yi-Hsuan Tsai, Sifei Liu, Yen-Yu Lin, and Ming-Hsuan Yang. Regularizing meta-learning via gradient dropout, 2020

  12. Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018

  13. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016