MetaAugment [Eng]
Rajendran et al. / Meta-Learning Requires Meta-Augmentation / NeurIPS 2020
Last updated
Rajendran et al. / Meta-Learning Requires Meta-Augmentation / NeurIPS 2020
Last updated
A standard supervised machine learning problem considers a set of training data () indexed by i and sampled from a task , where the goal is to learn a function . Following [1], we rely mostly on meta-learning centric nomenclature but borrow the terms "support", "query", "episode" from the few-shot learning literature. In meta-learning, we have a set of tasks {}, where each task is made of a support set contains , and a query set contains () samples. The grouped support and query sets are referred to as an episode. The training and test sets of examples are replaced by meta-training and meta-test sets of tasks, each of which consist of episodes. The goal is to learn a base learner that first observes support data () for a new task, then outputs a model which yields correct prediction for . The learned model for a i-th task is parameterized by and the well-generalized model which is adapted to each i-th task, i.e., the based learner, is . The generalization of the adapted model is measured on the query set , and in turn used to optimize the based learner during meta-training. In the scope of the project, to make it simplifier, we only consider classification tasks, this is commonly described as k-shot, N-way classification, indicating k examples in the support set, with class labels . Let and denote the loss function and the inner-loop learning rate, respectively. The above process is formulated as the following optimization problem
where represent the collection of samples and their corresponding labels for the support and query set respectively. In the meta-testing phase, to solve the new task , the optimal is fine-tuned on it support set to the resulting task-specific parameters
There are two forms of overfitting: (1) memorization overfitting, in which the model is able to overfit to the training set without relying on the learner and (2) learner overfitting, in which the learner overfits to the training set and does not generalize to the test set. Both types of overfitting hurt the generalization from meta-training to meta-testing tasks.
This paper introduces an information-theoretic framework of meta-augmentation, whereby adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks. Specifically they propose to augment new tasks based on existing tasks.
Data augmentation has been applied to several domains with strong results, including image classification, speech recognition, reinforcement learning, and language learning. Within meta-learning, augmentation has been applied in several ways. Mehrotra and Dukkipati [3] train a generator to generate new examples for one-shot learning problems. Santoro et al. [4] augmented Omniglot using random translations and rotations to generate more examples within a task. Liu et al. [5] applied similar transforms, but treated it as task augmentation by defining each rotation as a new task. These augmentations add more data and tasks, but do not turn non-mutually-exclusive problems into mutually-exclusive ones, since the pairing between is still consistent across meta-learning episodes, leaving open the possibility of memorization overfitting. Antoniou and Storkey [6] and Khodadadeh et al. [7] generate tasks by randomly selecting xs from an unsupervised dataset, using data augmentation on xs to generate more examples for the random task. The authors instead create a mutually-exclusive task setting by modifying ys to create more tasks with shared xs. The large interest in the field has spurred the creation of meta-learning benchmarks, investigations into tuning few-shot models, and analysis of what these models learn. For overfitting in MAML in particular, regularization has been done by encouraging the model’s output to be uniform before the base learner updates the model [8], limiting the updateable parameters [9], or regularizing the gradient update based on cosine similarity or dropout [10, 11]. Yin et al. [2] propose using a Variational Information Bottleneck to regularize the model by restricting the information flow between and .
Correctly balancing regularization to prevent overfitting or underfitting can be challenging, as the relationship between constraints on model capacity and generalization are hard to predict. For example, overparameterized networks have been empirically shown to have lower generalization error[12]. Rather than crippling the model by limiting its access to , the authors instead use data augmentation to encourage the model to pay more attention to .
We define an augmentation to be CE-preserving (conditional entropy preserving) if conditional entropy is conserved; for instance, the rotation augmentation is CE-preserving because rotations in do not affect the predictiveness of the original or rotated image to the class label. CE-preserving augmentations are commonly used in image-based problems. Conversely, an augmentation is CE-increasing if it increases conditional entropy, . For example, if Y is continuous and , then is CE-increasing, since will have two examples with shared x and different y, increasing .
The authors propose to augment in the same way as classical machine learning methods, by applying a CE-preserving augmentation to each task. However, the overfitting problem in meta-learning requires different augmentation. They wish to couple together such that the model cannot minimize training loss using alone. This can be done through CE-increasing augmentation. Labels are encrypted to with the same random key , in a way such that the base learner can only recover by associating and doing so is necessary to associate .
Theorem 1. Let be a noise variable independent from X, Y, and be the augmentation function. Let , and assume that is a one-to-one function. Then .
In order to lower to a level where the task can be solved, the learner must extract at least bits from . This reduces memorization overfitting, since it guarantees has some information required to predict , even given the model and .
By adding new and different varieties of tasks to the meta-train set, CE-increasing augmentations also help avoid learner overfitting and help the base learner generalize to test set tasks. This effect is similar to the effect that data augmentation has in classical machine learning to avoid overfitting.
Few shot classification benchmarks such as Mini-ImageNet [13] have meta-augmentation in them by default. New tasks created by shuffling the class index y of previous tasks are added to the training set. Here , where is a permutation sampled uniformly from all permutations. The can be viewd as an encryption key, which g applies to y to get . This augmentation is CE-increasing, since given an initial label distribution Y, augmenting with gives a uniform . Therefore, this makes the task setting mutually-exclusive, thereby reducing memorization overfitting. This is accompanied by creation of new tasks through combining classes from different tasks, adding more variation to the meta-train set. These added tasks help avoid learner overfitting.
For multivariate regression tasks where the support set contains a regression target, the dimensions of can be treated as class logits to be permuted. This reduces to an identical setup to the classification case. For scalar meta-learning regression tasks and situations where output dimensions cannot be permuted, we show the CE-increasing augmentation of adding uniform noise to the regression targets generates enough new tasks to help reduce overfitting.
General settings
1-shot, 5-way
Turn default mutually-exclusive benchmarks into non-mutually-exclusive versions of themselves by partitioning the classes into groups of N classes without overlap. These groups form the meta-train tasks, and over all of training, class order is never changed.
Dataset
Omniglot: 1623 different handwritten characters from different alphabets
Mini-ImageNet: is a more complex few-shot dataset, based on the ILSVRC object classification dataset. There are 100 classes in total, with 600 samples each.
D'Claw: a small robotics-inspired image classification dataset that authors collected.
Classification models: MAML, Prototypical, Matching
Baselines: non-mutually-exclusive settings
Evaluation metric: classification accuracy
Common few-shot image classification benchmarks, like Omniglot and Mini-ImageNet, are already mutuallyexclusive by default through meta-augmentation. In order to study the effect of meta-augmentation using task shuffling on various datasets, we turn these mutually-exclusive benchmarks into nonmutually-exclusive versions of themselves by partitioning the classes into groups of N classes without overlap. These groups form the meta-train tasks, and over all of training, class order is never changed.
Table 1: Few-shot image classification test set results. Results use MAML unless otherwise stated. All results are in 1-shot 5-way classification, except for D’Claw which is 1-shot 2-way. (unit: %)
Omniglot
98.1
98.5
98.7
Mini-ImageNet (MAML)
30.2
42.7
46
Mini-ImageNet (Prototypical)
32.5
32.5
37.2
Mini-ImageNet (matching)
33.8
33.8
39.8
D'Claw
72.5
79.8
83.1
General settings
Scalar output
Dataset
Sinusoid: Synthesized 1D sine wave regression problem
Pascal3D pose regression . Each task is to take a 128x128 grayscale image of an object from the Pascal 3D dataset and predict its angular orientation (normalized between 0 and 10) about the Z-axis, with respect to some unobserved canonical pose specific to each object.
Baselines: MAML, MR-MAML, CNP, MR-CNP
Evaluation metric: Prediction mean square error and standard deviations
Table 2: Pascal3D pose prediction error (MSE) means and standard deviations. Removing weight decay (WD) improves the MAML baseline and augmentation improves the MAML, MR-MAML, CNP, MR-CNP results. Bracketed numbers copied from Yin et al. [2].
Method
MAML (WD=1e-3)
MAML (WD=0)
MR-MAML (β=0.001)
MR-MAML (β=0)
CNP
MR-CNP
No Aug
[5.39±1.31]
3.74 ± .64
2.41 ± .04
2.8 ± .73
[8.48±.12]
[2.89±.18]
Aug
4.99±1.22
** 2.34 ± .66 **
**1.71 ± .16 **
**1.61 ± .06 **
**2.51±.17 **
2.51±.20
Memorization overfitting is just a classical machine learning problem in disguise: a function approximator pays too much attention to one input , and not enough to the other input , when the former is sufficient to solve the task at training time. The two inputs could take many forms, such as different subsets of pixels within the same image. By a similar analogy, learner overfitting corresponds to correct function approximation on input examples from the training set, and a systematic failure to generalize from those to the test set. Although meta-augmentation is helpful, it still has its limitations. Distribution mismatch between train-time tasks and test-time tasks can be lessened through augmentation, but augmentation may not entirely remove it. Creating new tasks by shuffling the class index of previous tasks is a simple procedure. It may be more explored by considering a mathematical framework to augment novel tasks that systematically abide by a specific rule.
2 forms of meta-over fitting: memorization overfitting, learner overfitting.
Meta-augmentation avoid both by shuffle classes in classification and add a random variable to y in regression.
Nguyen Ngoc Quang
KAIST AI
https://github.com/quangbk2010
Korean name (English name): Affiliation / Contact information
Korean name (English name): Affiliation / Contact information
...
Eleni, T., e. a. Meta-dataset: A dataset of datasets for learning to learn from few examples. In ICLR, 2020
Yin, M., T. G. Z. M. L. S. and Finn, C. Meta-learning without memorization. In ICLR, 2020
Akshay Mehrotra and Ambedkar Dukkipati. Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033, 2017
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016
Jialin Liu, Fei Chao, and Chih-Min Lin. Task augmentation by rotating for meta-learning. arXiv preprint arXiv:2003.00804, 2020
Antreas Antoniou and Amos Storkey. Assume, augment and learn: Unsupervised few-shot meta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884, 2019
Siavash Khodadadeh, Ladislau Boloni, and Mubarak Shah. Unsupervised meta-learning for few-shot image classification. In Advances in Neural Information Processing Systems, pages 10132–10142, 2019
Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11719–11727, 2019
Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pages 7693–7702, 2019
Simon Guiroy, Vikas Verma, and Christopher Pal. Towards understanding generalization in gradient-based meta-learning. arXiv preprint arXiv:1907.07287, 2019
Hung-Yu Tseng, Yi-Wen Chen, Yi-Hsuan Tsai, Sifei Liu, Yen-Yu Lin, and Ming-Hsuan Yang. Regularizing meta-learning via gradient dropout, 2020
Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha SohlDickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016