1 Hyperparameter Optimization 29static.tongtianta.site/paper_pdf/e82e2322-8137-11e... · novel...

1 Hyperparameter Optimization 29

74. Kandasamy, K., Dasarathy, G., Schneider, J., Póczos, B.: Multi-fidelity Bayesian Optimisa-tion with Continuous Approximations. In: Precup and Teh [122], pp. 1799–1808

75. Kandasamy, K., Schneider, J., Póczos, B.: High Dimensional Bayesian Optimisation andBandits via Additive Models. In: Bach and Blei [7], pp. 295–304

76. Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In:Dasgupta and McAllester [23], pp. 1238–1246

77. King, R., Feng, C., Sutherland, A.: Statlog: comparison of classification algorithms on largereal-world problems. Applied Artificial Intelligence an International Journal 9(3), 289–333(1995)

78. Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast bayesian hyperparameteroptimization on large datasets. In: Electronic Journal of Statistics. vol. 11 (2017)

79. Klein, A., Falkner, S., Mansur, N., Hutter, F.: RoBO: A flexible and robust Bayesian optimiza-tion framework in Python. In: NeurIPS workshop on Bayesian Optimization (BayesOpt’17)(2017)

80. Klein, A., Falkner, S., Springenberg, J.T., Hutter, F.: Learning curve prediction with Bayesianneural networks. In: Proceedings of the International Conference on Learning Representations(ICLR’17) (2017), published online: iclr.cc

81. Koch, P., Konen, W., Flasch, O., Bartz-Beielstein, T.: Optimizing support vector machines forstormwater prediction. Tech. Rep. TR10-2-007, Technische Universität Dortmund (2010)

82. Kohavi, R., John, G.: Automatic Parameter Selection by Minimizing Estimated Error. In:Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference onMachine Learning, pp. 304–312. Morgan Kaufmann Publishers (1995)

83. Komer, B., Bergstra, J., Eliasmith, C.: Hyperopt-sklearn: Automatic hyperparameter config-uration for scikit-learn. In: Hutter, F., Caruana, R., Bardenet, R., Bilenko, M., Guyon, I.,Kégl, B., Larochelle, H. (eds.) ICML workshop on Automated Machine Learning (AutoMLworkshop 2014) (2014)

84. Konen, W., Koch, P., Flasch, O., Bartz-Beielstein, T., Friese, M., Naujoks, B.: Tuned datamining: a benchmark study on different tuners. In: Krasnogor, N. (ed.) Proceedings of the13th Annual Conference on Genetic and Evolutionary Computation (GECCO’11). pp. 1995–2002. ACM (2011)

85. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutionalneural networks. In: Bartlett et al. [9], pp. 1097–1105

86. Krueger, T., Panknin, D., Braun, M.: Fast cross-validation via sequential testing. Journal ofMachine Learning Research (2015)

87. Lee, D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.): Proceedings ofthe 30th International Conference on Advances in Neural Information Processing Systems(NeurIPS’16) (2016)

88. Lee, H., Gramacy, R.: Optimization Subject to Hidden Constraints via Statistical Emulation.Pacific Journal of Optimization 7(3), 467–478 (2011)

89. Li, F.F., Li, J.: Cloud AutoML: Making AI accessible to every business (2018), https://www.blog.google/products/google-cloud/cloud-automl-making-ai-accessible-every-business/

90. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: Anovel bandit-based approach to hyperparameter optimization. Journal of Machine LearningResearch 18(185), 1–52 (2018)

91. Loshchilov, I., Hutter, F.: CMA-ES for hyperparameter optimization of deep neural networks.In: International Conference on Learning Representations Workshop track (2016), publishedonline: iclr.cc

92. Lu, X., Gonzalez, J., Dai, Z., Lawrence, N.: Structured Variationally Auto-encoded Optimiza-tion. In: Dy and Krause [27], pp. 3273–3281

93. Luketina, J., Berglund, M., Greff, K., Raiko, T.: Scalable Gradient-Based Tuning of Continu-ous Regularization Hyperparameters. In: Balcan and Weinberger [8], pp. 2952–2960

94. Luo, G.: A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics 5(1)(2016)

www.iclr.cc

https://www.blog.google/products/google-cloud/cloud-automl-making-ai-accessible-every-business/

https://www.blog.google/products/google-cloud/cloud-automl-making-ai-accessible-every-business/

www.iclr.cc

30 M. Feurer and F. Hutter

95. Lévesque, J.C.: Bayesian Hyperparameter Optimization: Overfitting, Ensembles and Condi-tional Spaces. Ph.D. thesis, Université Laval (2018)

96. Lévesque, J.C., Durand, A., Gagné, C., Sabourin, R.: Bayesian optimization for conditionalhyperparameter spaces. In: Howell, B. (ed.) 2017 International Joint Conference on NeuralNetworks (IJCNN). pp. 286–293. IEEE (2017)

97. Lévesque, J.C., Gagné, C., Sabourin, R.: Bayesian Hyperparameter Optimization for Ensem-ble Learning. In: Ihler and Janzing [66], pp. 437–446

98. MacKay, D.: Hyperparameters: Optimize, or Integrate Out?, pp. 43–59. Springer (1996)99. Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based Hyperparameter Optimization

through Reversible Learning. In: Bach and Blei [7], pp. 2113–2122100. Mantovani, R., Horvath, T., Cerri, R., Vanschoren, J., Carvalho, A.: Hyper-Parameter Tuning

of a Decision Tree Induction Algorithm. In: 2016 5th Brazilian Conference on IntelligentSystems (BRACIS). pp. 37–42. IEEE Computer Society Press (2016)

101. Marcel Wever, F.M., Hüllermeier, E.: ML-Plan for unlimited-length machine learningpipelines. In: Garnett, R., Vanschoren, F.H.J., Brazdil, P., Caruana, R., Giraud-Carrier, C.,Guyon, I., Kégl, B. (eds.) ICML workshop on Automated Machine Learning (AutoMLworkshop 2018) (2018)

102. Maron, O., Moore, A.: The racing algorithm: Model selection for lazy learners. ArtificialIntelligence Review 11(1–5), 193–225 (1997)

103. McInerney, J.: An Empirical Bayes Approach to Optimizing Machine Learning Algorithms.In: Guyon et al. [48], pp. 2712–2721

104. McIntire, M., Ratner, D., Ermon, S.: Sparse Gaussian Processes for Bayesian Optimization.In: Ihler and Janzing [66]

105. Melis, G., Dyer, C., Blunsom, P.: On the state of the art of evaluation in neural language mod-els. In: Proceedings of the International Conference on Learning Representations (ICLR’18)[1], published online: iclr.cc

106. Mendoza, H., Klein, A., Feurer, M., Springenberg, J., Hutter, F.: Towards automatically-tunedneural networks. In: ICML 2016 AutoML Workshop (2016)

107. Michie, D., Spiegelhalter, D., Taylor, C., Campbell, J. (eds.): Machine Learning, Neural andStatistical Classification. Ellis Horwood (1994)

108. Mohr, F., Wever, M., Höllermeier, E.: ML-Plan: Automated machine learning via hierarchicalplanning. Machine Learning 107(8–10), 1495–1515 (2018)

109. Momma, M., Bennett, K.: A Pattern Search Method for Model Selection of Support VectorRegression. In: Proceedings of the 2002 SIAM International Conference on Data Mining,pp. 261–274 (2002)

110. Montgomery, D.: Design and analysis of experiments. John Wiley & Sons, Inc, eighth edn.(2013)

111. Murray, I., Adams, R.: Slice sampling covariance hyperparameters of latent Gaussian models.In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Proceedings ofthe 24th International Conference on Advances in Neural Information Processing Systems(NeurIPS’10). pp. 1732–1740 (2010)

112. Nguyen, T., Gupta, S., Rana, S., Venkatesh, S.: Stable Bayesian Optimization. In: Kim, J.,Shim, K., Cao, L., Lee, J.G., Lin, X., Moon, Y.S. (eds.) Advances in Knowledge Discoveryand Data Mining (PAKDD’17). Lecture Notes in Artificial Intelligence, vol. 10235, pp. 578–591 (2017)

113. Nguyen, V., Gupta, S., Rana, S., Li, C., Venkatesh, S.: Filtering Bayesian optimizationapproach in weakly specified search space. Knowledge and Information Systems (2018)

114. Oh, C., Gavves, E., Welling, M.: BOCK: Bayesian Optimization with Cylindrical Kernels. In:Dy and Krause [27], pp. 3865–3874

115. Olson, R., Bartley, N., Urbanowicz, R., Moore, J.: Evaluation of a Tree-based PipelineOptimization Tool for Automating Data Science. In: Friedrich, T. (ed.) Proceedings of theGenetic and Evolutionary Computation Conference (GECCO’16). pp. 485–492. ACM (2016)

116. Olson, R., La Cava, W., Mustahsan, Z., Varik, A., Moore, J.: Data-driven advice for applyingmachine learning to bioinformatics problems. In: Proceedings of the Pacific Symposium inBiocomputing 2018. pp. 192–203 (2018)

www.iclr.cc


117. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A.,Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NeurIPS Autodiff Workshop(2017)

118. Pedregosa, F.: Hyperparameter optimization with approximate gradient. In: Balcan andWeinberger [8], pp. 737–746

119. Peng-Wei Chen, Jung-Ying Wang, Hahn-Ming Lee: Model selection of SVMs using GAapproach. In: Proceedings of the 2004 IEEE International Joint Conference on NeuralNetworks (IJCNN). vol. 3, pp. 2035–2040. IEEE Computer Society Press (2004)

120. Petrak, J.: Fast subsampling performance estimates for classification algorithm selection.Technical Report TR-2000-07, Austrian Research Institute for Artificial Intelligence (2000)

121. Poloczek, M., Wang, J., Frazier, P.: Multi-Information Source Optimization. In: Guyon et al.[48], pp. 4288–4298

122. Precup, D., Teh, Y. (eds.): Proceedings of the 34th International Conference on MachineLearning (ICML’17), vol. 70. Proceedings of Machine Learning Research (2017)

123. Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Fayyad, U., Chaudhuri,S., Madigan, D. (eds.) The 5th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD’99). pp. 23–32. ACM Press (1999)

124. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press(2006)

125. Rendle, S.: Factorization machines. In: Webb, G., Liu, B., Zhang, C., Gunopulos, D., Wu, X.(eds.) Proceedings of the 10th IEEE International Conference on Data Mining (ICDM’06).pp. 995–1000. IEEE Computer Society Press (2010)

126. Ripley, B.D.: Statistical aspects of neural networks. Networks and chaos—statistical andprobabilistic aspects 50, 40–123 (1993)

127. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: Imagenet large scale visual recognitionchallenge. International Journal of Computer Vision 115(3), 211–252 (2015)

128. Sabharwal, A., Samulowitz, H., Tesauro, G.: Selecting Near-Optimal Learners via Incremen-tal Data Allocation. In: Schuurmans, D., Wellman, M. (eds.) Proceedings of the ThirtiethNational Conference on Artificial Intelligence (AAAI’16). AAAI Press (2016)

129. Samanta, B.: Gear fault detection using artificial neural networks and support vector machineswith genetic algorithms. Mechanical Systems and Signal Processing 18(3), 625–644 (2004)

130. Sanders, S., Giraud-Carrier, C.: Informing the Use of Hyperparameter Optimization ThroughMetalearning. In: Gottumukkala, R., Ning, X., Dong, G., Raghavan, V., Aluru, S., Karypis,G., Miele, L., Wu, X. (eds.) 2017 IEEE International Conference on Big Data (Big Data).IEEE Computer Society Press (2017)

131. Schilling, N., Wistuba, M., Drumond, L., Schmidt-Thieme, L.: Hyperparameter optimizationwith factorized multilayer perceptrons. In: Appice, A., Rodrigues, P., Costa, V., Gama,J., Jorge, A., Soares, C. (eds.) Machine Learning and Knowledge Discovery in Databases(ECML/PKDD’15). Lecture Notes in Computer Science, vol. 9285, pp. 87–103. Springer(2015)

132. Schilling, N., Wistuba, M., Drumond, L., Schmidt-Thieme, L.: Joint Model Choice andHyperparameter Optimization with Factorized Multilayer Perceptrons. In: 2015 IEEE 27thInternational Conference on Tools with Artificial Intelligence (ICTAI). pp. 72–79. IEEEComputer Society Press (2015)

133. Sculley, D., Snoek, J., Wiltschko, A., Rahimi, A.: Winner’s curse? on pace, progress, andempirical rigor. In: International Conference on Learning Representations Workshop track(2018), published online: iclr.cc

134. Shah, A., Ghahramani, Z.: Pareto Frontier Learning with Expensive Correlated Objectives.In: Balcan and Weinberger [8], pp. 1919–1927

135. Shahriari, B., Swersky, K., Wang, Z., Adams, R., de Freitas, N.: Taking the human out of theloop: A review of Bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (2016)

136. Shahriari, B., Bouchard-Cote, A., de Freitas, N.: Unbounded Bayesian optimization viaregularization. In: Gretton and Robert [47], pp. 1168–1176

www.iclr.cc

32 M. Feurer and F. Hutter

137. SIGOPT: Improve ML models 100x faster (2018), https://sigopt.com/138. Simon, D.: Evolutionary optimization algorithms. John Wiley & Sons (2013)139. Snoek, J.: Bayesian optimization and semiparametric models with applications to assistive

technology. PhD Thesis, University of Toronto (2013)140. Snoek, J., Larochelle, H., Adams, R.: Practical Bayesian optimization of machine learning

algorithms. In: Bartlett et al. [9], pp. 2960–2968141. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat,

Adams, R.: Scalable Bayesian optimization using deep neural networks. In: Bach and Blei[7], pp. 2171–2180

142. Snoek, J., Swersky, K., Zemel, R., Adams, R.: Input warping for Bayesian optimization ofnon-stationary functions. In: Xing and Jebara [157], pp. 1674–1682

143. Sparks, E., Talwalkar, A., Haas, D., Franklin, M., Jordan, M., Kraska, T.: Automating modelsearch for large scale machine learning. In: Balazinska, M. (ed.) Proceedings of the SixthACM Symposium on Cloud Computing - SoCC ’15. pp. 368–380. ACM Press (2015)

144. Springenberg, J., Klein, A., Falkner, S., Hutter, F.: Bayesian optimization with robustBayesian neural networks. In: Lee et al. [87]

145. Sun, Q., Pfahringer, B., Mayo, M.: Towards a Framework for Designing Full Model Selectionand Optimization Systems. In: Multiple Classifier Systems, vol. 7872, pp. 259–270. Springer(2013)

146. Swersky, K., Duvenaud, D., Snoek, J., Hutter, F., Osborne, M.: Raiders of the lost architecture:Kernels for Bayesian optimization in conditional parameter spaces. In: NeurIPS Workshop onBayesian Optimization in Theory and Practice (BayesOpt’14) (2014)

147. Swersky, K., Snoek, J., Adams, R.: Multi-task Bayesian optimization. In: Burges, C., Bottou,L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Proceedings of the 27th InternationalConference on Advances in Neural Information Processing Systems (NeurIPS’13). pp. 2004–2012 (2013)

148. Swersky, K., Snoek, J., Adams, R.: Freeze-thaw Bayesian optimization arXiv:1406.3896v1[stats.ML] (2014)

149. Thornton, C., Hutter, F., Hoos, H., Leyton-Brown, K.: Auto-WEKA: combined selection andhyperparameter optimization of classification algorithms. In: Dhillon, I., Koren, Y., Ghani,R., Senator, T., Bradley, P., Parekh, R., He, J., Grossman, R., Uthurusamy, R. (eds.) The19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’13). pp. 847–855. ACM Press (2013)

150. Wainer, J., Cawley, G.: Empirical Evaluation of Resampling Procedures for Optimising SVMHyperparameters. Journal of Machine Learning Research 18, 1–35 (2017)

151. Wang, J., Xu, J., Wang, X.: Combination of hyperband and Bayesian optimization forhyperparameter optimization in deep learning. arXiv:1801.01596v1 [cs.CV] (2018)

152. Wang, L., Feng, M., Zhou, B., Xiang, B., Mahadevan, S.: Efficient Hyper-parameterOptimization for NLP Applications. In: Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing. pp. 2112–2117. Association for ComputationalLinguistics (2015)

153. Wang, Z., Hutter, F., Zoghi, M., Matheson, D., de Feitas, N.: Bayesian optimization in a billiondimensions via random embeddings. Journal of Artificial Intelligence Research 55, 361–387(2016)

154. Wang, Z., Gehring, C., Kohli, P., Jegelka, S.: Batched Large-scale Bayesian Optimizationin High-dimensional Spaces. In: Storkey, A., Perez-Cruz, F. (eds.) Proceedings of the21st International Conference on Artificial Intelligence and Statistics (AISTATS). vol. 84.Proceedings of Machine Learning Research (2018)

155. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Automatic Frankensteining: Creating Com-plex Ensembles Autonomously. In: Proceedings of the 2017 SIAM International Conferenceon Data Mining (2017)

156. Wolpert, D.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)157. Xing, E., Jebara, T. (eds.): Proceedings of the 31th International Conference on Machine

Learning, (ICML’14). Omnipress (2014)

https://sigopt.com/


158. Zabinsky, Z.: Pure Random Search and Pure Adaptive Search. In: Stochastic Adaptive Searchfor Global Optimization, pp. 25–54. Springer (2003)

159. Zeng, X., Luo, G.: Progressive sampling-based Bayesian optimization for efficient andautomatic machine learning model selection. Health Information Science and Systems 5(1)(2017)

160. Zhang, Y., Bahadori, M.T., Su, H., Sun, J.: FLASH: Fast Bayesian Optimization for DataAnalytic Pipelines. In: Krishnapuram, B., Shah, M., Smola, A., Aggarwal, C., Shen, D.,Rastogi, R. (eds.) Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD). pp. 2065–2074. ACM Press (2016)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,adaptation, distribution and reproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the Creative Commons licence andindicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s CreativeCommons licence, unless indicated otherwise in a credit line to the material. If material is notincluded in the chapter’s Creative Commons licence and your intended use is not permitted bystatutory regulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder.

http://creativecommons.org/licenses/by/4.0/

Chapter 2Meta-Learning

Joaquin Vanschoren

Abstract Meta-learning, or learning to learn, is the science of systematicallyobserving how different machine learning approaches perform on a wide range oflearning tasks, and then learning from this experience, or meta-data, to learn newtasks much faster than otherwise possible. Not only does this dramatically speed upand improve the design of machine learning pipelines or neural architectures, it alsoallows us to replace hand-engineered algorithms with novel approaches learned ina data-driven way. In this chapter, we provide an overview of the state of the art inthis fascinating and continuously evolving field.

2.1 Introduction

When we learn new skills, we rarely – if ever – start from scratch. We start fromskills learned earlier in related tasks, reuse approaches that worked well before,and focus on what is likely worth trying based on experience [82]. With every skilllearned, learning new skills becomes easier, requiring fewer examples and less trial-and-error. In short, we learn how to learn across tasks. Likewise, when buildingmachine learning models for a specific task, we often build on experience withrelated tasks, or use our (often implicit) understanding of the behavior of machinelearning techniques to help make the right choices.

The challenge in meta-learning is to learn from prior experience in a systematic,data-driven way. First, we need to collect meta-data that describe prior learningtasks and previously learned models. They comprise the exact algorithm con-figurations used to train the models, including hyperparameter settings, pipelinecompositions and/or network architectures, the resulting model evaluations, suchas accuracy and training time, the learned model parameters, such as the trainedweights of a neural net, as well as measurable properties of the task itself, also

J. Vanschoren (�)Department of Mathematics and Computer Science, TU Eindhoven, Eindhoven, North Brabant,The Netherlandse-mail: [email protected]

© The Author(s) 2019F. Hutter et al. (eds.), Automated Machine Learning, The Springer Serieson Challenges in Machine Learning, https://doi.org/10.1007/978-3-030-05318-5_2

35

http://crossmark.crossref.org/dialog/?doi=10.1007/978-3-030-05318-5_2&domain=pdf

https://orcid.org/0000-0001-7044-9805

mailto:[email protected]

https://doi.org/10.1007/978-3-030-05318-5_2

36 J. Vanschoren

known as meta-features. Second, we need to learn from this prior meta-data,to extract and transfer knowledge that guides the search for optimal models fornew tasks. This chapter presents a concise overview of different meta-learningapproaches to do this effectively.

The term meta-learning covers any type of learning based on prior experiencewith other tasks. The more similar those previous tasks are, the more types ofmeta-data we can leverage, and defining task similarity will be a key overarchingchallenge. Perhaps needless to say, there is no free lunch [57, 188]. When a newtask represents completely unrelated phenomena, or random noise, leveraging priorexperience will not be effective. Luckily, in real-world tasks, there are plenty ofopportunities to learn from prior experience.

In the remainder of this chapter, we categorize meta-learning techniques basedon the type of meta-data they leverage, from the most general to the most task-specific. First, in Sect. 2.2, we discuss how to learn purely from model evaluations.These techniques can be used to recommend generally useful configurations andconfiguration search spaces, as well as transfer knowledge from empirically similartasks. In Sect. 2.3, we discuss how we can characterize tasks to more explicitlyexpress task similarity and build meta-models that learn the relationships betweendata characteristics and learning performance. Finally, Sect. 2.4 covers how we cantransfer trained model parameters between tasks that are inherently similar, e.g.sharing the same input features, which enables transfer learning [111] and few-shotlearning [126] among others.

Note that while multi-task learning [25] (learning multiple related tasks simulta-neously) and ensemble learning [35] (building multiple models on the same task),can often be meaningfully combined with meta-learning systems, they do not inthemselves involve learning from prior experience on other tasks.

This chapter is based on a very recent survey article [176].

2.2 Learning from Model Evaluations

Consider that we have access to prior tasks tj ∈ T , the set of all known tasks, aswell as a set of learning algorithms, fully defined by their configurations θi ∈ ;here represents a discrete, continuous, or mixed configuration space which cancover hyperparameter settings, pipeline components and/or network architecturecomponents. P is the set of all prior scalar evaluations Pi,j = P(θi , tj ) ofconfiguration θi on task tj , according to a predefined evaluation measure, e.g.accuracy, and model evaluation technique, e.g. cross-validation. Pnew is the setof known evaluations Pi,new on a new task tnew . We now want to train a meta-learner L that predicts recommended configurations ∗

new for a new task tnew . Themeta-learner is trained on meta-data P ∪ Pnew . P is usually gathered beforehand,or extracted from meta-data repositories [174, 177]. Pnew is learned by the meta-learning technique itself in an iterative fashion, sometimes warm-started with aninitial P

′new generated by another method.

2 Meta-Learning 37

2.2.1 Task-Independent Recommendations

First, imagine not having access to any evaluations on tnew , hence Pnew = ∅. Wecan then still learn a function f : × T → {θ∗

k }, k = 1..K , yielding a set ofrecommended configurations independent of tnew . These θ∗

k can then be evaluatedon tnew to select the best one, or to warm-start further optimization approaches, suchas those discussed in Sect. 2.2.3.

Such approaches often produce a ranking, i.e. an ordered set θ∗k . This is typically

done by discretizing into a set of candidate configurations θi , also called aportfolio, evaluated on a large number of tasks tj . We can then build a rankingper task, for instance using success rates, AUC, or significant wins [21, 34, 85].However, it is often desirable that equally good but faster algorithms are rankedhigher, and multiple methods have been proposed to trade off accuracy and trainingtime [21, 134]. Next, we can aggregate these single-task rankings into a globalranking, for instance by computing the average rank [1, 91] across all tasks. Whenthere is insufficient data to build a global ranking, one can recommend subsets ofconfigurations based on the best known configurations for each prior task [70, 173],or return quasi-linear rankings [30].

To find the best θ∗ for a task tnew , never before seen, a simple anytime methodis to select the top-K configurations [21], going down the list and evaluating eachconfiguration on tnew in turn. This evaluation can be halted after a predefined valuefor K , a time budget, or when a sufficiently accurate model is found. In time-constrained settings, it has been shown that multi-objective rankings (includingtraining time) converge to near-optimal models much faster [1, 134], and providea strong baseline for algorithm comparisons [1, 85].

A very different approach to the one above is to first fit a differentiable functionfj (θi) = Pi,j on all prior evaluations of a specific task tj , and then use gradientdescent to find an optimized configuration θ∗

j per prior task [186]. Assuming thatsome of the tasks tj will be similar to tnew , those θ∗

j will be useful for warm-startingBayesian optimization approaches.

2.2.2 Configuration Space Design

Prior evaluations can also be used to learn a better configuration space ∗. Whileagain independent from tnew , this can radically speed up the search for optimalmodels, since only the more relevant regions of the configuration space are explored.This is critical when computational resources are limited, and has proven to be animportant factor in practical comparisons of AutoML systems [33].

First, in the functional ANOVA [67] approach, hyperparameters are deemedimportant if they explain most of the variance in algorithm performance on agiven task. In [136], this was explored using 250,000 OpenML experiments with3 algorithms across 100 datasets.

38 J. Vanschoren

An alternative approach is to first learn an optimal hyperparameter defaultsetting, and then define hyperparameter importance as the performance gain thatcan be achieved by tuning the hyperparameter instead of leaving it at that defaultvalue. Indeed, even though a hyperparameter may cause a lot of variance, it mayalso have one specific setting that always results in good performance. In [120],this was done using about 500,000 OpenML experiments on 6 algorithms and 38datasets. Default values are learned jointly for all hyperparameters of an algorithmby first training surrogate models for that algorithm for a large number of tasks.Next, many configurations are sampled, and the configuration that minimizes theaverage risk across all tasks is the recommended default configuration. Finally, theimportance (or tunability) of each hyperparameter is estimated by observing howmuch improvement can still be gained by tuning it.

In [183], defaults are learned independently from other hyperparameters, anddefined as the configurations that occur most frequently in the top-K configurationsfor every task. In the case that the optimal default value depends on meta-features(e.g. the number of training instances or features), simple functions are learned thatinclude these meta-features. Next, a statistical test defines whether a hyperparametercan be safely left at this default, based on the performance loss observed when nottuning a hyperparameter (or a set of hyperparameters), while all other parameters aretuned. This was evaluated using 118,000 OpenML experiments with 2 algorithms(SVMs and Random Forests) across 59 datasets.

2.2.3 Configuration Transfer

If we want to provide recommendations for a specific task tnew , we need additionalinformation on how similar tnew is to prior tasks tj . One way to do this is to evaluatea number of recommended (or potentially random) configurations on tnew , yieldingnew evidence Pnew . If we then observe that the evaluations Pi,new are similar to Pi,j ,then tj and tnew can be considered intrinsically similar, based on empirical evidence.We can include this knowledge to train a meta-learner that predicts a recommendedset of configurations ∗

new for tnew . Moreover, every selected θ∗new can be evaluated

and included in Pnew , repeating the cycle and collecting more empirical evidence tolearn which tasks are similar to each other.

2.2.3.1 Relative Landmarks

A first measure for task similarity considers the relative (pairwise) performancedifferences, also called relative landmarks, RLa,b,j = Pa,j − Pb,j between twoconfigurations θa and θb on a particular task tj [53]. Active testing [85] leveragesthese as follows: it warm-starts with the globally best configuration (see Sect. 2.2.1),calls it θbest , and proceeds in a tournament-style fashion. In each round, it selectsthe ‘competitor’ θc that most convincingly outperforms θbest on similar tasks. It

2 Meta-Learning 39

deems tasks to be similar if the relative landmarks of all evaluated configurationsare similar, i.e., if the configurations perform similarly on both tj and tnew thenthe tasks are deemed similar. Next, it evaluates the competitor θc, yielding Pc,new ,updates the task similarities, and repeats. A limitation of this method is that it canonly consider configurations θi that were evaluated on many prior tasks.

2.2.3.2 Surrogate Models

A more flexible way to transfer information is to build surrogate models sj (θi) =Pi,j for all prior tasks tj , trained using all available P. One can then define tasksimilarity in terms of the error between sj (θi) and Pi,new : if the surrogate modelfor tj can generate accurate predictions for tnew , then those tasks are intrinsicallysimilar. This is usually done in combination with Bayesian optimization (seeChap. 1) to determine the next θi .

Wistuba et al. [187] train surrogate models based on Gaussian Processes (GPs)for every prior task, plus one for tnew , and combine them into a weighted,normalized sum, with the (new) predicted mean μ defined as the weighted sumof the individual μj ’s (obtained from prior tasks tj ). The weights of the μj ’s arecomputed using the Nadaraya-Watson kernel-weighted average, where each taskis represented as a vector of relative landmarks, and the Epanechnikov quadratickernel [104] is used to measure the similarity between the relative landmark vectorsof tj and tnew . The more similar tj is to tnew , the larger the weight sj , increasing theinfluence of the surrogate model for tj .

Feurer et al. [45] propose to combine the predictive distributions of the individualGaussian processes, which makes the combined model a Gaussian process again.The weights are computed following the agnostic Bayesian ensemble of Lacoste etal. [81], which weights predictors according to an estimate of their generalizationperformance.

Meta-data can also be transferred in the acquisition function rather than thesurrogate model [187]. The surrogate model is only trained on Pi,new , but the nextθi to evaluate is provided by an acquisition function which is the weighted averageof the expected improvement [69] on Pi,new and the predicted improvements on allprior Pi,j . The weights of the prior tasks can again be defined via the accuracy of thesurrogate model or via relative landmarks. The weight of the expected improvementcomponent is gradually increased with every iteration as more evidence Pi,new iscollected.

2.2.3.3 Warm-Started Multi-task Learning

Another approach to relate prior tasks tj is to learn a joint task representation usingP prior evaluations. In [114], task-specific Bayesian linear regression [20] surrogatemodels sj (θ

zi ) are trained in a novel configuration θz learned by a feedforward

Neural Network NN(θi) which learns a suitable basis expansion θz of the original

40 J. Vanschoren

configuration θ in which linear surrogate models can accurately predict Pi,new . Thesurrogate models are pre-trained on OpenML meta-data to provide a warm-startfor optimizing NN(θi) in a multi-task learning setting. Earlier work on multi-tasklearning [166] assumed that we already have a set of ‘similar’ source tasks tj .It transfers information between these tj and tnew by building a joint GP modelfor Bayesian optimization that learns and exploits the exact relationship betweenthe tasks. Learning a joint GP tends to be less scalable than building one GP pertask, though. Springenberg et al. [161] also assumes that the tasks are related andsimilar, but learns the relationship between tasks during the optimization processusing Bayesian Neural Networks. As such, their method is somewhat of a hybridof the previous two approaches. Golovin et al. [58] assume a sequence order (e.g.,time) across tasks. It builds a stack of GP regressors, one per task, training each GPon the residuals relative to the regressor below it. Hence, each task uses the tasksbefore it to define its priors.

2.2.3.4 Other Techniques

Multi-armed bandits [139] provide yet another approach to find the source tasks tjmost related to tnew [125]. In this analogy, each tj is one arm, and the (stochastic)reward for selecting (pulling) a particular prior task (arm) is defined in terms ofthe error in the predictions of a GP-based Bayesian optimizer that models theprior evaluations of tj as noisy measurements and combines them with the existingevaluations on tnew . The cubic scaling of the GP makes this approach less scalable,though.

Another way to define task similarity is to take the existing evaluations Pi,j , use

Thompson Sampling [167] to obtain the optima distribution ρjmax , and then measure

the KL-divergence [80] between ρjmax and ρnew

max [124]. These distributions are thenmerged into a mixture distribution based on the similarities and used to build anacquisition function that predicts the next most promising configuration to evaluate.It is so far only evaluated to tune 2 SVM hyperparameters using 5 tasks.

Finally, a complementary way to leverage P is to recommend which configu-rations should not be used. After training surrogate models per task, we can lookup which tj are most similar to tnew , and then use sj (θi) to discover regions of

where performance is predicted to be poor. Excluding these regions can speed up thesearch for better-performing ones. Wistuba et al. [185], do this using a task similaritymeasure based on the Kendall tau rank correlation coefficient [73] between the ranksobtained by ranking configurations θi using Pi,j and Pi,new , respectively.

2 Meta-Learning 41

2.2.4 Learning Curves

We can also extract meta-data about the training process itself, such as how fastmodel performance improves as more training data is added. If we divide thetraining in steps st , usually adding a fixed number of training examples every step,we can measure the performance P(θi, tj , st ) = Pi,j,t of configuration θi on tasktj after step st , yielding a learning curve across the time steps st . As discussed inChap. 1, learning curves are also used to speed up hyperparameter optimization on agiven task. In meta-learning, learning curve information is transferred across tasks.

While evaluating a configuration on new task tnew , we can halt the training aftera certain number of iterations r < t , and use the partially observed learning curveto predict how well the configuration will perform on the full dataset based on priorexperience with other tasks, and decide whether to continue the training or not. Thiscan significantly speed up the search for good configurations.

One approach is to assume that similar tasks yield similar learning curves. First,define a distance between tasks based on how similar the partial learning curves are:dist (ta, tb) = f (Pi,a,t , Pi,b,t ) with t = 1, . . . , r . Next, find the k most similar taskst1...k and use their complete learning curves to predict how well the configurationwill perform on the new complete dataset. Task similarity can be measured bycomparing the shapes of the partial curves across all configurations tried, and theprediction is made by adapting the ‘nearest’ complete curve(s) to the new partialcurve [83, 84]. This approach was also successful in combination with active testing[86], and can be sped up further by using multi-objective evaluation measures thatinclude training time [134].

Interestingly, while several methods aim to predict learning curves during neuralarchitecture search (see Chap. 3), as of yet none of this work leverages learningcurves previously observed on other tasks.

2.3 Learning from Task Properties

Another rich source of meta-data are characterizations (meta-features) of the taskat hand. Each task tj ∈ T is described with a vector m(tj ) = (mj,1, . . . ,mj,K) ofK meta-features mj,k ∈ M , the set of all known meta-features. This can be usedto define a task similarity measure based on, for instance, the Euclidean distancebetween m(ti) and m(tj ), so that we can transfer information from the most similartasks to the new task tnew . Moreover, together with prior evaluations P, we can traina meta-learner L to predict the performance Pi,new of configurations θi on a newtask tnew .

42 J. Vanschoren

2.3.1 Meta-Features

Table 2.1 provides a concise overview of the most commonly used meta-features,together with a short rationale for why they are indicative of model performance.Where possible, we also show the formulas to compute them. More completesurveys can be found in the literature [26, 98, 130, 138, 175].

To build a meta-feature vector m(tj ), one needs to select and further processthese meta-features. Studies on OpenML meta-data have shown that the optimal setof meta-features depends on the application [17]. Many meta-features are computedon single features, or combinations of features, and need to be aggregated bysummary statistics (min,max,μ,σ ,quartiles,q1...4) or histograms [72]. One needs tosystematically extract and aggregate them [117]. When computing task similarity,it is also important to normalize all meta-features [9], perform feature selection[172], or employ dimensionality reduction techniques (e.g. PCA) [17]. Whenlearning meta-models, one can also use relational meta-learners [173] or case-basedreasoning methods [63, 71, 92].

Beyond these general-purpose meta-features, many more specific ones wereformulated. For streaming data one can use streaming landmarks [135, 137],for time series data one can compute autocorrelation coefficients or the slopeof regression models [7, 121, 147], and for unsupervised problems one cancluster the data in different ways and extract properties of these clusters [159].In many applications, domain-specific information can be leveraged as well[109, 156].

2.3.2 Learning Meta-Features

Instead of manually defining meta-features, we can also learn a joint represen-tation for groups of tasks. One approach is to build meta-models that generatea landmark-like meta-feature representation M ′ given other task meta-featuresM and trained on performance meta-data P, or f : M → M ′. Sun andPfahringer [165] do this by evaluating a predefined set of configurations θi

on all prior tasks tj , and generating a binary metafeature mj,a,b ∈ M ′ forevery pairwise combination of configurations θa and θb, indicating whether θa

outperformed θb or not, thus m′(tj ) = (mj,a,b,mj,a,c,mj,b,c, . . .). To computemnew,a,b, meta-rules are learned for every pairwise combination (a,b), each pre-dicting whether θa will outperform θb on task tj , given its other meta-featuresm(tj ).

We can also learn a joint representation based entirely on the available P meta-data, i.e. f : P × → M ′. We previously discussed how to do this with feed-forward neural nets [114] in Sect. 2.2.3. If the tasks share the same input space,e.g., they are images of the same resolution, one can also use deep metric learningto learn a meta-feature representation, for instance, using Siamese networks [75].

2 Meta-Learning 43

Table 2.1 Overview of commonly used meta-features. Groups from top to bottom: simple,statistical, information-theoretic, complexity, model-based, and landmarkers. Continuous featuresX and target Y have mean μX , stdev σX, variance σ 2

X . Categorical features X and class C havecategorical values πi , conditional probabilities πi|j , joint probabilities πi,j , marginal probabilitiesπi+ = ∑

j πij , entropy H(X) = − ∑i πi+log2(πi+)

Name Formula Rationale Variants

Nr instances n Speed, Scalability [99] p/n, log(n), log(n/p)

Nr features p Curse of dimensionality [99] log(p), % categorical

Nr classes c Complexity, imbalance [99] ratio min/maj class

Nr missing values m Imputation effects [70] % missing

Nr outliers o Data noisiness [141] o/n

Skewness E(X−μX)3

σ 3X

Feature normality [99] min,max,μ,σ ,q1, q3

Kurtosis E(X−μX)4

σ 4X

Feature normality [99] min,max,μ,σ ,q1, q3

Correlation ρX1X2 Feature interdependence [99] min,max,μ,σ ,ρXY [158]

Covariance covX1X2 Feature interdependence [99] min,max,μ,σ ,covXY

Concentration τX1X2 Feature interdependence [72] min,max,μ,σ ,τXY

Sparsity sparsity(X) Degree of discreteness [143] min,max,μ,σ

Gravity gravity(X) Inter-class dispersion [5]

ANOVA p-value pvalX1X2Feature redundancy [70] pvalXY

[158]

Coeff. of variation σYμY

Variation in target [158]

PCA ρλ1

√λ1

1+λ1Variance in first PC [99] λ1∑

i λi[99]

PCA skewness Skewness of first PC [48] PCA kurtosis [48]

PCA 95% dim95%var

pIntrinsic dimensionality [9]

Class probability P (C) Class distribution [99] min,max,μ,σ

Class entropy H(C) Class imbalance [99]

Norm. entropy H(X)log2n

Feature informativeness [26] min,max,μ,σ

Mutual inform. MI(C,X) Feature importance [99] min,max,μ,σ

Uncertainty coeff. MI(C,X)H(C)

Feature importance [3] min,max,μ,σ

Equiv. nr. feats H(C)

MI (C,X)Intrinsic dimensionality [99]

Noise-signal ratio H(X)−MI(C,X)

MI (C,X)Noisiness of data [99]

Fisher’s discrimin. (μc1−μc2)2

σ 2c1−σ 2

c2

Separability classes c1, c2 [64] See [64]

Volume of overlap Class distribution overlap [64] See [64]

Concept variation Task complexity [180] See [179, 180]

Data consistency Data quality [76] See [76]

Nr nodes, leaves |η|, |ψ | Concept complexity [113] Tree depth

Branch length Concept complexity [113] min,max,μ,σ

Nodes per feature |ηX | Feature importance [113] min,max,μ,σ

Leaves per class |ψc ||ψ | Class complexity [49] min,max,μ,σ

Leaves agreement nψi

nClass separability [16] min,max,μ,σ

Information gain Feature importance [16] min,max,μ,σ , gini

(continued)

44 J. Vanschoren

Table 2.1 (continued)

Name Formula Rationale Variants

Landmarker(1NN) P (θ1NN , tj ) Data sparsity [115] Elite 1NN [115]

Landmarker(Tree) P (θT ree, tj ) Data separability [115] Stump,RandomTree

Landmarker(Lin) P (θLin, tj ) Linear separability [115] Lin.Disciminant

Landmarker(NB) P (θNB, tj ) Feature independence [115] More models [14, 88]

Relative LM Pa,j − Pb,j Probing performance [53]

Subsample LM P (θi , tj , st ) Probing performance [160]

These are trained by feeding the data of two different tasks to two twin networks,and using the differences between the predicted and observed performance Pi,new

as the error signal. Since the model parameters between both networks are tied in aSiamese network, two very similar tasks are mapped to the same regions in the latentmeta-feature space. They can be used for warm starting Bayesian hyperparameteroptimization [75] and neural architecture search [2].

2.3.3 Warm-Starting Optimization from Similar Tasks

Meta-features are a very natural way to estimate task similarity and initializeoptimization procedures based on promising configurations on similar tasks. This isakin to how human experts start a manual search for good models, given experienceon related tasks.

First, starting a genetic search algorithm in regions of the search space withpromising solutions can significantly speed up convergence to a good solution.Gomes et al. [59] recommend initial configurations by finding the k most similarprior tasks tj based on the L1 distance between vectors m(tj ) and m(tnew), whereeach m(tj ) includes 17 simple and statistical meta-features. For each of the k mostsimilar tasks, the best configuration is evaluated on tnew , and used to initialize agenetic search algorithm (Particle Swarm Optimization), as well as Tabu Search.Reif et al. [129] follow a very similar approach, using 15 simple, statistical, andlandmarking meta-features. They use a forward selection technique to find the mostuseful meta-features, and warm-start a standard genetic algorithm (GAlib) with amodified Gaussian mutation operation. Variants of active testing (see Sect. 2.2.3)that use meta-features were also tried [85, 100], but did not perform better than theapproaches based on relative landmarks.

Also model-based optimization approaches can benefit greatly from an initialset of promising configurations. SCoT [9] trains a single surrogate ranking modelf : M × → R, predicting the rank of θi on task tj . M contains 4 meta-features(3 simple ones and one based on PCA). The surrogate model is trained on all therankings, including those on tnew . Ranking is used because the scale of evaluation

2 Meta-Learning 45

values can differ greatly between tasks. A GP regression converts the ranks toprobabilities to do Bayesian optimization, and each new Pi,new is used to retrainthe surrogate model after every step.

Schilling et al. [148] use a modified multilayer perceptron as a surrogate model,of the form sj (θi,m(tj ), b(tj )) = Pi,j where m(tj ) are the meta-features and b(tj )

is a vector of j binary indications which are 1 if the meta-instance is from tjand 0 otherwise. The multi-layer perceptron uses a modified activation functionbased on factorization machines [132] in the first layer, aimed at learning a latentrepresentation for each task to model task similarities. Since this model cannotrepresent uncertainties, an ensemble of 100 multilayer perceptrons is trained to getpredictive means and simulate variances.

Training a single surrogate model on all prior meta-data is often less scalable.Yogatama and Mann [190] also build a single Bayesian surrogate model, but onlyinclude tasks similar to tnew , where task similarity is defined as the Euclideandistance between meta-feature vectors consisting of 3 simple meta-features. ThePi,j values are standardized to overcome the problem of different scales for each tj .The surrogate model learns a Gaussian process with a specific kernel combinationon all instances.

Feurer et al. [48] offer a simpler, more scalable method that warm-startsBayesian optimization by sorting all prior tasks tj similar to [59], but including46 simple, statistical, and landmarking meta-features, as well as H(C). The t bestconfigurations on the d most similar tasks are used to warm-start the surrogatemodel. They search over many more hyperparameters than earlier work, includingpreprocessing steps. This warm-starting approach was also used in later work [46],which is discussed in detail in Chap. 6.

Finally, one can also use collaborative filtering to recommend promising con-figurations [162]. By analogy, the tasks tj (users) provide ratings (Pi,j ) for theconfigurations θi (items), and matrix factorization techniques are used to predictunknown Pi,j values and recommend the best configurations for any task. Animportant issue here is the cold start problem, since the matrix factorization requiresat least some evaluations on tnew . Yang et al. [189] use a D-optimal experimentdesign to sample an initial set of evaluations Pi,new . They predict both the predictiveperformance and runtime, to recommend a set of warm-start configurations thatare both accurate and fast. Misir and Sebag [102, 103] leverage meta-features tosolve the cold start problem. Fusi et al. [54] also use meta-features, following thesame procedure as [46], and use a probabilistic matrix factorization approach thatallows them to perform Bayesian optimization to further optimize their pipelineconfigurations θi . This approach yields useful latent embeddings of both the tasksand configurations, in which the bayesian optimization can be performed moreefficiently.

46 J. Vanschoren

2.3.4 Meta-Models

We can also learn the complex relationship between a task’s meta-features and theutility of specific configurations by building a meta-model L that recommends themost useful configurations ∗

new given the meta-features M of the new task tnew .There exists a rich body of earlier work [22, 56, 87, 94] on building meta-modelsfor algorithm selection [15, 19, 70, 115] and hyperparameter recommendation [4,79, 108, 158]. Experiments showed that boosted and bagged trees often yielded thebest predictions, although much depends on the exact meta-features used [72, 76].

2.3.4.1 Ranking

Meta-models can also generate a ranking of the top-K most promising configura-tions. One approach is to build a k-nearest neighbor (kNN) meta-model to predictwhich tasks are similar, and then rank the best configurations on these similar tasks[23, 147]. This is similar to the work discussed in Sect. 2.3.3, but without ties toa follow-up optimization approach. Meta-models specifically meant for ranking,such as predictive clustering trees [171] and label ranking trees [29] were alsoshown to work well. Approximate Ranking Tree Forests (ART Forests) [165],ensembles of fast ranking trees, prove to be especially effective, since they have‘built-in’ meta-feature selection, work well even if few prior tasks are available, andthe ensembling makes the method more robust. autoBagging [116] ranks Baggingworkflows including four different Bagging hyperparameters, using an XGBoost-based ranker, trained on 140 OpenML datasets and 146 meta-features. Lorena et al.[93] recommends SVM configurations for regression problems using a kNN meta-model and a new set of meta-features based on data complexity.

2.3.4.2 Performance Prediction

Meta-models can also directly predict the performance, e.g. accuracy or trainingtime, of a configuration on a given task, given its meta-features. This allows usto estimate whether a configuration will be interesting enough to evaluate in anyoptimization procedure. Early work used linear regression or rule-base regressorsto predict the performance of a discrete set of configurations and then rankthem accordingly [14, 77]. Guerra et al. [61] train an SVM meta-regressor perclassification algorithm to predict its accuracy, under default settings, on a newtask tnew given its meta-features. Reif et al. [130] train a similar meta-regressoron more meta-data to predict its optimized performance. Davis et al. [32] use aMultiLayer Perceptron based meta-learner instead, predicting the performance of aspecific algorithm configuration.

Instead of predicting predictive performance, a meta-regressor can also be trainedto predict algorithm training/prediction time, for instance, using an SVM regressor

2 Meta-Learning 47

trained on meta-features [128], itself tuned via genetic algorithms [119]. Yang et al.[189] predict configuration runtime using polynomial regression, based only on thenumber of instances and features. Hutter et al. [68] provide a general treatise onpredicting algorithm runtime in various domains.

Most of these meta-models generate promising configurations, but don’t actuallytune these configurations to tnew themselves. Instead, the predictions can be usedto warm-start or guide any other optimization technique, which allows for all kindsof combinations of meta-models and optimization techniques. Indeed, some of thework discussed in Sect. 2.3.3 can be seen as using a distance-based meta-model towarm-start Bayesian optimization [48, 54] or evolutionary algorithms [59, 129]. Inprinciple, other meta-models could be used here as well.

Instead of learning the relationship between a task’s meta-features and configu-ration performance, one can also build surrogate models predicting the performanceof configurations on specific tasks [40]. One can then learn how to combine theseper-task predictions to warm-start or guide optimization techniques on a new tasktnew [45, 114, 161, 187], as discussed in Sect. 2.2.3. While meta-features could alsobe used to combine per-task predictions based on task similarity, it is ultimatelymore effective to gather new observations Pi,new , since these allow us to refine thetask similarity estimates with every new observation [47, 85, 187].

2.3.5 Pipeline Synthesis

When creating entire machine learning pipelines [153], the number of configurationoptions grows dramatically, making it even more important to leverage priorexperience. One can control the search space by imposing a fixed structure on thepipeline, fully described by a set of hyperparameters. One can then use the mostpromising pipelines on similar tasks to warm-start a Bayesian optimization [46, 54].

Other approaches give recommendations for certain pipeline steps [118, 163],and can be leveraged in larger pipeline construction approaches, such as planning[55, 74, 105, 184] or evolutionary techniques [110, 164]. Nguyen et al. [105] con-struct new pipelines using a beam search focussed on components recommended bya meta-learner, and is itself trained on examples of successful prior pipelines. Bilalliet al. [18] predict which pre-processing techniques are recommended for a givenclassification algorithm. They build a meta-model per target classification algorithmthat, given the tnew meta-features, predicts which preprocessing technique shouldbe included in the pipeline. Similarly, Schoenfeld et al. [152] build meta-modelspredicting when a preprocessing algorithm will improve a particular classifier’saccuracy or runtime.

AlphaD3M [38] uses a self-play reinforcement learning approach in whichthe current state is represented by the current pipeline, and actions include theaddition, deletion, or replacement of pipeline components. A Monte Carlo TreeSearch (MCTS) generates pipelines, which are evaluated to train a recurrent neuralnetwork (LSTM) that can predict pipeline performance, in turn producing the action

48 J. Vanschoren

probabilities for the MCTS in the next round. The state description also includesmeta-features of the current task, allowing the neural network to learn across tasks.Mosaic [123] also generates pipelines using MCTS, but instead uses a bandits-basedapproach to select promising pipelines.

2.3.6 To Tune or Not to Tune?

To reduce the number of configuration parameters to be optimized, and to savevaluable optimization time in time-constrained settings, meta-models have also beenproposed to predict whether or not it is worth tuning a given algorithm given themeta-features of the task at hand [133] and how much improvement we can expectfrom tuning a specific algorithm versus the additional time investment [144]. Morefocused studies on specific learning algorithms yielded meta-models predictingwhen it is necessary to tune SVMs [96], what are good default hyperparametersfor SVMs given the task (including interpretable meta-models) [97], and how totune decision trees [95].

2.4 Learning from Prior Models

The final type of meta-data we can learn from are prior machine learning modelsthemselves, i.e., their structure and learned model parameters. In short, we want totrain a meta-learner L that learns how to train a (base-) learner lnew for a new tasktnew , given similar tasks tj ∈ T and the corresponding optimized models lj ∈ L,where L is the space of all possible models. The learner lj is typically defined by itsmodel parameters W = {wk}, k = 1 . . .K and/or its configuration θi ∈ .

2.4.1 Transfer Learning

In transfer learning [170], we take models trained on one or more source tasks tj ,and use them as starting points for creating a model on a similar target task tnew .This can be done by forcing the target model to be structurally or otherwise similarto the source model(s). This is a generally applicable idea, and transfer learningapproaches have been proposed for kernel methods [41, 42], parametric Bayesianmodels [8, 122, 140], Bayesian networks [107], clustering [168] and reinforcementlearning [36, 62]. Neural networks, however, are exceptionally suitable for transferlearning because both the structure and the model parameters of the source modelscan be used as a good initialization for the target model, yielding a pre-trainedmodel which can then be further fine-tuned using the available training data on tnew

[11, 13, 24, 169]. In some cases, the source network may need to be modified beforetransferring it [155]. We will focus on neural networks in the remainder of thissection.

2 Meta-Learning 49

Especially large image datasets, such as ImageNet [78], have been shown to yieldpre-trained models that transfer exceptionally well to other tasks [37, 154]. However,it has also been shown that this approach doesn’t work well when the target taskis not so similar [191]. Rather than hoping that a pre-trained model ‘accidentally’transfers well to a new problem, we can purposefully imbue meta-learners with aninductive bias (learned from many similar tasks) that allows them to learn new tasksmuch faster, as we will discuss below.

2.4.2 Meta-Learning in Neural Networks

An early meta-learning approach is to create recurrent neural networks (RNNs) ableto modify their own weights [149, 150]. During training, they use their own weightsas additional input data and observe their own errors to learn how to modify theseweights in response to the new task at hand. The updating of the weights is definedin a parametric form that is differentiable end-to-end and can jointly optimize boththe network and training algorithm using gradient descent, yet is also very difficultto train. Later work used reinforcement learning across tasks to adapt the searchstrategy [151] or the learning rate for gradient descent [31] to the task at hand.

Inspired by the feeling that backpropagation is an unlikely learning mechanismfor our own brains, Bengio et al. [12] replace backpropagation with simplebiologically-inspired parametric rules (or evolved rules [27]) to update the synapticweights. The parameters are optimized, e.g. using gradient descent or evolution,across a set of input tasks. Runarsson and Jonsson [142] replaced these parametricrules with a single layer neural network. Santoro et al. [146] instead use a memory-augmented neural network to learn how to store and retrieve ‘memories’ of priorclassification tasks. Hochreiter et al. [65] use LSTMs [66] as a meta-learner to trainmulti-layer perceptrons.

Andrychowicz et al. [6] also replace the optimizer, e.g. stochastic gradientdescent, with an LSTM trained on multiple prior tasks. The loss of the meta-learner(optimizer) is defined as the sum of the losses of the base-learners (optimizees), andoptimized using gradient descent. At every step, the meta-learner chooses the weightupdate estimated to reduce the optimizee’s loss the most, based on the learned modelweights {wk} of the previous step as well as the current performance gradient. Laterwork generalizes this approach by training an optimizer on synthetic functions,using gradient descent [28]. This allows meta-learners to optimize optimizees evenif these do not have access to gradients.

In parallel, Li and Malik [89] proposed a framework for learning optimizationalgorithms from a reinforcement learning perspective. It represents any particularoptimization algorithm as a policy, and then learns this policy via guided policysearch. Follow-up work [90] shows how to leverage this approach to learn opti-mization algorithms for (shallow) neural networks.

The field of neural architecture search includes many other methods that build amodel of neural network performance for a specific task, for instance using Bayesian

50 J. Vanschoren

optimization or reinforcement learning. See Chap. 3 for an in-depth discussion.However, most of these methods do not (yet) generalize across tasks and aretherefore not discussed here.

2.4.3 Few-Shot Learning

A particularly challenging meta-learning problem is to train an accurate deeplearning model using only a few training examples, given prior experience withvery similar tasks for which we have large training sets available. This is calledfew-shot learning. Humans have an innate ability to do this, and we wish to buildmachine learning agents that can do the same [82]. A particular example of this is‘K-shot N-way’ classification, in which we are given many examples (e.g., images)of certain classes (e.g., objects), and want to learn a classifier lnew able to classifyN new classes using only K examples of each.

Using prior experience, we can, for instance, learn a common feature represen-tation of all the tasks, start training lnew with a better model parameter initializationWinit and acquire an inductive bias that helps guide the optimization of the modelparameters, so that lnew can be trained much faster than otherwise possible.

Earlier work on one-shot learning is largely based on hand-engineered features[10, 43, 44, 50]. With meta-learning, however, we hope to learn a common featurerepresentation for all tasks in an end-to-end fashion.

Vinyals et al. [181] state that, to learn from very little data, one should lookto non-parameteric models (such as k-nearest neighbors), which use a memorycomponent rather than learning many model parameters. Their meta-learner is aMatching Network that applies the idea of a memory component in a neural net. Itlearns a common representation for the labelled examples, and matches each newtest instance to the memorized examples using cosine similarity. The network istrained on minibatches with only a few examples of a specific task each.

Snell et al. [157] propose Prototypical Networks, which map examples to a p-dimensional vector space such that examples of a given output class are closetogether. It then calculates a prototype (mean vector) for every class. New testinstances are mapped to the same vector space and a distance metric is used tocreate a softmax over all possible classes. Ren et al. [131] extend this approach tosemi-supervised learning.

Ravi and Larochelle [126] use an LSTM-based meta-learner to learn an updaterule for training a neural network learner. With every new example, the learnerreturns the current gradient and loss to the LSTM meta-learner, which then updatesthe model parameters {wk} of the learner. The meta-learner is trained across all priortasks.

Model-Agnostic Meta-Learning (MAML) [51], on the other hand, does not tryto learn an update rule, but instead learns a model parameter initialization Winit thatgeneralizes better to similar tasks. Starting from a random {wk}, it iteratively selectsa batch of prior tasks, and for each it trains the learner on K examples to compute the

2 Meta-Learning 51

gradient and loss (on a test set). It then backpropagates the meta-gradient to updatethe weights {wk} in the direction in which they would have been easier to update.In other words, after each iteration, the weights {wk} become a better Winit to startfinetuning any of the tasks. Finn and Levine [52] also argue that MAML is able toapproximate any learning algorithm when using a sufficiently deep fully connectedReLU network and certain losses. They also conclude that the MAML initializationsare more resilient to overfitting on small samples, and generalize more widely thanmeta-learning approaches based on LSTMs.

REPTILE [106] is an approximation of MAML that executes stochastic gradientdescent for K iterations on a given task, and then gradually moves the initializationweights in the direction of the weights obtained after the K iterations. The intuitionis that every task likely has more than one set of optimal weights {w∗

i }, and the goalis to find a Winit that is close to at least one of those {w∗

i } for every task.Finally, we can also derive a meta-learner from a black-box neural network.

Santoro et al. [145] propose Memory-Augmented Neural Networks (MANNs),which train a Neural Turing Machine (NTM) [60], a neural network with augmentedmemory capabilities, as a meta-learner. This meta-learner can then memorizeinformation about previous tasks and leverage that to learn a learner lnew . SNAIL[101] is a generic meta-learner architecture consisting of interleaved temporal con-volution and causal attention layers. The convolutional networks learn a commonfeature vector for the training instances (images) to aggregate information from pastexperiences. The causal attention layers learn which pieces of information to pickout from the gathered experience to generalize to new tasks.

Overall, the intersection of deep learning and meta-learning proves to beparticular fertile ground for groundbreaking new ideas, and we expect this field tobecome more important over time.

2.4.4 Beyond Supervised Learning

Meta-learning is certainly not limited to (semi-)supervised tasks, and has beensuccessfully applied to solve tasks as varied as reinforcement learning, activelearning, density estimation and item recommendation. The base-learner may beunsupervised while the meta-learner is supervised, but other combinations arecertainly possible as well.

Duan et al. [39] propose an end-to-end reinforcement learning (RL) approachconsisting of a task-specific fast RL algorithm which is guided by a general-purposeslow meta-RL algorithm. The tasks are interrelated Markov Decision Processes(MDPs). The meta-RL algorithm is modeled as an RNN, which receives theobservations, actions, rewards and termination flags. The activations of the RNNstore the state of the fast RL learner, and the RNN’s weights are learned by observingthe performance of fast learners across tasks.

In parallel, Wang et al. [182] also proposed to use a deep RL algorithm totrain an RNN, receiving the actions and rewards of the previous interval in order

52 J. Vanschoren

to learn a base-level RL algorithm for specific tasks. Rather than using relativelyunstructured tasks such as random MDPs, they focus on structured task distributions(e.g., dependent bandits) in which the meta-RL algorithm can exploit the inherenttask structure.

Pang et al. [112] offer a meta-learning approach to active learning (AL). Thebase-learner can be any binary classifier, and the meta-learner is a deep RLnetwork consisting of a deep neural network that learns a representation of theAL problem across tasks, and a policy network that learns the optimal policy,parameterized as weights in the network. The meta-learner receives the current state(the unlabeled point set and base classifier state) and reward (the performance of thebase classifier), and emits a query probability, i.e. which points in the unlabeled setto query next.

Reed et al. [127] propose a few-shot approach for density estimation (DE). Thegoal is to learn a probability distribution over a small number of images of a certainconcept (e.g., a handwritten letter) that can be used to generate images of thatconcept, or compute the probability that an image shows that concept. The approachuses autoregressive image models which factorize the joint distribution into per-pixel factors. Usually these are conditioned on (many) examples of the targetconcept. Instead, a MAML-based few-shot learner is used, trained on examples ofmany other (similar) concepts.

Finally, Vartak et al. [178] address the cold-start problem in matrix factorization.They propose a deep neural network architecture that learns a (base) neural networkwhose biases are adjusted based on task information. While the structure andweights of the neural net recommenders remain fixed, the meta-learner learns howto adjust the biases based on each user’s item history.

All these recent new developments illustrate that it is often fruitful to look atproblems through a meta-learning lens and find new, data-driven approaches toreplace hand-engineered base-learners.

2.5 Conclusion

Meta-learning opportunities present themselves in many different ways, and canbe embraced using a wide spectrum of learning techniques. Every time we tryto learn a certain task, whether successful or not, we gain useful experience thatwe can leverage to learn new tasks. We should never have to start entirely fromscratch. Instead, we should systematically collect our ‘learning experiences’ andlearn from them to build AutoML systems that continuously improve over time,helping us tackle new learning problems ever more efficiently. The more new taskswe encounter, and the more similar those new tasks are, the more we can tap intoprior experience, to the point that most of the required learning has already beendone beforehand. The ability of computer systems to store virtually infinite amountsof prior learning experiences (in the form of meta-data) opens up a wide rangeof opportunities to use that experience in completely new ways, and we are only

2 Meta-Learning 53

starting to learn how to learn from prior experience effectively. Yet, this is a worthygoal: learning how to learn any task empowers us far beyond knowing how to learnany specific task.

Acknowledgements The author would like to thank Pavel Brazdil, Matthias Feurer, Frank Hutter,Raghu Rajan, Erin Grant, Hugo Larochelle, Jan van Rijn and Jane Wang for many invaluablediscussions and feedback on the manuscript.

Bibliography

1. Abdulrahman, S., Brazdil, P., van Rijn, J., Vanschoren, J.: Speeding up Algorithm Selectionusing Average Ranking and Active Testing by Introducing Runtime. Machine Learning 107,79–108 (2018)

2. Afif, I.N.: Warm-Starting Deep Learning Model Construction using Meta-Learning. Master’sthesis, TU Eindhoven (2018)

3. Agresti, A.: Categorical Data Analysis. Wiley Interscience (2002)4. Ali, S., Smith-Miles, K.A.: Metalearning approach to automatic kernel selection for support

vector machines. Neurocomputing 70(1), 173–186 (2006)5. Ali, S., Smith-Miles, K.A.: On learning algorithm selection for classification. Applied Soft

Computing 6(2), 119–138 (2006)6. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford,

B., De Freitas, N.: Learning to learn by gradient descent by gradient descent. In: Advances inNeural Information Processing Systems. pp. 3981–3989 (2016)

7. Arinze, B.: Selecting appropriate forecasting models using rule induction. Omega 22(6), 647–658 (1994)

8. Bakker, B., Heskes, T.: Task Clustering and Gating for Bayesian Multitask Learning. Journalof Machine Learning Research 4, 83–999 (2003)

9. Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In:Proceedings of ICML 2013. pp. 199–207 (2013)

10. Bart, E., Ullman, S.: Cross-generalization: Learning novel classes from a single example byfeature replacement. In: Proceedings of CVPR 2005. pp. 672–679 (2005)

11. Baxter, J.: Learning Internal Representations. In: Advances in Neural Information ProcessingSystems, NeurIPS (1996)

12. Bengio, S., Bengio, Y., Cloutier, J.: On the search for new learning rules for anns. NeuralProcessing Letters 2(4), 26–30 (1995)

13. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: ICMLWorkshop on Unsupervised and Transfer Learning. pp. 17–36 (2012)

14. Bensusan, H., Kalousis, A.: Estimating the predictive accuracy of a classifier. Lecture Notesin Computer Science 2167, 25–36 (2001)

15. Bensusan, H., Giraud-Carrier, C.: Discovering task neighbourhoods through landmark learn-ing performances. In: Proceedings of PKDD 2000. pp. 325–330 (2000)

16. Bensusan, H., Giraud-Carrier, C., Kennedy, C.: A higher-order approach to meta-learning. In:Proceedings of ILP 2000. pp. 33–42 (2000)

17. Bilalli, B., Abelló, A., Aluja-Banet, T.: On the predictive power of meta-features in OpenML.International Journal of Applied Mathematics and Computer Science 27(4), 697–712 (2017)

18. Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Intelligent assistance for data pre-processing. Computer Standards and Interfaces 57, 101–109 (2018)

19. Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M., Malitsky, Y., Fréchette, A., Hoos, H.,Hutter, F., Leyton-Brown, K., Tierney, K., Vanschoren, J.: ASLib: A benchmark library foralgorithm selection. Artificial Intelligence 237, 41–58 (2016)

20. Bishop, C.M.: Pattern recognition and machine learning. Springer (2006)

54 J. Vanschoren

21. Brazdil, P., Soares, C., da Costa, J.P.: Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning 50(3), 251–277 (2003)

22. Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning: Applications to DataMining. Springer-Verlag Berlin Heidelberg (2009)

23. Brazdil, P.B., Soares, C., Da Coasta, J.P.: Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning 50(3), 251–277 (2003)

24. Caruana, R.: Learning many related tasks at the same time with backpropagation. NeuralInformation Processing Systems pp. 657–664 (1995)

25. Caruana, R.: Multitask Learning. Machine Learning 28(1), 41–75 (1997)26. Castiello, C., Castellano, G., Fanelli, A.M.: Meta-data: Characterization of input features

for meta-learning. In: 2nd International Conference on Modeling Decisions for ArtificialIntelligence (MDAI). pp. 457–468 (2005)

27. Chalmers, D.J.: The evolution of learning: An experiment in genetic connectionism. In:Connectionist Models, pp. 81–90. Elsevier (1991)

28. Chen, Y., Hoffman, M.W., Colmenarejo, S.G., Denil, M., Lillicrap, T.P., Botvinick, M., deFreitas, N.: Learning to learn without gradient descent by gradient descent. In: Proceedingsof ICML 2017, PMLR 70, pp. 748–756 (2017)

29. Cheng, W., Hühn, J., Hüllermeier, E.: Decision tree and instance-based learning for labelranking. In: Proceedings of ICML 2009. pp. 161–168 (2009)

30. Cook, W.D., Kress, M., Seiford, L.W.: A general framework for distance-based consensus inordinal ranking models. European Journal of Operational Research 96(2), 392–397 (1996)

31. Daniel, C., Taylor, J., Nowozin, S.: Learning step size controllers for robust neural networktraining. In: Proceedings of AAAI 2016. pp. 1519–1525 (2016)

32. Davis, C., Giraud-Carrier, C.: Annotative experts for hyperparameter selection. In: AutoMLWorkshop at ICML 2018 (2018)

33. De Sa, A., Pinto, W., Oliveira, L.O., Pappa, G.: RECIPE: A grammar-based frameworkfor automatically evolving classification pipelines. In: European Conference on GeneticProgramming. pp. 246–261 (2017)

34. Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of MachineLearning Research 7, 1–30 (2006)

35. Dietterich, T.: Ensemble methods in machine learning. In: International workshop on multipleclassifier systems. pp. 1–15 (2000)

36. Dietterich, T., Busquets, D., Lopez de Mantaras, R., Sierra, C.: Action Refinement inReinforcement Learning by Probability Smoothing. In: 19th International Conference onMachine Learning. pp. 107–114 (2002)

37. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: Adeep convolutional activation feature for generic visual recognition. In: Proceedings of ICML2014. pp. 647–655 (2014)

38. Drori, I., Krishnamurthy, Y., Rampin, R., de Paula Lourenco, R., Ono, J.P., Cho, K., Silva, C.,Freire, J.: AlphaD3M: Machine learning pipeline synthesis. In: AutoML Workshop at ICML(2018)

39. Duan, Y., Schulman, J., Chen, X., Bartlett, P.L., Sutskever, I., Abbeel, P.: RL2: Fastreinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779(2016)

40. Eggensperger, K., Lindauer, M., Hoos, H., Hutter, F., Leyton-Brown, K.: Efficient Bench-marking of Algorithm Configuration Procedures via Model-Based Surrogates . MachineLearning 107, 15–41 (2018)

41. Evgeniou, T., Micchelli, C., Pontil, M.: Learning Multiple Tasks with Kernel Methods.Journal of Machine Learning Research 6, 615–637 (2005)

42. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Tenth Conference on Knowl-edge Discovery and Data Mining (2004)

43. Fei-Fei, L.: Knowledge transfer in learning to recognize visual objects classes. In: Interna-tional Conference on Development and Learning. Art. 51 (2006)

44. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. Pattern analysis andmachine intelligence 28(4), 594–611 (2006)

2 Meta-Learning 55

45. Feurer, M., Letham, B., Bakshy, E.: Scalable meta-learning for Bayesian optimization. arXiv1802.02219 (2018)

46. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient androbust automated machine learning. In: Advances in Neural Information Processing Systems28. pp. 2944–2952 (2015)

47. Feurer, M., Letham, B., Bakshy, E.: Scalable meta-learning for Bayesian optimization usingranking-weighted gaussian process ensembles. In: AutoML Workshop at ICML 2018 (2018)

48. Feurer, M., Springenberg, J.T., Hutter, F.: Using meta-learning to initialize Bayesian opti-mization of hyperparameters. In: International Conference on Metalearning and AlgorithmSelection. pp. 3–10 (2014)

49. Filchenkov, A., Pendryak, A.: Dataset metafeature description for recommending featureselection. In: Proceedings of AINL-ISMW FRUCT 2015. pp. 11–18 (2015)

50. Fink, M.: Object classification from a single example utilizing class relevance metrics. In:Advances in Neural information processing systems, NeurIPS 2005. pp. 449–456 (2005)

51. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deepnetworks. In: Proceedings of ICML 2017. pp. 1126–1135 (2017)

52. Finn, C., Levine, S.: Meta-learning and universality: Deep representations and GradientDescent can Approximate any Learning Algorithm. In: Proceedings of ICLR 2018 (2018)

53. Fürnkranz, J., Petrak, J.: An evaluation of landmarking variants. ECML/PKDD 2001 Work-shop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning pp. 57–68(2001)

54. Fusi, N., Sheth, R., Elibol, H.M.: Probabilistic matrix factorization for automated machinelearning. In: Advances in Neural information processing systems, NeurIPS 2018, pp. 3352–3361 (2018)

55. Gil, Y., Yao, K.T., Ratnakar, V., Garijo, D., Ver Steeg, G., Szekely, P., Brekelmans, R.,Kejriwal, M., Luo, F., Huang, I.H.: P4ML: A phased performance-based pipeline plannerfor automated machine learning. In: AutoML Workshop at ICML 2018 (2018)

56. Giraud-Carrier, C.: Metalearning-a tutorial. In: Tutorial at the International Conference onMachine Learning and Applications. pp. 1–45 (2008)

57. Giraud-Carrier, C., Provost, F.: Toward a justification of meta-learning: Is the no free lunchtheorem a show-stopper. In: Proceedings of the ICML-2005 Workshop on Meta-learning.pp. 12–19 (2005)

58. Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., Sculley, D.: Google vizier: Aservice for black-box optimization. In: Proceedings of ICDM 2017. pp. 1487–1495 (2017)

59. Gomes, T.A., Prudêncio, R.B., Soares, C., Rossi, A.L., Carvalho, A.: Combining meta-learning and search techniques to select parameters for support vector machines. Neurocom-puting 75(1), 3–13 (2012)

60. Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprintarXiv:1410.5401 (2014)

61. Guerra, S.B., Prudêncio, R.B., Ludermir, T.B.: Predicting the performance of learningalgorithms using support vector machines as meta- regressors. In: Proceedings of ICANN.pp. 523–532 (2008)

62. Hengst, B.: Discovering Hierarchy in Reinforcement Learning with HEXQ. In: InternationalConference on Machine Learning. pp. 243–250 (2002)

63. Hilario, M., Kalousis, A.: Fusion of meta-knowledge and meta-data for case-based modelselection. Lecture Notes in Computer Science 2168, 180–191 (2001)

64. Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. PatternAnalysis and Machine Intelligence. 24(3), 289–300 (2002)

65. Hochreiter, S., Younger, A., Conwell, P.: Learning to learn using gradient descent. In: LectureNotes on Computer Science, 2130. pp. 87–94 (2001)

66. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

67. Hutter, F., Hoos, H., Leyton-Brown, K.: An Efficient Approach for Assessing HyperparameterImportance. In: Proceedings of ICML (2014)

56 J. Vanschoren

68. Hutter, F., Xu, L., Hoos, H., Leyton-Brown, K.: Algorithm runtime prediction: Methods &evaluation. Artificial Intelligence 206, 79–111 (2014)

69. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-boxfunctions. Journal of Global Optimization 13(4), 455–492 (1998)

70. Kalousis, A.: Algorithm Selection via Meta-Learning. Ph.D. thesis, University of Geneva,Department of Computer Science (2002)

71. Kalousis, A., Hilario, M.: Representational issues in meta-learning. Proceedings of ICML2003 pp. 313–320 (2003)

72. Kalousis, A., Hilario, M.: Model selection via meta-learning: a compara- tive study. Interna-tional Journal on Artificial Intelligence Tools 10(4), 525–554 (2001)

73. Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)74. Kietz, J.U., Serban, F., Bernstein, A., Fischer, S.: Designing KDD-workflows via HTN-

planning for intelligent discovery assistance. In: 5th Planning to Learn Workshop at ECAI2012 (2012)

75. Kim, J., Kim, S., Choi, S.: Learning to warm-start Bayesian hyperparameter optimization.arXiv preprint arXiv:1710.06219 (2017)

76. Köpf, C., Iglezakis, I.: Combination of task description strategies and case base properties formeta-learning. ECML/PKDD Workshop on Integration and Collaboration Aspects of DataMining pp. 65–76 (2002)

77. Köpf, C., Taylor, C., Keller, J.: Meta-analysis: From data characterization for meta-learningto meta-regression. In: PKDD Workshop on Data Mining, Decision Support, Meta-Learningand ILP. pp. 15–26 (2000)

78. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutionalneural networks. In: Advances in neural information processing systems. pp. 1097–1105(2012)

79. Kuba, P., Brazdil, P., Soares, C., Woznica, A.: Exploiting sampling and meta-learning forparameter setting support vector machines. In: Proceedings of IBERAMIA 2002. pp. 217–225 (2002)

80. Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematicalstatistics 22(1), 79–86 (1951)

81. Lacoste, A., Marchand, M., Laviolette, F., Larochelle, H.: Agnostic Bayesian learning ofensembles. In: Proceedings of ICML. pp. 611–619 (2014)

82. Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learnand think like people. Behavior and Brain Science 40 (2017)

83. Leite, R., Brazdil, P.: Predicting relative performance of classifiers from samples. Proceedingsof ICML pp. 497–504 (2005)

84. Leite, R., Brazdil, P.: An iterative process for building learning curves and predicting relativeperformance of classifiers. Lecture Notes in Computer Science 4874, 87–98 (2007)

85. Leite, R., Brazdil, P., Vanschoren, J.: Selecting Classification Algorithms with Active Testing.Lecture Notes in Artificial Intelligence 10934, 117–131 (2012)

86. Leite, R., Brazdil, P.: Active testing strategy to predict the best classification algorithm viasampling and metalearning. In: Proceedings of ECAI 2010. pp. 309–314 (2010)

87. Lemke, C., Budka, M., Gabrys, B.: Metalearning: a survey of trends and technologies.Artificial intelligence review 44(1), 117–130 (2015)

88. Ler, D., Koprinska, I., Chawla, S.: Utilizing regression-based landmarkers within a meta-learning framework for algorithm selection. Technical Report 569. University of Sydneypp. 44–51 (2005)

89. Li, K., Malik, J.: Learning to optimize. In: Proceedings of ICLR 2017 (2017)90. Li, K., Malik, J.: Learning to optimize neural nets. arXiv preprint arXiv:1703.00441 (2017)91. Lin, S.: Rank aggregation methods. WIREs Computational Statistics 2, 555–570 (2010)92. Lindner, G., Studer, R.: AST: Support for algorithm selection with a CBR approach. In: ICML

Workshop on Recent Advances in Meta-Learning and Future Work. pp. 38–47. J. StefanInstitute (1999)

93. Lorena, A.C., Maciel, A.I., de Miranda, P.B.C., Costa, I.G., Prudêncio, R.B.C.: Datacomplexity meta-features for regression problems. Machine Learning 107(1), 209–246 (2018)

2 Meta-Learning 57

94. Luo, G.: A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics 5(1),18 (2016)

95. Mantovani, R.G., Horváth, T., Cerri, R., Vanschoren, J., de Carvalho, A.C.: Hyper-parametertuning of a decision tree induction algorithm. In: Brazilian Conference on Intelligent Systems.pp. 37–42 (2016)

96. Mantovani, R.G., Rossi, A.L., Vanschoren, J., Bischl, B., Carvalho, A.C.: To tune ornot to tune: recommending when to adjust SVM hyper-parameters via meta-learning. In:Proceedings of IJCNN. pp. 1–8 (2015)

97. Mantovani, R.G., Rossi, A.L., Vanschoren, J., Carvalho, A.C.: Meta-learning recommenda-tion of default hyper-parameter values for SVMs in classifications tasks. In: ECML PKDDWorkshop on Meta-Learning and Algorithm Selection (2015)

98. Mantovani, R.: Use of meta-learning for hyperparameter tuning of classification problems.Ph.D. thesis, University of Sao Carlos, Brazil (2018)

99. Michie, D., Spiegelhalter, D.J., Taylor, C.C., Campbell, J.: Machine Learning, Neural andStatistical Classification. Ellis Horwood (1994)

100. Miranda, P., Prudêncio, R.: Active testing for SVM parameter selection. In: Proceedings ofIJCNN. pp. 1–8 (2013)

101. Mishra, N., Rohaninejad, M., Chen, X., Abbeel, P.: A simple neural attentive meta-learner.In: Proceedings of ICLR (2018)

102. Misir, M., Sebag, M.: Algorithm Selection as a Collaborative Filtering Problem. Researchreport, INRIA (2013)

103. Mısır, M., Sebag, M.: Alors: An algorithm recommender system. Artificial Intelligence 244,291–314 (2017)

104. Nadaraya, E.A.: On estimating regression. Theory of Probability & Its Applications 9(1),141–142 (1964)

105. Nguyen, P., Hilario, M., Kalousis, A.: Using meta-mining to support data mining workflowplanning and optimization. Journal of Artificial Intelligence Research 51, 605–644 (2014)

106. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms. arXiv1803.02999v2 (2018)

107. Niculescu-Mizil, A., Caruana, R.: Learning the Structure of Related Tasks. In: Proceedingsof NIPS Workshop on Inductive Transfer (2005)

108. Nisioti, E., Chatzidimitriou, K., Symeonidis, A.: Predicting hyperparameters from meta-features in binary classification problems. In: AutoML Workshop at ICML (2018)

109. Olier, I., Sadawi, N., Bickerton, G., Vanschoren, J., Grosan, C., Soldatova, L., King, R.: Meta-QSAR: learning how to learn QSARs. Machine Learning 107, 285–311 (2018)

110. Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipelineoptimization tool for automating data science. In: Proceedings of GECCO. pp. 485–492(2016)

111. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge and dataengineering 22(10), 1345–1359 (2010)

112. Pang, K., Dong, M., Wu, Y., Hospedales, T.: Meta-learning transferable active learningpolicies by deep reinforcement learning. In: AutoML Workshop at ICML (2018)

113. Peng, Y., Flach, P., Soares, C., Brazdil, P.: Improved dataset characterisation for meta-learning. Lecture Notes in Computer Science 2534, 141–152 (2002)

114. Perrone, V., Jenatton, R., Seeger, M., Archambeau, C.: Multiple adaptive Bayesian linearregression for scalable Bayesian optimization with warm start. In: Advances in Neuralinformation processing systems, NeurIPS 2018 (2018)

115. Pfahringer, B., Bensusan, H., Giraud-Carrier, C.G.: Meta-learning by landmarking variouslearning algorithms. In: 17th International Conference on Machine Learning (ICML).pp. 743–750 (2000)

116. Pinto, F., Cerqueira, V., Soares, C., Mendes-Moreira, J.: autoBagging: Learning to rankbagging workflows with metalearning. arXiv 1706.09367 (2017)

117. Pinto, F., Soares, C., Mendes-Moreira, J.: Towards automatic generation of metafeatures. In:Proceedings of PAKDD. pp. 215–226 (2016)

58 J. Vanschoren

118. Post, M.J., van der Putten, P., van Rijn, J.N.: Does Feature Selection Improve Classification?A Large Scale Experiment in OpenML. In: Advances in Intelligent Data Analysis XV.pp. 158–170 (2016)

119. Priya, R., De Souza, B.F., Rossi, A., Carvalho, A.: Using genetic algorithms to improveprediction of execution times of ML tasks. In: Lecture Notes in Computer Science. vol. 7208,pp. 196–207 (2012)

120. Probst, P., Bischl, B., Boulesteix, A.L.: Tunability: Importance of hyperparameters of machinelearning algorithms. ArXiv 1802.09596 (2018)

121. Prudêncio, R., Ludermir, T.: Meta-learning approaches to selecting time series models.Neurocomputing 61, 121–137 (2004)

122. Raina, R., Ng, A.Y., Koller, D.: Transfer Learning by Constructing Informative Priors. In:Proceedings of ICML (2006)

123. Rakotoarison, H., Sebag, M.: AutoML with Monte Carlo Tree Search. In: ICML Workshopon AutoML 2018 (2018)

124. Ramachandran, A., Gupta, S., Rana, S., Venkatesh, S.: Information-theoretic transfer learningframework for Bayesian optimisation. In: Proceedings of ECMLPKDD (2018)

125. Ramachandran, A., Gupta, S., Rana, S., Venkatesh, S.: Selecting optimal source for transferlearning in Bayesian optimisation. In: Proceedings of PRICAI. pp. 42–56 (2018)

126. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: Proceedings ofICLR (2017)

127. Reed, S., Chen, Y., Paine, T., Oord, A.v.d., Eslami, S., Rezende, D., Vinyals, O., de Freitas,N.: Few-shot autoregressive density estimation: Towards learning to learn distributions. In:Proceedings of ICLR 2018 (2018)

128. Reif, M., Shafait, F., Dengel, A.: Prediction of classifier training time including parameteroptimization. In: Proceedings of GfKI 2011. pp. 260–271 (2011)

129. Reif, M., Shafait, F., Dengel, A.: Meta-learning for evolutionary parameter optimization ofclassifiers. Machine learning 87(3), 357–380 (2012)

130. Reif, M., Shafait, F., Goldstein, M., Breuel, T., Dengel, A.: Automatic classifier selection fornon-experts. Pattern Analysis and Applications 17(1), 83–96 (2014)

131. Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B., Larochelle, H.,Zemel, R.S.: Meta-learning for semi-supervised few- shot classification. In: Proceedings ofICLR 2018 (2018)

132. Rendle, S.: Factorization machines. In: Proceedings of ICDM 2015. pp. 995–1000 (2010)133. Ridd, P., Giraud-Carrier, C.: Using metalearning to predict when parameter optimization

is likely to improve classification accuracy. In: ECAI Workshop on Meta-learning andAlgorithm Selection. pp. 18–23 (2014)

134. van Rijn, J., Abdulrahman, S., Brazdil, P., Vanschoren, J.: Fast Algorithm Selection UsingLearning Curves. In: Proceedings of IDA (2015)

135. van Rijn, J., Holmes, G., Pfahringer, B., Vanschoren, J.: The Online Performance EstimationFramework. Heterogeneous Ensemble Learning for Data Streams. Machine Learning 107,149–176 (2018)

136. van Rijn, J.N., Hutter, F.: Hyperparameter importance across datasets. In: Proceedings ofKDD. pp. 2367–2376 (2018)

137. van Rijn, J.N., Holmes, G., Pfahringer, B., Vanschoren, J.: Algorithm selection on datastreams. In: Discovery Science. pp. 325–336 (2014)

138. Rivolli, A., Garcia, L., Soares, C., Vanschoren, J., de Carvalho, A.: Towards reproducibleempirical research in meta-learning. arXiv preprint 1808.10406 (2018)

139. Robbins, H.: Some aspects of the sequential design of experiments. In: Herbert RobbinsSelected Papers, pp. 169–177. Springer (1985)

140. Rosenstein, M.T., Marx, Z., Kaelbling, L.P.: To Transfer or Not To Transfer. In: NIPSWorkshop on transfer learning (2005)

141. Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. Wiley InterdisciplinaryReviews: Data Mining and Knowledge Discovery 1(1), 73–79 (2011)

2 Meta-Learning 59

142. Runarsson, T.P., Jonsson, M.T.: Evolution and design of distributed learning rules. In: IEEESymposium on Combinations of Evolutionary Computation and Neural Networks. pp. 59–63(2000)

143. Salama, M.A., Hassanien, A.E., Revett, K.: Employment of neural network and rough set inmeta-learning. Memetic Computing 5(3), 165–177 (2013)

144. Sanders, S., Giraud-Carrier, C.: Informing the use of hyperparameter optimization throughmetalearning. In: Proceedings of ICDM 2017. pp. 1051–1056 (2017)

145. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learning withmemory-augmented neural networks. In: International conference on machine learning.pp. 1842–1850 (2016)

146. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: One-shot learning withmemory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016)

147. dos Santos, P., Ludermir, T., Prudêncio, R.: Selection of time series forecasting models basedon performance information. 4th International Conference on Hybrid Intelligent Systemspp. 366–371 (2004)

148. Schilling, N., Wistuba, M., Drumond, L., Schmidt-Thieme, L.: Hyperparameter optimizationwith factorized multilayer perceptrons. In: Proceedings of ECML PKDD. pp. 87–103 (2015)

149. Schmidhuber, J.: Learning to control fast-weight memories: An alternative to dynamicrecurrent networks. Neural Computing 4(1), 131–139 (1992)

150. Schmidhuber, J.: A neural network that embeds its own meta-levels. In: Proceedings of ICNN.pp. 407–412 (1993)

151. Schmidhuber, J., Zhao, J., Wiering, M.: Shifting inductive bias with success-story algorithm,adaptive levin search, and incremental self-improvement. Machine Learning 28(1), 105–130(1997)

152. Schoenfeld, B., Giraud-Carrier, C., Poggeman, M., Christensen, J., Seppi, K.: Featureselection for high-dimensional data: A fast correlation-based filter solution. In: AutoMLWorkshop at ICML (2018)

153. Serban, F., Vanschoren, J., Kietz, J., Bernstein, A.: A survey of intelligent assistants for dataanalysis. ACM Computing Surveys 45(3), Art.31 (2013)

154. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: anastounding baseline for recognition. In: Proceedings of CVPR 2014. pp. 806–813 (2014)

155. Sharkey, N.E., Sharkey, A.J.C.: Adaptive Generalization. Artificial Intelligence Review 7,313–328 (1993)

156. Smith-Miles, K.A.: Cross-disciplinary perspectives on meta-learning for algorithm selection.ACM Computing Surveys 41(1), 1–25 (2009)

157. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeuralInformation Processing Systems. pp. 4077–4087 (2017)

158. Soares, C., Brazdil, P., Kuba, P.: A meta-learning method to select the kernel width in supportvector regression. Machine Learning 54, 195–209 (2004)

159. Soares, C., Ludermir, T., Carvalho, F.D.: An analysis of meta-learning techniques for rankingclustering algorithms applied to artificial data. Lecture Notes in Computer Science 5768, 131–140 (2009)

160. Soares, C., Petrak, J., Brazdil, P.: Sampling based relative landmarks: Systematically testdriv-ing algorithms before choosing. Lecture Notes in Computer Science 3201, 250–261 (2001)

161. Springenberg, J., Klein, A., Falkner, S., Hutter, F.: Bayesian optimization with robustBayesian neural networks. In: Advances in Neural Information Processing Systems (2016)

162. Stern, D.H., Samulowitz, H., Herbrich, R., Graepel, T., Pulina, L., Tacchella, A.: Collaborativeexpert portfolio management. In: Proceedings of AAAI. pp. 179–184 (2010)

163. Strang, B., van der Putten, P., van Rijn, J.N., Hutter, F.: Don’t Rule Out Simple ModelsPrematurely. In: Advances in Intelligent Data Analysis (2018)

164. Sun, Q., Pfahringer, B., Mayo, M.: Towards a Framework for Designing Full Model Selectionand Optimization Systems. In: International Workshop on Multiple Classifier Systems.pp. 259–270 (2013)

60 J. Vanschoren

165. Sun, Q., Pfahringer, B.: Pairwise meta-rules for better meta-learning-based algorithm ranking.Machine Learning 93(1), 141–161 (2013)

166. Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Advances inneural information processing systems. pp. 2004–2012 (2013)

167. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples. Biometrika 25(3/4), 285–294 (1933)

168. Thrun, S.: Lifelong Learning Algorithms. In: Learning to Learn, chap. 8, pp. 181–209. KluwerAcademic Publishers, MA (1998)

169. Thrun, S., Mitchell, T.: Learning One More Thing. In: Proceedings of IJCAI. pp. 1217–1223(1995)

170. Thrun, S., Pratt, L.: Learning to Learn: Introduction and Overview. In: Learning to Learn,pp. 3–17. Kluwer (1998)

171. Todorovski, L., Blockeel, H., Džeroski, S.: Ranking with predictive clustering trees. LectureNotes in Artificial Intelligence 2430, 444–455 (2002)

172. Todorovski, L., Brazdil, P., Soares, C.: Report on the experiments with feature selection inmeta-level learning. PKDD 2000 Workshop on Data mining, Decision support, Meta-learningand ILP pp. 27–39 (2000)

173. Todorovski, L., Dzeroski, S.: Experiments in meta-level learning with ILP. Lecture Notes inComputer Science 1704, 98–106 (1999)

174. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machinelearning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014)

175. Vanschoren, J.: Understanding Machine Learning Performance with Experiment Databases.Ph.D. thesis, Leuven Univeristy (2010)

176. Vanschoren, J.: Meta-learning: A survey. arXiv:1810.03548 (2018)177. Vanschoren, J., Blockeel, H., Pfahringer, B., Holmes, G.: Experiment databases. Machine

Learning 87(2), 127–158 (2012)178. Vartak, M., Thiagarajan, A., Miranda, C., Bratman, J., Larochelle, H.: A meta-learning

perspective on cold-start recommendations for items. In: Advances in Neural InformationProcessing Systems. pp. 6904–6914 (2017)

179. Vilalta, R.: Understanding accuracy performance through concept characterization andalgorithm analysis. ICML Workshop on Recent Advances in Meta-Learning and Future Work(1999)

180. Vilalta, R., Drissi, Y.: A characterization of difficult problems in classification. Proceedingsof ICMLA (2002)

181. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shotlearning. In: Advances in Neural Information Processing Systems. pp. 3630–3638 (2016)

182. Weerts, H., Meuller, M., Vanschoren, J.: Importance of tuning hyperparameters of machinelearning algorithms. Technical report, TU Eindhoven (2018)

183. Weerts, H., Meuller, M., Vanschoren, J.: Importance of tuning hyperparameters of machinelearning algorithms. Tech. rep., TU Eindhoven (2018)

184. Wever, M., Mohr, F., Hüllermeier, E.: Ml-plan for unlimited-length machine learningpipelines. In: AutoML Workshop at ICML 2018 (2018)

185. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Hyperparameter search space pruning, a newcomponent for sequential model-based hyperparameter optimization. In: ECML PKDD 2015.pp. 104–119 (2015)

186. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Learning hyperparameter optimizationinitializations. In: 2015 IEEE International Conference on Data Science and AdvancedAnalytics (DSAA). pp. 1–10 (2015)

187. Wolpert, D., Macready, W.: No free lunch theorems for search. Technical Report SFI-TR-95-02-010, The Santa Fe Institute (1996)

2 Meta-Learning 61

188. Yang, C., Akimoto, Y., Kim, D., Udell, M.: OBOE: Collaborative filtering for automlinitialization. In: NeurIPS 2018 Workshop on Metalearning (2018)

189. Yang, C., Akimoto, Y., Kim, D., Udell, M.: Oboe: Collaborative filtering for automlinitialization. arXiv preprint arXiv:1808.03233 (2018)

190. Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyperparametertuning. In: AI and Statistics. pp. 1077–1085 (2014)

191. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neuralnetworks? In: Advances in neural information processing systems. pp. 3320–3328 (2014)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,adaptation, distribution and reproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the Creative Commons licence andindicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s CreativeCommons licence, unless indicated otherwise in a credit line to the material. If material is notincluded in the chapter’s Creative Commons licence and your intended use is not permitted bystatutory regulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder.

http://creativecommons.org/licenses/by/4.0/

Chapter 3Neural Architecture Search

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter

Abstract Deep Learning has enabled remarkable progress over the last years ona variety of tasks, such as image recognition, speech recognition, and machinetranslation. One crucial aspect for this progress are novel neural architectures.Currently employed architectures have mostly been developed manually by humanexperts, which is a time-consuming and error-prone process. Because of this, thereis growing interest in automated neural architecture search methods. We provide anoverview of existing work in this field of research and categorize them accordingto three dimensions: search space, search strategy, and performance estimationstrategy.

3.1 Introduction

The success of deep learning in perceptual tasks is largely due to its automationof the feature engineering process: hierarchical feature extractors are learned in anend-to-end fashion from data rather than manually designed. This success has beenaccompanied, however, by a rising demand for architecture engineering, whereincreasingly more complex neural architectures are designed manually. NeuralArchitecture Search (NAS), the process of automating architecture engineering,is thus a logical next step in automating machine learning. NAS can be seen assubfield of AutoML and has significant overlap with hyperparameter optimizationand meta-learning (which are described in Chaps. 1 and 2 of this book, respectively).

T. Elsken (�)Institut für Informatik, University of Freiburg, Freiburg, Baden-Württemberg, Germanye-mail: [email protected]; [email protected]

J. H. MetzenBosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, Baden-Württemberg,Germany

F. HutterDepartment of Computer Science, University of Freiburg, Freiburg, Germany

© The Author(s) 2019F. Hutter et al. (eds.), Automated Machine Learning, The Springer Serieson Challenges in Machine Learning, https://doi.org/10.1007/978-3-030-05318-5_3

63

http://crossmark.crossref.org/dialog/?doi=10.1007/978-3-030-05318-5_3&domain=pdf



https://doi.org/10.1007/978-3-030-05318-5_3

64 T. Elsken et al.

We categorize methods for NAS according to three dimensions: search space, searchstrategy, and performance estimation strategy:

• Search Space. The search space defines which architectures can be representedin principle. Incorporating prior knowledge about properties well-suited for atask can reduce the size of the search space and simplify the search. However,this also introduces a human bias, which may prevent finding novel architecturalbuilding blocks that go beyond the current human knowledge.

PerformanceEstimationStrategy

Search Space

A Search Strategy

architectureA ∈ A

performanceestimate of A

Fig. 3.1 Abstract illustration of Neural Architecture Search methods. A search strategy selectsan architecture A from a predefined search space A. The architecture is passed to a performanceestimation strategy, which returns the estimated performance of A to the search strategy

• Search Strategy. The search strategy details how to explore the search space.It encompasses the classical exploration-exploitation trade-off since, on the onehand, it is desirable to find well-performing architectures quickly, while on theother hand, premature convergence to a region of suboptimal architectures shouldbe avoided.

• Performance Estimation Strategy. The objective of NAS is typically to findarchitectures that achieve high predictive performance on unseen data. Per-formance Estimation refers to the process of estimating this performance: thesimplest option is to perform a standard training and validation of the architectureon data, but this is unfortunately computationally expensive and limits thenumber of architectures that can be explored. Much recent research thereforefocuses on developing methods that reduce the cost of these performanceestimations.

We refer to Fig. 3.1 for an illustration. The chapter is also structured accordingto these three dimensions: we start with discussing search spaces in Sect. 3.2, coversearch strategies in Sect. 3.3, and outline approaches to performance estimation inSect. 3.4. We conclude with an outlook on future directions in Sect. 3.5.

This chapter is based on a very recent survey article [23].

3.2 Search Space

The search space defines which neural architectures a NAS approach might discoverin principle. We now discuss common search spaces from recent works.

A relatively simple search space is the space of chain-structured neural networks,as illustrated in Fig. 3.2 (left). A chain-structured neural network architecture A

3 Neural Architecture Search 65

Fig. 3.2 An illustration ofdifferent architecture spaces.Each node in the graphscorresponds to a layer in aneural network, e.g., aconvolutional or poolinglayer. Different layer typesare visualized by differentcolors. An edge from layer Li

to layer Lj denotes that Lj

receives the output of Li asinput. Left: an element of achain-structured space. Right:an element of a morecomplex search space withadditional layer types andmultiple branches and skipconnections

input

L0

L1

Ln

output

input

L0

L2

L4

L6

L8

L10

L1

L3

L7

L9

L5

output

Ln−1

can be written as a sequence of n layers, where the i’th layer Li receives itsinput from layer i − 1 and its output serves as the input for layer i + 1, i.e.,A = Ln ◦ . . . L1 ◦ L0. The search space is then parametrized by: (i) the (maximum)number of layers n (possibly unbounded); (ii) the type of operation every layer canexecute, e.g., pooling, convolution, or more advanced layer types like depthwiseseparable convolutions [13] or dilated convolutions [68]; and (iii) hyperparametersassociated with the operation, e.g., number of filters, kernel size and strides fora convolutional layer [4, 10, 59], or simply number of units for fully-connectednetworks [41]. Note that the parameters from (iii) are conditioned on (ii), hencethe parametrization of the search space is not fixed-length but rather a conditionalspace.

Recent work on NAS [9, 11, 21, 22, 49, 75] incorporate modern design elementsknown from hand-crafted architectures such as skip connections, which allow tobuild complex, multi-branch networks, as illustrated in Fig. 3.2 (right). In this casethe input of layer i can be formally described as a function gi(L

outi−1, . . . , L

out0 )

combining previous layer outputs. Employing such a function results in significantlymore degrees of freedom. Special cases of these multi-branch architectures are (i)the chain-structured networks (by setting gi(L

outi−1, . . . , L

out0 ) = Lout

i−1), (ii) ResidualNetworks [28], where previous layer outputs are summed (gi(L

outi−1, . . . , L

out0 ) =

Louti−1 + Lout

j , j < i) and (iii) DenseNets [29], where previous layer outputs areconcatenated (gi(L

outi−1, . . . , L

out0 ) = concat (Lout

i−1, . . . , Lout0 )).

Motivated by hand-crafted architectures consisting of repeated motifs [28, 29,62], Zoph et al. [75] and Zhong et al. [71] propose to search for such motifs, dubbedcells or blocks, respectively, rather than for whole architectures. Zoph et al. [75]optimize two different kind of cells: a normal cell that preservers the dimensionalityof the input and a reduction cell which reduces the spatial dimension. The finalarchitecture is then built by stacking these cells in a predefined manner, as illustrated

66 T. Elsken et al.

input

input

output

output

input

output

Fig. 3.3 Illustration of the cell search space. Left: Two different cells, e.g., a normal cell (top) anda reduction cell (bottom) [75]. Right: an architecture built by stacking the cells sequentially. Notethat cells can also be combined in a more complex manner, such as in multi-branch spaces, bysimply replacing layers with cells

in Fig. 3.3. This search space has two major advantages compared to the onesdiscussed above:

1. The size of the search space is drastically reduced since cells can be comparablysmall. For example, Zoph et al. [75] estimate a seven-times speed-up comparedto their previous work [74] while achieving better performance.

2. Cells can more easily be transferred to other datasets by adapting the number ofcells used within a model. Indeed, Zoph et al. [75] transfer cells optimized onCIFAR-10 to ImageNet and achieve state-of-the-art performance.

Consequently, this cell-based search space was also successfully employed bymany later works [11, 22, 37, 39, 46, 49, 72]. However, a new design-choice ariseswhen using a cell-based search space, namely how to choose the meta-architecture:how many cells shall be used and how should they be connected to build theactual model? For example, Zoph et al. [75] build a sequential model from cells,in which each cell receives the outputs of the two preceding cells as input, whileCai et al. [11] employ the high-level structure of well-known manually designed


architectures, such as DenseNet [29], and use their cells within these models. Inprinciple, cells can be combined arbitrarily, e.g., within the multi-branch spacedescribed above by simply replacing layers with cells. Ideally, the meta-architectureshould be optimized automatically as part of NAS; otherwise one easily ends updoing meta-architecture engineering and the search for the cell becomes overlysimple if most of the complexity is already accounted for by the meta-architecture.

One step in the direction of optimizing meta-architectures is the hierarchicalsearch space introduced by Liu et al. [38], which consists of several levels of motifs.The first level consists of the set of primitive operations, the second level of differentmotifs that connect primitive operations via a direct acyclic graphs, the third levelof motifs that encode how to connect second-level motifs, and so on. The cell-basedsearch space can be seen as a special case of this hierarchical search space wherethe number of levels is three, the second level motifs corresponds to the cells, andthe third level is the hard-coded meta-architecture.

The choice of the search space largely determines the difficulty of the optimiza-tion problem: even for the case of the search space based on a single cell withfixed meta-architecture, the optimization problem remains (i) non-continuous and(ii) relatively high-dimensional (since more complex models tend to perform better,resulting in more design choices). We note that the architectures in many searchspaces can be written as fixed-length vectors; e.g., the search space for each of thetwo cells by Zoph et al. [75] can be written as a 40-dimensional search space withcategorical dimensions, each of which chooses between a small number of differentbuilding blocks and inputs. Similarly, unbounded search spaces can be constrainedto have a maximal depth, giving rise to fixed-size search spaces with (potentiallymany) conditional dimensions.

In the next section, we discuss Search Strategies that are well-suited for thesekinds of search spaces.

3.3 Search Strategy

Many different search strategies can be used to explore the space of neural archi-tectures, including random search, Bayesian optimization, evolutionary methods,reinforcement learning (RL), and gradient-based methods. Historically, evolution-ary algorithms were already used by many researchers to evolve neural architectures(and often also their weights) decades ago [see, e.g., 2, 25, 55, 56]. Yao [67] providesa literature review of work earlier than 2000.

Bayesian optimization celebrated several early successes in NAS since 2013,leading to state-of-the-art vision architectures [7], state-of-the-art performance forCIFAR-10 without data augmentation [19], and the first automatically-tuned neuralnetworks to win competition datasets against human experts [41]. NAS becamea mainstream research topic in the machine learning community after Zoph andLe [74] obtained competitive performance on the CIFAR-10 and Penn Treebankbenchmarks with a search strategy based on reinforcement learning. While Zoph

68 T. Elsken et al.

and Le [74] use vast computational resources to achieve this result (800 GPUsfor three to four weeks), after their work, a wide variety of methods have beenpublished in quick succession to reduce the computational costs and achieve furtherimprovements in performance.

To frame NAS as a reinforcement learning (RL) problem [4, 71, 74, 75], thegeneration of a neural architecture can be considered to be the agent’s action,with the action space identical to the search space. The agent’s reward is basedon an estimate of the performance of the trained architecture on unseen data (seeSect. 3.4). Different RL approaches differ in how they represent the agent’s policyand how they optimize it: Zoph and Le [74] use a recurrent neural network (RNN)policy to sequentially sample a string that in turn encodes the neural architecture.They initially trained this network with the REINFORCE policy gradient algorithm,but in follow-up work use Proximal Policy Optimization (PPO) instead [75]. Bakeret al. [4] use Q-learning to train a policy which sequentially chooses a layer’s typeand corresponding hyperparameters. An alternative view of these approaches isas sequential decision processes in which the policy samples actions to generatethe architecture sequentially, the environment’s “state” contains a summary of theactions sampled so far, and the (undiscounted) reward is obtained only after thefinal action. However, since no interaction with an environment occurs during thissequential process (no external state is observed, and there are no intermediaterewards), we find it more intuitive to interpret the architecture sampling processas the sequential generation of a single action; this simplifies the RL problem to astateless multi-armed bandit problem.

A related approach was proposed by Cai et al. [10], who frame NAS as asequential decision process: in their approach the state is the current (partiallytrained) architecture, the reward is an estimate of the architecture’s performance, andthe action corresponds to an application of function-preserving mutations, dubbednetwork morphisms [12, 63], see also Sect. 3.4, followed by a phase of training thenetwork. In order to deal with variable-length network architectures, they use a bi-directional LSTM to encode architectures into a fixed-length representation. Basedon this encoded representation, actor networks decide on the sampled action. Thecombination of these two components constitute the policy, which is trained end-to-end with the REINFORCE policy gradient algorithm. We note that this approachwill not visit the same state (architecture) twice so that strong generalization overthe architecture space is required from the policy.

An alternative to using RL are neuro-evolutionary approaches that use evolu-tionary algorithms for optimizing the neural architecture. The first such approachfor designing neural networks we are aware of dates back almost three decades:Miller et al. [44] use genetic algorithms to propose architectures and use back-propagation to optimize their weights. Many neuro-evolutionary approaches sincethen [2, 55, 56] use genetic algorithms to optimize both the neural architectureand its weights; however, when scaling to contemporary neural architectures withmillions of weights for supervised learning tasks, SGD-based weight optimization


methods currently outperform evolutionary ones.1 More recent neuro-evolutionaryapproaches [22, 38, 43, 49, 50, 59, 66] therefore again use gradient-based methodsfor optimizing weights and solely use evolutionary algorithms for optimizing theneural architecture itself. Evolutionary algorithms evolve a population of models,i.e., a set of (possibly trained) networks; in every evolution step, at least one modelfrom the population is sampled and serves as a parent to generate offsprings byapplying mutations to it. In the context of NAS, mutations are local operations,such as adding or removing a layer, altering the hyperparameters of a layer, addingskip connections, as well as altering training hyperparameters. After training theoffsprings, their fitness (e.g., performance on a validation set) is evaluated and theyare added to the population.

Neuro-evolutionary methods differ in how they sample parents, update popula-tions, and generate offsprings. For example, Real et al. [50], Real et al. [49], and Liuet al. [38] use tournament selection [27] to sample parents, whereas Elsken et al.[22] sample parents from a multi-objective Pareto front using an inverse density.Real et al. [50] remove the worst individual from a population, while Real et al. [49]found it beneficial to remove the oldest individual (which decreases greediness),and Liu et al. [38] do not remove individuals at all. To generate offspring, mostapproaches initialize child networks randomly, while Elsken et al. [22] employLamarckian inheritance, i.e, knowledge (in the form of learned weights) is passedon from a parent network to its children by using network morphisms. Real et al.[50] also let an offspring inherit all parameters of its parent that are not affectedby the applied mutation; while this inheritance is not strictly function-preserving itmight also speed up learning compared to a random initialization. Moreover, theyalso allow mutating the learning rate which can be seen as a way for optimizing thelearning rate schedule during NAS.

Real et al. [49] conduct a case study comparing RL, evolution, and random search(RS), concluding that RL and evolution perform equally well in terms of final testaccuracy, with evolution having better anytime performance and finding smallermodels. Both approaches consistently perform better than RS in their experiments,but with a rather small margin: RS achieved test errors of approximately 4% onCIFAR-10, while RL and evolution reached approximately 3.5% (after “modelaugmentation” where depth and number of filters was increased; the difference onthe actual, non-augmented search space was approx. 2%). The difference was evensmaller for Liu et al. [38], who reported a test error of 3.9% on CIFAR-10 and a top-1 validation error of 21.0% on ImageNet for RS, compared to 3.75% and 20.3% fortheir evolution-based method, respectively.

Bayesian Optimization (BO, see, e.g., [53]) is one of the most popular methodsfor hyperparameter optimization (see also Chap. 1 of this book), but it has not

1Some recent work shows that evolving even millions of weights is competitive to gradient-based optimization when only high-variance estimates of the gradient are available, e.g., forreinforcement learning tasks [15, 51, 57]. Nonetheless, for supervised learning tasks gradient-basedoptimization is by far the most common approach.

70 T. Elsken et al.

been applied to NAS by many groups since typical BO toolboxes are basedon Gaussian processes and focus on low-dimensional continuous optimizationproblems. Swersky et al. [60] and Kandasamy et al. [31] derive kernel functionsfor architecture search spaces in order to use classic GP-based BO methods, butso far without achieving new state-of-the-art performance. In contrast, severalworks use tree-based models (in particular, treed Parzen estimators [8], or randomforests [30]) to effectively search very high-dimensional conditional spaces andachieve state-of-the-art performance on a wide range of problems, optimizing bothneural architectures and their hyperparameters jointly [7, 19, 41, 69]. While a fullcomparison is lacking, there is preliminary evidence that these approaches can alsooutperform evolutionary algorithms [33].

Architectural search spaces have also been explored in a hierarchical manner,e.g., in combination with evolution [38] or by sequential model-based optimization[37]. Negrinho and Gordon [45] and Wistuba [65] exploit the tree-structure oftheir search space and use Monte Carlo Tree Search. Elsken et al. [21] proposea simple yet well performing hill climbing algorithm that discovers high-qualityarchitectures by greedily moving in the direction of better performing architectureswithout requiring more sophisticated exploration mechanisms.

In contrast to the gradient-free optimization methods above, Liu et al. [39]propose a continuous relaxation of the search space to enable gradient-basedoptimization: instead of fixing a single operation oi (e.g., convolution or pooling)to be executed at a specific layer, the authors compute a convex combination froma set of operations {o1, . . . , om}. More specifically, given a layer input x, thelayer output y is computed as y = ∑m

i=1 λioi(x), λi ≥ 0,∑m

i=1 λi = 1, wherethe convex coefficients λi effectively parameterize the network architecture. Liuet al. [39] then optimize both the network weights and the network architecture byalternating gradient descent steps on training data for weights and on validationdata for architectural parameters such as λ. Eventually, a discrete architecture isobtained by choosing the operation i with i = arg maxi λi for every layer. Shinet al. [54] and Ahmed and Torresani [1] also employ gradient-based optimization ofneural architectures, however they only consider optimizing layer hyperparametersor connectivity patterns, respectively.

3.4 Performance Estimation Strategy

The search strategies discussed in Sect. 3.3 aim at finding a neural architecture Athat maximizes some performance measure, such as accuracy on unseen data. Toguide their search process, these strategies need to estimate the performance of agiven architecture A they consider. The simplest way of doing this is to train A ontraining data and evaluate its performance on validation data. However, training eacharchitecture to be evaluated from scratch frequently yields computational demandsin the order of thousands of GPU days for NAS [49, 50, 74, 75].

1 Hyperparameter Optimization 29static.tongtianta.site/paper_pdf/e82e2322-8137-11e... · novel...

Documents

Transcript of 1 Hyperparameter Optimization 29static.tongtianta.site/paper_pdf/e82e2322-8137-11e... · novel...