Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have...
Transcript of Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have...
![Page 1: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/1.jpg)
Epilogue: what have you learned this semester?
1
ʻViagraʼ
ʻlotteryʼ
=0
ĉ(x) = spam
=1
ĉ(x) = ham
=0
ĉ(x) = spam
=1
0 2 4 6 8 10 12 142
4
6
8
10
12
14
16
![Page 2: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/2.jpg)
What did you get out of this course?
What skills have you learned in this course that you feel would be useful? What are the most important insights you gained this semester? What advice would you give future students? What was your biggest challenge in this course? What would you like me to do differently?
2
![Page 3: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/3.jpg)
What I hope you got out of this course
The machine learning toolbox ■ Formulating a problem as an ML problem ■ Understanding a variety of ML algorithms ■ Running and interpreting ML experiments ■ Understanding what makes ML work – theory and
practice
![Page 4: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/4.jpg)
Learning scenarios we covered
Classification: discrete/categorical labels Regression: continuous labels Clustering: no labels
4 0 2 4 6 8 10 12 142
4
6
8
10
12
14
16
![Page 5: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/5.jpg)
A variety of learning tasks
■ Supervised learning ■ Unsupervised learning ■ Semi-supervised learning
– Access to a lot of unlabeled data ■ Multi-label classification
– Each example can belong to multiple classes
■ Multi-task classification – Solving multiple related tasks
![Page 6: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/6.jpg)
A variety of learning tasks
■ Outlier/novelty detection – Novelty: anything that is not part of the
normal behavior of a system. ■ Reinforcement learning
– Learn action to maximize payoff
■ Structured output learning
Other Learning Aides
1.Nonlinear dimension reduction:
x1
x2
x1
x2
x1
x2
2.Hints (invariances and prior information):rotational invariance, monotonicity, symmetry, . . . .
3.Removing noisy data:
intensity
symmetry
intensity
symmetry
intensity
symmetry
4.Advanced validation techniques: Rademacher and Permutation penaltiesMore efficient than CV, more convenient and accurate than VC.
c⃝ AML Creator: Malik Magdon-Ismail Learning Aides: 16 /16brace
![Page 7: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/7.jpg)
Learning in structured output spaces Handle prediction problems with complex output spaces ■ Structured outputs: multivariate, correlated, constrained
General way to solve many learning problems
Examples taken from Ben Taskar’s 07 NIPS tutorial
![Page 8: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/8.jpg)
Local vs. Global
brace
Global classification takes advantage of correlations and satisfies the constraints in the problem
![Page 9: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/9.jpg)
Other techniques
² Graphical models (conditional random fields, Bayesian networks)
² Bayesian model averaging
9
![Page 10: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/10.jpg)
The importance of features and their representation
Choosing the right features is one of the most important aspects of applying ML. What you can do with features: ² Normalization ² Selection ² Construction ² Fill in missing features
10
![Page 11: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/11.jpg)
w
wTx + b < 0
wTx + b > 0
Types of models
Geometric q Ridge-regression, SVM, perceptron q Neural networks Distance-based q K-nearest-neighbors Probabilistic q Naïve-bayes Logical models: Tree/Rule based q Decision trees
Ensembles
11
ʻViagraʼ
ʻlotteryʼ
=0
ĉ(x) = spam
=1
ĉ(x) = ham
=0
ĉ(x) = spam
=1
P (Y = spam|Viagara, lottery)
![Page 12: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/12.jpg)
Loss + regularization
Many of the models we studied are based on a cost function of the form:
loss + regularization Example: Ridge regression
12
1
N
NX
i=1
(yi �w
|xi)
2 + �w|w
![Page 13: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/13.jpg)
Loss + regularization for classification
SVM The hinge loss is a margin maximizing loss function Can use other regularizers: (L1 norm) Leads to very sparse solutions and is non-differentiable. Elastic Net regularizer:
13
Hinge loss L2 regularizer
||w||1
↵||w||1 + (1� ↵)||w||22
C
N
NX
i=1
max [1� yihw(xi), 0] +1
2
w
|w
![Page 14: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/14.jpg)
Loss + regularization for classification SVM Logistic regression AdaBoost can be shown to optimize the exponential loss
14
Hinge loss L2 regularizer
Log loss L2 regularizer
C
N
NX
i=1
max [1� yihw(xi), 0] +1
2
w
|w
1
N
NX
i=1
log(1 + exp(yihw(xi)) +�
2
w
|w
1
N
NX
i=1
exp(�yihw(xi))
![Page 15: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/15.jpg)
Loss + regularization for regression
Ridge regression Closed form solution; sensitivity to outliers Lasso Sparse solutions; non-differentiable Can use alternative loss functions
15
1
N
NX
i=1
(yi �w
|x)2 + �w|
w
1
N
NX
i=1
(yi �w
|x)2 + �||w||1
![Page 16: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/16.jpg)
Comparison of learning methods
16
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
Table 10.1 from “Elements of statistical learning”
random
forests
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
![Page 17: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/17.jpg)
Comparison of learning methods
17
10.7 “Off-the-Shelf” Procedures for Data Mining 351
TABLE 10.1. Some characteristics of different learning methods. Key: ▲= good,◆=fair, and ▼=poor.
Characteristic Neural SVM Trees MARS k-NN,Nets Kernels
Natural handling of dataof “mixed” type
▼ ▼ ▲ ▲ ▼
Handling of missing values ▼ ▼ ▲ ▲ ▲
Robustness to outliers ininput space
▼ ▼ ▲ ▼ ▲
Insensitive to monotonetransformations of inputs
▼ ▼ ▲ ▼ ▼
Computational scalability(large N)
▼ ▼ ▲ ▲ ▼
Ability to deal with irrel-evant inputs
▼ ▼ ▲ ▲ ▼
Ability to extract linearcombinations of features
▲ ▲ ▼ ▼ ◆
Interpretability ▼ ▼ ◆ ▲ ▼
Predictive power ▲ ▲ ▼ ◆ ▲
siderations play an important role. Also, the data are usually messy: theinputs tend to be mixtures of quantitative, binary, and categorical vari-ables, the latter often with many levels. There are generally many missingvalues, complete observations being rare. Distributions of numeric predic-tor and response variables are often long-tailed and highly skewed. Thisis the case for the spam data (Section 9.1.2); when fitting a generalizedadditive model, we first log-transformed each of the predictors in order toget a reasonable fit. In addition they usually contain a substantial fractionof gross mis-measurements (outliers). The predictor variables are generallymeasured on very different scales.
In data mining applications, usually only a small fraction of the largenumber of predictor variables that have been included in the analysis areactually relevant to prediction. Also, unlike many applications such as pat-tern recognition, there is seldom reliable domain knowledge to help createespecially relevant features and/or filter out the irrelevant ones, the inclu-sion of which dramatically degrades the performance of many methods.
In addition, data mining applications generally require interpretable mod-els. It is not enough to simply produce predictions. It is also desirable tohave information providing qualitative understanding of the relationship
Table 10.1 from “Elements of statistical learning”
![Page 18: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/18.jpg)
The scikit-learn algorithm cheat sheet
18
http://scikit-learn.org/stable/tutorial/machine_learning_map/
![Page 19: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/19.jpg)
19 https://medium.com/@chris_bour/an-extended-version-of-the-scikit-learn-cheat-sheet-5f46efc6cbb#.g942x8l3d
![Page 20: Epilogue: what have you learned this semester?cs545/fall16/lib/exe/fetch.php?...Epilogue: what have you learned this semester? 1 ʻViagraʼ ʻlotteryʼ =0 ĉ(x) = spam =1 ĉ(x) = ham](https://reader034.fdocuments.in/reader034/viewer/2022042323/5f0d44157e708231d4397de9/html5/thumbnails/20.jpg)
Applying machine learning
Always try multiple models ■ What would you start with?
If accuracy is not high enough ■ Design new features ■ Collect more data
20