Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis...

43
Regression Mohammad Ali Keyvanrad Machine Learning In the Name of God Thanks to: M. Soleymani (Sharif University of Technology) R. Zemel (University of Toronto) R. Gutierrez-Osuna (Texas A&M University ) 1393-1394 ) Fall (

Transcript of Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis...

  • Slide 1

Slide 2 Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2 Slide 3 A simple example: 1-D regression 3 Slide 4 Example: Boston Housing data Concerns housing values in suburbs of Boston. Features CRIM: per capita crime rate by town RM: average number of rooms per dwelling Use this to predict house prices in other neighborhoods 4 Slide 5 Represent the Data 5 Slide 6 Noise A simple model typically does not exactly fit the data lack of fit can be considered noise Sources of noise Imprecision in data attributes (input noise) Errors in data targets (mislabeling) Additional attributes not taken into account by data attributes, affect target values (latent variables) Model may be too simple to account for data targets 6 Slide 7 Least-squares Regression 7 Slide 8 Optimizing the Objective 8 Slide 9 Optimizing Across Training Set 9 A systems of 2 linear equations Slide 10 Non-iterative Least-squares Regression An alternative optimization approach is non- iterative: take derivatives, set to zero, and solve for parameters. 10 Slide 11 Multi-dimensional inputs 11 Slide 12 Linear Regression It is mathematically easy to fit linear models to data. There are many ways to make linear models more powerful while retaining their nice mathematical properties: 1. By using non-linear, non-adaptive basis functions, we can get generalized linear models that learn non-linear mappings from input to output but are linear in their parameters only the linear part of the model learns. 12 Slide 13 Linear Regression 2. By using kernel methods we can handle expansions of the raw data that use a huge number of non-linear, non-adaptive basis functions. 3. By using large margin kernel methods we can avoid overfitting even when we use huge numbers of basis functions. But linear methods will not solve most AI problems. They have fundamental limitations. 13 Slide 14 Some types of basis functions in 1-D Sigmoid and Gaussian basis functions can also be used in multilayer neural networks, but neural networks learn the parameters of the basis functions. This is much more powerful but also much harder and much messier. 14 Sigmoids Gaussians Polynomials Slide 15 Two types of linear model that are equivalent with respect to learning The first model has the same number of adaptive coefficients as the dimensionality of the data +1. The second model has the same number of adaptive coefficients as the number of basis functions +1. Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) So well just focus on the first model 15 Slide 16 Fitting a polynomial Now we use one of these basis functions: an M th order polynomial function We can use the same approaches to optimize the values of the weights on each coefficient Analytic, and iterative 16 Slide 17 A sample for fitting a polynomial 17 Slide 18 A sample for fitting a polynomial 18 x1x1 x2x2 Slide 19 Minimizing squared error 19 Slide 20 Minimizing squared error 20 Slide 21 Online Least mean squares: An alternative approach for really big datasets This is called online learning. It can be more efficient if the dataset is very redundant and it is simple to implement in hardware. It is also called stochastic gradient descent if the training cases are picked at random. Care must be taken with the learning rate to prevent divergent oscillations, and the rate must decrease at the end to get a good fit. 21 Slide 22 1-D regression illustrates key concepts Data fits is linear model best (model selection)? Simplest models do not capture all the important variations (signal) in the data: underfit More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model One method of assessing fit: test generalization = models ability to predict the held out data Optimization is essential: stochastic and batch iterative approaches; analytic when available 22 Slide 23 Some fits to the data which is best? 23 Slide 24 Overfitting 24 Slide 25 Overfitting Over-fitting causes Model complexity E.g., Model with a large number of parameters (degrees of freedom) Low number of training data Small data size compared to the complexity of the model 25 Slide 26 Model complexity Polynomials with larger are becoming increasingly tuned to the random noise on the target values. 26 Slide 27 Number of training data & overfitting Over-fitting problem becomes less severe as the size of training data increases. 27 Slide 28 Avoiding Over-fitting Determine a suitable value for model complexity Simple method: Hold some data out of the training set called validation set Use held-out data to optimize model complexity. Regularization Explicit preference towards simple models Penalize for the model complexity in the objective function 28 Slide 29 Validation Almost invariably, all the pattern recognition techniques that we have introduced have one or more free parameters Two issues arise at this point Model Selection: How do we select the optimal parameter(s) for a given classification problem? Validation: Once we have chosen a model, how do we estimate its true error rate? The true error rate is the classifiers error rate when tested on the ENTIRE POPULATION 29 Slide 30 Validation In real applications only a finite set of examples is available This number is usually smaller than we would hope for! Why? Data collection is a very expensive process One may be tempted to use the entire training data to select the optimal classifier, then estimate the error rate This nave approach has two fundamental problems The final model will normally overfit the training data The error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to achieve 100% correct classification on training data 30 Slide 31 Validation We must make the best use of our (limited) data for Training Model selection Performance estimation Methods Holdout Cross validation 31 Slide 32 The holdout method Split dataset into two or three groups Training set: used to train the classifier Validation set: used to select the optimal parameter(s) for a given classification problem. Test set: used to estimate the error rate of the trained classifier 32 Slide 33 Simple hold-out: model selection 33 Slide 34 Training, validation, test sets 34 Slide 35 Validation The holdout method has two basic drawbacks In problems where we have a sparse dataset we may not be able to afford the luxury of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an unfortunate split These limitations of the holdout can be overcome at the expense of higher computational cost Cross validation Random subsampling K-fold cross-validation Leave-one-out cross-validation Bootstrap 35 Slide 36 Random subsampling 36 Slide 37 Cross-validation 37 Slide 38 Leave-One-Out Cross Validation 38 Slide 39 Cross Validation In practice, the choice for K depends on the size of the dataset For large datasets, even 3-fold cross validation will be quite accurate For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible A common choice for is K=10 39 Slide 40 Bootstrap The bootstrap is a technique with replacement From a dataset with examples Randomly select (with replacement) examples and use this set for training The remaining examples that were not selected for training are used for testing This value is likely to change from fold to fold Repeat this process for a specified number of folds () 40 Slide 41 Bootstrap 41 Slide 42 Procedure outline in training 1. Divide the available data into training, validation and test set 2. Select architecture and training parameters 3. Train the model using the training set 4. Evaluate the model using the validation set 5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation sets 7. Assess this final model using the test set 42 Slide 43 43