SWSI Update Carnegie Mellon University Katia Sycara Carnegie Mellon University softagents.
Regression - Carnegie Mellon School of Computer Science - SCS
Transcript of Regression - Carnegie Mellon School of Computer Science - SCS
![Page 1: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/1.jpg)
Regression
Amr1/28
Slides Credit: Aarti’s Lecture slides and Eric’s Lecture slides
![Page 2: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/2.jpg)
Big Picture
• Supervised Learning
– Classification
• Input x: feature vector
• Output: discrete class label
– Regression
• Input x: feature vector
• Output y: continuous value
![Page 3: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/3.jpg)
Tax Fraud Detection
Diagnosing sickle cell anemia
Anemic cellHealthy cell
Web ClassificationSportsScienceNews
Predict squirrel hillresident
Drive to CMU, Rachel’s fan,Shop at SH Giant Eagle
ResidentNot resident
Features, X Labels, Y
Classification Tasks
3
![Page 4: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/4.jpg)
4
Goal:
Classification
SportsScienceNews
Features, X Labels, Y
Probability of Error
![Page 5: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/5.jpg)
Classification
Optimal predictor:(Bayes classifier)
5
Depends on unknown distribution
![Page 6: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/6.jpg)
6
Discrete to Continuous Labels
SportsScienceNews
Classification
Regression
Anemic cellHealthy cell
Stock Market Prediction
Y = ?
X = Feb01
X = Document Y = Topic X = Cell Image Y = Diagnosis
![Page 7: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/7.jpg)
Regression
• What is the equivalent of Bayes-optimal classifier?
• How about if we can model P(Y|X)?
• How can we predict Y given new X?
• We need a LOSS function
– How about square loss?
– What should be the prediction?
![Page 8: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/8.jpg)
Regression (See board)
Optimal predictor:(Conditional Mean)
8
Dropping subscriptsfor notational convenience
![Page 9: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/9.jpg)
Models
• So how can we proceed?
• We need to make some assumption to model P(Y|X)
– Linear form (basis function)
– Noise distribution
– Loss function
– Etc.
![Page 10: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/10.jpg)
Regression algorithms
Learning algorithm
10
Linear RegressionLasso, Ridge regression (Regularized Linear Regression)Nonlinear RegressionKernel RegressionRegression Trees, Splines, Wavelet estimators, …
Empirical Risk Minimizer:
Empirical mean
![Page 11: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/11.jpg)
Least Squares Estimator (on board)
11
![Page 12: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/12.jpg)
Vector Derivative (see notes from website)
• Some useful facts: assume that A is symmetric
xxx
xAAxAAxA
AxAxx
AAx
aax
axa
T
x
T
x
T
x
T
x
T
x
T
x
2
)(2)()(
2
![Page 13: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/13.jpg)
Probabilistic Interpretation: MLE
13
Intuition: Signal plus (zero-mean) Noise model
Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !
log likelihood
![Page 14: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/14.jpg)
Variations
• What if the noise terms are independent but not identical?
– Homework
• What if they are IID but not Gaussian?
• Think about robustness
– What if we have outliers?
![Page 15: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/15.jpg)
Robustness
• The best fit from a quadratic regression
• But this is probably better …
![Page 16: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/16.jpg)
Regularized Least Squares and MAP
16
What if is not invertible ?
log likelihood log prior
Prior belief that β is Gaussian with zero-mean biases solution to “small” β
I) Gaussian Prior
0
Ridge Regression
Closed form: HW
![Page 17: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/17.jpg)
Regularized Least Squares and MAP
17
What if is not invertible ?
log likelihood log prior
Prior belief that β is Laplace with zero-mean biases solution to “small” β
Lasso
Closed form: HW
II) Laplace Prior
![Page 18: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/18.jpg)
Ridge Regression vs Lasso
18
Ridge Regression: Lasso: HOT!
Lasso (l1 penalty) results in sparse solutions – vector with more zero coordinatesGood for high-dimensional problems – don’t have to store all coordinates!
βs with constant l1 norm
βs with constant J(β)(level sets of J(β))
βs with constant l2 norm
β2
β1
![Page 19: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/19.jpg)
© Eric Xing @ CMU, 2006-2008 19
Case study: predicting gene expression
The genetic picture
CGTTTCACTGTACAATTT
causal SNPs
a univariate phenotype:
i.e., the expression intensity of a gene
![Page 20: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/20.jpg)
© Eric Xing @ CMU, 2006-2008 20
Individual 1
Individual 2
Individual N
Phenotype (BMI)
2.5
4.8
4.7
Genotype
. . C . . . . . T . . C . . . . . . . T . . .
. . C . . . . . A . . C . . . . . . . T . . .
. . G . . . . . A . . G . . . . . . . A . . .
. . C . . . . . T . . C . . . . . . . T . . .
. . G . . . . . T . . C . . . . . . . T . . .
. . G . . . . . T . . G . . . . . . . T . . .
Causal SNPBenign SNPs
…Association Mapping as Regression
![Page 21: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/21.jpg)
© Eric Xing @ CMU, 2006-2008 21
Individual 1
Individual 2
Individual N
Phenotype (BMI)
2.5
4.8
4.7
Genotype
. . 0 . . . . . 1 . . 0 . . . . . . . 0 . . .
. . 1 . . . . . 1 . . 1 . . . . . . . 1 . . .
. . 2 . . . . . 2 . . 1 . . . . . . . 0 . . .
…
yi =
J
j
jijx1
SNPs with large |βj| are relevant
Association Mapping as Regression
![Page 22: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/22.jpg)
© Eric Xing @ CMU, 2006-2008 22
Experimental setup
• Asthama dataset– 543 individuals, genotyped at 34 SNPs– Diploid data was transformed into 0/1 (for homozygotes) or 2 (for
heterozygotes)– X=543x34 matrix– Y=Phenotype variable (continuous)
• A single phenotype was used for regression
• Implementation details– Iterative methods: Batch update and online update implemented.– For both methods, step size α is chosen to be a small fixed value (10-6).
This choice is based on the data used for experiments.– Both methods are only run to a maximum of 2000 epochs or until the
change in training MSE is less than 10-4
![Page 23: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/23.jpg)
© Eric Xing @ CMU, 2006-2008 23
Convergence Curves
• For the batch method, the training MSE is initially large due to uninformed initialization
• In the online update, N updates for every epoch reduces MSE to a much smaller value.
![Page 24: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/24.jpg)
© Eric Xing @ CMU, 2006-2008 24
The Learned Coefficients
![Page 25: Regression - Carnegie Mellon School of Computer Science - SCS](https://reader030.fdocuments.in/reader030/viewer/2022020706/61fca44d9d50e757a521fd43/html5/thumbnails/25.jpg)
© Eric Xing @ CMU, 2006-2008 25
Performance vs. Training Size
The results from B and O update are almost identical. So the plots coincide.
The test MSE from the normal equation is more than that of B and O during small training. This is probably due to overfitting.
In B and O, since only 2000 iterations are allowed at most. This roughly acts as a mechanism that avoids overfitting.