Lecture 6: Applicationsxwu/class/ELEG667/Lecture6.pdf · Lecture 6: Applications 1 Xiugang Wu Fall...

Lecture 6: Applications

Xiugang Wu

Fall 2019

University of Delaware

X Predictor h

Y = h(X)

Statistics and Machine Learning

Consider the problem of predicting Y 2 Y when given the information ofX 2 X .

Here X 2 X is called the feature and Y 2 Y is called the label (or target). Note

that the problem includes the special case when X = ;.

- Data generation mechanism: (X,Y ) ⇠ P , with P = PXPXY

- Performance measure: Under loss function ` : Y ⇥ Y ! R+, the performance

of predictor h is measured by the risk L(h, P ) = EP [`(Y, h(X))]

- If P is known, the optimal predictor is given by the Bayes predictor

h⇤= argmin

hEP [`(Y, h(X))]

- What if P is unknown and instead we have access to data {(Xi, Yi)}ni=1 that

are i.i.d. generated according to P?

Statistics and Machine Learning

Two approaches to the problem, which are generally known as the generativeapproach and the discriminative approach:

- Generative approach (statistical decision theory): Estimate the distribution Pbased on data {(Xi, Yi)}ni=1 and then design the predictor; includes parametricand nonparametric estimation

- Discriminative approach (statistical learning theory): learn the predictor di-rectly from data {(Xi, Yi)}ni=1 without the intermediate step of estimating P ;includes classification and regression

Outline

•  Parametric Estimation •  Nonparametric Estimation •  Linear Regression and Logistic Regression •  Support Vector Machine

Outline

Parametric Estimation

distribution estimation problem: estimate probability density p(y) of a randomvariable from observed data

parametric distribution estimation: choose from a family of densities p(y;x),indexed by a parameter x

MLE (maximum likelihood estimation): maximizex log p(y;x)- y is observed data- l(x) = log p(y;x) is called log-likelihood function- can add constraints x 2 C explicitly, or define p(y;x) = 0 for x /2 C- a convex optimization problem if log p(y;x) is concave in x for fixed y

Linear Measurements with IID Noise

Linear measurement model: yi = aTi x+ vi, i = 1, 2, . . . ,m- x 2 Rn is vector of unknown parameters- vi is i.i.d. measurement noise, with density p(z)- yi is measurement: y 2 Rm has density p(y;x) =

Qmi=1 p(yi � aTi x)

ML Estimate xML: any solution x of

maximize l(x) =mX

log p(yi � aTi x)

Interpretation:- estimate probability density p(y) from observed data y1, y2, . . . , ym- densities parameterized by x as p(y;x); e.g., if noise is zero-mean, then prob-lem becomes estimating the mean of y, which is of the form (aT1 x, . . . , a

Examples

- Gaussian noise N (0,�2): p(z) = (2⇡�2)�1/2e�z2/(2�2)

L(x) = �m

2log(2⇡�2)� 1

(aTi x� yi)2

ML estimate is LS solution

- Laplacian noise: p(z) = (1/2a)e�|z|/a

L(x) = �m log(2a)� 1

|aTi x� yi|

ML estimate is solution to `1-norm minimization

MAP Estimation Maximum a posteriori probability (MAP) estimation is a Bayesian version ofmaximum likelihood estimation, with a prior probability density on the under-lying parameter x.

We assume that x (the vector to be estimated) and y (the observation) are ran-dom variables with a joint probability density p(x, y) = p(x)p(y|x), where p(x)is the prior density of x and p(y|x) is the conditional density of y given x.

Given observation y, the Maximum a posteriori probability (MAP) estimationis to find x that maximizes the posterior density of x given y, i.e.

xMAP = argmaxx

p(x|y)

= argmaxx

p(x, y)

= argmaxx

p(y|x)p(x)

= argmaxx

(log p(y|x) + log p(x))

- MAP reduces to ML when x is uniformly distributed- for any MLE problem with concave log-likelihood, we can add a prior densityp(x) that is log-concave, and the resulting MAP problem will be convex

Revisiting Linear Measurements with IID Noise

Linear measurement model: yi = aTi x+ vi, i = 1, 2, . . . ,m- x 2 Rn has prior density p(x)- vi is i.i.d. measurement noise, with density pz(z)- conditional density p(y|x) =

Qmi=1 pz(yi � aTi x)

MAP Estimate xMAP can be found by solving

maximize

log pz(yi � aTi x) + log p(x)

- For example, if pz(z) is N (0,�2z) and p(x) is N (x,⌃x), then the MAP estimate

can be found by solving the QP:

minimize

(yi � aTi x)2 + (x� x)T⌃�1

x (x� x)

Outline

Nonparametric Estimation Consider a discrete random variable X that takes values in the finite set X ={a1, . . . , an} and let pi denote the probability of X being equal to ai, i.e.pi = px(ai). The nonparametric estimation problem is to estimate p from theprobability simplex

{p | p ⌫ 0,1T p = 1}

based on a combination of prior information and, possibly, observations.

- many types of prior information about p can be expressed as linear equality orinequality constraints of p, e.g., E[x] =

Pni=1 ai ·pi = 3.3, E[x2] =

Pni=1 a

2i ·pi �

4, E[f(x)] =Pn

i=1 f(ai) · pi 2 [l, u], Pr(X 2 C) =P

a2C p(a) = 0.3

- can also include prior constraints involving nonlinear functions of p, e.g.,var(x) = E[x2] � E[x]2 =

Pni=1 a

2i · pi � (

Pni=1 ai · pi)2; a lower bound on

the variance of X can be expressed as a convex quadratic inequality on p

- As another example, the prior constraint Pr(X 2 A|X 2 B) 2 [l, u] can beexpressed as cT p/dT p 2 [l, u], i.e. ldT p cT p udT p

- In general, we can express the prior information about the distribution p asp 2 P. We assume that P can be described by a set of linear equalities andconvex inequalities, including the basic constraints p ⌫ 0,1T p = 1

Nonparametric Estimation Maximum Likelihood Estimation: Suppose we observe N independent samplesx1, . . . , xN from the distribution. Let ki denote the number of these sampleswith value ai, so that k1 + · · · + kn = N . The log-likelihood function is thenl(x) =

Pni=1 ki log pi, which is a concave function of p. The ML estimate of p

can be found by solving the convex problem

maximizenX

ki log pi

subject to p 2 P

Maximum Entropy Estimation: The maximum entropy distribution consistentwith the prior assumptions can be found by solving the convex problem

maximize �nX

pi log pi

subject to p 2 P

where the objective function �Pn

i=1 pi log pi is concave in p.

Outline

Linear Regression

Linear regression with squared-loss: learn linear predictor ha(x) = aTx from

training data {(xi, yi)}ni=1

- X = Rd, Y = R, Y = R

- a 2 Rdis the parameter to be learned

- loss function `(y, y) = (y � y)2

- risk of ha under distribution P : L(ha, P ) = EP [(Y � aTX)2]

- empirical risk of ha:1n

Pni=1(yi � aTxi)

2= L(ha, P ), where P denotes the

empirical distribution of (X,Y )

Empirical Risk Minimization (ERM):

minimizea

(yi � aTxi)2

which is an ordinary least-squares (OLS) problem

Regularization

Two types of regularization: constrained ERM and Penalized ERM, where con-strained ERM explicitly constrains the complexity of the model, and penalizedERM penalizes models with high complexity

constrained ERM: e.g., linear regression with constraint on `1 or `2 norm

Penalized ERM: linear regression with regularizer of `1 or `2 norm

minimizenX

(yi � aTxi)2

subject to kak1 r

minimizenX

(yi � aTxi)2

subject to kak22 r

minimize

(yi � aTxi)2+ �kak1 (LASSO Regression)

minimize

(yi � aTxi)2+ �kak22 (Ridge Regression)

Multi-criterion Interpretation

minimize (w.r.t. R2+) (kDa� yk22, kak22)

kDa� yk22

- example for D 2 R100⇥10with D = [xT

1 ;xT2 ; . . . , x

T100]; heavy line formed by

Pareto optimal points

- to determine Pareto optimal points, take � = (1, �) with � > 0 and minimize

kDa� yk22 + �kak22

- for fixed �, an OLS problem

In general, constrained ERM and penalized ERM and equivalent if criterion

functions are all convex

Logistic Regression

Logistic regression: learn predictor ha(x) =1

1+e�aT xfrom training data {(xi, yi)}ni=1

- X = Rd, Y = {+1,�1}, Y = [0, 1]- y is the predicted probability of the label of x being 1- a 2 Rd is the parameter to be learned- loss function `(y, ha(x)) = log(1 + e�yaT x)

- risk of ha under distribution P : L(ha, P ) = EP [log(1 + e�Y aTX)]

- empirical risk of ha:1n

Pni=1 log(1 + e�yia

T xi) = L(ha, P )

minimizea

log(1 + e�yiaT xi)

which is a convex optimization problem. Convexity of the optimization probleminherits from the convexity of the loss function for a given data point (x, y).

Outline

Classification Binary classfication: learn halfspace predictor ha(x) = sgn(aTx)- X = Rd, Y = Y = {+1,�1}- a 2 Rd is the parameter to be learned- loss function `(y, ha(x)) = I(y 6= sgn(aTx))- risk of ha under distribution P : L(ha, P ) = EP [I(Y 6= sgn(aTX))]- empirical risk of ha:

Pni=1 I(yi 6= sgn(aTxi)) = L(ha, P )

minimizea

I(yi 6= sgn(aTxi))

- If data is linearly separable, then there exists some a1 such that yiaT1 xi > 0, 8i.- Let a2 = a1

mini yiaT1 xi

. Then we have yiaT2 xi � 1, 8i.- Therefore, ERM is equivalent to the feasible problem:

find a subject to yiaTxi � 1, 8i

- There are infinitely many ERM solutions. Which one should we pick?

Support Vector Machine

Support Vector Machine (SVM) seeks for an ERM hyperplane that separatesthe training set with the largest margin- margin � of a hyperplane with respect to a training set is the minimal Eu-clidean distance between a point in the training set and the hyperplane- � = mini yiaTxi/kak- If we scale a such that mini yiaTxi = 1, then � = 1/kak

Support Vector Machine (SVM):

minimize kak2 subject to yiaTxi � 1, 8i

Lecture 6: Applicationsxwu/class/ELEG667/Lecture6.pdf · Lecture 6: Applications 1 Xiugang Wu Fall...

Documents

Transcript of Lecture 6: Applicationsxwu/class/ELEG667/Lecture6.pdf · Lecture 6: Applications 1 Xiugang Wu Fall...

ON THE J TEST FOR NON-NESTED HYPOTHESES 12faculty.wwu.edu/kriegj/Econ. Documents/On the J-test.pdf6 the Yˆ x coefficient results in the rejection of H 1.For instance, in the consumption

Lecture 9: Interior-Point Methods - University of Delaware › ~xwu › class › ELEG667 › Lecture9.pdf · Lecture 9: Interior-Point Methods 1 Xiugang Wu Fall 2019 University of

Feedback Adversarial Learning: Spatial Feedback for ... · vector z, G : z → ˆy, such that the discriminator cannot disambiguate a real sample y from a generated sample yˆ. An

yˆ ZgzZy ZóxsZ - QURANACADEMY.COM

PRINCIPLES OF ECONOMETRICS 5TH EDITIONEXERCISE 6.9 Consider, for example, the model yxze 12 3 If we augment the model with the predictions yˆ the model becomes 12 3 yxzye ˆ ...

The Formal Neuron: 1943 [MP43]cord/pdfs/courses/rdfia/2020c2b.pdf · 2020. 10. 28. · The Formal Neuron: 1943 [MP43] Mapping from x to yˆ: 1. Linear (aﬃne) mapping: s = wx+b 2.

student... · Translate this pagePK †bkF word/numbering.xmlíœ]n¤F ÇO ;XHy´‡þ ¬ ¯¬¬ m´Š¢dsÌà ´|Œ Æ^ç yˆ´û’\ ÇÊIÂ |Ú Ã83Õ ý4 ššrñïŸ«š~óös

Distributionsofforecastingerrorsofforecast combinations ......mt = XM m=1 wmtyˆmt = w ′ tyˆt, (3) where yˆt isthecolumnvectorofone-step-aheadforecasts(ˆy 1t, ˆy 2t,yˆ 3t,...,yˆMt)

Lecture 2: Convex Setsxwu/class/ELEG667/Lecture2.pdf · 2 Recap: Convex Optimization Problems minimize f 0(x) subject to f i(x) b i,i=1, 2,...,m where objective and constraint functions

Eleg667/2001-f/Topic-1a 1 A Brief Review of Algorithm Design and Analysis.

A Course in Machine Learningciml.info/dl/v0_9/ciml-v0_9-ch06.pdf90 a course in machine learning case of hinge loss and logistic loss, the growth of the function as yˆ goes negative

Lecture 4: Convex Optimization Problemsxwu/class/ELEG667/Lecture4.pdf · 2019-10-04 · • Geometric Optimization • Generalized Inequality Constraints • Semidefinite Programming

MS&E 226: Fundamentals of Data Scienceweb.stanford.edu/...regression_annotated_2019.pdf · R2 Formally: R2 = P n i=1( Yˆ i Yˆ)2 P n i=1(Yi Y)2 is a measure of the ﬁt of the model,

CUDA based implementation of DCT/IDCT on GPU - …pengye/course_project/GPU_Final_report.pdf · ELEG667: Final project report CUDA based implementation of DCT/IDCT on GPU Team member:

Wavelet-based analysis for object separation from ... - ITC · Wavelet-based analysis for object separation from laser ... yˆ F ou rie t ans fm y f,y j, k ... Wavelet-based analysis

Complications of Bariatric Surgery Introduction and ...medinfo2.psu.ac.th/surgery/Collective review/2559/2... · (early satiety) Yˆ˘ #$"&A adjustable gastric banding, sleeve gastrectomy,

Adaptive Regularization of Weight Vectorspapers.nips.cc/...regularization-of-weight-vectors.pdf · t ∈ Y and suﬀers a loss ‘(y t,yˆ t). For binary classiﬁcation we have Y

Statistical Machine Learning Hilary Term 2020palamara/teaching/SML20/HT20_lecture... · 2020. 2. 24. · ik = s bh k + Xp j=1 Wh jkx ij! Output probability: yˆ i = s bo + Xm k=1

TES Data for Assimilation, Inverse modeling and ... · CwoAK(u)=yˆ−lnF(x,u,t) Sε −1 2+u−u aS a −1 2 CO source state vector: •CHEM-background chemistry •NAFF-North American

Sizing Mixture ( RSM) Designs - ASQrube.asq.org/.../sizing-mixture-rsm-designs.pdf · Sizing Mixture (RSM) Designs FDS – PV Fraction of Design Space Prediction Variance: var(yˆ)