2. Supervised Learning

56
Supervised Learning By : Shedriko

description

Supervised Learning, Machine Learning

Transcript of 2. Supervised Learning

PowerPoint

Supervised LearningBy : Shedriko

PreliminaryLinear RegressionClassification and Logistic RegressionGeneralized Linear Models

Outline & ContentSupervised learningthemachine learningtask of inferring a function from labeled training data, thetraining dataconsist of a set oftraining examples, each example is apairconsisting of an input object (typically a vector) and a desired output value (also called thesupervisory signal)Preliminaryx(i) = input, y(i) = output ; i-th = index (not exponent)(x(i) ,y(i)) training example{(x(i) ,y(i)); i = 1,2,,m} training setPreliminary

X = Y = space of valuesRegression : target variable is continuous, such as housing problemClassification :target variable is a number of discrete values, such as whether the dwelling is a house or an apartmentPreliminaryA. Linear RegressionPreliminaryProblemLMS (least mean square) algorithmNormal equationsProbabilistic interpretationLWR (locally weighted linear regression)

Outline & ContentPreliminary

: weight (parameterizing the space of linear functions mapping from X to Y)h(x) = y : hipotesisProblemHow do we learn parameters ? Make h(x) close to y or h(x(i)) close to y(i) Each value of (cost function)

LMS algorithmMinimize J() using gradient descent algorithm : learning rate j : initial guessConsider there is only one training example

LMS algorithmFor a single training example, it gives For i=1 to m, it gives

LMS algorithmInitialized at (48,30)It gives 0 = 71.27 1 = 0.1345Or 0 = 89.60 1 = 0.1392 2 = -8.738

The normal equationsMatrix DerivativesFor a function mapping from m-by-n, we define derivatives of f with respect to A to be :

Thus, the gradient itself is an m-by-n matrix, whose (i,j) element is

The normal equationsExample :Suppose is a 2-by-2 matrix, the function is given by

Here, Aij denotes the (i,j) entry of the matrix A, we then have

The normal equationsTracetrace (written tr), for an n-by-n (square) matrix A, the trace A is defined to be the sum of its diagonal entries :

If a is a real number, a 1-by-1 matrix, givesThe property of trace for two matrices (AB is square), we have

We also have, e.g

The normal equationsWe then have :

And :

Equation (4) applies only to non-singular square matrices A, where |A| denotes the determinant of A

The normal equationsLeast square revisitedDefine the design matrix X to be the m-by-n contains training examples value

Also be the m dimensional vector containing all the target value of training set

The normal equationsSince we can verify

For a vector z, we have that , thus

The normal equationsTo minimize J , its derivatives with respect to by combining equations (2) and (3), we find that

Hence

The normal equationsTo minimize J (we set its derivatives to zero), we obtain normal equations Thus, the value of that minimize J() is

Assume : where is an error term Further assume independently and identically distributed, according to Gaussian distribution (normal distribution) i.e ~ The density of is given by This implies that

The notation indicates that this is the distribution of y(i) given x(i) and parameterized by

Probabilistic Interpretation

Given X (design matrix, contain all of x(i) ), the dataprobability is given by The likelihood function This can also be written

The principal of max likelihood we should choose to make the data as high probability as possible, so we maximize L()Probabilistic Interpretation

Probabilistic Interpretation The derivations will be a bit simpler if we instead maximize the log likelihood ()

Maximizing (the same as min) gives(we recognize as )

LWR algorithm In the original linear regression algorithm, to make a prediction at a query point x (i.e. to evaluate h(x) ), we would :1. fit to minimize 2. output In contrast, the locally weighted linear regression algorithm does the following :1. fit to minimize2. output

LWR algorithm the are non-negative valued weights

if is small, then w(i) is close to 1 (one)if is large, then w(i) is small

B. Classification and logistic regressionPreliminaryLogistic regressionDigression : The perceptron learning algorithmAnother algorithm for maximizing

Outline & Content

Focus on the binary classification problem, which y can take on only two values, 0 (zero) and 1 (one) For e.g. : build a spam classifier for email, then x(i) maybe some piece of emaily maybe 1 for spam email or 0 for otherwise 0 (zero) : negative class (-) 1 (one) : positive class (+) Given x(i) , the corresponding y is called the label for the training examplePreliminary To make larger than 1 or smaller than 0, we will choose

Logistic function or sigmoid function g(z) tends towards 1 as z g(z) tends towards 0 as z Logistic regression

Useful property of the derivative of the sigmoid function

Assume that Note : it can be written more compactly as Logistic regression

Assume that the m training examples were generated independently, we can write down the likelihood as

To maximize the log likelihoodLogistic regression

It is given The gradien ascent (similar to derivation in linear regression)

By using , it gives Logistic regression

To Digress by modifying the logistic regression method to force it to output values that are either 0 or 1 or exact value To do so, it seems natural to change the definition of g to be the threshold function

If we then let as before but with modified definition of g, then we have the perceptron learning algorithmDigression: The perceptron algorithm

Suppose , we wish to find value of so that , is a real number, Newtons method performsAnother algorithm for maximizing

The leftmost figure : function f with line y = 0,the value of from , is about 1.3 The middle figure : with Newtons method, we have tangent = 2.8 The rightmost figure : with Newtons method, we have tangent = 1.8 After a few more iterations, we approach tangent = 1.3 What if we want to use it to maximize some function ?The maxima of correspond to points where its first derivative is zeroAnother algorithm for maximizing

So by letting , to maximize we obtain

Newton-Raphson method (the generalization of Newtons method to this multidimensional setting) is given by ; is the vector of partial derivatives of with respect to the i s H is an n-by-n matrix called the Hessian, whose entries are given by

Another algorithm for maximizing

When Newtons method is applied to maximize the logistic regression log likelihood function , the resulting method is called Fisher scoring

Another algorithm for maximizing

C. Generalized Linear ModelsPreliminaryThe exponential familyConstructing GLMs

Outline & ContentPreliminary In the regression example, we had In the classification one, we had

In this section, both of those methods are special cases of a broader family of models, called Generalized Linear Models (GLMs)

Other models in GLM family can be derived and applied to other classification and regression problems

The exponential family A class of distribution is in the exponential family if it can be written in the form

: natural parameter /canonical parameter : sufficient statistic : log partition function : normalization constant : sums/integrates over y to 1 (one) A fixed choice of T, a and b defines a family (or set) of distributions that is parameterized by , we then get different distributions within this family

The exponential family Bernoulli distribution ( ) :specifies a distribution over , so that

There is a choice of T, a and b, so that equation [6] becomes exactly the class of Bernoulli distributions

The exponential family We write the Bernoulli distribution as

The natural parameter is given by If we invert this definition by solving in terms of , we obtain = To complete the Bernoullidistribution, we have :

The exponential family Gaussian distribution

Gaussian in the exponential family

Constructing GLMs To derive a GLM to predict some random variable y as a function of x in a classification or regression problem, there are three assumptions about the conditional distribution of y given x and about our model : 1. i.e. given x and , the distribution of y follows some exponential family distribution, with parameter 2. To predict the expected value of T(y) given x, in most of our examples we will have T(y) = y, so this means we would like the prediction h(x) output by our learned hypothesis h to satisfy

Constructing GLMs (Remember, we had : = = = ) 3. The natural parameter and inputs x are related lineary (if is vector-valued, )

Constructing GLMsOrdinary Least Squares Consider the target variable y (also called the response variable in GLM terminology) is continuous, and we model the conditional distribution of y as Gaussian

We let the exponential family ( ) distribution above be the Gaussian distribution We had so we have

Constructing GLMsLogistic Regression Given that y is binary-valued, it seems natural to choose the Bernoulli family of distributions, which we had

Furthermore, note that if then

So following a similar derivation as the one for ordinary least square, we get

Constructing GLMsSoftmax Regression For e.g. : classify email into two classes (spam or not-spam) which is binary classification, then we classify it into three classes (spam, personal & work-related mail) Thus we model it as a multinomial distribution For notational, we have T(y) is not y, now we have T(y) is k-1 dimensional vector, rather than real number We also have , e.g. : 1{2 = 3} = 0,1{3 = 5 2} = 1, where 0 = false, 1 = true

Constructing GLMsThe multinomial is a member of exponential family

Constructing GLMs where

The link function is given (for i = 1, , k) by

Constructing GLMsWe have defined , to invert the link function and derive the response function, we therefore have that

The can be substituted back into equation (7) to give the response function

This function mapping from s to s is called the softmax function

Constructing GLMsTo complete our model, we use : 1. assumption 3 2. given earlier, that s are linearly related to xs 3. have where as the parameters of our model 4. we also define so that as given previously Hence conditional distribution of y given x is

Constructing GLMsThis model which applies to classification problem where is called softmax regressionOur hypothesis will output

Constructing GLMsIf we have training set of m examples we would begin by writing down the log-likelihood

Thank you.