Nonlinear representations - GitHub Pages

33
Nonlinear representations

Transcript of Nonlinear representations - GitHub Pages

Page 1: Nonlinear representations - GitHub Pages

Nonlinear representations

Page 2: Nonlinear representations - GitHub Pages

Comments

• Assignment 2 deadline extended to the end of Friday

• I do apologize for the late additions to the assignment. Any issues?

2

Page 3: Nonlinear representations - GitHub Pages

Representations for learning nonlinear functions

• Generalized linear models enabled many p(y | x) distributions• Still however learning a simple function for E[Y | x], i.e., f(<x,w>)

• Approach we discussed earlier: augment current features x using polynomials

• There are many strategies to augmenting x• fixed representations, like polynomials, wavelets

• learned representations, like neural networks and matrix factorization

3

Page 4: Nonlinear representations - GitHub Pages

What if classes are not linearly separable?

4

How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?

x

21 + x

22 = 1

x1 = x2 = 0

=) f(x) = �1 < 0

x1 = 2, x2 = �1

=) f(x) = 4 + 1� 1 = 4 > 0

f(x) = x

21 + x

22 � 1

Page 5: Nonlinear representations - GitHub Pages

What if classes are not linearly separable?

5

How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?

x

21 + x

22 = 1 f(x) = x

21 + x

22 � 1

�(x) =

2

4x

21

x

22

1

3

5 f(x) = �(x)>w

If use logistic regression, what is p(y=1 | x)?

Page 6: Nonlinear representations - GitHub Pages

What if classes are not linearly separable?

6

x

21 + x

22 = 1 f(x) = x

21 + x

22 � 1

�(x) =

2

4x

21

x

22

1

3

5 f(x) = �(x)>w

Imagine learned w. How do we predict on a new x?

Page 7: Nonlinear representations - GitHub Pages

Nonlinear transformation

7

x ! �(x) =

0

@�1(x). . .

�p(x)

1

A

e.g., x = [x1, x2], �(x) =

0

BBBBBBBB@

x1

x2

x

21

x1x2

x

22

x

31

x

32

1

CCCCCCCCA

Page 8: Nonlinear representations - GitHub Pages

Gaussian kernel / Gaussian radial basis function

8

k(x,x0) = exp

✓�kx� x

0k22�2

x =

x1

x2

�!

2

64k(x,x1)

...k(x,xk)

3

75

x → [K (x , x1) K (x , x

2) K (x , x

3)] '

= [5 2 3] '

x

x2

x3 x

1

5

3

2

0.1

0.8

0.5

.1 .8 .5

p

f(x) =kX

i=1

wik(x,xi)

p

Page 9: Nonlinear representations - GitHub Pages

Gaussian kernel / Gaussian radial basis function

9

Possible function f with several centersKernel

k(x,x0) = exp

✓�kx� x

0k22�2

◆f(x) =

kX

i=1

wik(x,xi)

p

Page 10: Nonlinear representations - GitHub Pages

Selecting centers

• Many different strategies to decide on centers• many ML algorithms use kernels e.g., SVMs, Gaussian process regression

• For kernel representations, typical strategy is to select training data as centers

• Clustering techniques to find centers

• A grid of values to best (exhaustively) cover the space

• Many other strategies, e.g., using information gain, coherence criterion, informative vector machine

10

Page 11: Nonlinear representations - GitHub Pages

Covering space uniformly with centres

• Imagine has 1-d space, from range [-10, 10]

• How would we pick p centers uniformly?

• What if we have a 5-d space, in ranges [-1,1]?• To cover entire 5-dimensional, need to consider all possible options

• Split up 1-d into m values, then total number of centres is m^5

• i.e., for first value of x1, can try all other m values for x2, …, x5

11

Page 12: Nonlinear representations - GitHub Pages

Why select training data as centers?

• Observed data indicates the part of the space that we actually need to model• can be much more compact than exhaustively covering the space

• imagine only see narrow trajectories in world, even if data is d-dimensional, data you encounter may lie on a much lower-dimensional manifold (space)

• Any issues with using all training data for centres?• How can we subselect centres from training data?

12

Page 13: Nonlinear representations - GitHub Pages

How would we use clustering to select centres?

• Clustering is taking data and finding p groups

• What distance measure should we use?• If k(x,c) between 0 and 1 and k(x,x) = 1 for all x, then 1-k(x,c) gives a

distance

13

Page 14: Nonlinear representations - GitHub Pages

What if we select centres randomly?

• Are there any issues with the linear regression with a kernel transformation, if we select centres randomly from the data?

• If so, any suggestions to remedy the problem?

14

nX

i=1

0

@pX

j=1

k(x, zj)wj � yi

1

A2

+ �kwk1

Page 15: Nonlinear representations - GitHub Pages

Exercise

• What would it mean to use an l1 regularizer with a kernel representation?• Recall that l1 prefers to zero elements in w

15

nX

i=1

0

@pX

j=1

k(x, zj)wj � yi

1

A2

+ �kwk1

Page 16: Nonlinear representations - GitHub Pages

Exercise: How do we decide on the nonlinear transformation?

• We can pick a 5-th order polynomial or 6-th order, or… Which should we pick?

• We can pick p centres. How many should we pick?

• How can we avoid overparametrizing or underparameterizing?

16

Page 17: Nonlinear representations - GitHub Pages

Other similarity transforms

• Linear kernel:

• Laplace kernel (Laplace distribution instead of Gaussian)

• Binning transformation

17

k(x, c) = x

>c

k(x, c) = exp(�bkx� ck1)

s(x, c) =

⇢1 if x in box around c

0 else

Page 18: Nonlinear representations - GitHub Pages

Dealing with non-numeric data

• What if we have categorical features?• e.g., feature is the blood type of a person

• Even worse, what if we have strings describing the object?• e.g., feature is occupation, like “retail”

18

Page 19: Nonlinear representations - GitHub Pages

Some options

• Convert categorical feature into integers• e.g., {A, B, AB, O} —> {1, 2, 3, 4}

• Any issues?

• Convert categorical feature into indicator vector• e.g., A —> [1 0 0 0], B —> [0 1 0 0], …

• Any issues?

19

Page 20: Nonlinear representations - GitHub Pages

Using kernels for categorical or non-numeric data

• An alternative is to use kernels (or similarity transforms)

• If you know something about your data/domain, might have a good similarity measure for non-numeric data

• Some more generic kernel options as well• Matching kernel: similarity between two items is the proportion of

features that are equal

20

Page 21: Nonlinear representations - GitHub Pages

age

gender

income

education

x =

{15-24, 25-34, …, 65+}

{F, M}

{Low, Medium, High}

{Bachelors, Trade-Sch, High-Sch, …}

Census dataset: Predict hours worked per week

Example: matching kernel

21

Page 22: Nonlinear representations - GitHub Pages

Example: Matching similarity for categorical data

22

age

gender

income

education

x =k(x1,x2) = k

24-34

F

Medium

Trade-Sch

35-44

F

Medium

Bachelors

= 0.5

Page 23: Nonlinear representations - GitHub Pages

Representational properties of transformations

• Approximation properties: which transformations can approximate “any function”?

• Radial basis functions (a huge number of them)

• Polynomials and the Taylor series

• Fourier basis and wavelets

23

Page 24: Nonlinear representations - GitHub Pages

Distinction with the kernel trick• When is the similarity actually called a “kernel”?

• Nice property of kernels: can always be written as a dot product in some feature space

• In some cases, they are used to compute inner products efficiently, assuming one is actually learning with the feature expansion • This is called the kernel trick

• Implicitly learning with feature expansion • not learning with expansion that is similarities to centres

24

k(x, c) = (x)> (c)

(x)

Page 25: Nonlinear representations - GitHub Pages

Example: polynomial kernel

25

�(x) =

2

4x

21p

2x1x2

x

22

3

5

k(x,x0) = h�(x),�(x0

)i = hx,x0i2

In general, for order d polynomials, k(x,x0) = hx,x0id

Page 26: Nonlinear representations - GitHub Pages

Gaussian kernel

26

�(x) = exp(��x

2)

2

666664

1q2�1! xq

(2�)2

2! x

2

.

.

.

3

777775

Infinite polynomial representation

k(x,x0) = exp

✓�kx� x

0k22�2

Page 27: Nonlinear representations - GitHub Pages

Regression with new features

27

minw

nX

i=1

(�(xi)>w � yi)

2 = minw

nX

i=1

0

@

0

@pX

j=1

�j(xi)wj

1

A� yi

1

A2

minw

nX

i=1

(�(xi)>w � yi)

2 = mina

nX

i=1

0

@

0

@nX

j=1

h�(xi),�(xj)iaj

1

A� yi

1

A2

What if p is really big?

If can compute dot product efficiently, then can solve this regression problem efficiently

Page 28: Nonlinear representations - GitHub Pages

What about learning the representation?

• We have talked about fixed nonlinear transformations• polynomials

• kernels

• How do we introduce learning?• could learn centers, for example

• learning is quite constrained, since can only pick centres and widths

• Neural networks learn this transformation more from scratch

28

Page 29: Nonlinear representations - GitHub Pages

Fixed representation vs. NN

29

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

GLM with augmented fix representation Two-layer neural network

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

Fixed transform

No learning

here

Learn weights

here

Learn weights

here

Learn weights

here�(x) �(x)

Page 30: Nonlinear representations - GitHub Pages

Explicit steps in visualizations

30

✓j

e.g., sigmoid tanh ReLUd d

jth hidden node

Page 31: Nonlinear representations - GitHub Pages

Example: logistic regression versus neural network

• Both try to predict p(y = 1 | x)

• Logistic regression learns W such that

• Neural network learns W1 and W2 such that

31

and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw(2) 2 R2 such that hw(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2) 2 R2 such that �(hw(2)

) approximatesthe true output y. ⇤

Now we consider the more general case with any d, k1

,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W(2) 2 Rd⇥k1 andW(1) 2 Rk1⇥m, where k

1

is the dimension of the hidden layer

h = �(W(2)x) =

2

6

6

6

6

4

�(xW(2)

:1

)

�(xW(2)

:2

)

...�(xW(2)

:k1)

3

7

7

7

7

5

2 Rk1

where the sigmoid function is applied to each entry in xW(2) and hW(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:

p(y = 1|x) = �(hW(1)

) = �(�(xW(2)

)W(1)

).

With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W(2) 2 Rk1⇥d,W(1) 2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W(1), and duplicate gradient information sent back to compute thegradient for W(2). This algorithm is typically called back propagation, which we describenext.

In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f

1

, . . . , fH , ordered with f1

as the output transfer, andk1

, . . . , kH�1

as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is

f1

f2

. . . fH�1

fH⇣

xW(H)

W(H�1)

. . .⌘

W(1)

where W(1) 2 Rk1⇥m, W(2) 2 Rk2⇥k1 , . . . ,W(H) 2 Rd⇥kH�1 .

Backpropagation algorithm

We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn

85

and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw(2) 2 R2 such that hw(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2) 2 R2 such that �(hw(2)

) approximatesthe true output y. ⇤

Now we consider the more general case with any d, k1

,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W(2) 2 Rd⇥k1 andW(1) 2 Rk1⇥m, where k

1

is the dimension of the hidden layer

h = �(W(2)x) =

2

6

6

6

6

4

�(xW(2)

:1

)

�(xW(2)

:2

)

...�(xW(2)

:k1)

3

7

7

7

7

5

2 Rk1

where the sigmoid function is applied to each entry in xW(2) and hW(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:

p(y = 1|x) = �(hW(1)

) = �(�(xW(2)

)W(1)

).

With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W(2) 2 Rk1⇥d,W(1) 2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W(1), and duplicate gradient information sent back to compute thegradient for W(2). This algorithm is typically called back propagation, which we describenext.

In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f

1

, . . . , fH , ordered with f1

as the output transfer, andk1

, . . . , kH�1

as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is

f1

f2

. . . fH�1

fH⇣

xW(H)

W(H�1)

. . .⌘

W(1)

where W(1) 2 Rk1⇥m, W(2) 2 Rk2⇥k1 , . . . ,W(H) 2 Rd⇥kH�1 .

Backpropagation algorithm

We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn

85

= p(y = 1|x)

Page 32: Nonlinear representations - GitHub Pages

No representation learning vs. neural network

32

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

GLM (e.g. logistic regression) Two-layer neural network

Page 33: Nonlinear representations - GitHub Pages

What are the representational capabilities of neural nets?

• Single hidden-layer neural networks with sigmoid transfer can represent any continuous function on a bounded space within epsilon accuracy, for a large enough number of hidden nodes• see Cybenko, 1989: “Approximation by Superpositions of a Sigmoidal

Function”

33

x1

x2

x3

x4

y

y

Inputlayer

Outputlayer

Figure 7.2: Generalized linear model, such as logistic regression.

x1

x2

x3

x4

y

y

Hiddenlayer

Inputlayer

Outputlayer

Figure 7.3: Standard neural network.

106

W(1)

W(2)

h

(2) = f2(xW(2))

y = f1(h(1)

W

(1))