Nonlinear representations - GitHub Pages
Transcript of Nonlinear representations - GitHub Pages
Nonlinear representations
Comments
• Assignment 2 deadline extended to the end of Friday
• I do apologize for the late additions to the assignment. Any issues?
2
Representations for learning nonlinear functions
• Generalized linear models enabled many p(y | x) distributions• Still however learning a simple function for E[Y | x], i.e., f(<x,w>)
• Approach we discussed earlier: augment current features x using polynomials
• There are many strategies to augmenting x• fixed representations, like polynomials, wavelets
• learned representations, like neural networks and matrix factorization
3
What if classes are not linearly separable?
4
How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?
x
21 + x
22 = 1
x1 = x2 = 0
=) f(x) = �1 < 0
x1 = 2, x2 = �1
=) f(x) = 4 + 1� 1 = 4 > 0
f(x) = x
21 + x
22 � 1
What if classes are not linearly separable?
5
How to learn f(x) such that f(x) > 0 predicts + and f(x) < 0 predicts negative?
x
21 + x
22 = 1 f(x) = x
21 + x
22 � 1
�(x) =
2
4x
21
x
22
1
3
5 f(x) = �(x)>w
If use logistic regression, what is p(y=1 | x)?
What if classes are not linearly separable?
6
x
21 + x
22 = 1 f(x) = x
21 + x
22 � 1
�(x) =
2
4x
21
x
22
1
3
5 f(x) = �(x)>w
Imagine learned w. How do we predict on a new x?
Nonlinear transformation
7
x ! �(x) =
0
@�1(x). . .
�p(x)
1
A
e.g., x = [x1, x2], �(x) =
0
BBBBBBBB@
x1
x2
x
21
x1x2
x
22
x
31
x
32
1
CCCCCCCCA
Gaussian kernel / Gaussian radial basis function
8
k(x,x0) = exp
✓�kx� x
0k22�2
◆
x =
x1
x2
�!
2
64k(x,x1)
...k(x,xk)
3
75
x → [K (x , x1) K (x , x
2) K (x , x
3)] '
= [5 2 3] '
x
x2
x3 x
1
5
3
2
0.1
0.8
0.5
.1 .8 .5
p
f(x) =kX
i=1
wik(x,xi)
p
Gaussian kernel / Gaussian radial basis function
9
Possible function f with several centersKernel
k(x,x0) = exp
✓�kx� x
0k22�2
◆f(x) =
kX
i=1
wik(x,xi)
p
Selecting centers
• Many different strategies to decide on centers• many ML algorithms use kernels e.g., SVMs, Gaussian process regression
• For kernel representations, typical strategy is to select training data as centers
• Clustering techniques to find centers
• A grid of values to best (exhaustively) cover the space
• Many other strategies, e.g., using information gain, coherence criterion, informative vector machine
10
Covering space uniformly with centres
• Imagine has 1-d space, from range [-10, 10]
• How would we pick p centers uniformly?
• What if we have a 5-d space, in ranges [-1,1]?• To cover entire 5-dimensional, need to consider all possible options
• Split up 1-d into m values, then total number of centres is m^5
• i.e., for first value of x1, can try all other m values for x2, …, x5
11
Why select training data as centers?
• Observed data indicates the part of the space that we actually need to model• can be much more compact than exhaustively covering the space
• imagine only see narrow trajectories in world, even if data is d-dimensional, data you encounter may lie on a much lower-dimensional manifold (space)
• Any issues with using all training data for centres?• How can we subselect centres from training data?
12
How would we use clustering to select centres?
• Clustering is taking data and finding p groups
• What distance measure should we use?• If k(x,c) between 0 and 1 and k(x,x) = 1 for all x, then 1-k(x,c) gives a
distance
13
What if we select centres randomly?
• Are there any issues with the linear regression with a kernel transformation, if we select centres randomly from the data?
• If so, any suggestions to remedy the problem?
14
nX
i=1
0
@pX
j=1
k(x, zj)wj � yi
1
A2
+ �kwk1
Exercise
• What would it mean to use an l1 regularizer with a kernel representation?• Recall that l1 prefers to zero elements in w
15
nX
i=1
0
@pX
j=1
k(x, zj)wj � yi
1
A2
+ �kwk1
Exercise: How do we decide on the nonlinear transformation?
• We can pick a 5-th order polynomial or 6-th order, or… Which should we pick?
• We can pick p centres. How many should we pick?
• How can we avoid overparametrizing or underparameterizing?
16
Other similarity transforms
• Linear kernel:
• Laplace kernel (Laplace distribution instead of Gaussian)
• Binning transformation
17
k(x, c) = x
>c
k(x, c) = exp(�bkx� ck1)
s(x, c) =
⇢1 if x in box around c
0 else
Dealing with non-numeric data
• What if we have categorical features?• e.g., feature is the blood type of a person
• Even worse, what if we have strings describing the object?• e.g., feature is occupation, like “retail”
18
Some options
• Convert categorical feature into integers• e.g., {A, B, AB, O} —> {1, 2, 3, 4}
• Any issues?
• Convert categorical feature into indicator vector• e.g., A —> [1 0 0 0], B —> [0 1 0 0], …
• Any issues?
19
Using kernels for categorical or non-numeric data
• An alternative is to use kernels (or similarity transforms)
• If you know something about your data/domain, might have a good similarity measure for non-numeric data
• Some more generic kernel options as well• Matching kernel: similarity between two items is the proportion of
features that are equal
20
age
gender
income
education
x =
{15-24, 25-34, …, 65+}
{F, M}
{Low, Medium, High}
{Bachelors, Trade-Sch, High-Sch, …}
Census dataset: Predict hours worked per week
Example: matching kernel
21
Example: Matching similarity for categorical data
22
age
gender
income
education
x =k(x1,x2) = k
24-34
F
Medium
Trade-Sch
35-44
F
Medium
Bachelors
= 0.5
Representational properties of transformations
• Approximation properties: which transformations can approximate “any function”?
• Radial basis functions (a huge number of them)
• Polynomials and the Taylor series
• Fourier basis and wavelets
23
Distinction with the kernel trick• When is the similarity actually called a “kernel”?
• Nice property of kernels: can always be written as a dot product in some feature space
• In some cases, they are used to compute inner products efficiently, assuming one is actually learning with the feature expansion • This is called the kernel trick
• Implicitly learning with feature expansion • not learning with expansion that is similarities to centres
24
k(x, c) = (x)> (c)
(x)
Example: polynomial kernel
25
�(x) =
2
4x
21p
2x1x2
x
22
3
5
k(x,x0) = h�(x),�(x0
)i = hx,x0i2
In general, for order d polynomials, k(x,x0) = hx,x0id
Gaussian kernel
26
�(x) = exp(��x
2)
2
666664
1q2�1! xq
(2�)2
2! x
2
.
.
.
3
777775
Infinite polynomial representation
k(x,x0) = exp
✓�kx� x
0k22�2
◆
Regression with new features
27
minw
nX
i=1
(�(xi)>w � yi)
2 = minw
nX
i=1
0
@
0
@pX
j=1
�j(xi)wj
1
A� yi
1
A2
minw
nX
i=1
(�(xi)>w � yi)
2 = mina
nX
i=1
0
@
0
@nX
j=1
h�(xi),�(xj)iaj
1
A� yi
1
A2
What if p is really big?
If can compute dot product efficiently, then can solve this regression problem efficiently
What about learning the representation?
• We have talked about fixed nonlinear transformations• polynomials
• kernels
• How do we introduce learning?• could learn centers, for example
• learning is quite constrained, since can only pick centres and widths
• Neural networks learn this transformation more from scratch
28
Fixed representation vs. NN
29
x1
x2
x3
x4
y
y
Inputlayer
Outputlayer
Figure 7.2: Generalized linear model, such as logistic regression.
x1
x2
x3
x4
y
y
Hiddenlayer
Inputlayer
Outputlayer
Figure 7.3: Standard neural network.
106
GLM with augmented fix representation Two-layer neural network
x1
x2
x3
x4
y
y
Inputlayer
Outputlayer
Figure 7.2: Generalized linear model, such as logistic regression.
x1
x2
x3
x4
y
y
Hiddenlayer
Inputlayer
Outputlayer
Figure 7.3: Standard neural network.
106
Fixed transform
No learning
here
Learn weights
here
Learn weights
here
Learn weights
here�(x) �(x)
Explicit steps in visualizations
30
✓j
e.g., sigmoid tanh ReLUd d
jth hidden node
Example: logistic regression versus neural network
• Both try to predict p(y = 1 | x)
• Logistic regression learns W such that
• Neural network learns W1 and W2 such that
31
and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw(2) 2 R2 such that hw(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2) 2 R2 such that �(hw(2)
) approximatesthe true output y. ⇤
Now we consider the more general case with any d, k1
,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W(2) 2 Rd⇥k1 andW(1) 2 Rk1⇥m, where k
1
is the dimension of the hidden layer
h = �(W(2)x) =
2
6
6
6
6
4
�(xW(2)
:1
)
�(xW(2)
:2
)
...�(xW(2)
:k1)
3
7
7
7
7
5
2 Rk1
where the sigmoid function is applied to each entry in xW(2) and hW(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:
p(y = 1|x) = �(hW(1)
) = �(�(xW(2)
)W(1)
).
With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W(2) 2 Rk1⇥d,W(1) 2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W(1), and duplicate gradient information sent back to compute thegradient for W(2). This algorithm is typically called back propagation, which we describenext.
In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f
1
, . . . , fH , ordered with f1
as the output transfer, andk1
, . . . , kH�1
as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is
f1
⇣
f2
⇣
. . . fH�1
⇣
fH⇣
xW(H)
⌘
W(H�1)
⌘
. . .⌘
W(1)
⌘
where W(1) 2 Rk1⇥m, W(2) 2 Rk2⇥k1 , . . . ,W(H) 2 Rd⇥kH�1 .
Backpropagation algorithm
We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn
85
and y 2 {0, 1}. If y 2 R, we use linear regression for this last layer and so learn weightsw(2) 2 R2 such that hw(2) approximates the true output y. If y 2 {0, 1}, we use logisticregression for this last layer and so learn weights w(2) 2 R2 such that �(hw(2)
) approximatesthe true output y. ⇤
Now we consider the more general case with any d, k1
,m. To provide some intuitionfor this more general setting, we will begin with one hidden layer, for the sigmoid transferfunction and cross-entropy output loss. For logistic regression we estimated W 2 Rd⇥m,with f(xW) = �(xW) ⇡ y. We will predict an output vector y 2 Rm, because it will makelater generalizations more clear-cut and make notation for the weights in each layer moreuniform. When we add a hidden layer, we have two parameter matrices W(2) 2 Rd⇥k1 andW(1) 2 Rk1⇥m, where k
1
is the dimension of the hidden layer
h = �(W(2)x) =
2
6
6
6
6
4
�(xW(2)
:1
)
�(xW(2)
:2
)
...�(xW(2)
:k1)
3
7
7
7
7
5
2 Rk1
where the sigmoid function is applied to each entry in xW(2) and hW(1). This hidden layeris the new set of features and again you will do the regular logistic regression optimizationto learn weights on h:
p(y = 1|x) = �(hW(1)
) = �(�(xW(2)
)W(1)
).
With the probabilistic model and parameter specified, we now need to derive an algo-rithm to obtain those parameters. As before, we take a maximum likelihood approach andderive gradient descent updates. This composition of transfers seems to complicate matters,but we can still take the gradient w.r.t. our parameters. We simply have more parametersnow: W(2) 2 Rk1⇥d,W(1) 2 R1⇥k1 . Once we have the gradient w.r.t. each parameter ma-trix, we simply take a step in the direction of the negative of the gradient, as usual. Thegradients for these parameters share information; for computational efficiency, the gradientis computed first for W(1), and duplicate gradient information sent back to compute thegradient for W(2). This algorithm is typically called back propagation, which we describenext.
In general, we can compute the gradient for any number of hidden layers. Denoteeach differentiable transfer function f
1
, . . . , fH , ordered with f1
as the output transfer, andk1
, . . . , kH�1
as the hidden dimensions with H � 1 hidden layers. Then the output from theneural network is
f1
⇣
f2
⇣
. . . fH�1
⇣
fH⇣
xW(H)
⌘
W(H�1)
⌘
. . .⌘
W(1)
⌘
where W(1) 2 Rk1⇥m, W(2) 2 Rk2⇥k1 , . . . ,W(H) 2 Rd⇥kH�1 .
Backpropagation algorithm
We will start by deriving back propagation for two layers; the extension to multiple layerswill be more clear given this derivation. Due to the size of the network, we will often learn
85
= p(y = 1|x)
No representation learning vs. neural network
32
x1
x2
x3
x4
y
y
Inputlayer
Outputlayer
Figure 7.2: Generalized linear model, such as logistic regression.
x1
x2
x3
x4
y
y
Hiddenlayer
Inputlayer
Outputlayer
Figure 7.3: Standard neural network.
106
x1
x2
x3
x4
y
y
Inputlayer
Outputlayer
Figure 7.2: Generalized linear model, such as logistic regression.
x1
x2
x3
x4
y
y
Hiddenlayer
Inputlayer
Outputlayer
Figure 7.3: Standard neural network.
106
GLM (e.g. logistic regression) Two-layer neural network
What are the representational capabilities of neural nets?
• Single hidden-layer neural networks with sigmoid transfer can represent any continuous function on a bounded space within epsilon accuracy, for a large enough number of hidden nodes• see Cybenko, 1989: “Approximation by Superpositions of a Sigmoidal
Function”
33
x1
x2
x3
x4
y
y
Inputlayer
Outputlayer
Figure 7.2: Generalized linear model, such as logistic regression.
x1
x2
x3
x4
y
y
Hiddenlayer
Inputlayer
Outputlayer
Figure 7.3: Standard neural network.
106
W(1)
W(2)
h
(2) = f2(xW(2))
y = f1(h(1)
W
(1))