Machine Learning for NLPLecture 2: Basic linear algebra and optimization
UNIVERSITY OF
GOTHENBURG
Richard Johansson
September 4, 2015
-20pt
UNIVERSITY OF
GOTHENBURG
math in machine learning
I machine learning is a �mathy� subject. . .
I the most important branches of mathematics used in ML:I probability and statistical theoryI linear algebraI optimization
I in this lecture, we'll have a look at the latter two
-20pt
UNIVERSITY OF
GOTHENBURG
overview
basic linear algebra and its implementation in Python
basic optimization
-20pt
UNIVERSITY OF
GOTHENBURG
recap: mapping features to numerical vectors
I we convert symbolic features to numbers when we usescikit-learn's classi�ers
vec = DictVectorizer()
Xe = vec.fit_transform(X)
-20pt
UNIVERSITY OF
GOTHENBURG
types of vectorizers
I a DictVectorizer converts from attribute�value dicts:
I a CountVectorizer converts from texts (after applying atokenizer) or lists:
I a TfidfVectorizer is like a CountVectorizer, but also usesTF*IDF
-20pt
UNIVERSITY OF
GOTHENBURG
vectors
I a tuple consisting of n numbers is called a vector
I the set of all possible tuples of length n is called ann-dimensional vector space
I for instance: (1, 2) is a 2-dimensional vector
I they can be interpreted geometrically, either as a point in acoordinate system
1 2
1
2
I . . . or as a direction (e.g. of motion or force)
-20pt
UNIVERSITY OF
GOTHENBURG
basic linear algebra
the basic operations on vectors:
I scaling: α · v = α · (v1, . . . , vn) = (α · v1, . . . , α · vn)I addition and subtraction:v +w = (v1, . . . , vn) + (w1, . . . ,wn) = (v1 + w1, . . . , vn + wn)
I scalar product or dot product:v ·w = (v1, . . . , vn) · (w1, . . . ,wn) = v1 · w1 + . . .+ vn · wn
I vector length or norm:|v | = |(v1, . . . , vn)| =
√v1 · v1 + . . .+ vn · vn =
√v · v
-20pt
UNIVERSITY OF
GOTHENBURG
examples: basic linear algebra
I 0.5 · (1, 0, 0, 1) = (0.5, 0, 0, 0.5)
I (1, 0, 0, 1) + (0, 0, 1, 1) = (1, 0, 1, 2)
I (1, 0, 0, 1) · (0, 0, 1, 1) = 1 · 0+ 0 · 0+ 0 · 1+ 1 · 1 = 1
I |(1, 0, 0, 1)| =√1 · 1+ 0 · 0+ 0 · 0+ 1 · 1 =
√2
-20pt
UNIVERSITY OF
GOTHENBURG
simple linear algebra implementation
I naively, we could implement the basic vector operations inPython:
I def scale(a, v): return [a*vk for vk in v]I def vsum(v, w): return [vk+wk for (vk,wk) in
zip(v, w)]I def dot(v, w): return sum([vk*wk for (vk,wk) in
zip(v, w)])I def vlength(v): return math.sqrt(dot(v, v))
I however, this is ine�cient if the dimension of the vector spaceis high
-20pt
UNIVERSITY OF
GOTHENBURG
linear algebra implementation: better
I NumPy and SciPy are Python libraries containing manymathematical functions
I they are interlinked and typically installed togetherI scikit-learn relies on both of them
I they use specialized math libraries to make computationsfaster
I e.g. BLAS for your processor or graphics card
I example with a 100 million dimension random vector:I my simple function dot(v, v) takes 81 secondsI numpy.dot(v, v) takes 0.15 seconds
-20pt
UNIVERSITY OF
GOTHENBURG
NumPy linear algebra examples
>>> import numpy
>>> v1 = numpy.array([1, 0, 0, 1, 0])
>>> v2 = numpy.array([0, 2, 1, -2, 1])
>>> v1
array([1, 0, 0, 1, 0])
>>> v2
array([ 0, 2, 1, -2, 1])
>>> v1 + v2
array([ 1, 2, 1, -1, 1])
>>> 100 * v1
array([100, 0, 0, 100, 0])
>>> numpy.dot(v1, v2)
-2
>>> v1.dot(v2)
-2
>>> numpy.linalg.norm(v1)
1.4142135623730951
-20pt
UNIVERSITY OF
GOTHENBURG
sparse vectors
I in NLP, feature vectors are a bit peculiar compared to someother �elds (e.g. speech and image processing):
I the vector spaces often have a very high dimensionI in each feature vector, most of the entries are zeroI ["prices", "fall"] → (0, 1, 0, . . . , 0, 1, 0, . . . , 0, 0, 0)
I sparse vector: keep track of non-zero entries only:[(2, 1), (10, 1)]
I in some cases, this saves memory and is much faster
-20pt
UNIVERSITY OF
GOTHENBURG
sparse vectors in Python
I SciPy includes �ve di�erent types of sparse vectors
I in scikit-learn, DictVectorizer and CountVectorizer createvectors of the class csr_matrix
I more on this when we discuss classi�er implementation
I see also http:
//docs.scipy.org/doc/scipy/reference/sparse.html
-20pt
UNIVERSITY OF
GOTHENBURG
matrices
I a matrix is a 2-dimensional array of numbers: a �list of lists�[1 2 0−2 1 0
]I note that a vector can be seen as a special case of a matrix: arow or a column
[−2 1 0
] −210
-20pt
UNIVERSITY OF
GOTHENBURG
reasons for using matrices
I matrices have a geometric interpretation, as we'll see in amoment
I however, in this context we mainly care about them to speedup our programs
I we can see matrices as collections of vectorsI in Python, it's more e�cient to carry out a small number of
operations on large matrices than on many small vectors
-20pt
UNIVERSITY OF
GOTHENBURG
basic matrix operations
the basic elementwise operations on matrices, similar to what wedid for the vectors:
I scaling: multiply all the cells by some number
10 ·[
1 23 4
]=
[10 2030 40
]I addition / subtraction:[
1 23 4
]+
[10 2030 40
]=
[11 2233 44
]
-20pt
UNIVERSITY OF
GOTHENBURG
matrix multiplication
I matrix multiplication is an extension of the dot product forvectors
I each cell in the new matrix is computed as the dot productbetween a row and a column:[
1 23 4
]·[
10 2030 40
]=
[70 100150 220
]
-20pt
UNIVERSITY OF
GOTHENBURG
matrix multiplication
I matrix multiplication is an extension of the dot product forvectors
I each cell in the new matrix is computed as the dot productbetween a row and a column:[
1 2
3 4
]·[10 2030 40
]=
[70 100150 220
]
-20pt
UNIVERSITY OF
GOTHENBURG
matrix multiplication
I matrix multiplication is an extension of the dot product forvectors
I each cell in the new matrix is computed as the dot productbetween a row and a column:[
1 2
3 4
]·[
10 20
30 40
]=
[70 100
150 220
]
-20pt
UNIVERSITY OF
GOTHENBURG
matrix multiplication
I matrix multiplication is an extension of the dot product forvectors
I each cell in the new matrix is computed as the dot productbetween a row and a column:[
1 23 4
]·[
10 2030 40
]=
[70 100150 220
]
-20pt
UNIVERSITY OF
GOTHENBURG
geometric interpretation of matrix multiplication
I as mentioned, we use matrix multiplication (and other matrixoperations) mainly for e�ciency in this course
I a matrix multiplication instead of many dot products
I however, in geometry we can use matrix multiplication can beused to express many useful transformations
I scalingI rotationI projection from 3D to 2DI . . .
-20pt
UNIVERSITY OF
GOTHENBURG
matrix multiplication in NumPy
A = numpy.array([[1, 2], [3, 4]])
B = numpy.array([[10, 20], [30, 40]])
print(A.dot(B))
-20pt
UNIVERSITY OF
GOTHENBURG
overview
basic linear algebra and its implementation in Python
basic optimization
-20pt
UNIVERSITY OF
GOTHENBURG
optimization
I what is optimization?
I unconstrained optimization: �nd the x that gives us theminimal (or maximal) value of some function f :
minx
f (x)
I constrained optimization: �nd the x that gives us theminimal (or maximal) value of f , where x satis�es some extraconditions:
minx
f (x)
such that x > 0
I today unconstrained optimization only
-20pt
UNIVERSITY OF
GOTHENBURG
optimization in machine learning
I as we will see in the next lecture, several ML models areformulated as optimization of some mathematical function:
I support vector machineI logistic regressionI neural networksI . . .
I typically, we want to optimize a goodness of �t (how well wehandle the training set) and a regularizer (simplicity of theclassi�er)
-20pt
UNIVERSITY OF
GOTHENBURG
one-variable example
−3 −2 −1 0 1 2 3−0.5
0.0
0.5
1.0
1.5
minimum
-20pt
UNIVERSITY OF
GOTHENBURG
two-variable example
-20pt
UNIVERSITY OF
GOTHENBURG
remember your highschool calculus. . .
I in your early school days, you might have seen the derivativeof a function
I intuition: the derivative measures the slope
−3 −2 −1 0 1 2 3−0.5
0.0
0.5
1.0
1.5
minimum
I if a �nice� function has a maximum or minimum, then thederivative will be zero there
-20pt
UNIVERSITY OF
GOTHENBURG
the gradient
I the multidimensional equivalent of the derivative is called thegradient
I if f is a function of n variables, then the gradient is ann-dimensional vector, often written ∇f (x)
I intuition: the gradient points in the uphill direction
I again: the gradient is zero if we have an optimum
-20pt
UNIVERSITY OF
GOTHENBURG
computing the gradient
-20pt
UNIVERSITY OF
GOTHENBURG
gradient descent
I as we saw, the gradient points in the uphill direction:
I this intuition leads to a simple idea for �nding the minimum:I take a small step in the direction opposite to the gradientI repeat until the gradient is close enough to zero
I this is called gradient descent
-20pt
UNIVERSITY OF
GOTHENBURG
gradient descent, pseudocode
I the same thing again, in pseudocode:
1. set x to some initial value, and select a suitable step size c
2. compute the gradient ∇f (x)3. if ∇f (x) is small enough, we are done4. otherwise, subtract c · ∇f (x) from x and go back to step 2
I conversely, to �nd the maximum we can do gradient ascent:then we instead add c · ∇f (x) to x
-20pt
UNIVERSITY OF
GOTHENBURG
in Python
def gradient_ascent(x_init, y_init,
threshold = 0.001,
steplength = 0.01):
x = x_init
y = y_init
done = False
while not done:
gxy = gradient_of_my_function(x, y)
x += steplength * gxy[0]
y += steplength * gxy[1]
if numpy.linalg.norm(gxy) < threshold:
done = True
return (x, y)
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
I let's optimize this function:
def f(x, y):
return math.exp(-(x-2)**2 - (y+1)**2)
I its gradient is
def gradient_of_f(x, y):
return (-2*(x-2)*f(x, y), -2*(y+1)*f(x, y))
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example
−1 0 1 2 3−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
0.5
0.150
0.300
0.450
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
will we always reach the top?
I yes, ifI there is actually a topI the step is short enoughI the surface isn't too jumpy
I smarter versions of gradient ascent/descent try to adapt thestep length so that we don't go too slow in the beginning, orbounce around the top at the end
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
I let's optimize another function:
def f(x, y):
return math.exp( -(x-2)**2 - 0.5*(y+1)**2) \
+ 0.7 * math.exp( -0.7*(x+1)**2 - 0.8*(y-1)**2)
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
gradient ascent example (2)
−3 −2 −1 0 1 2−3
−2
−1
0
1
2
0.150
0.150
0.300
0.300
0.450
0.450
0.600
0.600
0.750
0.900
-20pt
UNIVERSITY OF
GOTHENBURG
local and global maxima/minima
I some functions have local maxima or minima
I these functions are harder to optimize because the local (butnot global) optima also have a gradient of 0
-20pt
UNIVERSITY OF
GOTHENBURG
convex and concave functions
I a function is convex if it always curves downwards
I equivalently, if we draw a line between two points of thesurface, the surface is always below the line
−3 −2 −1 0 1 2 30.0
0.2
0.4
0.6
0.8
1.0
−3 −2 −1 0 1 2 30.0
0.2
0.4
0.6
0.8
1.0
1.2
I the point of this: if we �nd a local optimum (gradient is 0) ofa convex function, this is guaranteed to be the minimum
I conversely, a function is concave if it always curves upwards
-20pt
UNIVERSITY OF
GOTHENBURG
stochastic gradient descent
I in some cases it is cumbersome to compute the gradientI because it depends on all the data in the training set
I stochastic gradient descent: simplify the computation bycomputing the gradient using just a small part
I typically, a single training example
I pseudocode:
1. set w to some initial value, and select a suitable step size c
2. select a single training instance x
3. compute the gradient ∇f (w) using x only
4. if we are �done�, stop5. otherwise, subtract c · ∇f (w) from w and go back to step 2
I (stopping criterion shouldn't be based on just a single instance)
-20pt
UNIVERSITY OF
GOTHENBURG
next lecture
I linear classi�ersI expressed as a dot product w · x
I we'll use the concepts discussed today to go into the details ofa few di�erent learning algorithms:
I perceptronI support vector classi�erI logistic regression
I implementations in NumPy/SciPy similar to the classi�ers inscikit-learn
I preparation for the second assignment
Top Related