Download - Machine Learning for NLP Lecture 2: Basic linear algebra and … · -20pt UNIVERSITY OF GOTHENBURG linear algebra implementation: better I NumPy and SciPy are Python libraries containing

Machine Learning for NLPLecture 2: Basic linear algebra and optimization

UNIVERSITY OF

GOTHENBURG

Richard Johansson

September 4, 2015

-20pt

UNIVERSITY OF

GOTHENBURG

math in machine learning

I machine learning is a �mathy� subject. . .

I the most important branches of mathematics used in ML:I probability and statistical theoryI linear algebraI optimization

I in this lecture, we'll have a look at the latter two

-20pt

UNIVERSITY OF

GOTHENBURG

overview

basic linear algebra and its implementation in Python

basic optimization

-20pt

UNIVERSITY OF

GOTHENBURG

recap: mapping features to numerical vectors

I we convert symbolic features to numbers when we usescikit-learn's classi�ers

vec = DictVectorizer()

Xe = vec.fit_transform(X)

-20pt

UNIVERSITY OF

GOTHENBURG

types of vectorizers

I a DictVectorizer converts from attribute�value dicts:

I a CountVectorizer converts from texts (after applying atokenizer) or lists:

I a TfidfVectorizer is like a CountVectorizer, but also usesTF*IDF

-20pt

UNIVERSITY OF

GOTHENBURG

vectors

I a tuple consisting of n numbers is called a vector

I the set of all possible tuples of length n is called ann-dimensional vector space

I for instance: (1, 2) is a 2-dimensional vector

I they can be interpreted geometrically, either as a point in acoordinate system

1 2

1

2

I . . . or as a direction (e.g. of motion or force)

-20pt

UNIVERSITY OF

GOTHENBURG

basic linear algebra

the basic operations on vectors:

I scaling: α · v = α · (v1, . . . , vn) = (α · v1, . . . , α · vn)I addition and subtraction:v +w = (v1, . . . , vn) + (w1, . . . ,wn) = (v1 + w1, . . . , vn + wn)

I scalar product or dot product:v ·w = (v1, . . . , vn) · (w1, . . . ,wn) = v1 · w1 + . . .+ vn · wn

I vector length or norm:|v | = |(v1, . . . , vn)| =

√v1 · v1 + . . .+ vn · vn =

√v · v

-20pt

UNIVERSITY OF

GOTHENBURG

examples: basic linear algebra

I 0.5 · (1, 0, 0, 1) = (0.5, 0, 0, 0.5)

I (1, 0, 0, 1) + (0, 0, 1, 1) = (1, 0, 1, 2)

I (1, 0, 0, 1) · (0, 0, 1, 1) = 1 · 0+ 0 · 0+ 0 · 1+ 1 · 1 = 1

I |(1, 0, 0, 1)| =√1 · 1+ 0 · 0+ 0 · 0+ 1 · 1 =

√2

-20pt

UNIVERSITY OF

GOTHENBURG

simple linear algebra implementation

I naively, we could implement the basic vector operations inPython:

I def scale(a, v): return [a*vk for vk in v]I def vsum(v, w): return [vk+wk for (vk,wk) in

zip(v, w)]I def dot(v, w): return sum([vk*wk for (vk,wk) in

zip(v, w)])I def vlength(v): return math.sqrt(dot(v, v))

I however, this is ine�cient if the dimension of the vector spaceis high

-20pt

UNIVERSITY OF

GOTHENBURG

linear algebra implementation: better

I NumPy and SciPy are Python libraries containing manymathematical functions

I they are interlinked and typically installed togetherI scikit-learn relies on both of them

I they use specialized math libraries to make computationsfaster

I e.g. BLAS for your processor or graphics card

I example with a 100 million dimension random vector:I my simple function dot(v, v) takes 81 secondsI numpy.dot(v, v) takes 0.15 seconds

-20pt

UNIVERSITY OF

GOTHENBURG

NumPy linear algebra examples

>>> import numpy

>>> v1 = numpy.array([1, 0, 0, 1, 0])

>>> v2 = numpy.array([0, 2, 1, -2, 1])

>>> v1

array([1, 0, 0, 1, 0])

>>> v2

array([ 0, 2, 1, -2, 1])

>>> v1 + v2

array([ 1, 2, 1, -1, 1])

>>> 100 * v1

array([100, 0, 0, 100, 0])

>>> numpy.dot(v1, v2)

-2

>>> v1.dot(v2)

-2

>>> numpy.linalg.norm(v1)

1.4142135623730951

-20pt

UNIVERSITY OF

GOTHENBURG

sparse vectors

I in NLP, feature vectors are a bit peculiar compared to someother �elds (e.g. speech and image processing):

I the vector spaces often have a very high dimensionI in each feature vector, most of the entries are zeroI ["prices", "fall"] → (0, 1, 0, . . . , 0, 1, 0, . . . , 0, 0, 0)

I sparse vector: keep track of non-zero entries only:[(2, 1), (10, 1)]

I in some cases, this saves memory and is much faster

-20pt

UNIVERSITY OF

GOTHENBURG

sparse vectors in Python

I SciPy includes �ve di�erent types of sparse vectors

I in scikit-learn, DictVectorizer and CountVectorizer createvectors of the class csr_matrix

I more on this when we discuss classi�er implementation

I see also http:

//docs.scipy.org/doc/scipy/reference/sparse.html

http://docs.scipy.org/doc/scipy/reference/sparse.html

http://docs.scipy.org/doc/scipy/reference/sparse.html

-20pt

UNIVERSITY OF

GOTHENBURG

matrices

I a matrix is a 2-dimensional array of numbers: a �list of lists�[1 2 0−2 1 0

]I note that a vector can be seen as a special case of a matrix: arow or a column

[−2 1 0

] −210

-20pt

UNIVERSITY OF

GOTHENBURG

reasons for using matrices

I matrices have a geometric interpretation, as we'll see in amoment

I however, in this context we mainly care about them to speedup our programs

I we can see matrices as collections of vectorsI in Python, it's more e�cient to carry out a small number of

operations on large matrices than on many small vectors

-20pt

UNIVERSITY OF

GOTHENBURG

basic matrix operations

the basic elementwise operations on matrices, similar to what wedid for the vectors:

I scaling: multiply all the cells by some number

10 ·[

1 23 4

]=

[10 2030 40

]I addition / subtraction:[

1 23 4

]+

[10 2030 40

]=

[11 2233 44

]

-20pt

UNIVERSITY OF

GOTHENBURG

matrix multiplication

I matrix multiplication is an extension of the dot product forvectors

I each cell in the new matrix is computed as the dot productbetween a row and a column:[

1 23 4

]·[

10 2030 40

]=

[70 100150 220

]

-20pt

UNIVERSITY OF

GOTHENBURG




1 2

3 4

]·[10 2030 40

]=

[70 100150 220

]

-20pt

UNIVERSITY OF

GOTHENBURG




1 2

3 4

]·[

10 20

30 40

]=

[70 100

150 220

]

-20pt

UNIVERSITY OF

GOTHENBURG




1 23 4

]·[

10 2030 40

]=

[70 100150 220

]

-20pt

UNIVERSITY OF

GOTHENBURG

geometric interpretation of matrix multiplication

I as mentioned, we use matrix multiplication (and other matrixoperations) mainly for e�ciency in this course

I a matrix multiplication instead of many dot products

I however, in geometry we can use matrix multiplication can beused to express many useful transformations

I scalingI rotationI projection from 3D to 2DI . . .

-20pt

UNIVERSITY OF

GOTHENBURG

matrix multiplication in NumPy

A = numpy.array([[1, 2], [3, 4]])

B = numpy.array([[10, 20], [30, 40]])

print(A.dot(B))

-20pt

UNIVERSITY OF

GOTHENBURG

overview

basic linear algebra and its implementation in Python

basic optimization

-20pt

UNIVERSITY OF

GOTHENBURG

optimization

I what is optimization?

I unconstrained optimization: �nd the x that gives us theminimal (or maximal) value of some function f :

minx

f (x)

I constrained optimization: �nd the x that gives us theminimal (or maximal) value of f , where x satis�es some extraconditions:

minx

f (x)

such that x > 0

I today unconstrained optimization only

-20pt

UNIVERSITY OF

GOTHENBURG

optimization in machine learning

I as we will see in the next lecture, several ML models areformulated as optimization of some mathematical function:

I support vector machineI logistic regressionI neural networksI . . .

I typically, we want to optimize a goodness of �t (how well wehandle the training set) and a regularizer (simplicity of theclassi�er)

-20pt

UNIVERSITY OF

GOTHENBURG

one-variable example

−3 −2 −1 0 1 2 3−0.5

0.0

0.5

1.0

1.5

minimum

-20pt

UNIVERSITY OF

GOTHENBURG

two-variable example

-20pt

UNIVERSITY OF

GOTHENBURG

remember your highschool calculus. . .

I in your early school days, you might have seen the derivativeof a function

I intuition: the derivative measures the slope

−3 −2 −1 0 1 2 3−0.5

0.0

0.5

1.0

1.5

minimum

I if a �nice� function has a maximum or minimum, then thederivative will be zero there

-20pt

UNIVERSITY OF

GOTHENBURG

the gradient

I the multidimensional equivalent of the derivative is called thegradient

I if f is a function of n variables, then the gradient is ann-dimensional vector, often written ∇f (x)

I intuition: the gradient points in the uphill direction

I again: the gradient is zero if we have an optimum

-20pt

UNIVERSITY OF

GOTHENBURG

computing the gradient

-20pt

UNIVERSITY OF

GOTHENBURG

gradient descent

I as we saw, the gradient points in the uphill direction:

I this intuition leads to a simple idea for �nding the minimum:I take a small step in the direction opposite to the gradientI repeat until the gradient is close enough to zero

I this is called gradient descent

-20pt

UNIVERSITY OF

GOTHENBURG

gradient descent, pseudocode

I the same thing again, in pseudocode:

1. set x to some initial value, and select a suitable step size c

2. compute the gradient ∇f (x)3. if ∇f (x) is small enough, we are done4. otherwise, subtract c · ∇f (x) from x and go back to step 2

I conversely, to �nd the maximum we can do gradient ascent:then we instead add c · ∇f (x) to x

-20pt

UNIVERSITY OF

GOTHENBURG

in Python

def gradient_ascent(x_init, y_init,

threshold = 0.001,

steplength = 0.01):

x = x_init

y = y_init

done = False

while not done:

gxy = gradient_of_my_function(x, y)

x += steplength * gxy[0]

y += steplength * gxy[1]

if numpy.linalg.norm(gxy) < threshold:

done = True

return (x, y)

-20pt

UNIVERSITY OF

GOTHENBURG

gradient ascent example

I let's optimize this function:

def f(x, y):

return math.exp(-(x-2)**2 - (y+1)**2)

I its gradient is

def gradient_of_f(x, y):

return (-2*(x-2)*f(x, y), -2*(y+1)*f(x, y))

-20pt

UNIVERSITY OF

GOTHENBURG

gradient ascent example

−1 0 1 2 3−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

0.5

0.150

0.300

0.450

0.600

0.750

0.900

-20pt

UNIVERSITY OF

GOTHENBURG

will we always reach the top?

I yes, ifI there is actually a topI the step is short enoughI the surface isn't too jumpy

I smarter versions of gradient ascent/descent try to adapt thestep length so that we don't go too slow in the beginning, orbounce around the top at the end

-20pt

UNIVERSITY OF

GOTHENBURG

gradient ascent example (2)

I let's optimize another function:

def f(x, y):

return math.exp( -(x-2)**2 - 0.5*(y+1)**2) \

+ 0.7 * math.exp( -0.7*(x+1)**2 - 0.8*(y-1)**2)

-20pt

UNIVERSITY OF

GOTHENBURG

gradient ascent example (2)

−3 −2 −1 0 1 2−3

−2

−1

0

1

2

0.150

0.150

0.300

0.300

0.450

0.450

0.600

0.600

0.750

0.900

-20pt

UNIVERSITY OF

GOTHENBURG

local and global maxima/minima

I some functions have local maxima or minima

I these functions are harder to optimize because the local (butnot global) optima also have a gradient of 0

-20pt

UNIVERSITY OF

GOTHENBURG

convex and concave functions

I a function is convex if it always curves downwards

I equivalently, if we draw a line between two points of thesurface, the surface is always below the line

−3 −2 −1 0 1 2 30.0

0.2

0.4

0.6

0.8

1.0

−3 −2 −1 0 1 2 30.0

0.2

0.4

0.6

0.8

1.0

1.2

I the point of this: if we �nd a local optimum (gradient is 0) ofa convex function, this is guaranteed to be the minimum

I conversely, a function is concave if it always curves upwards

-20pt

UNIVERSITY OF

GOTHENBURG

stochastic gradient descent

I in some cases it is cumbersome to compute the gradientI because it depends on all the data in the training set

I stochastic gradient descent: simplify the computation bycomputing the gradient using just a small part

I typically, a single training example

I pseudocode:

1. set w to some initial value, and select a suitable step size c

2. select a single training instance x

3. compute the gradient ∇f (w) using x only

4. if we are �done�, stop5. otherwise, subtract c · ∇f (w) from w and go back to step 2

I (stopping criterion shouldn't be based on just a single instance)

-20pt

UNIVERSITY OF

GOTHENBURG

next lecture

I linear classi�ersI expressed as a dot product w · x

I we'll use the concepts discussed today to go into the details ofa few di�erent learning algorithms:

I perceptronI support vector classi�erI logistic regression

I implementations in NumPy/SciPy similar to the classi�ers inscikit-learn

I preparation for the second assignment