Lecture 2.1: Vector Calculus CSC 84020 - Machine...
Transcript of Lecture 2.1: Vector Calculus CSC 84020 - Machine...
Lecture 2.1: Vector Calculus
CSC 84020 - Machine Learning
Andrew Rosenberg
February 5, 2009
Today
Last Time
Probability Review
Today
Vector Calculus
Background
Let’s talk.
Linear Algebra
VectorsMatricesBasis SpacesEigenvectors/values?Inversion and transposition
Calculus
DerivationIntegration
Vector Calculus
GradientsDerivation w.r.t. a vector
Linear Algebra Basics
What is a vector?
What is a matrix?
Transposition
Adding matrices and vectors
Multiplying matrices.
Definitions
A vector is a one dimensional array.We denote vectors as either x, x.If we don’t specify otherwise assume x is a column vector.
x =
x0
x1
. . .
xn−1
Definitions
A matrix is a higher dimensional array.We typically denote matrices as capital letters e.g., A.If A is an n-by-m matrix, it has the following structure
A =
a0,0 a0,1 . . . a0,m−1
a1,0 a1,1 a1,m−1...
. . ....
an−1,0 an−1,1 . . . an−1,m−1
Matrix transposition
Transposing a matrix or vector swaps rows and columns.
A column-vector becomes a row-vector
x =
x0
x1
. . .
xn−1
xT =(
x0 x1 . . . xn−1
)
Matrix transposition
Transposing a matrix or vector swaps rows and columns.
A column-vector becomes a row-vector
A =
a0,0 a0,1 . . . a0,m−1
a1,0 a1,1 a1,m−1...
. . ....
an−1,0 an−1,1 . . . an−1,m−1
AT =
a0,0 a1,0 . . . an−1,0
a0,1 a1,1 a1,m−1...
. . ....
a0,m−1 a1,m−1 . . . an−1,m−1
If A is n-by-m, then AT is m-by-n.
Adding Matrices
Matrices can only be added if they have the same dimension.
A+B =
a0,0 + b0,0 a0,1 + b0,1 . . . a0,m−1 + b0,m−1
a1,0 + b1,0 a1,1 + b1,1 a1,m−1 + b1,m−1...
. . ....
an−1,0 + bn−1,0 an−1,1 + bn−1,1 . . . an−1,m−1 + bn−1,m−1
Multiplying matrices
To multiply two matrices, the inner dimensions must match.
An n-by-m can be multiplied by an n′-by-m′ matrix iff m = n′.
AB = C
cij =
m∑
k=0
aik ∗ bkj
That is, multiply the i -th row by the j-th column.
Image from wikipedia.
Useful matrix operations
Inversion
Norm
Eigenvector decomposition
Matrix Inversion
The inverse of an n-by-m matrix A is denoted A−1, and has thefollowing property.
AA−1 = I
Where I is the identity matrix, an n-by-n matrix where Iij = 1 iffi = j and 0 otherwise.If A is a square matrix (iff n = m) then,
A−1A = I
Matrix Inversion
The inverse of an n-by-m matrix A is denoted A−1, and has thefollowing property.
AA−1 = I
Where I is the identity matrix, an n-by-n matrix where Iij = 1 iffi = j and 0 otherwise.If A is a square matrix (iff n = m) then,
A−1A = I
What is the inverse of a vector? x−1 =?
Some useful Matrix Inversion Properties
(A−1)−1 = A
(kA)−1 = k−1A−1
(AT )−1 = (A−1)T
(AB)−1 = B−1A−1
The norm of a vector
The norm of a vector x is written ||x||.The norm represents the euclidean length of a vector.
||x|| =
√
√
√
√
n−1∑
i=0
x2i
=√
x20 + x2
1 + . . . + x2n−1
Positive Definite/Semi-Definite
A positive definite matrix, M has the property that
xTMx > 0
A positive semi-definite matrix, M has the property that
xTMx ≥ 0
Why might we care about these matrices?
Eigenvectors
For a square matrix A, the eigenvector is defined as
Aui = λiui
Where ui is an eigenvector and λi is its correspondingeigenvalue.
In general, eigenvalues are complex numbers, but if A issymmetric, they are real.
Eigenvectors describe how a matrix transforms a vector, and canbe used to define a basis space, namely the eigenspace.
Who cares? The eigenvectors of a covariance matrix have somevery interesting properties.
Basis Spaces
Basis spaces allow vectors to be represented in different spaces.Our normal 2-dimensional basis space is generated by the vectors[0, 1], [1, 0].
Any 2-d vector can be expressed as the sum of linear factorsof these two basis vectors.
However, any two non-colinear vectors can generate a 2-d basisspace. In this basis space, the generating vectors are perpendicular.
Basis Spaces
Basis Spaces
Basis Spaces
Why do we care?
Dimensionality reduction.
Calculus Basics
What is a derivative?
What is an integral?
Derivatives
A derivative, ddx
f (x) can be thought of as defining the slope of afunction f (x). This is sometimes also written as f ′(x).
Derivative Example
Integrals
Integrals are an inverse operation of the derivative (plus aconstant).
∫
f (x)dx = F (x) + c
F ′(x) = f (x)
An integral can be thought of as a calculation of the area underthe curve defined by f (x).
A definite integral evaluates the area over a finite region. Anindefinite integral is calculated over the range of (−∞,∞).
Integration Example
Useful calculus operations
Product, quotient, summation rules for derivatives.
Useful integration and derivative identities.
Chain rule
Integration by parts
Variable substitution (don’t forget the Jacobian!)
Calculus Identities
Summation ruleg(x) = f0(x) + f0(x)
g ′(x) = f ′0(x) + f ′1(x)
Product Ruleg(x) = f0(x)f1(x)
g ′(x) = f0(x)f ′1(x) + f ′0(x)f1(x)
Quotient Rule
g(x) =f0(x)
f1(x)
g ′(x) =f0(x)f ′1(x) − f ′0(x)f1(x)
f 21 (x)
Calculus Identities
Constant multipliersg(x) = cf (x)
g ′(x) = cf ′(x)
Exponent Rule
g(x) = f (x)k
g ′(x) = kf (x)k−1
Chain Ruleg(x) = f0(f1(x))
g ′(x) = f ′0(f1(x))f ′1(x)
Calculus Identities
Exponent Ruleg(x) = ex
g ′(x) = ex
g(x) = kx
g ′(x) = ln(k)kx
Logarithm Ruleg(x) = ln(x)
g ′(x) =1
x
g(x) = logb(x)
g ′(x) =1
x ln b
Calculus Operations
Integration by Parts
∫
f (x)dg(x)
dxdx = f (x)g(x) −
∫
g(x)df (x)
dxdx
Variable Substitution
∫ b
a
f (g(x))g ′(x)dx =
∫ g(b)
g(a)f (x)dx
Vector Calculus
Derivation with respect to to a vector or matrix.
Gradient of a vector.
Change of variables with a vector.
Derivation with respect to a vector
Given a vector x = (x0, x1, . . . , xn−1)T , and a function
f (x) : Rn → R how can we find ∂f (x)
∂x?
Derivation with respect to a vector
Given a vector x = (x0, x1, . . . , xn−1)T , and a function
f (x) : Rn → R how can we find ∂f (x)
∂x?
∂f (x)
∂x=
∂f (x)∂x0
∂f (x)∂x1...
∂f (x)∂xn−1
This is also called the gradient of the function, and is oftenwritten ∇f (x) or ∇f .
Derivation with respect to a vector
Given a vector x = (x0, x1, . . . , xn−1)T , and a function
f (x) : Rn → R how can we find ∂f (x)
∂x?
∂f (x)
∂x=
∂f (x)∂x0
∂f (x)∂x1...
∂f (x)∂xn−1
This is also called the gradient of the function, and is oftenwritten ∇f (x) or ∇f .
Why might this be useful?
Useful Vector Calculus identities
Given a vector x with |x| = n and a scalar variable y .
∂x
∂y=
∂x0∂y∂x1∂y...
∂xn−1
∂y
Useful Vector Calculus identities
Given a vector x with |x| = n and a vector y with |y| = m .
∂x
∂y=
∂x0∂y0
∂x0∂y1
. . . ∂x0∂ym−1
∂x1∂y0
∂x1∂y1
. . . ∂x1∂ym−1
......
. . ....
∂xn−1
∂y0
∂xn−1
∂y1. . .
∂xn−1
∂ym−1
Vector Calculus Identities
Similar to – Scalar Multiplication Rule
∂
∂x(xTa) =
∂
∂x(aTx) = a
Similar to – Product Rule
∂
∂x(AB) =
∂A
∂xB + A
∂B
∂x
Derivative of an Matrix inverse.
∂
∂x(A−1) = −A−1∂A
∂xA−1
Change of Variable in an Integral
∫
f (x)dx =
∫
f (u)
∣
∣
∣
∣
∂x
∂u
∣
∣
∣
∣
du
Calculating the Expectation of a Gaussian
Now we have enough tools to calculate the expectation of avariable given a Gaussian Distribution.
Recall:
E[x |µ, σ2] =
∫
p(x |µ, σ2)xdx
=
∫
N(x |µ, σ2)xdx
=
∫
1√2πσ2
exp
{
− 1
2σ2(x − µ)2
}
xdx
Calculating the Expectation of a Gaussian
E[x |µ, σ2] =
Z
1√2πσ2
exp
− 1
2σ2(x − µ)2
ff
xdx
u = x − µ
du = dx
E[x |µ, σ2] =
Z
1√2πσ2
exp
− 1
2σ2(x − µ)2
ff
xdx
=
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
(u + µ)du
=
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
udu + µ
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
du
Calculating the Expectation of a Gaussian
E[x |µ, σ2] =
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
udu + µ
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
du
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
du = 1
E[x |µ, σ2] =
Z
1√2πσ2
exp
− 1
2σ2u
2
ff
udu + µ
Aside: A function is Odd iff f (x) = −f (−x).
Odd functions have the propertyR
∞
−∞f (x)dx = 0.
A function is Even iff f (x) = f (−x).
The product of an odd function and an even function is an odd function.
Calculating the Expectation of a Gaussian
E[x |µ, σ2] =
∫
1√2πσ2
exp
{
− 1
2σ2u2
}
udu + µ
exp
{
− 1
2σ2u2
}
is even
u is odd
exp
{
− 1
2σ2u2
}
u is odd
∫
1√2πσ2
exp
{
− 1
2σ2u2
}
udu = 0
E[x |µ, σ2] = µ
Why does Machine Learning need these tools?
Calculus
We need to find maximum likelihoods or minimum risks. Thisoptimization is accomplished with derivatives.
Integration allows us to marginalize continuous probabilitydensity functions.
Linear Algebra
We will be working in high-dimension spaces.
Vectors and Matrices allow us to refer to high dimensionalpoints – groups of features – as vectors.
Matrices allow us to describe the feature space.
Why does machine learning need these tools
Vector Calculus
We need to do all of the calculus operations inhigh-dimensional feature spaces.
We will want to optimize multiple values simultaneously –Gradient Descent.
We will need to take a marginal over a high dimensionaldistributions – Gaussians.
Broader Context
What we have so far:
Entities in the world are represented as feature vectors andmaybe a label.
We want to construct statistical models of the feature vectors.
Finding the most likely model is an optimization problem.
Since the feature vectors may have more than one dimension,linear algebra can help us work with them.
Bye
Next
Linear Regression