Komunjer, Sobel, Watson 2014jsobel/205f10/notessecond.pdfMath Camp, Part II1 Komunjer, Sobel, Watson...

131
Math Camp, Part II 1 Komunjer, Sobel, Watson 2014 1 For use only by Econ 205 students at UCSD in 2014. These notes are not to be distributed elsewhere.

Transcript of Komunjer, Sobel, Watson 2014jsobel/205f10/notessecond.pdfMath Camp, Part II1 Komunjer, Sobel, Watson...

  • Math Camp, Part II1

    Komunjer, Sobel, Watson

    2014

    1For use only by Econ 205 students at UCSD in 2014. These notes are not to bedistributed elsewhere.

  • ii

    PrefaceThese notes are a continuation of those distributed on the first day of class. Ivanaand Joel W. are responsible for useful parts. Joel S. is responsible for the irrelevantmaterial and the mistakes. We apologize for inconsistent or incomplete crossreferences and notation. You are welcome to offer corrections and suggestions.These notes contain far more material than will be covered in the third week ofclass.

  • Contents

    1 Taylor’s Theorem 1

    2 Univariate Optimization 5

    3 Basic Linear Algebra 113.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.2.1 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Inner Product and Distance . . . . . . . . . . . . . . . . . 19

    3.3 Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . 213.4 Linear Algebra: Main Theory . . . . . . . . . . . . . . . . . . . . 243.5 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . . . . . . 263.6 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4 Multivariable Calculus 334.1 Linear Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Linear Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Representing Functions . . . . . . . . . . . . . . . . . . . . . . . 374.4 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . 394.5 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Partial Derivatives and Directional Derivatives . . . . . . . . . . . 434.7 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.8 Properties of the Derivative . . . . . . . . . . . . . . . . . . . . . 474.9 Gradients and Level Sets . . . . . . . . . . . . . . . . . . . . . . 514.10 Homogeneous Functions . . . . . . . . . . . . . . . . . . . . . . 534.11 Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . 544.12 Taylor Approximations . . . . . . . . . . . . . . . . . . . . . . . 55

    5 Invertibility and Implicit Function Theorem 595.1 Inverse Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    iii

  • iv CONTENTS

    5.2 Implicit Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4 Envelope Theorem for Unconstrained Optimization . . . . . . . . 67

    6 Monotone Comparative Statics 69

    7 Convexity 737.1 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Quasi-Concave and Quasi-Convex Functions . . . . . . . . . . . 75

    7.2.1 Relationship between Concavity and Quasiconcavity . . . 76

    8 Unconstrained Extrema of Real-Valued Functions 798.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 First-Order Conditions . . . . . . . . . . . . . . . . . . . . . . . 808.3 Second Order Conditions . . . . . . . . . . . . . . . . . . . . . . 81

    8.3.1 S.O. Sufficient Conditions . . . . . . . . . . . . . . . . . 828.3.2 S.O. Necessary Conditions . . . . . . . . . . . . . . . . . 82

    9 Constrained Optimization 859.1 Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 859.2 The Kuhn-Tucker Theorem . . . . . . . . . . . . . . . . . . . . . 889.3 Saddle Point Theorems . . . . . . . . . . . . . . . . . . . . . . . 939.4 Second-Order Conditions . . . . . . . . . . . . . . . . . . . . . . 1029.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    10 Integration 10710.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10710.2 Fundamental Theorems of Calculus . . . . . . . . . . . . . . . . 11110.3 Properties of Integrals . . . . . . . . . . . . . . . . . . . . . . . . 11310.4 Computing Integrals . . . . . . . . . . . . . . . . . . . . . . . . 114

    11 Ordinary Differential Equations 11711.1 Introduction and Terminology . . . . . . . . . . . . . . . . . . . 11711.2 Existence of Solutions . . . . . . . . . . . . . . . . . . . . . . . 11811.3 Solving Ordinary Differential Equations . . . . . . . . . . . . . . 120

    11.3.1 Separable Equations . . . . . . . . . . . . . . . . . . . . 12011.3.2 First Order Linear Differential Equations . . . . . . . . . 12011.3.3 Constant Coefficients . . . . . . . . . . . . . . . . . . . . 122

    11.4 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

  • Chapter 1

    Taylor’s Theorem

    Using the first derivative, we were able to come up with a way to find the bestlinear approximation to a function. It is natural to ask whether it is possible tofind higher order approximations. What does this mean? By analogy with zerothand first order approximations, we first decide what an appropriate approximatingfunction is and then what the appropriate definition of approximation is.

    First-order approximations were affine functions. In general, an nth-orderapproximation is a polynomial of degree n, that is a function of the form

    a0 + a1x+ · · ·+ an−1xn−1 + anxn.Technically, the degree of a polynomial is the largest power of n that appears

    with non-zero coefficient. So this polynomial has degree n if and only if an 6= 0.Plainly a zeroth degree polynomial is a constant, a first degree polynomial is

    an affine function, a second degree polynomial is a quadratic, and so on. An nth

    order approximation of the function f at x is a polynomial of degree at most n,An that satisfies

    limy→x

    f(y)− An(y)(y − x)n

    = 0.

    This definition generalizes the earlier definition. Notice that the denominator isa power of y − x. When y approaches x the denominator is really small. If theratio has limit zero it must be that the numerator is really, really small. We knowthat zeroth order approximations exist for continuous functions and first-orderapproximations exist for differentiable functions. It is natural to guess that higher-order approximations exist under stricter assumptions. This guess is correct.

    Definition 1. The nth derivative of a function f , denoted fn, is defined inductivelyto be the derivative of f (n−1).

    1

  • 2 CHAPTER 1. TAYLOR’S THEOREM

    We say f is of class Cn on (a, b) (f ∈ Cn) if f (n)(x) exists and is continuous∀x.

    One can check that, just as differentiability of f implies continuity of f , if fn

    exists, then fn−1 is continuous.

    Theorem 1 (Taylor’s Theorem). Let f ∈ Cn and assume that fn exists on (a, b)and let c and d be any points in (a, b). Then there exists a point t between c and dand a polynomial An of degree at most n such that

    f(d) = An(d) +f (n+1)(t)

    (n+ 1)!(d− c)n+1, (1.1)

    where An is the Taylor Polynomial for f centered at c:

    An(d) =n∑j=0

    f (k)(c)

    k!(d− c)k.

    The theorem decomposes f into a polynomial and an error termEn =f (n+1)(t)(n+1)!

    (d−c)n+1. Notice that

    limd→c

    En(d− c)n

    = 0

    so the theorem states that the Taylor polynomial is, in fact, the nth order approxi-mation of f at c.

    The form of the Taylor approximation may seem mysterious at first, but thecoefficients can be seen to be the only choices with the property that f (k)(c) =A

    (k)n (c) for k ≤ n. As impressive as the theorem appears, it is just a disguised

    version of the mean-value theorem.

    FIGURE GOES HERE

    Proof. Define

    F (x) ≡ f(d)−n∑k=0

    f (k)(x)

    k!(d− x)k

    and

    G(x) ≡ F (x)−(d− xd− c

    )n+1F (c).

  • 3

    It follows that F (d) = 0 and (lots of terms cancel) F ′(x) = −f(n+1)(x)n!

    (d − x)n.Also, G(c) = G(d) = 0. It follows from the mean value theorem that there existsa t between c and d such that G′(t) = 0. That is, there exists a t such that

    0 = −f(n+1)(x)

    n!(d− x)n +

    (d− td− c

    )nF (c)

    or

    F (c) =f (n+1)(t)

    (n+ 1)!(d− c)n+1.

    An examination of the definition of F confirms that this completes the proof.

    Taylor’s Theorem has several uses. As a conceptual tool it makes precise thenotion that well behaved functions have polynomial approximations. This per-mits you to understand “complicated” functions like the logarithm or exponentialby using their Taylor’s expansion. As a computational tool, it permits you tocompute approximate values of functions. Of course, doing this is not practical(because calculators and computers are available). As a practical tool, first- andsecond-order approximations permit you to conduct analyses in terms of linear orquadratic approximations. This insight is especially important for solving opti-mization problems, as we will see soon.

    Next we provide examples of the first two uses.Consider the logarithm function: f(x) = log x.1 This function is defined for

    x > 0. It is not hard to show that f (k)(x) = x−k(−1)k−1(k − 1)!. So f (k)(1) =(−1)k−1(k − 1)! Hence:

    f(x) =N∑k=1

    (−1)k−1 (x− 1)k

    k+ EN

    where EN = (−1)N (y−1)N

    N+1for y between 1 and x. Notice that this expansion is

    done around x0 = 1. This is a point at which the function is nicely behaved. Nextnotice that the function f (k) is differentiable at 1 for all k. This suggests that youcan extend the polynomial for an infinite number of terms. It is the case that

    log(x) =∞∑k=1

    (−1)k−1 (x− 1)k

    k.

    Sometimes this formula is written in the equivalent form:

    1Unless otherwise mentioned, logarithms are always with respect to the base e.

  • 4 CHAPTER 1. TAYLOR’S THEOREM

    log(y + 1) =∞∑k=1

    (−1)k−1yk

    k.

    The second way to use Taylor’s Theorem is to find approximations. The for-mula above can let you compute logarithms. How about square roots? The Tay-lor’s expansion of the square root of x around 1 is:

    √2 = 1 + .5− .125 + E2.

    E2 = x−2.5/16 for some x ∈ [1, 2]. Check to make sure you know where the

    terms come from. The approximation says that√

    2 = 1.375 up to an error. Theerror term is largest when x = 1. Hence the error is no more than .0625. The errorterm is smallest when x = 2. I’m not sure what the error is then, but it is certainlypositive. Hence I know that the square root of 2 is at least 1.375 and no greaterthan 1.4275. Perhaps this technique will come in handy the next time you need tocompute a square root without the aid of modern electronics.

  • Chapter 2

    Univariate Optimization

    Economics is all about optimization subject to constraints. Consumers do it. Theymaximize utility subject to a budget constraint. Firms do it. They maximizeprofits (or minimize costs) subject to technological constraints. You will studyoptimization problems in many forms. The one-variable theory is particularlyeasy, but it teaches many lessons that extend to general settings.

    There are two features of an optimization problem: the objective function,which is the real-valued function1 that you are trying to maximize or minimize,and the constraint set.

    You already know a lot about one-variable optimization problems. You knowthat continuous functions defined on closed and bounded intervals attain theirmaximum and minimum values. You know that local optima of differentiablefunctions satisfy the first-order condition.

    The next step is to distinguish between local maxima and local minima.Suppose that f differentiable on an open interval. The point x∗ is a critical

    point if f ′(x∗) = 0. We know that local minima and local maxima must be criticalpoints. We know from examples that critical points need not be local optima. Itturns out that properties of the second derivative of f classify critical points.

    Theorem 2 (Second-Order Conditions). Let f be twice continuously differen-tiable on an open interval (a, b) and let x∗ ∈ (a, b) be a critical point of f . Then

    1. If x∗ is a local maximum, then f′′(x∗) ≤ 0.

    2. If x∗ is a local minimum, then f′′(x∗) ≥ 0.

    1Although the domain of the function you are trying to optimize varies in applications, therange is typically the real numbers. In general, you need the range to have an ordering that enablesyou to compare any pair of points.

    5

  • 6 CHAPTER 2. UNIVARIATE OPTIMIZATION

    3. If f′′(x∗) < 0, then x∗ is a local maximum.

    4. If f′′(x∗) > 0, then x∗ is a local minimum.

    Conditions (1) and (3) are almost converses (as are (2) and (4)), but not quite.Knowing that x∗ is a local maximum is enough to guarantee that f ′′(x∗) ≤ 0.Knowing that f ′′x∗) is not enough to guarantee that you have a local minimum(you may have a local maximum or you may have neither a minimum nor a max-imum). (All the intuition you need comes from thinking about the behavior off(x) = xn at x = 0 for different values of x = 0.) You might think that you couldimprove the statements by trying to characterize strict local maxima. It is true thatif f ′′(x∗) < 0, then x∗ is a strict local maximum, but it is possible to have a strictlocal maximum and f ′′(x∗) = 0. The conditions in Theorem 2 parallel the resultsabout first derivatives and monotonicity stated earlier.

    Proof. By Taylor’s Theorem we can write:

    f(x) = f(x∗) + f ′(x∗)(x− x∗) + 12f′′(t)(x− x∗)2 (2.1)

    for t between x and x∗. If f ′′(x∗) > 0, then by continuity of f ′′ , f ′′(t) > 0 for tsufficiently close to x∗ and so, by (2.1), f(x) > f(x∗) for all x sufficiently closeto x∗. Consequently, if x∗ is a local maximum, f ′′(x∗) ≤ 0, proving (1). (2) issimilar.

    If f ′′(x∗) < 0, then by continuity of f ′′ , there exists δ > 0 such that if 0 <|x− t| < δ, then f ′′(t) < 0. By (2.1), it follows that if 0 < |x− x∗| < δ, thenf(x) < f(x∗), which establishes (3). (4) is similar.

    The theorem allows us to refine our method for looking for maxima. If f isdefined on an interval (and is twice continuously differentiable), the maximum (ifit exists) must occur either at a boundary point or at a critical point x∗ that satisfiesf(x∗) ≤ 0. So you can search for maxima by evaluating f at the boundaries andat the appropriate critical points.

    This method still does not permit you to say when a local maximum is reallya global maximum. You can do this only if f satisfies the appropriate globalconditions.

    Definition 2. We say a function f is concave over an interval X ⊂ R if ∀x, y ∈ Xand δ ∈ (0, 1), we have

    f(δx+ (1− δ)y) ≥ δf(x) + (1− δ)f(y) (2.2)

    If f is only a function of one argument you can think of this graph as having aninverted “u” shape.

  • 7

    Geometrically the definition says that the graph of the function always liesabove segments connecting two points on the graph. Another way to say this isthat the graph of the function always lies below its tangents (when the tangentsexist). If the inequality in (2.2) is strict, then we say that the function is strictlyconcave. A linear function is concave, but not strictly concave.

    Concave functions have nicely behaved sets of local maximizers. It is an im-mediate consequence of the definition that if x and y are local maxima, then soare all of the points on the line segment connecting x to y. A fancy way to get atthis result is to note that concavity implies

    f(δx+ (1− δ)y) ≥ δf(x) + (1− δ)f(y) ≥ min{f(x), f(y)}. (2.3)

    Moreover, the inequality in (2.3) is strict if δ ∈ (0, 1) and either (a) f(x) 6= f(y)or (b) f is strictly concave. Suppose that x is a local maximum of f . It followsthat f(x) ≥ f(y) for all y. Otherwise f(λx+(1−λ)y) > f(x), for all λ ∈ (0, 1),contradicting the hypothesis that x is a local maximum. This means that any localmaximum of f must be a global maximum. It follows that if x and y are both localmaxima, then they both must be global maximal and so f(x) = f(y). In this caseit follows from (2.3) that all of the points on the segment connecting x and y mustalso be maxima. It further implies that x = y if f is strictly concave.

    These results are useful. They guarantee that local extrema are global maxima(so you know that a critical point must be a maximum without worrying aboutboundary points or local minima) and they provide a tractable sufficient conditionfor uniqueness. Notice that these nice properties follow from (2.3), which is aweaker condition than (2.2).2 This suggests that the following definition might beuseful.

    Definition 3. We say a function f is quasi concave over an interval X ⊂ R if∀x, y ∈ X and δ ∈ (0, 1), we have f(δx+ (1− δ)y) ≥ min{f(x), f(y)}.

    If quasi-concavity is so great, why bother with concavity? It turns out thatconcavity has a characterization in terms of second derivatives.

    We can repeat the same analysis with signs reversed.

    Definition 4. We say a function f is convex over an interval X ⊂ R if ∀x, y ∈ Xand δ ∈ (0, 1), we have

    f(δx+ (1− δ)y) ≤ δf(x) + (1− δ)f(y)

    If f is only a function of one argument you can think of this graph as having an“u” shape.

    2As an exercise, try to find a function that satisfies (2.3) but not (2.2).

  • 8 CHAPTER 2. UNIVARIATE OPTIMIZATION

    Convex functions have interior minima and differentiable convex functionshave positive second derivatives.

    Theorem 3. Let f : X −→ R, X an open interval and f ∈ C2 on X .f ′′(x) ≤ 0, ∀x ∈ X , if and only if f is concave on X .

    Proof. If f is concave, then for all λ ∈ (0, 1)

    f(λx+ (1− λ)y)− f(x)1− λ

    ≥ f(y)− f(λx+ (1− λ)y)λ

    .

    Routine manipulation demonstrates that the limit of the left-hand side (if it exists)as λ → 1 is equal to (y − x)f ′(x) (note that f(λx + (1 − λ)y) − f(x) = f(x +(1 − λ)(y − x)) − f(x)) and the limit of the right-hand side λ → 1 is equal to(y − x)f ′(y). It follows that if f is concave and differentiable, then

    (y − x)f ′(x) ≥ (y − x)f ′(y),

    which in turn implies that f ′ is decreasing. Consequently, if f ′′ exists, then it isnon-positive.

    Conversely, if f is differentiable, then by the Mean Value Theorem

    f(λx+ (1− λ)y)− f(x) = (1− λ)f ′(c)(y − x)

    for some c between x and λx+ (1− λ)y. You can check that this means that if f ′is decreasing, then

    f(λx+ (1− λ)y)− f(x) ≥ (1− λ)f ′(λx+ (1− λ)y)(y − x). (2.4)

    Similarly,f(λx+ (1− λ)y)− f(y) = −λf ′(c)(y − x)

    for some c between λx+ (1− λ)y and y and if f ′ is decreasing, then

    f(λx+ (1− λ)y)− f(y) ≥ −λf ′(λx+ (1− λ)y)(y − x). (2.5)

    The result follows from adding λ times inequality (2.4) to 1 − λ times inequal-ity (2.5).

    Note

    • The previous theorem used the fact that f ′ was decreasing (rather than f ′′ ≤0).

  • 9

    • The Mean-Value Theorem says that if f is differentiable, then f(y)−f(x) =f ′(c)(y − x) for some c between x and y. You can check that this meansthat if f ′ is decreasing, then f(y)− f(x) ≤ f ′(x)(y− x). This inequality isthe algebraic way of expressing the fact that the tangent line to the graph off at (x, f(x)) lies above the graph.

    • If f ′′ < 0 so that f is strictly concave, then f ′ is strictly decreasing. Thatmeans that f can have at most one critical point. Since f ′′ < 0, this criticalpoint must be a local maximum. These statements simply reiterate earlierobservations: Local maxima of strictly concave functions are global max-ima and these must be global maxima.

  • 10 CHAPTER 2. UNIVARIATE OPTIMIZATION

  • Chapter 3

    Basic Linear Algebra

    Many of the elements of one-variable differential calculus extend naturally tohigher dimensions. The definition of continuity is (stated appropriately) identi-cal. Maxima exist for real-valued continuous functions defined on the appropriatedomain. The derivative still plays the role of linear approximation. Critical pointsplay the same role in optimization theory. Taylor’s Theorem and Second-OrderConditions reappear. Before we return to calculus, however, we need to record afew facts about domains that are rich enough to permit many variables.

    3.1 Preliminaries

    Definition 5. n-dimensional Euclidean Space:

    Rn = R× R× · · · × R× R

    Note if X and Y are sets, then

    X × Y ≡ {(x, y) | x ∈ X , y ∈ Y}

    soRn = {(x1, x2, . . . , xn) | xi ∈ R,∀ i = 1, 2, . . . , n}

    There are different ways to keep track of elements of Rn. When doing cal-culus, it is standard to think of a point in Rn as a list of n real numbers, andwrite x = (x1, . . . , xn). When doing linear algebra, it is common to think of the

    11

  • 12 CHAPTER 3. BASIC LINEAR ALGEBRA

    elements of Rn as column vectors. When we do this we write

    x =

    x1x2...xn

    .Given x ∈ Rn, we understand xi to be the ith coordinate. In these notes, we willtry (and almost certainly fail) to denote vectors using bold face (x) and elementsmore plainly (x). Although there may be exceptions, for the most part you’ll seex in discussions of linear algebra and x in discussions about calculus.

    Definition 6. The zero element:

    0 =

    00...0

    Definition 7 (Vector Addition). For x, y ∈ Rn we have

    x + y =

    x1 + y1x2 + y2

    ...xn + yn

    Vector addition is commutative

    x + y = y + x

    Definition 8 (Scalar Multiplication). For x ∈ Rn, and a ∈ R we have

    ax =

    ax1ax2

    ...axn

    In other words every element of x gets multiplied by a.

  • 3.2. MATRICES 13

    You may hear people talk about vector spaces. Maybe they are showing off.Maybe they really need a more general structure. In any event a vector space is ageneral set of V in which the operation of addition and multiplication by a scalarmake sense, where addition is commutative and associative (as above), there is aspecial zero vector that is the additive identity (0 + v = v), additive inverses exist(for each v there is a −v, and where scalar multiplication is defined as above.Euclidean Spaces are the leading example of vector spaces. We will need to talkabout subsets of Euclidean Spaces that have a linear structure (they contain 0, andif x and y are in the set, then so is x + y and all scalar multiples of x and y). Wewill call these subspaces (this is a correct use of the technical term), but we haveno reason to talk about more general kinds of vector spaces.

    3.2 Matrices

    Definition 9. An m× n matrix is an element ofMm×n written as in the form

    A =

    α11 α12 · · · α1nα21 α22 · · · α2n

    ......

    ...αm1 αm2 · · · αmn

    = [αij]where m denotes the number of rows and n denotes the number of columns.

    Note An m × n matrix is just of a collection of nm numbers organized in aparticular way. Hence we can think of a matrix as an element of Rm×n. The ex-tra notationMm×n makes it possible to distinguish the way that the numbers areorganized. Note We denote vectors in boldface lower-case letters. Matrices arerepresented in capital boldface.

    Note Vectors are just a special case of matrices. e.g.

    x =

    x1x2...xn

    ∈Mn×1

    In particular real numbers are just another special case of matrices. e.g. the num-ber 6

    6 ∈ R = R1×1

  • 14 CHAPTER 3. BASIC LINEAR ALGEBRA

    This notation emphasizes that we think of a vector with n components as amatrix with n rows and 1 columns.

    Example 1.

    A2×3

    =

    (0 1 56 0 2

    )

    Definition 10. The transpose of a matrix A, is denoted At. To get the transpose ofa matrix, we let the first row of the original matrix become the first column of thenew (transposed) matrix. Using Definition 9 we would get

    At =

    α11 α21 · · · α1nα12 α22 · · · α2n

    ......

    ...α1m α2m · · · αnm

    = [αji]Definition 11. A matrix A is symmetric if A = At.

    So we can see that if A ∈Mm×n, then At ∈Mn×m.

    Example 2. Using Example 1 we see that

    At3×2

    =

    0 61 05 2

    3.2.1 Matrix AlgebraYou have probably guessed then how to add 2 (m× n) matrices - term by term!Note as we saw beforehand with vectors, it is totally meaningless to add matricesthat are of different dimensions.

    Definition 12 (Addition of Matrices). If

    Am×n

    =

    α11 α12 · · · α1nα21 α22 · · · α2n

    ......

    ...αm1 αm2 · · · αmn

    = [αij]

  • 3.2. MATRICES 15

    and

    Bm×n

    =

    β11 β12 · · · β1nβ21 β22 · · · β2n

    ......

    ...βm1 βn2 · · · βmn

    = [βij]then

    A + Bm×n

    = Dm×n

    =

    α11 + β11 α12 + β12 · · · α1n + β1nα21 + β21 α22 + β22 · · · α2n + β2n

    ......

    ...αm1 + βm1 αm2 + βm2 · · · αmn + βmn

    = [δij] = [αij+βij]

    A + B︸ ︷︷ ︸m×n

    =

    α11 + β11 α12 + β12 · · · α1n + β1nα21 + β21 α22 + β22 · · · α2n + β2n

    ......

    ...αm1 + βm1 αm2 + βm2 · · · αmn + βmn

    = [αij + βij]

    Definition 13 (Multiplication of Matrices). If Am×k

    and Bk×n

    are given, then we

    defineAm×k· Bk×n

    = Cm×n

    = [cij]

    such that

    cij ≡k∑l=1

    ailblj

    so note above that the only index being summed over is l.

    The above expression may look quite daunting if you have never seen summa-tion signs before so a simple example should help to clarify.

    Example 3. Let

    A2×3

    =

    (0 1 56 0 2

    )and

    B3×2

    =

    0 31 02 3

  • 16 CHAPTER 3. BASIC LINEAR ALGEBRA

    Then

    A2×3· B3×2︸ ︷︷ ︸

    2×2

    =

    (0 1 56 0 2

    0 31 02 3

    =

    ((0× 0) + (1× 1) + (5× 2), (0× 3) + (1× 0) + (5× 3)(6× 0) + (0× 1) + (2× 2), (6× 3) + (0× 0) + (2× 3)

    )=

    (11 154 24

    )

    Note that A2×3· B4×5

    is meaningless. The second dimension of A must be equalto the first dimension of B.

    Note further that this brings up the very important point that matrices do not mul-tiply like regular numbers. They are NOT commutative i.e.

    A · B 6= B · A

    For example

    A2×3· B3×46= B

    3×4· A2×3

    in fact, not only does the LHS not equal the RHS, the RHS does not even exist.We will see later that one interpretation of a matrix is as a representation of alinear function. With that interpretation, matrix multiplication takes on a specificmeaning and there will be another way to think about why you can only multiplycertain “conformable” pairs of matrices.

    Definition 14. Any matrix which has the same number of rows as columns isknown as a square matrix, and is denoted A

    n×n.

    For example the matrix

    A3×3

    =

    0 1 56 0 23 2 1

    is a square (3× 3) matrix.

  • 3.2. MATRICES 17

    Definition 15. There is a special square matrix known as the identity matrix(which is likened to the number 1 (the multiplicative identity) from Definition ??),in that any matrix multiplied by this identity matrix gives back the original matrix.The Identity matrix is denoted In and is equal to

    Inn×n

    ==

    1 0 . . . 00 1 . . . 0... . . .

    ...0 . . . 0 1

    .Definition 16. A square matrix is called a diagonal matrix if aij = 0 wheneveri 6= j.1

    Definition 17. A square matrix is called an upper triangular matrix (resp. lowertriangular if aij = 0 whenever i > j (resp. i < j).

    Diagonal matrices are easy to deal with. Triangular matrixes are also some-what tractable. You’ll see that for many applications you can replace an arbitrarysquare matrix with a related diagonal matrix.

    For any matrix Am×n

    we have the results that

    Am×n· In = A

    m×n

    andIm · A

    m×n= A

    m×n

    Note that unlike normal algebra it is not the same matrix which multiplies Am×n

    on

    both sides to give back Am×n

    (unless n = m).

    Definition 18. We say a matrix An×n

    is invertible or non-singular if ∃ Bn×n

    such that

    An×n· Bn×n︸ ︷︷ ︸

    n×n

    = Bn×n· An×n︸ ︷︷ ︸

    n×n

    = In

    If A is invertible, we denote it’s inverse as A−1.So we get

    A(n×n)

    · A−1(n×n)︸ ︷︷ ︸

    n×n

    = A−1n×n· An×n︸ ︷︷ ︸

    n×n

    = In

    1The main diagonal is always defined as the diagonal going from top left corner to bottom rightcorner i.e. ↘

  • 18 CHAPTER 3. BASIC LINEAR ALGEBRA

    A square matrix that is not invertible is called singular. Note that this onlyapplies to square matrices2.Note We will see how to calculate inverses soon.

    Definition 19. The determinant of a matrix A (written detA = |A|) is definedinductively.

    n = 1 A(1×1)

    detA = |A| ≡ a11

    n ≥ 2 A(n×n)

    detA = |A| ≡ a11 |A−11| − a12 |A−12|+ a13 |A−13| − · · · ± a1n |A−1n|

    where A−1j is the matrix formed by deleting the first row and jth column ofA.

    Note A−1j is an (n− 1)× (n− 1) dimensional matrix.

    Example 4. If

    A2×2

    = [aij] =

    (a11 a12a21 a22

    )=⇒ |A| = a11a22 − a12a21

    Example 5. If

    A3×3

    = [aij] =

    a11 a12 a13a21 a22 a23a31 a32 a33

    =⇒ |A| = a11

    ∣∣∣∣ a22 a23a32 a33∣∣∣∣− a12 ∣∣∣∣ a21 a23a31 a33

    ∣∣∣∣+ a13 ∣∣∣∣ a21 a22a31 a32∣∣∣∣

    The determinant is useful primarily because of the following result:

    2You can find one-sided “pseudo inverses” for all matrices, even those that are not square.

  • 3.2. MATRICES 19

    Theorem 4. A matrix is invertible if and only if its determinant 6= 0.

    Definition 20. The Inverse of a matrix An×n

    is defined as

    A−1 =1

    |A|· adjA

    where adjA is the adjoint of A and we will not show how to calculate it here.

    Example 6. If A is a (2× 2) matrix and invertible then

    A−1 =1

    |A|·(

    a22 −a12−a21 a11

    )Note It is worthwhile memorizing the formula for the inverse of a 2×2 matrix.

    Leave inverting higher-order matrices to computers.

    3.2.2 Inner Product and DistanceDefinition 21 (Inner Product). If x, y ∈ Mn×1, then the inner product (or dotproduct or scalar product) is given by

    xty = x1y1 + x2y2 + · · ·+ xnyx

    =n∑i=1

    xiyi

    Note that xty = ytx. We will have reason to use this concept when we docalculus, and will write x · y =

    ∑ni=1 xiyi.

    Definition 22 (Distance). We generalize the notion of distance.For Rn, a function

    d : Rn × Rn −→ R

    is called a metric if for any two points x and y ∈ Rn

    1. d(x, y) ≥ 0

    2. d(x, y) = 0⇐⇒ x = y

  • 20 CHAPTER 3. BASIC LINEAR ALGEBRA

    3. d(x, y) = d(y, x)

    4. d(x, y) ≤ d(x, z) + d(z, y), for any z ∈ Rn

    The last of these properties is called the triangle inequality.If you think about an example on a map (e.g. in R2), all this is saying is that it isa shorter distance to walk in a straight line from x to y, than it is to walk from xto z and then from z to y.

    A metric is a generalized distance function. It is possible to do calculus onabstract spaces that have a metric (they are called metric spaces). Usually thereare many possible ways to define a metric. Even in Rn there are many possibilities:

    Example 7.

    d(x, y) ={

    1, if x 6= y,0, if x = y.

    This satisfies the definition of a metric. It basically tells you whether x = y.The metric defined by

    Example 8.d(x, y) =

    ∑i=1,...,n

    |xi − yi|

    states that the distance between two points is the length of the path connectingthe two points using segments parallel to the coordinate axes.

    We will be satisfied with the standard Euclidean metric:

    Example 9.d(x, y) = ‖x− y‖

    where

    ‖z‖ =√z21 + z

    22 + · · ·+ z2n

    =

    √√√√ n∑i=1

    z2i

    Under the Euclidean metric, the distance between two points is the length ofthe line segment connecting the points. We call ‖z‖, which is the distance between0 and z the norm of z.

    Notice that ‖z‖2 = z · z.

  • 3.3. SYSTEMS OF LINEAR EQUATIONS 21

    When x·y = 0 we say that x and y are orthogonal/at right angles/perpendicular.It is a surprising geometric property that two vectors are perpendicular if and

    only if their inner product is zero. This fact follows rather easily from “The Lawof Cosines.” The law of cosines states that if a triangle has sides A,B, and C andthe angle θ opposite the side c, then

    c2 = a2 + b2 − 2ab cos(θ),

    where a, b, and c are the lengths of A, B, and C respectively. This means that:

    (x− y) · (x− y) = x · x + y · y− 2 ‖x‖ ‖x‖ cos(θ),

    where θ is the angle between x and y. If you multiply everything out you get theidentity:

    ‖x‖ ‖x‖ cos(θ) = xty. (3.1)

    Equation (3.1) has two nice consequences. First, it justifies the use of the termorthogonal: The inner product of two non-zero vectors is zero if and only if thecosine of the angle between them is zero. Second, it gives you an upper bound ofthe inner product (because the absolute value of the cosine is less than or equal toone):

    ‖x‖ ‖x‖ ≥∣∣xty∣∣ .

    3.3 Systems of Linear EquationsConsider the system of n equations in m variables:

    y1 = α11x1 + α12x2 + · · ·+ α1nxny2 = α21x1 + α22x2 + · · ·+ α2nxn

    ...yi = αi1x1 + αi2x2 + · · ·+ αinxn

    ...ym = αm1x1 + αm2x2 + · · ·+ αmnxn

    Here the variables are the xj . This can be written as

    y(m×1)

    = A(m×n)

    · x(n×1)︸ ︷︷ ︸

    (m×1)

  • 22 CHAPTER 3. BASIC LINEAR ALGEBRA

    where

    y(m×1)

    =

    y1y2...ym

    , x(n×1) =

    x1x2...xn

    Am×n

    =

    α11 α12 · · · α1nα21 α22 · · · α2n

    ......

    ...αm1 αm2 · · · αmn

    = [αij]or, putting it all together

    y(m×1)

    = A(m×n)

    · x(n×1)︸ ︷︷ ︸

    (m×1)

    y1y2...ym

    =

    α11 α12 · · · α1nα21 α22 · · · α2n

    ......

    ...αm1 αm2 · · · αmn

    ·

    x1x2...xn

    Note you should convince yourself that if you multiply out the RHS of the

    above equation and then compare corresponding entries of the new (m×1) vectorsthat the result is equivalent to the original system of equations. The value ofmatrices is that they permit you to write the complicated system of equations in asimple form. Once you have written them a system of equations in this way, youcan use matrix operations to solve some systems.

    Example 10. In high school you probably solved equations of the form:

    3x1 − 2x2 = 78x1 + x2 = 25

    Well matrix algebra is just a clever way to solve these in one go.So here we have that

    A(2×2)

    =

    (3 −28 1

    ), x

    (2×1)=

    (x1x2

    ), y

    (2×1)=

    (725

    )

  • 3.3. SYSTEMS OF LINEAR EQUATIONS 23

    And can write this asA

    (2×2)· x(2×1)

    = y(2×1)

    And obviously we want to solve for x.So we multiply both sides of the equation on the left (recall that it does matterwhat side of the equation you multiply) by A−1,

    A−1A︸ ︷︷ ︸I2

    x = A−1y

    ⇐⇒ I2x = A−1y

    ⇐⇒ x = A−1(2×2)

    · y(2×1)︸ ︷︷ ︸

    (2×1)

    and we have that

    A−1 =1

    |A|·(

    1 2−8 3

    )=

    1

    |(3× 1)− (−2× 8)|·(

    1 2−8 3

    )=

    1

    |19|·(

    1 2−8 3

    )

    So

    x = A−1y

    =1

    |19|·(

    1 2−8 3

    )·(

    725

    )=

    1

    19·(

    5719

    )=

    (31

    )The example illustrates some general properties. If you have exactly as many

    equations as unknowns (so that the matrix A is square, then the system has a

  • 24 CHAPTER 3. BASIC LINEAR ALGEBRA

    unique solution if and only if A is invertible. If A is invertible, it is obvious thatthe solution is unique (and given by the formula: x = A−1y). If A is not invertible,it is the case that there is a nonzero z such that Az = 0. This means that if youcan find one solution to Ax = y,3 then you can find infinitely many solutions(by adding arbitrary multiples of z to the original solution. Intuitively, it is hardfor a matrix to be singular, so “most” of the time systems of n equations and nunknowns have unique solutions.

    These comments do not tell you about situations where the number of equa-tions is not equal to the number of unknowns. In most cases, when there are extraequations, a system of linear equations will have no solutions. If there are moreunknowns than equations, typically the system will have infinitely many solutions(it is possible for the system to have no solutions, but it is not possible for thesystem to have a unique solution).

    A system of equations of the form Ax = 0 is called a homogeneous system ofequations. Such a system always has a solution (x = 0). The solution will not beunique if there are more unknowns than equations.

    The standard way to establish these results is by applying “elementary rowoperations” to A to transform a system of equations into an equivalent system thatis easier to analyze.

    3.4 Linear Algebra: Main TheoryEuclidean Spaces (and Vector Spaces more generally) have a nice structure thatpermits you to write all elements in terms of a fixed set of elements. In order todescribe this theory, we need a few definitions.

    A linear combination of a collection of vectors {x1, . . . , xk} is a vector of theform

    ∑ki=1 λixi for some scalars λ1, . . . , λk.

    Rn has the property that sums and scalar multiples of elements of Rn remainin the set. Hence if we are given some elements of the set, all linear combina-tions will also be in the set. Some subsets are special because they contain noredundancies:

    Definition 23. A collection of vectors {x1, . . . , xk} is linearly independent if∑k

    i=1 λixi =0 if and only if λi = 0 for all i.

    Here is the way in which linear independence captures the idea of no redun-dancies:

    3When A is singular, there is no guarantee that a solution to this equation exists.

  • 3.4. LINEAR ALGEBRA: MAIN THEORY 25

    Theorem 5. If X = {x1, . . . , xk} is a linearly independent collection of vectorsand z ∈ S(X), then there are unique λ1, . . . , λk such that z =

    ∑ki=1 λixi.

    Proof. Existence follows from the definition of span. Suppose that there are twolinear combinations that of the elements of X that yield z so that

    z =k∑i=1

    λixi

    and

    z =k∑i=1

    λ′ixi.

    Subtract the equations to obtain:

    0 =k∑i=1

    (λ′i − λi) xi.

    By linear independence, λi = λ′i for all i, the desired result.

    Next, let us investigate the set of things that can be described by a collectionof vectors.

    Definition 24. The span of a collection of vectors X = {x1, . . . , xk} is S(X) ={y : y =

    ∑ki=1 λixi for some scalars λ1, . . . , λk}.

    S(X) is the smallest vector space containing all of the vectors in X .

    Definition 25. The dimension of a vector space is N , where N is the smallestnumber of vectors needed to span the space.

    We deal only with finite dimensional vector spaces. We’ll see this definitionagrees with the intuitive notion of dimension. In particular, Rn has dimension n.

    Definition 26. A basis for a vector span V is any collection of linearly indepen-dent vectors than span V .

    Theorem 6. IfX = {x1, . . . , xk} is a set of linearly independent vectors that doesnot span V , then there exists v ∈ V such that X ∪ {v} is linearly independent.

  • 26 CHAPTER 3. BASIC LINEAR ALGEBRA

    Proof. Take v ∈ V such that v 6= 0 and v 6∈ S(X). X ∪ {v} is a linearlyindependent set. To see this, suppose that there exists λi i = 0, . . . , k such that atleast one λi 6= 0 and

    λ0v +k∑i=1

    λixi. (3.2)

    If λ0 = 0, then X are not linearly independent. If λ0 6= 0, then equation (3.2) canbe rewritten

    v =k∑i=1

    λiλ0

    xi.

    In either case, we have a contradiction.

    Definition 27. The standard basis for Rn consists of the set of N vectors ei, i =1, . . . , N , where ei is the vector with component 1 in the ith position and zero inall other positions.

    You should check that the standard basis really is a linearly independent setthat spans Rn. Also notice that the elements of the standard basis are mutuallyorthogonal. When this happens, we say that the basis is orthogonal. It is also thecase that each basis element has unit length. When this also happens, we say thatthe basis is orthonormal. It is always possible to find an orthonormal basis.4 Theyare particularly useful because it is easy to figure out how to express an arbitraryelement of X in terms of the basis.

    It follows from these observations that each vector v has a unique representa-tion in terms of the basis, where the representation consists of the λi used in thelinear combination that expresses v in terms of the basis. For the standard basis,this representation is just the components of the vector.

    It is not hard (but a bit tedious) to prove that all bases have the same number ofelements. (This follows from the observation that any system of n homogeneousequations and m > n unknowns has a non-trivial solution, which in turn followsfrom “row-reduction” arguments.)

    3.5 Eigenvectors and EigenvaluesAn eigenvalue of the square matrix A is a number λ with the property A − λI issingular. If λ is an eigenvalue of A, then any x 6= 0 such that (A − λI)x = 0 iscalled an eigenvector of A associated with the eigenvalue λ.

    4We have exhibited an orthonormal basis for Rn. It is possible to construct an orthonormalbasis for any vector space.

  • 3.5. EIGENVECTORS AND EIGENVALUES 27

    Eigenvalues are those values for which the equation

    Ax = λx

    has a non-zero solution. You can compute eigenvalues (in theory) by solving theequation det(A−λI) = 0. If A is an n×nmatrix, then this characteristic equationis a polynomial equation of degree n. By the Fundamental Theorem of Algebra,it will have n (not necessarily distinct and not necessarily real) roots. That is, thecharacteristic polynomial can be written

    P (λ) = (λ− r1)m1 · · · (λ− rk)mk ,

    where r1, r2 . . . , rk are the distinct roots (ri 6= rj when i 6= j) and mi are positiveintegers summing to n. We call mi the multiplicity of root ri. Eigenvalues andtheir corresponding eigenvectors are important because they enable one to relatecomplicated matrices to simple ones.

    Theorem 7. If A is an n×n matrix that has n distinct eigen-values or is symmet-ric, then there exists an invertible n × n matrix P and a diagonal matrix D suchthat A = PDP−1. Moveover, the diagonal entries of D are the eigenvalues of Aand the columns of P are the corresponding eigenvectors.

    The theorem says that if A satisfies certain conditions, then it is “related” to adiagonal matrix. It also tells you how to find the diagonal matrix. The relationshipA = PDP−1 is quite useful. For example, it follows from the relationship thatAk = PDkP−1. It is much easier to raise a diagonal matrix to a power than find apower of a general matrix.

    Proof. Suppose that λ is an eigenvalue of A and x is an eigenvector. This meansthat Ax = λx. If P is a matrix with column j equal to an eigenvector associ-ated with λi, it follows that AP = PD. The theorem would follow if we couldguarantee that P is invertible.

    When A is symmetric, one can prove that A has only real eigenvalues andthat one can find n linearly independent eigenvectors even if the eigenvalues arenot distinct. This result is elementary (but uses some basic facts about complexnumbers).

    Suppose that you haveAx = λx

    for a real symmetric matrix A. Taking the complex conjugate of both sides yields

    Ax∗ = λ∗x∗.

  • 28 CHAPTER 3. BASIC LINEAR ALGEBRA

    We have

    (x∗)t Ax− xtAx∗ = (λ− λ∗)xtx∗ (3.3)(using (x∗)t x = xtx∗. On the other hand, the left-hand side of equation (3.3)

    xtAx∗ =(xtAx∗

    )t= (x∗)t Ax

    where the first equation is true because the transpose of a number is equal to itselfand the second equation follows because A is symmetric. It follows that λ is equalto its transpose, so it must be real.

    It is also the case that the eigenvectors corresponding to distinct eigenvaluesof a symmetric matrix are orthogonal. To see this observe that Axi = λixi andAxj = λjxj implies that

    λixtjxi = xtiAxj = x

    tjAxi = λjx

    tjxi.

    In general, one can prove that the eigenvectors of distinct eigenvalues are dis-tinct. To see this, suppose that λ1, . . . , λk are distinct eigenvalues and x1, . . . , xkare associated eigenvectors. In order to reach a contradiction, suppose that thevectors are linearly dependent. Without loss of generality, we may assume that{x1, . . . , xk−1} are linearly independent, but that xk can be written as a linear com-bination of the first k− 1 vectors. This means that there exists αi i = 1, . . . , k− 1not all zero such that:

    k−1∑i=1

    αixi = xk. (3.4)

    Multiply both sides of equation (3.4) by A and use the eigenvalue property toobtain:

    k−1∑i=1

    αiλixi = λkxk. (3.5)

    Multiply equation (3.4) by λk and subtract it from equation (3.5) to obtain:

    k−1∑i=1

    αi(λi − λk)xi = 0. (3.6)

    Since the eigenvalues are distinct, equation (3.6) gives a non-trivial linear com-bination of the first k − 1 xi that is equal to 0, which contradicts linear indepen-dence.

    Here are some useful facts about determinants and eigenvalues. (The proofsrange from obvious to tedious.)

  • 3.6. QUADRATIC FORMS 29

    1. det AB = det BA

    2. If D is a diagonal matrix, then det D is equal to the product of its diagonalelements.

    3. det A is equal to the product of the eigenvalues of A.

    4. The trace of a square matrix A is equal to the sum of the diagonal elementsof A. That is, tr (A) =

    ∑ni=1 aii. Fact: tr (A) =

    ∑ni=1 λi, where λi is the

    ith eigenvalue of A (eigenvalues counted with multiplicity).

    One variation on the symmetric case is particularly useful in the next section.When A is symmetric, then we take take the eigenvectors of A to be orthonor-mal. In this case, the P in the previous theorem has the property that P−1 = Pt.Eigenvalues turn out to be important in many different places. They play a role inthe study of stability of difference and differential equations. They make certaincomputations easy. They make it possible to define a sense in which matrices canbe positive and negative that allows us to generalize the one-variable second-orderconditions. The next topic will do this.

    3.6 Quadratic FormsDefinition 28. A quadratic form in n variables is any function Q : Rn −→ R thatcan be written Q(x) = xtAx where A is a symmetric n× n matrix.

    When n = 1 a quadratic form is a function of the form ax2. When n = 2 it is afunction of the form a11x21+2a12x1x2+a22x

    22 (remember a12 = a21). When n = 3,

    it is a function of the form a11x21+a22x22+a33x

    23+2a12x1x2+2a13x1x3+2a23x2x3.

    A quadratic form is second-degree polynomial that has no constant term.We will see soon that the second derivative of a real-valued function on Rn is

    not a single function, but a collection of n2 functions. Quadratic forms will be theway in which we study second derivatives. In particular, they are important forchecking second-order conditions and concavity of functions.

    Definition 29. A quadratic form Q(x) is

    1. positive definite if Q(x) > 0 for all x 6= 0.

    2. positive semi definite if Q(x) ≥ 0 for all x.

    3. negative definite if Q(x) < 0 for all x 6= 0.

    4. negative semi definite if Q(x) ≤ 0 for all x.

  • 30 CHAPTER 3. BASIC LINEAR ALGEBRA

    5. indefinite if there exists x and y such that Q(x) > 0 > Q(y).

    The definition provides a notion of positivity and negativity for matrices. Youshould check to confirm that the definitions coincide with ordinary notions ofpositive and negative when Q is a quadratic form of one variable (that is Q(x) =Ax2). In this case the quadratic form is positive definite in A > 0, negative (andpositive) semi-definite when A = 0 and negative definite when a < 0. Whenn > 1 it is not hard to find indefinite matrices. We will see this soon.

    If A happened to be a diagonal matrix, then it is easy to classify the associatedquadratic form according to the definition. Q(x) = xtAx =

    ∑ni=1 aiix

    2i . This

    quadratic form is positive definite if and only if all of the aii > 0, negative definiteif and only if all of the aii < 0, positive semi definite if and only if aii ≥ 0, for alli negative semi definite if and only if aii ≤ 0 for all i, and indefinite if A has bothnegative and positive diagonal entries.

    The theory of diagonalization gives us a way to translate use these results forall matrices. We know that if A is a symmetric matrix, then it can be written A= RtDR, where D is a diagonal matrix with (real) eigenvalues down the diagonaland R is an orthogonal matrix. This means that the quadratic form: Q(x) =xtAx = xtRtDRx = (Rx)t D (Rx). This expression is useful because it meansthat the definiteness of A is equivalent to the definiteness of its diagonal matrix ofeigenvectors, D. (Notice that if I can find an x such that xtAx > 0, then I can findan x such that ytDy (y = Rx) and conversely.)

    Theorem 8. The quadratic form Q(x) = xtAx is

    1. positive definite if λi > 0 for all i.

    2. positive semi definite if λi ≥ 0 for all i.

    3. negative definite if λi < 0 for all i.

    4. negative semi definite if λi ≤ 0 for all i.

    5. indefinite if there exists j and k such that λj > 0 > λk.

    There is a computational trick that often allows you to identify definitenesswithout computing eigenvalue.

    Definition 30. A principal submatrix of a square matrix A is the matrix obtainedby deleting any k rows and the corresponding k columns. The determinant ofa principal submatrix is called the principal minor of A. The leading principalsubmatrix of order k of an n × n matrix is obtained by deleting the last n − krows and column of the matrix. The determinant of a leading principal submatrixis called the leading principal minor of A.

  • 3.6. QUADRATIC FORMS 31

    Theorem 9. A matrix is

    1. positive definite if and only if all of its leading principal minors are positive.

    2. negative definite if and only if its odd principal minors and negative and itseven principal minors are positive.

    3. indefinite if one of its kth order leading principal minors is negative for aneven k or if there are two odd leading principal minors that have differentsigns.

    The theorem permits you to classify the definiteness of matrices without find-ing eigenvalues. It may seem strange at first, but you can remember it by thinkingabout diagonal matrices.

  • 32 CHAPTER 3. BASIC LINEAR ALGEBRA

  • Chapter 4

    Multivariable Calculus

    The goal is to extend the calculus from real-valued functions of a real variableto general functions from Rn to Rm. Some ideas generalize easily, but goingfrom one dimensional domains to many dimensional domains raises new issuesthat need discussion. Raising the dimension of the range space, on the otherhand, raises no new conceptual issues. Consequently, we begin our discussion toreal-valued functions. We will explicitly consider higher dimensional ranges onlywhen convenient or necessary. (It will be convenient to talk about linear functionsin general terms. It is necessary to talk about the most interesting generalizationof the Chain Rule and for the discussions of inverse and implicit functions.)

    4.1 Linear StructuresTo do calculus, you need to understand linear “objects” in Rn. In the plane thereare three kinds of subspace (recall that a subspace is a set L with the property thatif x and y are in the set, then so are x + y and λx for all λ). These sets are: theentire plane, any line through the origin, and the origin. In higher dimensionalspaces, there are more linear subsets.

    It turns out that the most useful to us are one dimensional linear subsets, lines,and n− 1 dimensional subsets, hyperplanes.

    Definition 31. A line is described by a point x and a direction v. It can be repre-sented as {z : there exists t ∈ R such that z = x+ tv}.

    If we constrain t ∈ [0, 1] in the definition, then the set is the line segmentconnecting x to x+ v. Two points still determine a line: The line connecting x toy can be viewed as the line containing x in the direction v. You should check thatthis is the same as the line through y in the direction v.

    33

  • 34 CHAPTER 4. MULTIVARIABLE CALCULUS

    Definition 32. A hyperplane is described by a point x0 and a normal directionp ∈ Rn, p 6= 0. It can be represented as {z : p · (z − x0) = 0}. p is called thenormal direction of the plane.

    The interpretation of the definition is that a hyperplane consists of all of the zwith the property that the direction z − x0 is normal to p.

    In R2 lines are hyperplanes. In R3 hyperplanes are “ordinary” planes.Lines and hyperplanes are two kinds of “flat” subset of Rn. Lines are subsets

    of dimension one. Hyperplanes are subsets of dimension n − 1 or co-dimensionone. You can have a flat subsets of any dimension less than n. Although, ingeneral, lines and hyperplanes are not subspaces (because they do not contain theorigin) you obtain these sets by “translating” a subspace that is, by adding thesame constant to all of its elements.

    Definition 33. A linear manifold of Rn is a set S such that there is a subspace Von Rn and x0 ∈ Rn with S = V + {x0}.

    In the above definition, V + {x0} ≡ {y : y = v + x0 for some v ∈ V }.The definition is a bit pedantic. Officially lines and hyperplanes are linear

    manifolds and not linear subspaces.It is worthwhile reviewing the concepts of line and hyperplane. Given two

    points x and y, you can construct a line that passes through the points. This line is

    {z : z = x+ t(y − x) for some t.}

    This formulation is somewhat different from the one normally sees, but it is equiv-alent. Writing out the two-dimensional version yields:

    z1 = x1 + t(y1 − x1) and z2 = x2 + t(y2 − x2).

    If you use the equation for z1 to solve for t and substitute out you get:

    z2 = x2 +(y2 − x2) (z1 − x1)

    y1 − x1or

    z2 − x2 =y2 − x2y1 − x1

    (z1 − x1) ,

    which is the standard way to represent the equation of a line (in the plane) throughthe point (x1, x2) with slope (y2−x2)(y1−x1). This means that the “parametric”representation is essentially equivalent to the standard representation in R2.1 The

    1The parametric representation is actually a bit more general, since it allows you to describelines that are parallel to the vertical axis. Because these lines have infinite slope, they cannot berepresented in standard form.

  • 4.1. LINEAR STRUCTURES 35

    familiar ways to represent lines do not work in higher dimensions. The reason isthat one linear equation in Rn typically has an n− 1 dimensional solution set, soit is a good way to describe a one dimensional set only if n = 2.

    You need two pieces of information to describe a line. If the informationconsists of a point and a direction, then the parametric version of the line is im-mediately available. If the information consists of two points, then you form adirection by subtracting one point from the other (the order is not important).

    You can describe a hyperplane easily given a point and a (normal) direction.Note that the direction of a line is the direction you follow to stay on the line. Thedirection for a hyperplane is the direction you follow to go away from the hyper-plane. If you are given a point and a normal direction, then you can immediatelywrite the equation for the hyperplane. What other pieces of information determinea hyperplane? In R3, a hyperplane is just a standard plane. Typically, three pointsdetermine a plane (if the three points are all on the same line, then infinitely manyplanes pass through the points). How can you determine the equation of a planein R3 that passes through three given points? A mechanical procedure is to notethat the equation for the plane can always be written Ax1 +Bx2 +Cx3 = D and,use the three points to find values for the coefficients. For example, if the pointsare (1, 2− 3), (0, 1, 1), (2, 1, 1), then we can solve:

    A + 2B − 3C = DB + C = D

    2A + B + C = D

    Doing so yields (A,B,C,D) = (0, .8D, .2D,D). (If you find one set ofcoefficients that work, any non-zero multiple will also work.) Hence an equationfor the plane is: 4x2 + x3 = 5 you can check that the three points actually satisfythis equation.

    An alternate computation technique is to look for a normal direction. A normaldirection is a direction that is orthogonal to all directions in the plane. A directionin the plane is a direction of a line in the plane. You can get such a direction bysubtracting any two points in the plane. A two dimensional hyperplane will havetwo independent directions. For this example, one direction can come from thedifference between the first two points: (1, 1,−4) and the other can come fromthe difference between the second and third points (−2, 0, 0) (a third directionwill be redundant, but you can do the computation using the direction of the lineconnecting (1, 2,−3) and (2, 1, 1) instead of either of directions computed above).Once you have two directions, you want to find a normal to both of them. That is,a p such that p 6= 0 and p · (1, 1,−4) = p · (−2, 0, 0) = 0. This is a system of

  • 36 CHAPTER 4. MULTIVARIABLE CALCULUS

    two equations and three variables. All multiples of (0, 4, 1) solve the equations.2

    Hence the equation for the hyperplane is (0, 4, 1) · (x1 − 1, x2 − 2, x3 + 3) = 0.You can check that this agrees with the equation will found earlier. It also wouldbe equivalent to the equation you would obtain if you used either of the other twogiven points as “the point on the plane.”

    4.2 Linear FunctionsDefinition 34. A function

    L : Rn −→ Rm

    is linear if and only if

    1. If for all x and y, f(x+ y) = x+ y and

    2. for all scalars λ, f(λx) = λf(x).

    The first condition says that a linear function must be additive. The secondcondition says that it must have constant returns to scale. The conditions generateseveral obvious consequences. If L is a linear function, then L(0) = 0 and, moregenerally, L(x) = −L(−x). It is an important observation that any linear functioncan be “represented” by matrix multiplication. Given a linear function, computeL(ei), where ei is the ith standard basis element. Can this ai and let A be thesquare matrix with ith column equal to ai. Note that A must have n columnsand m rows. Note also that (by the properties of matrix multiplication and linearfunctions)L(x) = Ax for all x. This means that any linear function can be thoughtof as matrix multiplication (and the matrix has a column for each dimension inthe domain and a row for each dimension in the range). Conversely, every matrixgives rise to a linear function. Hence the identification between linear functionsand matrices is perfect. When the linear function is real valued, linear functionscan be thought of as taking an inner product.

    The identification of linear functions with matrix multiplication supplies an-other interpretation of matrix multiplication. If L : Rm −→ Rk and M : Rk −→Rn are functions, then we can form the composite function M ◦ L : Rm −→ Rnby M ◦ L(x) = M (L(x)). If M and L are linear functions, and A is the matrixthat represents M and B is the matrix that represents L, then AB is the matrix thatrepresentsM ◦L. (Officially, you must verify this by checking how the compositefunction transforms the standard basis elements.)

    2The “cross product” is a computational tool that allows you to mechanically compute a direc-tion perpendicular to two given directions.

  • 4.3. REPRESENTING FUNCTIONS 37

    4.3 Representing FunctionsFunctions of one variable are easy to visualize because we could draw their graphsin the plane. In general, the graph of a function f isGr(f) = {(x, y) : y = f(x)}.This means that if f : Rn −→ R, then the graph is a subset of Rn+1. If n = 2,then the graph is a subset of R3, so someone with a good imagination of a three-dimensional drawing surface could visualize it. If n > 2 there is no hope. Youcan get some intuition by looking at “slices” of the graph obtained by holding thefunction’s value constant.

    Definition 35. A level set is the set of points such that the functions achieves thesame value. Formally it is defined as the set

    {x ∈ Rn | f(x) = c} for any c ∈ R

    While the graph of the function is a subset of Rn+1, the level set (actually,level sets) are subsets of Rn.

    Example 11.f(x) = x21 + x

    22

    So the function is R2 −→ R and therefore the graph is in R3.

    The level sets of this function are circles in the plane. (The graph is a cone.)FIGURE GOES HERE

    Example 12. A good example to help understand this is a utility function (of whichyou will see lots!). A utility function is a function which “measures” a person’shappiness. It is usually denoted U . In 200A you will see conditions necessaryfor the existence of the utility function but for now we will just assume that itexists, and is strictly increasing in each argument. Suppose we have a guy Joelwhose utility function is just a function of the number of apples and the number ofbananas he eats. So his happiness is determined solely by the number of applesand bananas he eats, and nothing else. Thus we lose no information when wethink about utility as a function of two variables:

    U : R2 −→ R

    U(xA, xB) where xA is the number of apples he eats, and xB is the number ofbananas he eats.

  • 38 CHAPTER 4. MULTIVARIABLE CALCULUS

    A level set is all the different possible combinations of apples and bananasthat give him the same utility level i.e. that leave him equally happy!jFor exampleJoel might really like apples and only slightly like bananas. So 3 apples and 2bananas might make him as happy as 1 apple and 10 bananas. In other words heneeds lots of bananas to compensate for the loss of the 2 apples.

    If the only functions we dealt with were utility functions, we would call levelsets “indifference curves.” In economics, typically curves that are “iso-SOMETHING”are level sets of some function.

    In fact maybe another example is in order.

    Example 13. Suppose we have a guy Joel who only derives joy from teachingmathematics. Nothing else in the world gives him any pleasure, and as such hisutility function UJ is only a function of the number of hours of he spends teachingmathematics HM . Now we also make the assumption that Joel’s utility is strictlyincreasing in hours spend teaching mathematics, the more he teaches the happierhe is3. So the question is does Joel’s utility function UJ have any level sets?Since utility is a function of one variable, Joel’s level “sets” are zero dimensionalobjects – points. Since if his utility function is defined as UJ(HM) = H

    1/2M and

    Joel teaches for 4 hours (i.e. HM = 4), then UJ(4) = 41/2 = 2. So is thereany other combination of hours teaching that could leave Joel equally happy?Obviously not, since his utility is only a function of one argument, and it is strictlyincreasing, no two distinct values can leave him equally happy.4

    Example 14. Recall Example 11

    So plotting level sets of a function f : R2 −→ R is really just a clever way ofrepresenting the information from a 3d graph, but only drawing a 2d graph (mucheasier to draw!).

    A good example of this is a map of a country. When you spread a map out infront of you we know that left-right (←→) represents East-West, and that up-down (l) represents North-South. So how do we represent mountains and valleys

    3you may doubt that Joel is really like this but I assure you he is!4So you can see that level sets reference the arguments of a function. And functions with two

    or more arguments are much more likely to have level sets than functions of one argument, sinceyou can have many different combinations of the arguments.

  • 4.4. LIMITS AND CONTINUITY 39

    i.e. points on the earth of different altitude? The answer is with level sets! Thereare many contour lines drawn all over a map and at each point along these linesthe earth is at the same altitude. Of course to be completely rigorous it should alsobe obvious which direction altitude is increasing in.So in this case our function would take in coordinates (values for east and west)and spit out the value (altitude) at those coordinates (values).

    Definition 36. We define the upper contour set of f as the set

    {x ∈ Rn | f(x) ≥ c} c ∈ R

    And we define the lower contour set of f as the set

    {x ∈ Rn | f(x) ≤ c} c ∈ R

    So referring back to our map example. The upper contour set of a point xwould be a set of all the coordinates such that if we plugged those coordinatesinto our altitude function, it would give out a value greater than or equal to thevalue at the point x, i.e. all points that are at a higher altitude than x.

    4.4 Limits and ContinuityDefinition 37 (Limit of a Function).

    f : Rn −→ R

    limx−→a

    f(x) = c ∈ R

    ⇐⇒ ∀ � > 0,∃ δ > 0 such that

    0 < d(x, a) < δ

    =⇒ |f(x)− c| < �

    This definition agrees with the earlier definition, although there are two twists.First, a general “distance function” replaces absolute values in the condition thatsays that x is close to a. For our purposes, the distance function will always be thestandard Euclidean distance. Second, there we do not define one-sided continuity.

  • 40 CHAPTER 4. MULTIVARIABLE CALCULUS

    Definition 38 (Continuity of a Function).

    f : Rn −→ R

    is called continuous at a point a ∈ Rn if

    limx−→a

    f(x) = f(a)

    Again, this definition is a simple generalization of the one-variable definition.

    Exercise 1. Prove that f(x) = x1x2 is continuous at the point (1, 1).

    We must show that ∀ � > 0,∃ δ > 0 such that

    ‖(x1, x2)− (1, 1)‖ < δ

    =⇒ |f(x1, x2)− 1| < �

    Note that

    ‖(x1, x2)− (1, 1)‖ =√

    (x1 − 1)2 + (x2 − 1)2

    Also

    |f(x1, x2)− 1| = |x1x2 − 1|= |x2x1 − x1 + x1 − 1|= |x1(x2 − 1) + x1 − 1|4≤ |x1(x2 − 1)|︸ ︷︷ ︸

    < 12�

    + |x1 − 1|︸ ︷︷ ︸< 1

    2�

    < �

    where the second last inequality is got using the triangle inequality hence the 4superscipt.

    For any given � > 0 let δ = min{

    14�, 1}

    . Then

    ‖(x1, x2)− (1, 1)‖ <1

    4�

    =⇒ |x1 − 1| <1

    4� and |x2 − 1| <

    1

    4�

  • 4.5. SEQUENCES 41

    Also we have that x1 < 2. Thus

    |x1(x2 − 1)|+ |x1 − 1| < 2 ·1

    4�+

    1

    4�

    =3

    4�

    implying that

    |f(x1, x2)− 1| <3

    4�

    < �

    4.5 Sequences(we now use superscript for elements of sequence){

    xk}∞k=1

    sequences of vectors in Rn.

    xk ∈ Rn, ∀ k

    xk = (xk1, xk2, . . . , x

    kn)

    Definition 39. A sequence{

    xk}∞k=1

    converges to a point x ∈ Rn, that is xk −→ x,if and only if ∀ � > 0, ∃K ∈ P such that k ≥ K =⇒

    d(xk, x) < �

    Definition 40. For a, b ∈ Rn, we say

    a ≥ b

    ⇐⇒ ai ≥ bi ∀ i = 1, 2, . . . , n

    anda > b

    ⇐⇒ ai ≥ bi ∀ i = 1, 2, . . . , n,

    and aj > bj for some j

  • 42 CHAPTER 4. MULTIVARIABLE CALCULUS

    Definition 41. Let X ⊂ R.X is bounded from above if ∃ M ∈ Rn such that

    M ≥ x, ∀ x ∈ X

    X is bounded from below if ∃ m ∈ Rn such that

    m ≤ x, ∀ x ∈ X

    Now we define what mean by vectors being “greater” or “less” than each other.

    Definition 42. X is said to be closed if, for every sequence{

    xk}

    from X , if

    xk −→ x ∈ Rn

    (so x is a limit point of X ), then x ∈ X .

    Definition 43. X is said to be a compact space if every sequence from X has asubsequence that converges to a point in X .

    Definition 44. For a metric d, a Ball of Radius � around a point x ∈ Rn is definedas

    B�(x) ≡ { y ∈ Rn | d(x, y) < �}

    FIGURE GOES HERE

    Definition 45. X is called open if ∀x ∈ X , ∃� > 0 such that

    B�(x) ⊂ X

    Note the following 2 common misconceptions

    X not open, does not implyX closed

    X not closed, does not implyX open

    For example, Rn is both open and closed.[0, 1) is neither open nor closed.

  • 4.6. PARTIAL DERIVATIVES AND DIRECTIONAL DERIVATIVES 43

    4.6 Partial Derivatives and Directional Derivatives

    Definition 46. Take f : Rn −→ R. The ith partial derivative of f at x is defined as

    ∂f

    ∂xi(x) ≡ lim

    h−→0

    f(x + hei)− f(x)h

    .

    Treat every other xj as a constant and take the derivative as though f we afunction of just xi.

    As in the one variable case, partial derivatives need not exist. If the ith partialderivative exists, then the function (when viewed as a function of xi alone) mustbe continuous.

    The definition illustrates the way that we will think about functions of manyvariables. We we think of them as many functions of one variable. If it is too hardto figure out how f is behaving when you move several variables at once, hold allbut one of the variables fixed and analyze the one-variable function that results. Apartial derivative is just an ordinary derivative of a function of one variable. Theone variable is one of the components of the function. Roughly speaking, if afunction is well behaved, knowing how it changes in every direction allows youto know how it behaves in general.

    These considerations suggest a more general concept.

    Definition 47. Take f : Rn −→ R and let v be a unit vector in Rn. The directionalderivative of f in the direction v at x is defined as

    Dv(x) ≡ limh−→0

    f(x + hv)− f(x)h

    .

    It follows from the definition, that

    ∂f

    ∂xi(x) ≡ Dei(x).

    That is, the ith partial derivative is just a directional derivative in the direction ei.Notice that to compute directional derivatives you are just computing the deriva-

    tive of a function of one variable. The one-variable function is the function of hof the form f(x + hv) with x and v fixed.

  • 44 CHAPTER 4. MULTIVARIABLE CALCULUS

    4.7 DifferentiabilityDefinition 48. We say a function f : Rn −→ Rm is differentiable at a ∈ Rn if andonly if there is a linear function L : Rn −→ Rm such that

    limx−→a

    ∥∥∥∥m×1f(x)− m×1f(a)− L n×1(x− a∥∥∥∥∥∥∥∥x− an×1∥∥∥∥ = 0.

    If L exists we call it the derivative of f at a and denote it by Df(a). In thecase f : Rn −→ R this is equivalent to

    limx−→a

    [f(x)− f(a)‖x− a‖

    − L(x− a)‖x− a‖

    ]= 0

    This implies that if f is differentiable, then for any directions defined by y ∈Rn and a magnitude given by α ∈ R

    limα−→0

    f(a + αy)− f(a)|α| ‖y‖

    = limα−→0

    Df(a)αy|α| ‖y‖

    =Df(a)y‖y‖

    That is, if a function in differentiable at a point, then all directional derivativesexist at the point.

    Stated more officially:

    Theorem 10. If f : Rn −→ R is differentiable then all of its directional deriva-tives exist. The directional derivative of f at a ∈ Rn in direction v ∈ Rn is givenby

    Df(a)v

    We assume that the direction v in the definition of directional derivative is ofunit length.

    The theorem says two things. First, it says the if a function is differentiable,then the matrix representation of the derivative is just the matrix of partial deriva-tives. (This follows because the way that you get the matrix representation is toevaluate the linear function on the standard basis elements. Here, this would giveyou partial derivatives.) Second, it gives a formula for computing derivational

  • 4.7. DIFFERENTIABILITY 45

    derivatives is any direction. A directional derivative is just a weighted average ofpartial derivatives.

    The definition of differentiability should look like the one-variable definition.In both cases, the derivative is the best linear approximation to the function. Inthe multivariable setting a few things change. First, Euclidean distance replacesabsolute values. Notice that the objects inside the ‖·‖ are vectors and not scalars.Taking their norms replaces a difference with a non-negative scalar. If we didnot do this (at least for the denominator), then the ratios would not make sense.Second, because the linear function has the same domain and range as f , it is morecomplicated than in the one variable case. In the one variable case, the derivativeof f evaluated at a point is a single function of a single variable. This allows us tothink about f ′ as a real-valued function. When f : Rn −→ Rm, the derivative is alinear function from Rn into Rm. This means that it can be represented by matrixmultiplication of a matrix with m rows and n columns. That is, the derivative isdescribed by mn numbers. What are these numbers? The computation after thedefinition demonstrates that the entries in the matrix that represents the derivativeare the partial derivatives of (the component functions of) f . This is why wetypically think of the derivatives of multivariable functions as “matrices of partialderivatives.”

    Sometimes people represent the derivative of a function from Rn to R as avector rather than a linear function.

    Definition 49. Givenf : Rn −→ R

    the gradient of f at x ∈ Rn is

    Df(x) = ∇f(x)

    =

    (∂f

    ∂x1(x),

    ∂f

    ∂x2(x), . . . ,

    ∂f

    ∂xn(x))

    The definition just introduces notation. In terms of the notation, Theorem 10states that

    Dvf(a) = ∇f(a) · v. (4.1)

    We have a definition of differentiability. We also have definitions of partialderivatives. Theorem 10 says that if you know that a function is differentiable,then you know that is has partial derivatives and that these partial derivatives tell

  • 46 CHAPTER 4. MULTIVARIABLE CALCULUS

    you everything you need to know about the derivative. What about the converse?That is, if a function has partial derivatives, then do we know it is differentiable?The answer is “almost”.

    Theorem 11. Take f : Rn −→ R, if f ’s partial derivatives exist and are continu-ous (f ∈ C1), then f is differentiable.

    Theorem 11 states that if a function has partial derivatives in all directions,then the function is differentiable provided that these partial derivatives arecontinuous. There are standard examples of functions that have partial deriva-tives in all directions but still are not differentiable. In these examples the partialderivatives are discontinuous. These examples also provide situations in whichequation (4.1) fails to be true (because the gradient can exist although the func-tion is not differentiable).

    Example 15.f(x) = x1x2

    =⇒ ∂f∂x1

    (0, 0) = x2 |(x1=0,x2=0)

    = 0

    =⇒ ∂f∂x2

    (0, 0) = x1 |(x1=0,x2=0)

    = 0

    limh−→0

    f((0, 0) + h(1, 1))− f(0, 0)h√

    2

    from (0, 0) the distance traveled to the point (1, 1) is√

    2, so that is why we nor-malize and divide in the denominator by

    √2.

    = limh−→0

    f((h, h))− f(0, 0)h√

    2

    = limh−→0

    h2

    h√

    2

    = 0

    So for this function the derivative in every direction is zero.

  • 4.8. PROPERTIES OF THE DERIVATIVE 47

    Example 16.f(x) = |x1|

    12 |x2|

    12

    So

    limh−→0

    f((0, 0) + h(1, 1))− f(0, 0)h√

    2= lim

    h−→0

    f((h, h))− f(0, 0)h√

    2

    = limh−→0

    |h|12 |h|

    12

    h√

    2

    = limh−→0

    h

    h√

    2

    Why include this example? The computation above tells you that the direc-tional derivative of fat (0, 0) in the direction (1/

    √2, 1/√

    2) does not exist. (Theone sided limit exists and is equal to 1/

    √2 or −1/

    √2. On the other hand, you

    can easily check that both partial derivatives of the function at (0, 0) exist andare equal to zero. Hence the formula for computing the direction directive as theaverage of partials fails. Why? Because f is not differentiable at (0, 0).

    Question: What is the direction from x that most increases the value of f?Answer: It’s the direction given by the gradient.

    Theorem 12. Suppose f : Rn −→ R is differentiable at x, then the direction thatmaximizes the directional derivative at x is given by

    v = ∇f(x)

    One proves this theorem by noting that

    |Dvf(x)| = |Df(x)v| ≤ ||∇f(x)|| × ||v||

    with equality if and only if v = λDf(x).

    4.8 Properties of the Derivative

    Theorem 13. g, f : Rn −→ Rm both differentiable at a ∈ RnThen

  • 48 CHAPTER 4. MULTIVARIABLE CALCULUS

    1.D[cf ](a) = cDf(a)∀c ∈ R

    2.D[f + g](a) = Df(a) +Dg(a)

    For the case m=1

    3.D[g · f ](a)

    1×n= g(a)

    1×1·Df(a)

    1×n+ f(a)

    1×1·Dg(a)

    1×n

    4.

    D

    [f

    g

    ](a) =

    g(a) ·Df(a)− f(a) ·Dg(a)[g(a)]2

    Theorem 14 (Chain Rule Theorem). Suppose U ⊂ Rn and V ⊂ Rm are opensets and suppose g : U −→ Rm is differentiable at x ∈ U and f : V −→ Rl isdifferentiable at y ≡ g(x) ∈ V . Then f ◦ g is differentiable at x and

    D[f ◦ g](x)︸ ︷︷ ︸l×n

    = Df(y)l×m

    Dg(x)m×n

    Proof. Though not a complete proof it is much more intuitive to consider the case

    f : Rm −→ R, and g : R −→ Rm

    So we can see that l = n = 1 and

    f ◦ g : R −→ R

    ThenD(f ◦ g)′(t) = Df(x)Dg(t) for x = g(t)

    That is

    D[f ◦ g](t) = ∂f1∂x1· ∂x1∂t

    +∂f2∂x2· ∂x2∂t

    + · · ·+ ∂fm∂xm

    · ∂xm∂t

    Example 17. Let the variable t denote the price of oil. This one variable inducesan array of population responses (thus becomes a vector valued function) like

  • 4.8. PROPERTIES OF THE DERIVATIVE 49

    1. What type of car to buy?

    2. How fast to expand your business?

    3. How far to drive on holiday?

    4. etc.

    and then these responses in turn have their own effect like determining GNP, thevariable y (which was got by the function f using these population responses).

    tg−→ x = g(t) f−→ y = f(g(t)) ∈ R

    ↓ ↓ ↓price of oil actions taken by individuals GNP

    D[f ◦ g](t) = ∂y∂t

    = D(f(g(t))Dg(t)

    =

    (∂f

    ∂x1(g(t)), . . . ,

    ∂f

    ∂xm(g(t))

    dg1dt...

    dgmdt

    =

    m∑i=1

    ∂y

    ∂x1· dg1dt

    Example 18.g(x) = x− 1

    f(y) =

    (2yy2

    )So note that

    g : R −→ R, and f : R −→ R2

    [f ◦ g](x) =(

    2(x− 1)(x− 1)2

    )D[f ◦ g](x) =

    (2

    2(x− 1)

    )

  • 50 CHAPTER 4. MULTIVARIABLE CALCULUS

    Now let’s see if we get the same answer doing it the chain rule way:

    Dg(x) = 1

    Df(y) =

    (22y

    )Df(g(x))Dg(x) =

    (2

    2(x− 1)

    )

    Example 19.

    f(y) = f(y1, y2)

    =

    (y21 + y2y1 − y1y2

    )

    g(x) = g(x1, x2)

    =

    (x21 − x2x1x2

    )= y

    So we note here that both g and f take in two arguments and spit out a (2 × 1)vector, so we must have

    g : R2 −→ R2, and f : R2 −→ R2

    Again, it is very very important to keep track of the dimensions!

    D[f ◦ g](x) = Df(g(x)Dg(x)

    =

    (∂f1∂y1

    ∂f1∂y2

    ∂f2∂y1

    ∂f2∂y2

    (∂g1∂x1

    ∂g1∂x2

    ∂g2∂x1

    ∂g2∂x2

    )

    =

    (2y1 1

    1− y2 −y1

    )·(

    2x1 −1x2 x1

    )

    and we know thaty1 = x

    21 − x2

  • 4.9. GRADIENTS AND LEVEL SETS 51

    y2 = x1x2

    So

    =

    (2x21 − 2x2 11− x1x2 x2 − x21

    )·(

    2x1 −1x2 x1

    )=

    (4x1(x

    21 − x2) + x2 x1 − 2(x21 − x2)

    2x1(1− x1x2) + x2(x2 − x21) x1(x2 − x21) + x1x2

    )

    4.9 Gradients and Level Sets

    EXAMPLES AND FIGURES GO HERE

    Example 20.f : R3 −→ R

    so the graph is in R4 (pretty difficult to draw!), but the graph of the level set is inR3.

    f(x) = x21 + x22 + x

    23 = 1

    Note that this is just a sphere of radius 1.So at any point along the sphere there is not just one tangent line to the point -there are lots of them.

    In general, a surface in Rn+1 can be viewed as the solution to a system of equa-tions. For convenience, represent a point in Rn+1 as a pair (x, y), with x ∈ Rnand y ∈ R. If F : Rn+1 −→ R, then the set {(x, y) : F (x, y) = 0} is typi-cally an n dimensional set. We can talk about what it means to be a tangent tothis surface. Certainly the tangent at (x0, y0) should be an n dimensional linearmanifold in Rn+1 that contains (x0, y0). It should also satisfy the approximationproperty: if (x, y) is a point on the surface that is close to (x0, y0), then it shouldbe approximated up to first order by a point on the tangent. One way to thinkabout this property is to think about directions on the surface. Consider a function

  • 52 CHAPTER 4. MULTIVARIABLE CALCULUS

    G : R −→ Rn+1 such that G(0) = (x0, y0) and F ◦ G(t) ≡ 0 for t in a neighbor-hood of 0. G defines a curve on the surface through (x0, y0). A direction on thesurface at (x0, y0) is just a direction of a curve through (x0, y0) or DG(0). By thechain rule it follows that

    ∇F (x0, y0) ·DG(0) = 0,

    which implies that∇F (x0, y0) is orthogonal to all of the directions on the surface.This generates a non-trivial hyperplane provided that DF (x0, y0) 6= 0. We sum-marize these observations below. (You can decide whether this is a definition or atheorem.)

    Definition 50. Let F : Rn+1 −→ R be differentiable at the point (x0, y0). Assumethat F (x0, y0) = 0 and that DF (x0, y0) 6= 0. The equation of the hyperplanetangent to the surface F (x, y) = 0 at the point (x0, y0) is

    ∇F (x0, y0) · ((x, y)− (x0, y0)) = 0. (4.2)

    Now suppose that f : Rn −→ R is differentiable at x ∈ Rn. Consider thefunction F (x, y) = f(x) − y. The surface F (x, y) = 0 is exactly the graph of f .Hence the tangent to the surface is the tangent to the graph of f . This means thatthe formula for the equation of the tangent hyperplane given above can be used tofind the formula for the equation of the tangent to the graph of a function.

    Theorem 15. If f : Rn −→ R is differentiable at x0 ∈ Rn then the vector∇f(x0)is normal (perpendicular) to the tangent vector of the level set of f at value f(x)at point x ∈ Rn and the equation of the hyperplane tangent to the graph of f atthe point (x0, f(x0)) is

    ∇f(x0) · (x− x0)) = y − y0.

    Proof. Substitute ∇F (x0, y0) = (∇f(x0,−1) into equation (4.2) and re-arrangeterms.

    Example 21. Find the tangent plane to {x | x1x2 − x23 = 6} ⊂ R3 at x̂ = (2, 5, 2).Notice if you let f(x) = x1x2 − x23, then this is a level set of f for value 6.

    ∇f(x) = ( ∂f∂x1

    ,∂f

    ∂x2,∂f

    ∂x3)

    = (x2, x1,−2x3)

  • 4.10. HOMOGENEOUS FUNCTIONS 53

    ∇f(x̂) = ∇f(x) |x=(2,5,2)= (5, 2,−4)

    Tangent Plane:

    {x̂ + y | y · ∇f(x̂) = 0} = {(2, 5, 2) + (y1, y2, y3) | 5y1 + 2y2 − 4y3 = 0}= {x | 5x1 − 10 + 2x2 − 10− 4x3 + 8 = 0}= {5x1 + 2x2 − 4x3 = 12}

    Example 22. Suppose f(x, y, z) = 3x2 + 2xy − z2. ∇f(x, y, z) = (6x +2y, 2x,−2z). Notice that f(2, 1, 3) = 7. The level set of f when f(x, y, z) = 7 is{(x, y, z) : f(x, y, z) = 7}. This set is a (two dimensional) surface in R3: It canbe written F (x, y, z) = 0 (for F (x, y, z) = f(x, y, z)−7). Consequently the equa-tion of the tangent to the level set of f is a (two-dimensional) hyperplane in R3. Atthe point (2, 1, 3), the hyperplane has normal equal to ∇f(2, 1, 3) = (12, 4,−6).Hence the equation of the hyperplane to the level set at (2, 1, 3) is equal to:

    (12, 4,−6) · (x− 2, y − 1, z − 3) = 0

    or12x+ 4y − 6z = 10.

    On the other hand, the graph of f is a three-dimensional subset of R4: {(x, y, z, w) :w = f(x, y, z)}. A point on this surface is (2, 1, 3, 7) = (x, y, z, w). The tangenthyperplane at this point can be written:

    w − 7 = ∇f(2, 1, 3) · (x− 2, y − 1, z − 3) = 12x+ 4y − 6z − 10

    or12x+ 4y − 6z − w = 3.

    4.10 Homogeneous FunctionsDefinition 51. The functionF : Rn −→ R is homogeneous of degree k ifF (λx) =λkF (x) for all λ.

    Homogeneity of degree one is weaker than linearity: All linear functions arehomogeneous of degree one, but not conversely. For example, f(x, y) =

    √xy is

    homogeneous of degree one but not linear.

  • 54 CHAPTER 4. MULTIVARIABLE CALCULUS

    Theorem 16 (Euler’s Theorem). If F : Rn −→ R be a differential at x andhomogeneous of degree k, then∇F (x) · x = kF (x).Proof. Fix x. Consider the functionH(λ) = F (λx). This is a composite function,H(λ) = F ◦G(λ), where G : R −→ Rn, such that G(λ) = λx. By the chain rule,DH(λ) = DF (G(λ))DG(λ). If we evaluate this when λ = 1 we have

    DH(1) = ∇F (x) · x. (4.3)On the other hand, we know from homogeneity that H(λ) = λkx. Differentiatingthe right hand side of this equation yields DH(λ) = kλk−1F (λx) and evaluatingwhen λ = 1 yields

    DH(1) = kF (x). (4.4)

    Combining equations (4.3) and (4.4) yields the theorem.

    In economics functions that are homogeneous of degree zero and one arisenaturally in consumer theory. A cost function depends on the wages you pay toworkers. If all of the wages double, then the cost doubles. This is homogeneity ofdegree one. On the other hand, a consumer’s demand behavior is typically homo-geneous of degree zero. Demand is a function φ(p, w) that gives the consumer’sutility maximizing feasible demand given prices p and wealth w. The demand isthe best affordable consumption for the consumer. The consumptions x that areaffordable satisfy p · x ≤ w (and possibly another constraint like non-negativity).If p and w are multiplied by the same factor, λ, then the budget constraint remainsunchanged. Hence the demand function is homogeneous of degree zero.

    Euler’s Theorem provides a nice decomposition of a function F . Suppose thatF describes the profit produced by a team of n agents, when agent i contributeseffort xi. How such the team divide the profit it generates? If F is linear, theanswer is easy: If F (x) = p · x, then just give agent i pixi. Here you giveeach agent a constant “per unit” payment equal to the marginal contribution ofher effort. When you do so, you distribute the entire surplus (and nothing else).When F is non-linear, it is harder to figure out the contribut