Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep...

215
Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018 1 / 82

Transcript of Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep...

Page 1: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Lecture 1: Introduction

Andre Martins

Deep Structured Learning Course, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 1 / 82

Page 2: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Course Website

https://andre-martins.github.io/pages/

deep-structured-learning-ist-fall-2018.html

There I’ll post:

• Syllabus

• Lecture slides

• Literature pointers

• Homework assignments

• ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 2 / 82

Page 3: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Outline

1 Introduction

2 Class Administrativia

3 Recap

Linear Algebra

Probability Theory

Optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 3 / 82

Page 4: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What is “‘Deep Learning”?

• Neural networks?

• Neural networks with many hidden layers?

• Anything beyond shallow (linear) models for statistical learning?

• Anything that learns representations?

• A form of learning that is really intense and profound?

Andre Martins (IST) Lecture 1 IST, Fall 2018 4 / 82

Page 5: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What is “‘Deep Learning”?

• Neural networks?

• Neural networks with many hidden layers?

• Anything beyond shallow (linear) models for statistical learning?

• Anything that learns representations?

• A form of learning that is really intense and profound?

Andre Martins (IST) Lecture 1 IST, Fall 2018 4 / 82

Page 6: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Where is the “Structure”?

• In the input objects (text, graphs, images, ...)

• In the outputs we want to predict (parsing, graph labeling, imagesegmentation, ...)

• In our model (convolutional networks, attention mechanisms)

• Related: latent structure (typically a way of encoding priorknowledge into the model)

Andre Martins (IST) Lecture 1 IST, Fall 2018 5 / 82

Page 7: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

This Course: “Deep Learning + Structure”

Andre Martins (IST) Lecture 1 IST, Fall 2018 6 / 82

Page 8: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Why Did Deep Learning Become Mainstream?

Lots of recent breakthroughs:

• Object recognition

• Speech and language processing

• Chatbots and dialog systems

• Self-driving cars

• Machine translation

• Solving games (Atari, Go)

No signs of slowing down...

Andre Martins (IST) Lecture 1 IST, Fall 2018 7 / 82

Page 9: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 8 / 82

Page 10: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 9 / 82

Page 11: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 10 / 82

Page 12: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 11 / 82

Page 13: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 12 / 82

Page 14: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 13 / 82

Page 15: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Andre Martins (IST) Lecture 1 IST, Fall 2018 14 / 82

Page 16: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Why Now?

Why does deep learning work now, but not 20 years ago?

Many of the core ideas were there, after all.

But now we have:

• more data

• more computing power

• better software engineering

• a few algorithmic innovations (many layers, ReLUs, betterinitialization and learning rates, dropout, LSTMs, convolutional nets)

Andre Martins (IST) Lecture 1 IST, Fall 2018 15 / 82

Page 17: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

“But It’s Non-Convex”

Why does gradient-based optimization work at all in neural nets despitethe non-convexity?

One possible, partial answer:

• there are generally many hidden units

• there are many ways a neural net can approximately implement thedesired input-output relationship

• we only need to find one

Andre Martins (IST) Lecture 1 IST, Fall 2018 16 / 82

Page 18: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Recommended Books

Main book:

• Deep Learning. Ian Goodfellow,Yoshua Bengio, and Aaron Courville.MIT Press, 2016. Chapters available athttp://deeplearningbook.org

Andre Martins (IST) Lecture 1 IST, Fall 2018 17 / 82

Page 19: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Recommended Books

Secondary books:

• Machine Learning: a Probabilistic Perspective. Kevin P. Murphy.MIT Press, 2013.

• Linguistic Structured Prediction. Noah A. Smith. Morgan &Claypool Synthesis Lectures on Human Language Technologies. 2011.

Andre Martins (IST) Lecture 1 IST, Fall 2018 18 / 82

Page 20: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Syllabus

Sep 19 Introduction and Course DescriptionSep 26 Linear ClassifiersOct 3 Feedforward Neural NetworksOct 10 Training Neural NetworksOct 17 Linear Sequence ModelsOct 24 Representation Learning and Convolutional Neural NetworksOct 31 Structured Prediction and Graphical ModelsNov 7 Recurrent Neural NetworksNov 14 Sequence-to-Sequence LearningNov 21 Attention Mechanisms and Neural MemoriesNov 28 Deep Reinforcement LearningDec 5 Deep Generative Models (VAEs, GANs)Dec 12 Final Projects IDec 19 Final Projects II

Andre Martins (IST) Lecture 1 IST, Fall 2018 19 / 82

Page 21: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Outline

1 Introduction

2 Class Administrativia

3 Recap

Linear Algebra

Probability Theory

Optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 20 / 82

Page 22: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What This Class Is About

• Introduction to deep learning

• Introduction to structured prediction

• Goal: after finishing this class, you should be able to:• Understand how deep learning works without magic• Understand the intuition behind deep structured learning models• Apply the learned techniques on a practical problem (NLP, vision, ...)

• Target audience:• MSc/PhD students with basic background in ML and good

programming skills

Andre Martins (IST) Lecture 1 IST, Fall 2018 21 / 82

Page 23: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What This Class Is Not About

It’s not about:

• Just playing with a deep learning toolkit without learning thefundamental concepts

• Introduction to ML (see Mario Figueiredo’s Statistical Learningcourse and Jorge Marques’ Estimation and Classification course)

• Optimization (check Joao Xavier’s Non-Linear Optimization course)

• Natural Language Processing

• ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 22 / 82

Page 24: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Prerequisites

• Calculus and basic linear algebra

• Basic probability theory

• Basic knowledge of machine learning

• Programming (Python & PyTorch preferred)

• Helpful: basic optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 23 / 82

Page 25: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Course Information

• Instructor: Andre Martins ([email protected])

• TAs/Guest Lecturers: Erick Fonseca & Vlad Niculae

• Location: LT2 (North Tower, 4th floor)

• Schedule: Wednesdays 14:30–18:00 (tentative)

• Communication:piazza.com/tecnico.ulisboa.pt/fall2018/pdeecdsl

Andre Martins (IST) Lecture 1 IST, Fall 2018 24 / 82

Page 26: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Grading

• 4 homework assignments: 60%• Theoretical questions & implementation• Late days: 10% penalization each late day

• Final project (in groups of 2–3): 40%• Final class presentations & poster session (tentative)

Andre Martins (IST) Lecture 1 IST, Fall 2018 25 / 82

Page 27: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Final Project

• Possible idea: apply a deep learning technique to a structuredproblem relevant to your research (NLP, vision, robotics, ...)

• Otherwise, pick a project from a list of suggestions

• Must be finished this semester

• Four evaluation stages: project proposal (10%), midterm report(10%), final report (10%, conference paper format), classpresentation (10%)

• List of project suggestions will be made available soon

Andre Martins (IST) Lecture 1 IST, Fall 2018 26 / 82

Page 28: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Collaboration Policy

• Assignments are individual

• Students may discuss the questions, as long as they write their ownanswers and their own code

• If this happens, acknowledge with whom you collaborate!

• Zero tolerance on plagiarism!!

• Always credit your sources!!!

Andre Martins (IST) Lecture 1 IST, Fall 2018 27 / 82

Page 29: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Caveat

• This is the first year I’m teaching this class

• ... which means you’re the first batch of students taking it :)

• Constructive feedback will be highly appreciated (and encouraged!)

Andre Martins (IST) Lecture 1 IST, Fall 2018 28 / 82

Page 30: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Questions?

Andre Martins (IST) Lecture 1 IST, Fall 2018 29 / 82

Page 31: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Outline

1 Introduction

2 Class Administrativia

3 Recap

Linear Algebra

Probability Theory

Optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 30 / 82

Page 32: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Quick Background Recap

Slide credits: Prof. Mario Figueiredo (taken from his LxMLS class)

Andre Martins (IST) Lecture 1 IST, Fall 2018 31 / 82

Page 33: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Outline

1 Introduction

2 Class Administrativia

3 Recap

Linear Algebra

Probability Theory

Optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 32 / 82

Page 34: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Linear Algebra

• Linear algebra provides (among many other things) a compact way ofrepresenting, studying, and solving linear systems of equations

• Example: the system

4 x1 − 5 x2 = −13

−2 x1 + 3 x2 = 9

can be written compactly as Ax = b , where

A =

[4 −5−2 3

], b =

[−13

9

],

and can be solved as

x = A−1b =

[1.5 2.51 2

] [−13

9

]=

[35

].

Andre Martins (IST) Lecture 1 IST, Fall 2018 33 / 82

Page 35: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Linear Algebra

• Linear algebra provides (among many other things) a compact way ofrepresenting, studying, and solving linear systems of equations

• Example: the system

4 x1 − 5 x2 = −13

−2 x1 + 3 x2 = 9

can be written compactly as Ax = b , where

A =

[4 −5−2 3

], b =

[−13

9

],

and can be solved as

x = A−1b =

[1.5 2.51 2

] [−13

9

]=

[35

].

Andre Martins (IST) Lecture 1 IST, Fall 2018 33 / 82

Page 36: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Notation: Matrices and Vectors

• A ∈ Rm×n is a matrix with m rows and n columns.

A =

A1,1 · · · A1,n...

. . ....

Am,1 · · · Am,n

.

• x ∈ Rn is a vector with n components,

x =

x1...xn

.• A (column) vector is a matrix with n rows and 1 column.

• A matrix with 1 row and n columns is called a row vector.

Andre Martins (IST) Lecture 1 IST, Fall 2018 34 / 82

Page 37: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Notation: Matrices and Vectors

• A ∈ Rm×n is a matrix with m rows and n columns.

A =

A1,1 · · · A1,n...

. . ....

Am,1 · · · Am,n

.• x ∈ Rn is a vector with n components,

x =

x1...xn

.

• A (column) vector is a matrix with n rows and 1 column.

• A matrix with 1 row and n columns is called a row vector.

Andre Martins (IST) Lecture 1 IST, Fall 2018 34 / 82

Page 38: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Notation: Matrices and Vectors

• A ∈ Rm×n is a matrix with m rows and n columns.

A =

A1,1 · · · A1,n...

. . ....

Am,1 · · · Am,n

.• x ∈ Rn is a vector with n components,

x =

x1...xn

.• A (column) vector is a matrix with n rows and 1 column.

• A matrix with 1 row and n columns is called a row vector.

Andre Martins (IST) Lecture 1 IST, Fall 2018 34 / 82

Page 39: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Notation: Matrices and Vectors

• A ∈ Rm×n is a matrix with m rows and n columns.

A =

A1,1 · · · A1,n...

. . ....

Am,1 · · · Am,n

.• x ∈ Rn is a vector with n components,

x =

x1...xn

.• A (column) vector is a matrix with n rows and 1 column.

• A matrix with 1 row and n columns is called a row vector.

Andre Martins (IST) Lecture 1 IST, Fall 2018 34 / 82

Page 40: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Transpose and Products

• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .

• A matrix A is symmetric if AT = A.

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Inner product between vectors x , y ∈ Rn:

〈x , y〉 = xT y = yT x =n∑

i=1

xiyi ∈ R.

• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .

Andre Martins (IST) Lecture 1 IST, Fall 2018 35 / 82

Page 41: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Transpose and Products

• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .

• A matrix A is symmetric if AT = A.

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Inner product between vectors x , y ∈ Rn:

〈x , y〉 = xT y = yT x =n∑

i=1

xiyi ∈ R.

• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .

Andre Martins (IST) Lecture 1 IST, Fall 2018 35 / 82

Page 42: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Transpose and Products

• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .

• A matrix A is symmetric if AT = A.

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Inner product between vectors x , y ∈ Rn:

〈x , y〉 = xT y = yT x =n∑

i=1

xiyi ∈ R.

• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .

Andre Martins (IST) Lecture 1 IST, Fall 2018 35 / 82

Page 43: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Transpose and Products

• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .

• A matrix A is symmetric if AT = A.

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Inner product between vectors x , y ∈ Rn:

〈x , y〉 = xT y = yT x =n∑

i=1

xiyi ∈ R.

• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .

Andre Martins (IST) Lecture 1 IST, Fall 2018 35 / 82

Page 44: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Transpose and Products

• Given matrix A ∈ Rm×n, its transpose AT is such that (AT )i ,j = Aj ,i .

• A matrix A is symmetric if AT = A.

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Inner product between vectors x , y ∈ Rn:

〈x , y〉 = xT y = yT x =n∑

i=1

xiyi ∈ R.

• Outer product between vectors x ∈ Rn and y ∈ Rm: x yT ∈ Rn×m,where (x yT )i ,j = xi yj .

Andre Martins (IST) Lecture 1 IST, Fall 2018 35 / 82

Page 45: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Properties of Matrix Products and Transposes

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Matrix product is associative: (AB)C = A(BC ).

• In general, matrix product is not commutative: AB 6= BA.

• Transpose of product: (AB)T = BTAT .

• Transpose of sum: (A + B)T = AT + BT .

Andre Martins (IST) Lecture 1 IST, Fall 2018 36 / 82

Page 46: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Properties of Matrix Products and Transposes

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Matrix product is associative: (AB)C = A(BC ).

• In general, matrix product is not commutative: AB 6= BA.

• Transpose of product: (AB)T = BTAT .

• Transpose of sum: (A + B)T = AT + BT .

Andre Martins (IST) Lecture 1 IST, Fall 2018 36 / 82

Page 47: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Properties of Matrix Products and Transposes

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Matrix product is associative: (AB)C = A(BC ).

• In general, matrix product is not commutative: AB 6= BA.

• Transpose of product: (AB)T = BTAT .

• Transpose of sum: (A + B)T = AT + BT .

Andre Martins (IST) Lecture 1 IST, Fall 2018 36 / 82

Page 48: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Properties of Matrix Products and Transposes

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Matrix product is associative: (AB)C = A(BC ).

• In general, matrix product is not commutative: AB 6= BA.

• Transpose of product: (AB)T = BTAT .

• Transpose of sum: (A + B)T = AT + BT .

Andre Martins (IST) Lecture 1 IST, Fall 2018 36 / 82

Page 49: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Properties of Matrix Products and Transposes

• Given matrices A ∈ Rm×n and B ∈ Rn×p, their product is

C = AB ∈ Rm×p where Ci ,j =n∑

k=1

Ai ,k Bk,j

• Matrix product is associative: (AB)C = A(BC ).

• In general, matrix product is not commutative: AB 6= BA.

• Transpose of product: (AB)T = BTAT .

• Transpose of sum: (A + B)T = AT + BT .

Andre Martins (IST) Lecture 1 IST, Fall 2018 36 / 82

Page 50: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Norms

• The norm of a vector is (informally) its “magnitude.” Euclidean norm:

‖x‖2 =√〈x , x〉 =

√xT x =

√√√√ n∑i=1

x2i .

• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,

‖x‖p =

(n∑

i=1

|xi |p)1/p

.

• Notable case: the `1 norm, ‖x‖1 =∑

i |xi |.

• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.

• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.

Andre Martins (IST) Lecture 1 IST, Fall 2018 37 / 82

Page 51: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Norms

• The norm of a vector is (informally) its “magnitude.” Euclidean norm:

‖x‖2 =√〈x , x〉 =

√xT x =

√√√√ n∑i=1

x2i .

• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,

‖x‖p =

(n∑

i=1

|xi |p)1/p

.

• Notable case: the `1 norm, ‖x‖1 =∑

i |xi |.

• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.

• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.

Andre Martins (IST) Lecture 1 IST, Fall 2018 37 / 82

Page 52: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Norms

• The norm of a vector is (informally) its “magnitude.” Euclidean norm:

‖x‖2 =√〈x , x〉 =

√xT x =

√√√√ n∑i=1

x2i .

• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,

‖x‖p =

(n∑

i=1

|xi |p)1/p

.

• Notable case: the `1 norm, ‖x‖1 =∑

i |xi |.

• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.

• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.

Andre Martins (IST) Lecture 1 IST, Fall 2018 37 / 82

Page 53: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Norms

• The norm of a vector is (informally) its “magnitude.” Euclidean norm:

‖x‖2 =√〈x , x〉 =

√xT x =

√√√√ n∑i=1

x2i .

• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,

‖x‖p =

(n∑

i=1

|xi |p)1/p

.

• Notable case: the `1 norm, ‖x‖1 =∑

i |xi |.

• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.

• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.

Andre Martins (IST) Lecture 1 IST, Fall 2018 37 / 82

Page 54: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Norms

• The norm of a vector is (informally) its “magnitude.” Euclidean norm:

‖x‖2 =√〈x , x〉 =

√xT x =

√√√√ n∑i=1

x2i .

• More generally, the `p norm of a vector x ∈ Rn, where p ≥ 1,

‖x‖p =

(n∑

i=1

|xi |p)1/p

.

• Notable case: the `1 norm, ‖x‖1 =∑

i |xi |.

• Notable case: the `∞ norm, ‖x‖∞ = max{|x1|, ..., |xn|}.

• Notable case: the `0 “norm” (not): ‖x‖0 = |{i : xi 6= 0}|.

Andre Martins (IST) Lecture 1 IST, Fall 2018 37 / 82

Page 55: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Special Matrices

• The identity matrix I ∈ Rn×n is a square matrix such that

Iij =

{1 i = j0 i 6= j

I =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

• Neutral element of matrix product: A I = I A = A.

• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.

• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.

• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 38 / 82

Page 56: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Special Matrices

• The identity matrix I ∈ Rn×n is a square matrix such that

Iij =

{1 i = j0 i 6= j

I =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

• Neutral element of matrix product: A I = I A = A.

• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.

• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.

• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 38 / 82

Page 57: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Special Matrices

• The identity matrix I ∈ Rn×n is a square matrix such that

Iij =

{1 i = j0 i 6= j

I =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

• Neutral element of matrix product: A I = I A = A.

• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.

• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.

• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 38 / 82

Page 58: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Special Matrices

• The identity matrix I ∈ Rn×n is a square matrix such that

Iij =

{1 i = j0 i 6= j

I =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

• Neutral element of matrix product: A I = I A = A.

• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.

• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.

• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 38 / 82

Page 59: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Special Matrices

• The identity matrix I ∈ Rn×n is a square matrix such that

Iij =

{1 i = j0 i 6= j

I =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

• Neutral element of matrix product: A I = I A = A.

• Diagonal matrix: A ∈ Rn×n is diagonal if (i 6= j)⇒ Ai ,j = 0.

• Upper triangular matrix: (j < i)⇒ Ai ,j = 0.

• Lower triangular matrix: (j > i)⇒ Ai ,j = 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 38 / 82

Page 60: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|,

|AT | = |A|, |αA| = αn|A|

Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 61: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|,

|AT | = |A|, |αA| = αn|A|

Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 62: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|,

|AT | = |A|, |αA| = αn|A|

Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 63: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|,

|AT | = |A|, |αA| = αn|A|

Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 64: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|,

|AT | = |A|, |αA| = αn|A|

Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 65: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|, |AT | = |A|,

|αA| = αn|A|

Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 66: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Eigenvalues, eigenvectors, determinant, trace

• A vector x ∈ Rn is an eigenvector of matrix A ∈ Rn×n if

Ax = λ x ,

where λ ∈ R is the corresponding eigenvalue.

• The eigenvalues of a diagonal matrix are the elements in the diagonal.

• Matrix trace:trace(A) =

∑i

Ai ,i =∑i

λi

• Matrix determinant:

|A| = det(A) =∏i

λi

• Properties: |AB| = |A||B|, |AT | = |A|, |αA| = αn|A|Andre Martins (IST) Lecture 1 IST, Fall 2018 39 / 82

Page 67: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A,

(A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 68: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A,

(A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 69: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A,

(A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 70: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A,

(A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 71: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A,

(A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 72: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A,

(A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 73: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A, (A−1)T = (AT )−1,

(AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 74: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A, (A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 75: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Matrix Inverse

• Matrix A ∈ Rn×n in invertible if there is B ∈ Rn×n s.t. AB = BA = I .

• ...matrix B, such that AB = BA = I , denoted B = A−1.

• Matrix A ∈ Rn×n is invertible ⇔ det(A) 6= 0.

• Determinant of inverse: det(A−1) =1

det(A).

• Solving system Ax = b, if A is invertible: x = A−1b.

• Properties: (A−1)−1 = A, (A−1)T = (AT )−1, (AB)−1 = B−1A−1

• There are many algorithms to compute A−1; general case,computational cost O(n3).

Andre Martins (IST) Lecture 1 IST, Fall 2018 40 / 82

Page 76: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Quadratic Forms and Positive (Semi-)DefiniteMatrices

• Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTAx =n∑

i=1

n∑j=1

Ai , j xi xj ∈ R

is called a quadratic form.

• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.

• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.

• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.

• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 41 / 82

Page 77: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Quadratic Forms and Positive (Semi-)DefiniteMatrices

• Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTAx =n∑

i=1

n∑j=1

Ai , j xi xj ∈ R

is called a quadratic form.

• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.

• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.

• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.

• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 41 / 82

Page 78: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Quadratic Forms and Positive (Semi-)DefiniteMatrices

• Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTAx =n∑

i=1

n∑j=1

Ai , j xi xj ∈ R

is called a quadratic form.

• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.

• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.

• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.

• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 41 / 82

Page 79: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Quadratic Forms and Positive (Semi-)DefiniteMatrices

• Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTAx =n∑

i=1

n∑j=1

Ai , j xi xj ∈ R

is called a quadratic form.

• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.

• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.

• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.

• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 41 / 82

Page 80: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Quadratic Forms and Positive (Semi-)DefiniteMatrices

• Given matrix A ∈ Rn×n and vector x ∈ Rn,

xTAx =n∑

i=1

n∑j=1

Ai , j xi xj ∈ R

is called a quadratic form.

• A symmetric matrix A ∈ Rn×n is positive semi-definite (PSD) if, forany x ∈ Rn, xTAx ≥ 0.

• A symmetric matrix A ∈ Rn×n is positive definite (PD) if, for anyx ∈ Rn, (x 6= 0)⇒ xTAx > 0.

• Matrix A ∈ Rn×n is PSD ⇔ all λi (A) ≥ 0.

• Matrix A ∈ Rn×n is PD ⇔ all λi (A) > 0.

Andre Martins (IST) Lecture 1 IST, Fall 2018 41 / 82

Page 81: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Convex Sets

Andre Martins (IST) Lecture 1 IST, Fall 2018 42 / 82

Page 82: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Convex Functions

Andre Martins (IST) Lecture 1 IST, Fall 2018 43 / 82

Page 83: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Outline

1 Introduction

2 Class Administrativia

3 Recap

Linear Algebra

Probability Theory

Optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 44 / 82

Page 84: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Probability theory

• “Essentially, all models are wrong, but some are useful”; G. Box, 1987

• The study of probability has roots in games of chance (dice, cards, ...)

• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...

• Natural tool to model uncertainty, information, knowledge, belief, ...

• ...thus also learning, decision making, inference, ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 45 / 82

Page 85: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Probability theory

• “Essentially, all models are wrong, but some are useful”; G. Box, 1987

• The study of probability has roots in games of chance (dice, cards, ...)

• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...

• Natural tool to model uncertainty, information, knowledge, belief, ...

• ...thus also learning, decision making, inference, ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 45 / 82

Page 86: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Probability theory

• “Essentially, all models are wrong, but some are useful”; G. Box, 1987

• The study of probability has roots in games of chance (dice, cards, ...)

• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...

• Natural tool to model uncertainty, information, knowledge, belief, ...

• ...thus also learning, decision making, inference, ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 45 / 82

Page 87: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Probability theory

• “Essentially, all models are wrong, but some are useful”; G. Box, 1987

• The study of probability has roots in games of chance (dice, cards, ...)

• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...

• Natural tool to model uncertainty, information, knowledge, belief, ...

• ...thus also learning, decision making, inference, ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 45 / 82

Page 88: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Probability theory

• “Essentially, all models are wrong, but some are useful”; G. Box, 1987

• The study of probability has roots in games of chance (dice, cards, ...)

• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...

• Natural tool to model uncertainty, information, knowledge, belief, ...

• ...thus also learning, decision making, inference, ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 45 / 82

Page 89: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Probability theory

• “Essentially, all models are wrong, but some are useful”; G. Box, 1987

• The study of probability has roots in games of chance (dice, cards, ...)

• Great names in science: Cardano, Fermat, Pascal, Laplace, Gauss,Huygens, Legendre, Poisson, Kolmogorov, Bernoulli, Cauchy, Gibbs,Boltzman, de Finetti, ...

• Natural tool to model uncertainty, information, knowledge, belief, ...

• ...thus also learning, decision making, inference, ...

Andre Martins (IST) Lecture 1 IST, Fall 2018 45 / 82

Page 90: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What is probability?

• Classical definition: P(A) =NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of event A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

• Frequentist definition: P(A) = limN→∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

• Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

...gives meaning to P(“Tomorrow it will rain”).

Andre Martins (IST) Lecture 1 IST, Fall 2018 46 / 82

Page 91: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What is probability?

• Classical definition: P(A) =NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of event A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

• Frequentist definition: P(A) = limN→∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

• Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

...gives meaning to P(“Tomorrow it will rain”).

Andre Martins (IST) Lecture 1 IST, Fall 2018 46 / 82

Page 92: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

What is probability?

• Classical definition: P(A) =NA

N

...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of event A. Laplace, 1814

Example: P(randomly drawn card is ♣) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.

• Frequentist definition: P(A) = limN→∞

NA

N

...relative frequency of occurrence of A in infinite number of trials.

• Subjective probability: P(A) is a degree of belief. de Finetti, 1930s

...gives meaning to P(“Tomorrow it will rain”).

Andre Martins (IST) Lecture 1 IST, Fall 2018 46 / 82

Page 93: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Key concepts: Sample space and events

• Sample space X = set of possible outcomes of a random experiment.

Examples:• Tossing two coins: X = {HH,TH,HT ,TT}• Roulette: X = {1, 2, ..., 36}• Draw a card from a shuffled deck: X = {A♣, 2♣, ...,Q♦,K♦}.

• An event A is a subset of X: A ⊆ X.

Examples:• “exactly one H in 2-coin toss”: A = {TH,HT} ⊂ {HH,TH,HT ,TT}.

• “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.

• “drawn a ♥ card”: C = {A♥, 2♥, ...,K♥} ⊂ {A♣, ...,K♦}

Andre Martins (IST) Lecture 1 IST, Fall 2018 47 / 82

Page 94: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Key concepts: Sample space and events

• Sample space X = set of possible outcomes of a random experiment.

Examples:• Tossing two coins: X = {HH,TH,HT ,TT}• Roulette: X = {1, 2, ..., 36}• Draw a card from a shuffled deck: X = {A♣, 2♣, ...,Q♦,K♦}.

• An event A is a subset of X: A ⊆ X.

Examples:• “exactly one H in 2-coin toss”: A = {TH,HT} ⊂ {HH,TH,HT ,TT}.

• “odd number in the roulette”: B = {1, 3, ..., 35} ⊂ {1, 2, ..., 36}.

• “drawn a ♥ card”: C = {A♥, 2♥, ...,K♥} ⊂ {A♣, ...,K♦}

Andre Martins (IST) Lecture 1 IST, Fall 2018 47 / 82

Page 95: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kolmogorov’s Axioms for Probability

• Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P

• For any A, P(A) ≥ 0

• P(X) = 1

• If A1, A2 ... ⊆ X are disjoint events, then P(⋃

i

Ai

)=∑i

P(Ai )

• From these axioms, many results can be derived. Examples:

• P(∅) = 0

• C ⊂ D ⇒ P(C ) ≤ P(D)

• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

• P(A ∪ B) ≤ P(A) + P(B) (union bound)

Andre Martins (IST) Lecture 1 IST, Fall 2018 48 / 82

Page 96: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kolmogorov’s Axioms for Probability

• Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0

• P(X) = 1

• If A1, A2 ... ⊆ X are disjoint events, then P(⋃

i

Ai

)=∑i

P(Ai )

• From these axioms, many results can be derived. Examples:

• P(∅) = 0

• C ⊂ D ⇒ P(C ) ≤ P(D)

• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

• P(A ∪ B) ≤ P(A) + P(B) (union bound)

Andre Martins (IST) Lecture 1 IST, Fall 2018 48 / 82

Page 97: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kolmogorov’s Axioms for Probability

• Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0

• P(X) = 1

• If A1, A2 ... ⊆ X are disjoint events, then P(⋃

i

Ai

)=∑i

P(Ai )

• From these axioms, many results can be derived. Examples:

• P(∅) = 0

• C ⊂ D ⇒ P(C ) ≤ P(D)

• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

• P(A ∪ B) ≤ P(A) + P(B) (union bound)

Andre Martins (IST) Lecture 1 IST, Fall 2018 48 / 82

Page 98: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kolmogorov’s Axioms for Probability

• Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0

• P(X) = 1

• If A1, A2 ... ⊆ X are disjoint events, then P(⋃

i

Ai

)=∑i

P(Ai )

• From these axioms, many results can be derived. Examples:

• P(∅) = 0

• C ⊂ D ⇒ P(C ) ≤ P(D)

• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

• P(A ∪ B) ≤ P(A) + P(B) (union bound)

Andre Martins (IST) Lecture 1 IST, Fall 2018 48 / 82

Page 99: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kolmogorov’s Axioms for Probability

• Probability is a function that maps events A into the interval [0, 1].

Kolmogorov’s axioms (1933) for probability P• For any A, P(A) ≥ 0

• P(X) = 1

• If A1, A2 ... ⊆ X are disjoint events, then P(⋃

i

Ai

)=∑i

P(Ai )

• From these axioms, many results can be derived. Examples:

• P(∅) = 0

• C ⊂ D ⇒ P(C ) ≤ P(D)

• P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

• P(A ∪ B) ≤ P(A) + P(B) (union bound)

Andre Martins (IST) Lecture 1 IST, Fall 2018 48 / 82

Page 100: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)(conditional prob. of A given B)

• ...satisfies all of Kolmogorov’s axioms:

• For any A ⊆ X, P(A|B) ≥ 0

• P(X|B) = 1

• If A1, A2, ... ⊆ X are disjoint, then

P(⋃

i

Ai

∣∣∣B) =∑i

P(Ai |B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

Andre Martins (IST) Lecture 1 IST, Fall 2018 49 / 82

Page 101: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)(conditional prob. of A given B)

• ...satisfies all of Kolmogorov’s axioms:

• For any A ⊆ X, P(A|B) ≥ 0

• P(X|B) = 1

• If A1, A2, ... ⊆ X are disjoint, then

P(⋃

i

Ai

∣∣∣B) =∑i

P(Ai |B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

Andre Martins (IST) Lecture 1 IST, Fall 2018 49 / 82

Page 102: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)(conditional prob. of A given B)

• ...satisfies all of Kolmogorov’s axioms:

• For any A ⊆ X, P(A|B) ≥ 0

• P(X|B) = 1

• If A1, A2, ... ⊆ X are disjoint, then

P(⋃

i

Ai

∣∣∣B) =∑i

P(Ai |B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

Andre Martins (IST) Lecture 1 IST, Fall 2018 49 / 82

Page 103: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

• Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)

P(B)=

P(A) P(B)

P(B)= P(A)

• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Andre Martins (IST) Lecture 1 IST, Fall 2018 50 / 82

Page 104: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

• Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)

P(B)=

P(A) P(B)

P(B)= P(A)

• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Andre Martins (IST) Lecture 1 IST, Fall 2018 50 / 82

Page 105: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

• Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)

P(B)=

P(A) P(B)

P(B)= P(A)

• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Andre Martins (IST) Lecture 1 IST, Fall 2018 50 / 82

Page 106: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

• Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)

P(B)=

P(A) P(B)

P(B)= P(A)

• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Andre Martins (IST) Lecture 1 IST, Fall 2018 50 / 82

Page 107: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

• Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)

P(B)=

P(A) P(B)

P(B)= P(A)

• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Andre Martins (IST) Lecture 1 IST, Fall 2018 50 / 82

Page 108: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditional Probability and Independence

• If P(B) > 0, P(A|B) =P(A ∩ B)

P(B)

• Events A, B are independent (A ⊥⊥ B) ⇔ P(A ∩ B) = P(A)P(B).

• Relationship with conditional probabilities:

A ⊥⊥ B ⇔ P(A|B) =P(A ∩ B)

P(B)=

P(A) P(B)

P(B)= P(A)

• Example: X = “52 cards”, A = {3♥, 3♣, 3♦, 3♠}, andB = {A♥, 2♥, ...,K♥}; then, P(A) = 1/13, P(B) = 1/4

P(A ∩ B) = P({3♥}) =1

52

P(A)P(B) =1

13

1

4=

1

52

P(A|B) = P(“3”|“♥”) =1

13= P(A)

Andre Martins (IST) Lecture 1 IST, Fall 2018 50 / 82

Page 109: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Bayes Theorem

• Law of total probability: if A1, ...,An are a partition of X

P(B) =∑i

P(B|Ai )P(Ai )

=∑i

P(B ∩ Ai )

• Bayes’ theorem: if {A1, ...,An} is a partition of X

P(Ai |B) =P(B ∩ Ai )

P(B)=

P(B|Ai ) P(Ai )∑j

P(B|Aj)P(Aj)

Andre Martins (IST) Lecture 1 IST, Fall 2018 51 / 82

Page 110: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Bayes Theorem

• Law of total probability: if A1, ...,An are a partition of X

P(B) =∑i

P(B|Ai )P(Ai )

=∑i

P(B ∩ Ai )

• Bayes’ theorem: if {A1, ...,An} is a partition of X

P(Ai |B) =P(B ∩ Ai )

P(B)=

P(B|Ai ) P(Ai )∑j

P(B|Aj)P(Aj)

Andre Martins (IST) Lecture 1 IST, Fall 2018 51 / 82

Page 111: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables

• A (real) random variable (RV) is a function: X : X→ R

• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])

• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.

• Example: distance traveled by a tossed coin; range of X = R+.

Andre Martins (IST) Lecture 1 IST, Fall 2018 52 / 82

Page 112: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables

• A (real) random variable (RV) is a function: X : X→ R

• Discrete RV: range of X is countable (e.g., N or {0, 1})

• Continuous RV: range of X is uncountable (e.g., R or [0, 1])

• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.

• Example: distance traveled by a tossed coin; range of X = R+.

Andre Martins (IST) Lecture 1 IST, Fall 2018 52 / 82

Page 113: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables

• A (real) random variable (RV) is a function: X : X→ R

• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])

• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.

• Example: distance traveled by a tossed coin; range of X = R+.

Andre Martins (IST) Lecture 1 IST, Fall 2018 52 / 82

Page 114: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables

• A (real) random variable (RV) is a function: X : X→ R

• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])

• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.

• Example: distance traveled by a tossed coin; range of X = R+.

Andre Martins (IST) Lecture 1 IST, Fall 2018 52 / 82

Page 115: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables

• A (real) random variable (RV) is a function: X : X→ R

• Discrete RV: range of X is countable (e.g., N or {0, 1})• Continuous RV: range of X is uncountable (e.g., R or [0, 1])

• Example: number of head in tossing two coins,X = {HH,HT ,TH,TT},X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.Range of X = {0, 1, 2}.

• Example: distance traveled by a tossed coin; range of X = R+.

Andre Martins (IST) Lecture 1 IST, Fall 2018 52 / 82

Page 116: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.

• Probability mass function (discrete RV): fX (x) = P(X = x),

FX (x) =∑xi≤x

fX (xi ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 53 / 82

Page 117: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.

• Probability mass function (discrete RV): fX (x) = P(X = x),

FX (x) =∑xi≤x

fX (xi ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 53 / 82

Page 118: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.

• Probability mass function (discrete RV): fX (x) = P(X = x),

FX (x) =∑xi≤x

fX (xi ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 53 / 82

Page 119: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Discrete Random Variables

• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .

• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX (x) = px(1− p)1−x .

• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)

fX (x) = Binomial(x ; n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n − x)! x!

Andre Martins (IST) Lecture 1 IST, Fall 2018 54 / 82

Page 120: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Discrete Random Variables

• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .

• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX (x) = px(1− p)1−x .

• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)

fX (x) = Binomial(x ; n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n − x)! x!

Andre Martins (IST) Lecture 1 IST, Fall 2018 54 / 82

Page 121: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Discrete Random Variables

• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .

• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX (x) = px(1− p)1−x .

• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)

fX (x) = Binomial(x ; n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n − x)! x!

Andre Martins (IST) Lecture 1 IST, Fall 2018 54 / 82

Page 122: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Discrete Random Variables

• Uniform: X ∈ {x1, ..., xK}, pmf fX (xi ) = 1/K .

• Bernoulli RV: X ∈ {0, 1}, pmf fX (x) =

{p ⇐ x = 1

1− p ⇐ x = 0

Can be written compactly as fX (x) = px(1− p)1−x .

• Binomial RV: X ∈ {0, 1, ..., n} (sum on n Bernoulli RVs)

fX (x) = Binomial(x ; n, p) =

(n

x

)px (1− p)(n−x)

Binomial coefficients(“n choose x”):(

n

x

)=

n!

(n − x)! x!

Andre Martins (IST) Lecture 1 IST, Fall 2018 54 / 82

Page 123: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More Important Discrete Random Variables

• Geometric(p): X ∈ N, pmf fX (x) = p(1− p)x−1.(e.g., number of trials until the first success).

• Poisson(λ): X ∈ N ∪ {0}, pmf fX (x) =e−λλx

x!

Notice that∑∞

x=0λx

x! = eλ, thus∑∞

x=0 fX (x) = 1.

“...probability of the number of independent occurrences in a fixed(time/space) interval if these occurrences have known average rate”

Andre Martins (IST) Lecture 1 IST, Fall 2018 55 / 82

Page 124: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More Important Discrete Random Variables

• Geometric(p): X ∈ N, pmf fX (x) = p(1− p)x−1.(e.g., number of trials until the first success).

• Poisson(λ): X ∈ N ∪ {0}, pmf fX (x) =e−λλx

x!

Notice that∑∞

x=0λx

x! = eλ, thus∑∞

x=0 fX (x) = 1.

“...probability of the number of independent occurrences in a fixed(time/space) interval if these occurrences have known average rate”

Andre Martins (IST) Lecture 1 IST, Fall 2018 55 / 82

Page 125: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: continuous RV with uniform distribution on [a, b].

• Probability density function (pdf, continuous RV): fX (x)

FX (x) =

∫ x

−∞fX (u) du, P(X ∈ [c , d ]) =

∫ d

cfX (x) dx , P(X =x) = 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 56 / 82

Page 126: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: continuous RV with uniform distribution on [a, b].

• Probability density function (pdf, continuous RV): fX (x)

FX (x) =

∫ x

−∞fX (u) du, P(X ∈ [c , d ]) =

∫ d

cfX (x) dx , P(X =x) = 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 56 / 82

Page 127: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: continuous RV with uniform distribution on [a, b].

• Probability density function (pdf, continuous RV): fX (x)

FX (x) =

∫ x

−∞fX (u) du, P(X ∈ [c , d ]) =

∫ d

cfX (x) dx , P(X =x) = 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 56 / 82

Page 128: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: continuous RV with uniform distribution on [a, b].

• Probability density function (pdf, continuous RV): fX (x)

FX (x) =

∫ x

−∞fX (u) du,

P(X ∈ [c , d ]) =

∫ d

cfX (x) dx , P(X =x) = 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 56 / 82

Page 129: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: continuous RV with uniform distribution on [a, b].

• Probability density function (pdf, continuous RV): fX (x)

FX (x) =

∫ x

−∞fX (u) du, P(X ∈ [c , d ]) =

∫ d

cfX (x) dx ,

P(X =x) = 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 56 / 82

Page 130: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Random Variables: Distribution Function

• Distribution function: FX (x) = P({ω ∈ X : X (ω) ≤ x})

• Example: continuous RV with uniform distribution on [a, b].

• Probability density function (pdf, continuous RV): fX (x)

FX (x) =

∫ x

−∞fX (u) du, P(X ∈ [c , d ]) =

∫ d

cfX (x) dx , P(X =x) = 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 56 / 82

Page 131: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Continuous Random Variables

• Uniform: fX (x) = Uniform(x ; a, b) =

{1

b−a ⇐ x ∈ [a, b]

0 ⇐ x 6∈ [a, b](previous slide).

• Gaussian: fX (x) = N(x ;µ, σ2) =1√

2π σ2e−

(x−µ)2

2σ2

• Exponential: fX (x) = Exp(x ;λ) =

{λe−λ x ⇐ x ≥ 0

0 ⇐ x < 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 57 / 82

Page 132: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Continuous Random Variables

• Uniform: fX (x) = Uniform(x ; a, b) =

{1

b−a ⇐ x ∈ [a, b]

0 ⇐ x 6∈ [a, b](previous slide).

• Gaussian: fX (x) = N(x ;µ, σ2) =1√

2π σ2e−

(x−µ)2

2σ2

• Exponential: fX (x) = Exp(x ;λ) =

{λe−λ x ⇐ x ≥ 0

0 ⇐ x < 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 57 / 82

Page 133: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Continuous Random Variables

• Uniform: fX (x) = Uniform(x ; a, b) =

{1

b−a ⇐ x ∈ [a, b]

0 ⇐ x 6∈ [a, b](previous slide).

• Gaussian: fX (x) = N(x ;µ, σ2) =1√

2π σ2e−

(x−µ)2

2σ2

• Exponential: fX (x) = Exp(x ;λ) =

{λe−λ x ⇐ x ≥ 0

0 ⇐ x < 0

Andre Martins (IST) Lecture 1 IST, Fall 2018 57 / 82

Page 134: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Random Variables

• Expectation: E(X ) =

∑i

xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞

x fX (x) dx X continuous

• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.

E(X ) = 0 (1− p) + 1 p = p.

• Example: Binomial, fX (x) =(nx

)px (1− p)n−x , for x ∈ {0, ..., n}.

E(X ) = n p.

• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.

• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R

Andre Martins (IST) Lecture 1 IST, Fall 2018 58 / 82

Page 135: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Random Variables

• Expectation: E(X ) =

∑i

xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞

x fX (x) dx X continuous

• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.

E(X ) = 0 (1− p) + 1 p = p.

• Example: Binomial, fX (x) =(nx

)px (1− p)n−x , for x ∈ {0, ..., n}.

E(X ) = n p.

• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.

• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R

Andre Martins (IST) Lecture 1 IST, Fall 2018 58 / 82

Page 136: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Random Variables

• Expectation: E(X ) =

∑i

xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞

x fX (x) dx X continuous

• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.

E(X ) = 0 (1− p) + 1 p = p.

• Example: Binomial, fX (x) =(nx

)px (1− p)n−x , for x ∈ {0, ..., n}.

E(X ) = n p.

• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.

• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R

Andre Martins (IST) Lecture 1 IST, Fall 2018 58 / 82

Page 137: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Random Variables

• Expectation: E(X ) =

∑i

xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞

x fX (x) dx X continuous

• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.

E(X ) = 0 (1− p) + 1 p = p.

• Example: Binomial, fX (x) =(nx

)px (1− p)n−x , for x ∈ {0, ..., n}.

E(X ) = n p.

• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.

• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R

Andre Martins (IST) Lecture 1 IST, Fall 2018 58 / 82

Page 138: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Random Variables

• Expectation: E(X ) =

∑i

xi fX (xi ) X ∈ {x1, ..., xK} ⊂ R∫ ∞−∞

x fX (x) dx X continuous

• Example: Bernoulli, fX (x) = px (1− p)1−x , for x ∈ {0, 1}.

E(X ) = 0 (1− p) + 1 p = p.

• Example: Binomial, fX (x) =(nx

)px (1− p)n−x , for x ∈ {0, ..., n}.

E(X ) = n p.

• Example: Gaussian, fX (x) = N(x ;µ, σ2). E(X ) = µ.

• Linearity of expectation:E(X + Y ) = E(X ) + E(Y ); E(αX ) = αE(X ), α ∈ R

Andre Martins (IST) Lecture 1 IST, Fall 2018 58 / 82

Page 139: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Functions of Random Variables

• E(g(X )) =

∑i

g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞

g(x) fX (x) dx X continuous

• Example: variance, var(X ) = E((

X − E(X ))2)

= E(X 2)− E(X )2

• Example: Bernoulli variance, E(X 2) = E(X ) = p

, thus var(X ) = p(1− p).

• Example: Gaussian variance, E((X − µ)2

)= σ2.

• Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX (x) dx =

∫1A(x) fX (x) dx = E(1A(X ))

Andre Martins (IST) Lecture 1 IST, Fall 2018 59 / 82

Page 140: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Functions of Random Variables

• E(g(X )) =

∑i

g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞

g(x) fX (x) dx X continuous

• Example: variance, var(X ) = E((

X − E(X ))2)

= E(X 2)− E(X )2

• Example: Bernoulli variance, E(X 2) = E(X ) = p

, thus var(X ) = p(1− p).

• Example: Gaussian variance, E((X − µ)2

)= σ2.

• Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX (x) dx =

∫1A(x) fX (x) dx = E(1A(X ))

Andre Martins (IST) Lecture 1 IST, Fall 2018 59 / 82

Page 141: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Functions of Random Variables

• E(g(X )) =

∑i

g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞

g(x) fX (x) dx X continuous

• Example: variance, var(X ) = E((

X − E(X ))2)

= E(X 2)− E(X )2

• Example: Bernoulli variance, E(X 2) = E(X ) = p

, thus var(X ) = p(1− p).

• Example: Gaussian variance, E((X − µ)2

)= σ2.

• Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX (x) dx =

∫1A(x) fX (x) dx = E(1A(X ))

Andre Martins (IST) Lecture 1 IST, Fall 2018 59 / 82

Page 142: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Functions of Random Variables

• E(g(X )) =

∑i

g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞

g(x) fX (x) dx X continuous

• Example: variance, var(X ) = E((

X − E(X ))2)

= E(X 2)− E(X )2

• Example: Bernoulli variance, E(X 2) = E(X ) = p , thus var(X ) = p(1− p).

• Example: Gaussian variance, E((X − µ)2

)= σ2.

• Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX (x) dx =

∫1A(x) fX (x) dx = E(1A(X ))

Andre Martins (IST) Lecture 1 IST, Fall 2018 59 / 82

Page 143: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Functions of Random Variables

• E(g(X )) =

∑i

g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞

g(x) fX (x) dx X continuous

• Example: variance, var(X ) = E((

X − E(X ))2)

= E(X 2)− E(X )2

• Example: Bernoulli variance, E(X 2) = E(X ) = p , thus var(X ) = p(1− p).

• Example: Gaussian variance, E((X − µ)2

)= σ2.

• Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX (x) dx =

∫1A(x) fX (x) dx = E(1A(X ))

Andre Martins (IST) Lecture 1 IST, Fall 2018 59 / 82

Page 144: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Expectation of Functions of Random Variables

• E(g(X )) =

∑i

g(xi )fX (xi ) X discrete, g(xi ) ∈ R∫ ∞−∞

g(x) fX (x) dx X continuous

• Example: variance, var(X ) = E((

X − E(X ))2)

= E(X 2)− E(X )2

• Example: Bernoulli variance, E(X 2) = E(X ) = p , thus var(X ) = p(1− p).

• Example: Gaussian variance, E((X − µ)2

)= σ2.

• Probability as expectation of indicator, 1A(x) =

{1 ⇐ x ∈ A0 ⇐ x 6∈ A

P(X ∈ A) =

∫AfX (x) dx =

∫1A(x) fX (x) dx = E(1A(X ))

Andre Martins (IST) Lecture 1 IST, Fall 2018 59 / 82

Page 145: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Two (or More) Random Variables

• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

• Joint pdf of two continuous RVs: fX ,Y (x , y), such that

P((X ,Y ) ∈ A

)=

∫ ∫AfX ,Y (x , y) dx dy .

Extends trivially to more than two RVs.

• Marginalization: fY (y) =

∑x

fX ,Y (x , y), if X is discrete∫ ∞−∞

fX ,Y (x , y) dx , if X continuous

• Independence:

X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ E(X Y ) = E(X )E(Y )

.

Andre Martins (IST) Lecture 1 IST, Fall 2018 60 / 82

Page 146: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Two (or More) Random Variables

• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

• Joint pdf of two continuous RVs: fX ,Y (x , y), such that

P((X ,Y ) ∈ A

)=

∫ ∫AfX ,Y (x , y) dx dy .

Extends trivially to more than two RVs.

• Marginalization: fY (y) =

∑x

fX ,Y (x , y), if X is discrete∫ ∞−∞

fX ,Y (x , y) dx , if X continuous

• Independence:

X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ E(X Y ) = E(X )E(Y )

.

Andre Martins (IST) Lecture 1 IST, Fall 2018 60 / 82

Page 147: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Two (or More) Random Variables

• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

• Joint pdf of two continuous RVs: fX ,Y (x , y), such that

P((X ,Y ) ∈ A

)=

∫ ∫AfX ,Y (x , y) dx dy .

Extends trivially to more than two RVs.

• Marginalization: fY (y) =

∑x

fX ,Y (x , y), if X is discrete∫ ∞−∞

fX ,Y (x , y) dx , if X continuous

• Independence:

X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ E(X Y ) = E(X )E(Y )

.

Andre Martins (IST) Lecture 1 IST, Fall 2018 60 / 82

Page 148: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Two (or More) Random Variables

• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

• Joint pdf of two continuous RVs: fX ,Y (x , y), such that

P((X ,Y ) ∈ A

)=

∫ ∫AfX ,Y (x , y) dx dy .

Extends trivially to more than two RVs.

• Marginalization: fY (y) =

∑x

fX ,Y (x , y), if X is discrete∫ ∞−∞

fX ,Y (x , y) dx , if X continuous

• Independence:

X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ E(X Y ) = E(X )E(Y )

.

Andre Martins (IST) Lecture 1 IST, Fall 2018 60 / 82

Page 149: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Two (or More) Random Variables

• Joint pmf of two discrete RVs: fX ,Y (x , y) = P(X = x ∧ Y = y).

Extends trivially to more than two RVs.

• Joint pdf of two continuous RVs: fX ,Y (x , y), such that

P((X ,Y ) ∈ A

)=

∫ ∫AfX ,Y (x , y) dx dy .

Extends trivially to more than two RVs.

• Marginalization: fY (y) =

∑x

fX ,Y (x , y), if X is discrete∫ ∞−∞

fX ,Y (x , y) dx , if X continuous

• Independence:

X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ E(X Y ) = E(X )E(Y ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 60 / 82

Page 150: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditionals and Bayes’ Theorem

• Conditional pmf (discrete RVs):

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)

P(Y = y)=

fX ,Y (x , y)

fY (y).

• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)

fY (y)...the meaning is technically delicate.

• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)

fY (y)(pdf or pmf).

• Also valid in the mixed case (e.g., X continuous, Y discrete).

Andre Martins (IST) Lecture 1 IST, Fall 2018 61 / 82

Page 151: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditionals and Bayes’ Theorem

• Conditional pmf (discrete RVs):

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)

P(Y = y)=

fX ,Y (x , y)

fY (y).

• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)

fY (y)...the meaning is technically delicate.

• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)

fY (y)(pdf or pmf).

• Also valid in the mixed case (e.g., X continuous, Y discrete).

Andre Martins (IST) Lecture 1 IST, Fall 2018 61 / 82

Page 152: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditionals and Bayes’ Theorem

• Conditional pmf (discrete RVs):

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)

P(Y = y)=

fX ,Y (x , y)

fY (y).

• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)

fY (y)...the meaning is technically delicate.

• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)

fY (y)(pdf or pmf).

• Also valid in the mixed case (e.g., X continuous, Y discrete).

Andre Martins (IST) Lecture 1 IST, Fall 2018 61 / 82

Page 153: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Conditionals and Bayes’ Theorem

• Conditional pmf (discrete RVs):

fX |Y (x |y) = P(X = x |Y = y) =P(X = x ∧ Y = y)

P(Y = y)=

fX ,Y (x , y)

fY (y).

• Conditional pdf (continuous RVs): fX |Y (x |y) =fX ,Y (x , y)

fY (y)...the meaning is technically delicate.

• Bayes’ theorem: fX |Y (x |y) =fY |X (y |x) fX (x)

fY (y)(pdf or pmf).

• Also valid in the mixed case (e.g., X continuous, Y discrete).

Andre Martins (IST) Lecture 1 IST, Fall 2018 61 / 82

Page 154: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Joint, Marginal, and Conditional Probabilities: AnExample

• A pair of binary variables X ,Y ∈ {0, 1}, with joint pmf:

• Marginals: fX (0) = 15 + 2

5 = 35 , fX (1) = 1

10 + 310 = 4

10 ,

fY (0) = 15 + 1

10 = 310 , fY (1) = 2

5 + 310 = 7

10 .

• Conditional probabilities:

Andre Martins (IST) Lecture 1 IST, Fall 2018 62 / 82

Page 155: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Joint, Marginal, and Conditional Probabilities: AnExample

• A pair of binary variables X ,Y ∈ {0, 1}, with joint pmf:

• Marginals: fX (0) = 15 + 2

5 = 35 , fX (1) = 1

10 + 310 = 4

10 ,

fY (0) = 15 + 1

10 = 310 , fY (1) = 2

5 + 310 = 7

10 .

• Conditional probabilities:

Andre Martins (IST) Lecture 1 IST, Fall 2018 62 / 82

Page 156: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Joint, Marginal, and Conditional Probabilities: AnExample

• A pair of binary variables X ,Y ∈ {0, 1}, with joint pmf:

• Marginals: fX (0) = 15 + 2

5 = 35 , fX (1) = 1

10 + 310 = 4

10 ,

fY (0) = 15 + 1

10 = 310 , fY (1) = 2

5 + 310 = 7

10 .

• Conditional probabilities:

Andre Martins (IST) Lecture 1 IST, Fall 2018 62 / 82

Page 157: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

An Important Multivariate RV: Multinomial

• Multinomial: X = (X1, ...,XK ), Xi ∈ {0, ..., n}, such that∑

i Xi = n,

fX (x1, ..., xK ) =

{ ( nx1 x2 ··· xK

)px11 px22 · · · p

xKk ⇐

∑i xi = n

0 ⇐∑

i xi 6= n(n

x1 x2 · · · xK

)=

n!

x1! x2! · · · xK !

Parameters: p1, ..., pK ≥ 0, such that∑

i pi = 1.

• Generalizes the binomial from binary to K -classes.

• Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,

∑i xi = n.

Andre Martins (IST) Lecture 1 IST, Fall 2018 63 / 82

Page 158: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

An Important Multivariate RV: Multinomial

• Multinomial: X = (X1, ...,XK ), Xi ∈ {0, ..., n}, such that∑

i Xi = n,

fX (x1, ..., xK ) =

{ ( nx1 x2 ··· xK

)px11 px22 · · · p

xKk ⇐

∑i xi = n

0 ⇐∑

i xi 6= n(n

x1 x2 · · · xK

)=

n!

x1! x2! · · · xK !

Parameters: p1, ..., pK ≥ 0, such that∑

i pi = 1.

• Generalizes the binomial from binary to K -classes.

• Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,

∑i xi = n.

Andre Martins (IST) Lecture 1 IST, Fall 2018 63 / 82

Page 159: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

An Important Multivariate RV: Multinomial

• Multinomial: X = (X1, ...,XK ), Xi ∈ {0, ..., n}, such that∑

i Xi = n,

fX (x1, ..., xK ) =

{ ( nx1 x2 ··· xK

)px11 px22 · · · p

xKk ⇐

∑i xi = n

0 ⇐∑

i xi 6= n(n

x1 x2 · · · xK

)=

n!

x1! x2! · · · xK !

Parameters: p1, ..., pK ≥ 0, such that∑

i pi = 1.

• Generalizes the binomial from binary to K -classes.

• Example: tossing n independent fair dice, p1 = · · · = p6 = 1/6.xi = number of outcomes with i dots. Of course,

∑i xi = n.

Andre Martins (IST) Lecture 1 IST, Fall 2018 63 / 82

Page 160: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

An Important Multivariate RV: Gaussian

• Multivariate Gaussian: X ∈ Rn,

fX (x) = N(x ;µ,C ) =1√

det(2π C )exp

(−1

2(x − µ)TC−1(x − µ)

)

• Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X ) = µ. Meaning of C : next slide.

Andre Martins (IST) Lecture 1 IST, Fall 2018 64 / 82

Page 161: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

An Important Multivariate RV: Gaussian

• Multivariate Gaussian: X ∈ Rn,

fX (x) = N(x ;µ,C ) =1√

det(2π C )exp

(−1

2(x − µ)TC−1(x − µ)

)

• Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X ) = µ. Meaning of C : next slide.

Andre Martins (IST) Lecture 1 IST, Fall 2018 64 / 82

Page 162: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

An Important Multivariate RV: Gaussian

• Multivariate Gaussian: X ∈ Rn,

fX (x) = N(x ;µ,C ) =1√

det(2π C )exp

(−1

2(x − µ)TC−1(x − µ)

)

• Parameters: vector µ ∈ Rn and matrix C ∈ Rn×n.Expected value: E(X ) = µ. Meaning of C : next slide.

Andre Martins (IST) Lecture 1 IST, Fall 2018 64 / 82

Page 163: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 164: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 165: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 166: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)

⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 167: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 168: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 169: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Covariance, Correlation, and all that...

• Covariance between two RVs:

cov(X ,Y ) = E[(X − E(X )

) (Y − E(Y )

)]= E(X Y )− E(X )E(Y )

• Relationship with variance: var(X ) = cov(X ,X ).

• Correlation: corr(X ,Y ) = ρ(X ,Y ) = cov(X ,Y )√var(X )

√var(Y )

∈ [−1, 1]

• X ⊥⊥ Y ⇔ fX ,Y (x , y) = fX (x) fY (y)⇒6⇐ cov(X , Y ) = 0 (example)

• Covariance matrix of multivariate RV, X ∈ Rn:

cov(X ) = E[(X − E(X )

)(X − E(X )

)T ]= E(X XT )− E(X )E(X )T

• Covariance of Gaussian RV, fX (x) = N(x ;µ,C ) ⇒ cov(X ) = C

Andre Martins (IST) Lecture 1 IST, Fall 2018 65 / 82

Page 170: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 171: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 172: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 173: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 174: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 175: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 176: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

More on Expectations and Covariances

Let A ∈ Rn×n be a matrix and a ∈ Rn a vector.

• If E(X ) = µ and Y = AX , then E(Y ) = Aµ;

• If cov(X ) = C and Y = AX , then cov(Y ) = ACAT ;

• If cov(X ) = C and Y = aTX ∈ R, then var(Y ) = aTCa ≥ 0;

• If cov(X ) = C and Y = C−1/2X , then cov(Y ) = I ;

• If fX (x) = N(x ; 0, I ) and Y = µ+ C 1/2X , then fY (y) = N(y ;µ,C );

• If fX (x) = N(x ;µ,C ) and Y =C−1/2(X − µ), then fY (y)=N(y ; 0, I ).

Andre Martins (IST) Lecture 1 IST, Fall 2018 66 / 82

Page 177: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Central Limit Theorem

Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i

• Their sum, Yn =n∑

i=1

Xi satisfies:

E[Yn] =n∑

i=1

µi ≡ µ

var(Yn) =∑i

σ2i ≡ σ

• ...thus, if Zn =Yn − µσ

E[Zn] = 0 var(Zn) = 1

• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn

limn→∞

Zn ∼ N(0, 1)

Andre Martins (IST) Lecture 1 IST, Fall 2018 67 / 82

Page 178: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Central Limit Theorem

Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i

• Their sum, Yn =n∑

i=1

Xi satisfies:

E[Yn] =n∑

i=1

µi ≡ µ

var(Yn) =∑i

σ2i ≡ σ

• ...thus, if Zn =Yn − µσ

E[Zn] = 0 var(Zn) = 1

• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn

limn→∞

Zn ∼ N(0, 1)

Andre Martins (IST) Lecture 1 IST, Fall 2018 67 / 82

Page 179: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Central Limit Theorem

Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i

• Their sum, Yn =n∑

i=1

Xi satisfies:

E[Yn] =n∑

i=1

µi ≡ µ var(Yn) =∑i

σ2i ≡ σ

• ...thus, if Zn =Yn − µσ

E[Zn] = 0 var(Zn) = 1

• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn

limn→∞

Zn ∼ N(0, 1)

Andre Martins (IST) Lecture 1 IST, Fall 2018 67 / 82

Page 180: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Central Limit Theorem

Take n independent r.v. X1, ...,Xn such that E[Xi ] = µi and var(Xi ) = σ2i

• Their sum, Yn =n∑

i=1

Xi satisfies:

E[Yn] =n∑

i=1

µi ≡ µ var(Yn) =∑i

σ2i ≡ σ

• ...thus, if Zn =Yn − µσ

E[Zn] = 0 var(Zn) = 1

• Central limit theorem (CLT): under some mild conditions on X1, ...,Xn

limn→∞

Zn ∼ N(0, 1)

Andre Martins (IST) Lecture 1 IST, Fall 2018 67 / 82

Page 181: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Central Limit Theorem

Illustration

Andre Martins (IST) Lecture 1 IST, Fall 2018 68 / 82

Page 182: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Inequalities

• Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X ), then

P(X > t) ≤ E(X )

t

Simple proof:

t P(X > t) =

∫ ∞t

t fX (x) dx ≤∫ ∞t

x fX (x) dx = E(X )−∫ t

0

x fX (x) dx︸ ︷︷ ︸≥0

≤ E(X )

• Chebyshev’s inequality: µ = E(Y ) and σ2 = var(Y ), then

P(|Y − µ| ≥ s) ≤ σ2

s2

...simple corollary of Markov’s inequality, with X = |Y − µ|2, t = s2

Andre Martins (IST) Lecture 1 IST, Fall 2018 69 / 82

Page 183: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Inequalities

• Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X ), then

P(X > t) ≤ E(X )

t

Simple proof:

t P(X > t) =

∫ ∞t

t fX (x) dx ≤∫ ∞t

x fX (x) dx = E(X )−∫ t

0

x fX (x) dx︸ ︷︷ ︸≥0

≤ E(X )

• Chebyshev’s inequality: µ = E(Y ) and σ2 = var(Y ), then

P(|Y − µ| ≥ s) ≤ σ2

s2

...simple corollary of Markov’s inequality, with X = |Y − µ|2, t = s2

Andre Martins (IST) Lecture 1 IST, Fall 2018 69 / 82

Page 184: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Important Inequalities

• Markov’s ineqality: if X ≥ 0 is an RV with expectation E(X ), then

P(X > t) ≤ E(X )

t

Simple proof:

t P(X > t) =

∫ ∞t

t fX (x) dx ≤∫ ∞t

x fX (x) dx = E(X )−∫ t

0

x fX (x) dx︸ ︷︷ ︸≥0

≤ E(X )

• Chebyshev’s inequality: µ = E(Y ) and σ2 = var(Y ), then

P(|Y − µ| ≥ s) ≤ σ2

s2

...simple corollary of Markov’s inequality, with X = |Y − µ|2, t = s2

Andre Martins (IST) Lecture 1 IST, Fall 2018 69 / 82

Page 185: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Cauchy-Schwartz

• Cauchy-Schwartz’s inequality for RVs:

|E(X Y )| ≤√

E(X 2)E(Y 2)

...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product

• Important corollary: let E[X ] = µ and E[X ] = ν,

|cov(X ,Y )| = E[(X − µ)(Y − ν)]

≤√

E[(X − µ)2]E[(Y − ν)2]

=√

var(X ) var(Y )

• Implication for correlation:

corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√

var(X )√

var(Y )∈ [−1, 1]

Andre Martins (IST) Lecture 1 IST, Fall 2018 70 / 82

Page 186: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Cauchy-Schwartz

• Cauchy-Schwartz’s inequality for RVs:

|E(X Y )| ≤√

E(X 2)E(Y 2)

...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product

• Important corollary: let E[X ] = µ and E[X ] = ν,

|cov(X ,Y )| = E[(X − µ)(Y − ν)]

≤√

E[(X − µ)2]E[(Y − ν)2]

=√

var(X ) var(Y )

• Implication for correlation:

corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√

var(X )√

var(Y )∈ [−1, 1]

Andre Martins (IST) Lecture 1 IST, Fall 2018 70 / 82

Page 187: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Cauchy-Schwartz

• Cauchy-Schwartz’s inequality for RVs:

|E(X Y )| ≤√

E(X 2)E(Y 2)

...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product

• Important corollary: let E[X ] = µ and E[X ] = ν,

|cov(X ,Y )| = E[(X − µ)(Y − ν)]

≤√E[(X − µ)2]E[(Y − ν)2]

=√

var(X ) var(Y )

• Implication for correlation:

corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√

var(X )√

var(Y )∈ [−1, 1]

Andre Martins (IST) Lecture 1 IST, Fall 2018 70 / 82

Page 188: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Cauchy-Schwartz

• Cauchy-Schwartz’s inequality for RVs:

|E(X Y )| ≤√

E(X 2)E(Y 2)

...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product

• Important corollary: let E[X ] = µ and E[X ] = ν,

|cov(X ,Y )| = E[(X − µ)(Y − ν)]

≤√E[(X − µ)2]E[(Y − ν)2]

=√

var(X ) var(Y )

• Implication for correlation:

corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√

var(X )√

var(Y )∈ [−1, 1]

Andre Martins (IST) Lecture 1 IST, Fall 2018 70 / 82

Page 189: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Cauchy-Schwartz

• Cauchy-Schwartz’s inequality for RVs:

|E(X Y )| ≤√

E(X 2)E(Y 2)

...why? Because 〈X ,Y 〉 ≡ E[XY ] is a valid inner product

• Important corollary: let E[X ] = µ and E[X ] = ν,

|cov(X ,Y )| = E[(X − µ)(Y − ν)]

≤√E[(X − µ)2]E[(Y − ν)2]

=√

var(X ) var(Y )

• Implication for correlation:

corr(X ,Y ) = ρ(X ,Y ) =cov(X ,Y )√

var(X )√

var(Y )∈ [−1, 1]

Andre Martins (IST) Lecture 1 IST, Fall 2018 70 / 82

Page 190: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Jensen

• Recall that a real function g is convex if, for any x , y , and α ∈ [0, 1]

g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

g(E(X )) ≤ E(g(X ))

Examples: E(X )2 ≤ E(X 2) ⇒ var(X ) = E(X 2)− E(X )2 ≥ 0.

E(logX ) ≤ logE(X ), for X a positive RV.

Andre Martins (IST) Lecture 1 IST, Fall 2018 71 / 82

Page 191: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Jensen

• Recall that a real function g is convex if, for any x , y , and α ∈ [0, 1]

g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

g(E(X )) ≤ E(g(X ))

Examples: E(X )2 ≤ E(X 2) ⇒ var(X ) = E(X 2)− E(X )2 ≥ 0.

E(logX ) ≤ logE(X ), for X a positive RV.

Andre Martins (IST) Lecture 1 IST, Fall 2018 71 / 82

Page 192: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Other Important Inequalities: Jensen

• Recall that a real function g is convex if, for any x , y , and α ∈ [0, 1]

g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y)

Jensen’s inequality: if g is a real convex function, then

g(E(X )) ≤ E(g(X ))

Examples: E(X )2 ≤ E(X 2) ⇒ var(X ) = E(X 2)− E(X )2 ≥ 0.

E(logX ) ≤ logE(X ), for X a positive RV.

Andre Martins (IST) Lecture 1 IST, Fall 2018 71 / 82

Page 193: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 194: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 195: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 196: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 197: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 198: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 199: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 200: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Entropy and all that...

Entropy of a discrete RV X ∈ {1, ...,K}: H(X ) = −K∑

x=1

fX (x) log fX (x)

• Positivity: H(X ) ≥ 0 ;H(X ) = 0 ⇔ fX (i) = 1, for exactly one i ∈ {1, ...,K}.

• Upper bound: H(X ) ≤ logK ;H(X ) = logK ⇔ fX (x) = 1/k , for all x ∈ {1, ...,K}

• Measure of uncertainty/randomness of X

Continuous RV X , differential entropy: h(X ) = −∫

fX (x) log fX (x) dx

• h(X ) can be positive or negative. Example, iffX (x) = Uniform(x ; a, b), h(X ) = log(b − a).

• If fX (x) = N(x ;µ, σ2), then h(X ) = 12 log(2πeσ2).

• If var(Y ) = σ2, then h(Y ) ≤ 12 log(2πeσ2)

Andre Martins (IST) Lecture 1 IST, Fall 2018 72 / 82

Page 201: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX ) =K∑

x=1

fX (x) logfX (x)

gX (x)

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX ) =

∫fX (x) log

fX (x)

gX (x)dx

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere

Andre Martins (IST) Lecture 1 IST, Fall 2018 73 / 82

Page 202: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX ) =K∑

x=1

fX (x) logfX (x)

gX (x)

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX ) =

∫fX (x) log

fX (x)

gX (x)dx

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere

Andre Martins (IST) Lecture 1 IST, Fall 2018 73 / 82

Page 203: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX ) =K∑

x=1

fX (x) logfX (x)

gX (x)

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX ) =

∫fX (x) log

fX (x)

gX (x)dx

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere

Andre Martins (IST) Lecture 1 IST, Fall 2018 73 / 82

Page 204: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Kullback-Leibler divergence

Kullback-Leibler divergence (KLD) between two pmf:

D(fX‖gX ) =K∑

x=1

fX (x) logfX (x)

gX (x)

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), for x ∈ {1, ...,K}

KLD between two pdf:

D(fX‖gX ) =

∫fX (x) log

fX (x)

gX (x)dx

Positivity: D(fX‖gX ) ≥ 0D(fX‖gX ) = 0 ⇔ fX (x) = gX (x), almost everywhere

Andre Martins (IST) Lecture 1 IST, Fall 2018 73 / 82

Page 205: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Mutual information

Mutual information (MI) between two random variables:

I (X ;Y ) = D(fX ,Y ‖fX fY

)

Positivity: I (X ;Y ) ≥ 0I (X ;Y ) = 0 ⇔ X ,Y are independent.

MI is a measure of dependency between two random variables

Andre Martins (IST) Lecture 1 IST, Fall 2018 74 / 82

Page 206: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Mutual information

Mutual information (MI) between two random variables:

I (X ;Y ) = D(fX ,Y ‖fX fY

)Positivity: I (X ;Y ) ≥ 0

I (X ;Y ) = 0 ⇔ X ,Y are independent.

MI is a measure of dependency between two random variables

Andre Martins (IST) Lecture 1 IST, Fall 2018 74 / 82

Page 207: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Mutual information

Mutual information (MI) between two random variables:

I (X ;Y ) = D(fX ,Y ‖fX fY

)Positivity: I (X ;Y ) ≥ 0

I (X ;Y ) = 0 ⇔ X ,Y are independent.

MI is a measure of dependency between two random variables

Andre Martins (IST) Lecture 1 IST, Fall 2018 74 / 82

Page 208: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Recommended Reading

• K. Murphy, “Machine Learning: A Probabilistic Perspective”, MITPress, 2012.

• L. Wasserman, “All of Statistics: A Concise Course in StatisticalInference”, Springer, 2004.

Andre Martins (IST) Lecture 1 IST, Fall 2018 75 / 82

Page 209: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Outline

1 Introduction

2 Class Administrativia

3 Recap

Linear Algebra

Probability Theory

Optimization

Andre Martins (IST) Lecture 1 IST, Fall 2018 76 / 82

Page 210: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Minimizing a function

• We are given a function f : Rn → R.

• Goal: find x∗ that minimizes f : Rd → R.

• Global minimum: for any x ∈ Rn, f (x∗) ≤ f (x).

• Local minimum: for any ‖x − x∗‖ ≤ δ ⇒ f (x∗) ≤ f (x).

Are these global minima ?

• No, (local minima, saddle points, . . . )

Andre Martins (IST) Lecture 1 IST, Fall 2018 77 / 82

Page 211: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Iterative descent methods

Goal: find the minimum/minimizer of f : Rd → R

• Proceed in small steps in the optimal direction till a stoppingcriterion is met.

• Gradient descent: updates of the form: x (t+1) ← x (t)− η(t)∇f (x (t))

Figure: Illustration of gradient descent.The blue circles correspond to thefunction values at different points, while the red lines correspond to steps taken inthe negative gradient direction.

Andre Martins (IST) Lecture 1 IST, Fall 2018 78 / 82

Page 212: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Convex functions

Pro: Guarantee of a global minima X

Figure: Illustration of a convex function. The line segment between any twopoints on the graph lies entirely above the curve.

Andre Martins (IST) Lecture 1 IST, Fall 2018 79 / 82

Page 213: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Non-Convex functions

Pro: No guarantee of a global minima 7

Figure: Illustration of a non-convex function. Note the line segment intersectingthe curve.

Andre Martins (IST) Lecture 1 IST, Fall 2018 80 / 82

Page 214: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

Thank you!

Questions?

Andre Martins (IST) Lecture 1 IST, Fall 2018 81 / 82

Page 215: Lecture 1: Introduction - andre-martins.github.io · Lecture 1: Introduction Andr e Martins Deep Structured Learning Course, Fall 2018 Andr e Martins (IST) Lecture 1 IST, Fall 2018

References I

Andre Martins (IST) Lecture 1 IST, Fall 2018 82 / 82