Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally:...

180
Mul$class Classifica$on Alan Ri0er (many slides from Greg Durrett,Vivek Srikumar, Stanford CS231n)

Transcript of Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally:...

Page 1: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

Alan Ri0er(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

Page 2: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

This Lecture

‣ Mul$class fundamentals

‣ Mul$class logis$c regression

‣ Mul$class SVM

‣ Feature extrac$on

‣ Op$miza$on

Page 3: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Fundamentals

Page 4: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Text Classifica$on

~20 classes

Sports

Health

Page 5: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Image Classifica$on

‣ Thousands of classes (ImageNet)

Car

Dog

Page 6: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

En$ty Linking

Page 7: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

En$ty Linking

Although he originally won the event, the United States An$-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecu$ve Tour de France wins from 1999–2005.

Page 8: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

En$ty Linking

Although he originally won the event, the United States An$-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecu$ve Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Page 9: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

En$ty Linking

Although he originally won the event, the United States An$-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecu$ve Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Armstrong County is a county in Pennsylvania…

Page 10: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

En$ty Linking

Although he originally won the event, the United States An$-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecu$ve Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Armstrong County is a county in Pennsylvania…

??

Page 11: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

En$ty Linking

‣ 4,500,000 classes (all ar$cles in Wikipedia)

Although he originally won the event, the United States An$-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecu$ve Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Armstrong County is a county in Pennsylvania…

??

Page 12: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Reading Comprehension

‣ Mul$ple choice ques$ons, 4 classes (but classes change per example)

Richardson (2013)

Page 13: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Binary Classifica$on

‣ Binary classifica$on: one weight vector defines posi$ve and nega$ve classes

+++ ++ +++

- - --

----

-

Page 14: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

1 11 1

1 122

22

33 33

Page 15: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

1 11 1

1 122

22

33 33

‣ Can we just use binary classifiers here?

Page 16: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ One-vs-all: train k classifiers, one to dis$nguish each class from all the rest

Page 17: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ One-vs-all: train k classifiers, one to dis$nguish each class from all the rest

1 11 1

1 122

22

33 33

Page 18: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ One-vs-all: train k classifiers, one to dis$nguish each class from all the rest

1 11 1

1 122

22

33 33

1 11 1

1 122

22

33 33

Page 19: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ One-vs-all: train k classifiers, one to dis$nguish each class from all the rest

1 11 1

1 122

22

33 33

1 11 1

1 122

22

33 33

‣ How do we reconcile mul$ple posi$ve predic$ons? Highest score?

Page 20: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ Not all classes may even be separable using this approach

1 11 1

1 12 2222 2

3 3

3 33 3

Page 21: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ Not all classes may even be separable using this approach

1 11 1

1 12 2222 2

3 3

3 33 3

Page 22: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ Not all classes may even be separable using this approach

1 11 1

1 12 2222 2

3 3

3 33 3

Page 23: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ Not all classes may even be separable using this approach

1 11 1

1 12 2222 2

3 3

3 33 3

‣ Can separate 1 from 2+3 and 2 from 1+3 but not 3 from the others (with these features)

Page 24: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ All-vs-all: train n(n-1)/2 classifiers to differen$ate each pair of classes

Page 25: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ All-vs-all: train n(n-1)/2 classifiers to differen$ate each pair of classes

1 11 1

1 122

22

Page 26: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ All-vs-all: train n(n-1)/2 classifiers to differen$ate each pair of classes

1 11 1

1 1

33 33

1 11 1

1 122

22

Page 27: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ All-vs-all: train n(n-1)/2 classifiers to differen$ate each pair of classes

1 11 1

1 1

33 33

1 11 1

1 122

22

‣ Again, how to reconcile?

Page 28: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

+++ ++ +++

- - --

----

-

1 11 1

1 122

22

33 33

‣ Binary classifica$on: one weight vector defines both classes

Page 29: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

+++ ++ +++

- - --

----

-

1 11 1

1 122

22

33 33

‣ Binary classifica$on: one weight vector defines both classes

‣ Mul$class classifica$on: different weights and/or features per class

Page 30: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

+++ ++ +++

- - --

----

-

1 11 1

1 122

22

33 33

‣ Binary classifica$on: one weight vector defines both classes

‣ Mul$class classifica$on: different weights and/or features per class

Page 31: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

Page 32: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on‣ Formally: instead of two labels, we have an output space containing

a number of possible classesY

‣ Same machinery that we’ll use later for exponen$ally large output spaces, including sequences and trees

Page 33: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

‣ Decision rule:

‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Y

‣ Same machinery that we’ll use later for exponen$ally large output spaces, including sequences and trees

argmaxy2Yw>f(x, y)

Page 34: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

‣ Decision rule:

‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Y

‣ Same machinery that we’ll use later for exponen$ally large output spaces, including sequences and trees

argmaxy2Yw>f(x, y)

features depend on choice of label now! note: this isn’t the gold label

Page 35: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

‣ Decision rule:

‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Y

‣ Same machinery that we’ll use later for exponen$ally large output spaces, including sequences and trees

argmaxy2Yw>f(x, y)

‣ Mul$ple feature vectors, one weight vector

features depend on choice of label now! note: this isn’t the gold label

Page 36: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

‣ Decision rule:

‣ Can also have one weight vector per class:

‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Y

‣ Same machinery that we’ll use later for exponen$ally large output spaces, including sequences and trees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Mul$ple feature vectors, one weight vector

features depend on choice of label now! note: this isn’t the gold label

Page 37: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Classifica$on

‣ Decision rule:

‣ Can also have one weight vector per class:

‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Y

‣ Same machinery that we’ll use later for exponen$ally large output spaces, including sequences and trees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ The single weight vector approach will generalize to structured output spaces, whereas per-class weight vectors won’t

‣ Mul$ple feature vectors, one weight vector

features depend on choice of label now! note: this isn’t the gold label

Page 38: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Feature Extrac$on

Page 39: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

Page 40: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

Page 41: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science‣ Base feature func$on:

Page 42: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball]‣ Base feature func$on:

Page 43: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]‣ Base feature func$on:

Page 44: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]

f(x, y = ) =Health

‣ Base feature func$on:

Page 45: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]

[1, 1, 0, 0, 0, 0, 0, 0, 0]f(x, y = ) =Health

‣ Base feature func$on:

Page 46: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]

[1, 1, 0, 0, 0, 0, 0, 0, 0]f(x, y = ) =Health

feature vector blocks for each label

‣ Base feature func$on:

Page 47: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

feature vector blocks for each label

‣ Base feature func$on:

Page 48: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

feature vector blocks for each label

‣ Base feature func$on:

I[contains drug & label = Health]

Page 49: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Block Feature Vectors‣ Decision rule: argmaxy2Yw

>f(x, y)

too many drug trials, too few pa5ents

Health

Sports

Science

f(x)= I[contains drug], I[contains pa5ents], I[contains baseball] = [1, 1, 0]

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalent to having three weight vectors in this case

feature vector blocks for each label

‣ Base feature func$on:

I[contains drug & label = Health]

Page 50: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Making Decisions

f(x) = I[contains drug], I[contains pa5ents], I[contains baseball]

too many drug trials, too few pa5ents

Health

Sports

Science

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

Page 51: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Making Decisions

f(x) = I[contains drug], I[contains pa5ents], I[contains baseball]

w = [+2.1, +2.3, -5, -2.1, -3.8, 0, +1.1, -1.7, -1.3]

too many drug trials, too few pa5ents

Health

Sports

Science

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

Page 52: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Making Decisions

f(x) = I[contains drug], I[contains pa5ents], I[contains baseball]

w = [+2.1, +2.3, -5, -2.1, -3.8, 0, +1.1, -1.7, -1.3]

too many drug trials, too few pa5ents

Health

Sports

Science

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports “word drug in Science ar$cle” = +1.1

Page 53: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Making Decisions

f(x) = I[contains drug], I[contains pa5ents], I[contains baseball]

w = [+2.1, +2.3, -5, -2.1, -3.8, 0, +1.1, -1.7, -1.3]

=

too many drug trials, too few pa5ents

Health

Sports

Science

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports “word drug in Science ar$cle” = +1.1

w>f(x, y)

Page 54: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Making Decisions

f(x) = I[contains drug], I[contains pa5ents], I[contains baseball]

w = [+2.1, +2.3, -5, -2.1, -3.8, 0, +1.1, -1.7, -1.3]

= Health: +4.4 Sports: -5.9 Science: -0.6

too many drug trials, too few pa5ents

Health

Sports

Science

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports “word drug in Science ar$cle” = +1.1

w>f(x, y)

Page 55: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Making Decisions

f(x) = I[contains drug], I[contains pa5ents], I[contains baseball]

w = [+2.1, +2.3, -5, -2.1, -3.8, 0, +1.1, -1.7, -1.3]

= Health: +4.4 Sports: -5.9 Science: -0.6

argmax

too many drug trials, too few pa5ents

Health

Sports

Science

[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports “word drug in Science ar$cle” = +1.1

w>f(x, y)

Page 56: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocks

Page 57: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksthe router the packets

Page 58: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

the router the packets

Page 59: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

‣ Classify blocks as one of 36 POS tags the router the packets

Page 60: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

‣ Classify blocks as one of 36 POS tags

‣ Example x: sentence with a word (in this case, blocks) highlighted

the router the packets

Page 61: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

‣ Classify blocks as one of 36 POS tags

‣ Example x: sentence with a word (in this case, blocks) highlighted

‣ Extract features with respect to this word:

the router the packets

Page 62: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

f(x, y=VBZ) = I[curr_word=blocks & tag = VBZ], I[prev_word=router & tag = VBZ] I[next_word=the & tag = VBZ] I[curr_suffix=s & tag = VBZ]

‣ Classify blocks as one of 36 POS tags

‣ Example x: sentence with a word (in this case, blocks) highlighted

‣ Extract features with respect to this word:

the router the packets

Page 63: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

f(x, y=VBZ) = I[curr_word=blocks & tag = VBZ], I[prev_word=router & tag = VBZ] I[next_word=the & tag = VBZ] I[curr_suffix=s & tag = VBZ]

‣ Classify blocks as one of 36 POS tags

‣ Example x: sentence with a word (in this case, blocks) highlighted

‣ Extract features with respect to this word:

not saying that the is tagged as VBZ! saying that the follows the VBZ word

the router the packets

Page 64: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Another example: POS taggingblocksNNSVBZNNDT…

f(x, y=VBZ) = I[curr_word=blocks & tag = VBZ], I[prev_word=router & tag = VBZ] I[next_word=the & tag = VBZ] I[curr_suffix=s & tag = VBZ]

‣ Classify blocks as one of 36 POS tags

‣ Next two lectures: sequence labeling!

‣ Example x: sentence with a word (in this case, blocks) highlighted

‣ Extract features with respect to this word:

not saying that the is tagged as VBZ! saying that the follows the VBZ word

the router the packets

Page 65: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

Page 66: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 67: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

‣ Compare to binary:

sum over output space to normalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 68: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

‣ Compare to binary:

nega$ve class implicitly had f(x, y=0) = the zero vector

sum over output space to normalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 69: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 70: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Softmax function

Page 71: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Why? Interpret raw classifier scores as probabili/es

Softmax function

Page 72: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Why? Interpret raw classifier scores as probabili/es

Softmax function

too many drug trials, too few pa5ents

Page 73: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health: +2.2Sports: +3.1

Science: -0.6

w>f(x, y)

Why? Interpret raw classifier scores as probabili/es

Softmax function

too many drug trials, too few pa5ents

Page 74: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health: +2.2Sports: +3.1

Science: -0.6

w>f(x, y)

Why? Interpret raw classifier scores as probabili/es

exp6.0522.20.55

probabili$es must be >= 0

unnormalized probabili$es

Softmax function

too many drug trials, too few pa5ents

Page 75: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health: +2.2Sports: +3.1

Science: -0.6

w>f(x, y)

Why? Interpret raw classifier scores as probabili/es

exp6.0522.20.55

probabili$es must be >= 0

unnormalized probabili$es

normalize 0.21

0.77 0.02

probabili$es must sum to 1

probabili$es

Softmax function

too many drug trials, too few pa5ents

Page 76: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health: +2.2Sports: +3.1

Science: -0.6

w>f(x, y)

Why? Interpret raw classifier scores as probabili/es

exp6.0522.20.55

probabili$es must be >= 0

unnormalized probabili$es

normalize 0.21

0.77 0.02

probabili$es must sum to 1

probabili$es

Softmax function

1.000.000.00

correct (gold) probabili$es

too many drug trials, too few pa5ents

Page 77: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health: +2.2Sports: +3.1

Science: -0.6

w>f(x, y)

Why? Interpret raw classifier scores as probabili/es

exp6.0522.20.55

probabili$es must be >= 0

unnormalized probabili$es

normalize 0.21

0.77 0.02

probabili$es must sum to 1

probabili$es

Softmax function

1.000.000.00

correct (gold) probabili$es

too many drug trials, too few pa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

Page 78: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health: +2.2Sports: +3.1

Science: -0.6

w>f(x, y)

Why? Interpret raw classifier scores as probabili/es

exp6.0522.20.55

probabili$es must be >= 0

unnormalized probabili$es

normalize 0.21

0.77 0.02

probabili$es must sum to 1

probabili$es

Softmax function

1.000.000.00

correct (gold) probabili$es

too many drug trials, too few pa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21) = - 1.56

Page 79: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 80: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

‣ Training: maximize L(x, y) =nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 81: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class Logis$c Regression

sum over output space to normalize

‣ Training: maximize

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!L(x, y) =

nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 82: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training‣ Mul$class logis$c regression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 83: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training‣ Mul$class logis$c regression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 84: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training‣ Mul$class logis$c regression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 85: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training‣ Mul$class logis$c regression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

gold feature value

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 86: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training‣ Mul$class logis$c regression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

gold feature value

model’s expecta$on of feature value

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 87: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 88: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 89: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

gradient:

Page 90: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0]gradient:

Page 91: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]gradient:

Page 92: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

gradient:

Page 93: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

= [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]

gradient:

Page 94: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

= [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]

gradient:

update :w>f(x, y) + `(y, y⇤)

Page 95: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

= [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]

gradient:

[1.3, 0.9, -5, 3.2, -0.1, 0, 1.1, -1.7, -1.3]update :w>f(x, y) + `(y, y⇤)

Page 96: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

= [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]

gradient:

[1.3, 0.9, -5, 3.2, -0.1, 0, 1.1, -1.7, -1.3] + [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]update :w>f(x, y) + `(y, y⇤)

Page 97: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

= [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]

gradient:

[1.3, 0.9, -5, 3.2, -0.1, 0, 1.1, -1.7, -1.3] + [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]= [2.09, 1.69, 0, 2.43, -0.87, 0, 1.08, -1.72, 0]

update :w>f(x, y) + `(y, y⇤)

Page 98: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training

too many drug trials, too few pa5ents[1, 1, 0, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 1, 0, 0, 0, 0]

f(x, y = ) =Health

f(x, y = ) =Sports

y* = Health

Pw(y|x) = [0.21, 0.77, 0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1, 1, 0, 0, 0, 0, 0, 0, 0] — 0.21 [1, 1, 0, 0, 0, 0, 0, 0, 0]— 0.77 [0, 0, 0, 1, 1, 0, 0, 0, 0] — 0.02 [0, 0, 0, 0, 0, 0, 1, 1, 0]

= [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]

gradient:

[1.3, 0.9, -5, 3.2, -0.1, 0, 1.1, -1.7, -1.3] + [0.79, 0.79, 0, -0.77, -0.77, 0, -0.02, -0.02, 0]= [2.09, 1.69, 0, 2.43, -0.87, 0, 1.08, -1.72, 0]

new Pw(y|x) = [0.89, 0.10, 0.01]

update :w>f(x, y) + `(y, y⇤)

Page 99: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Logis$c Regression: Summary

‣ Model: Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 100: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Logis$c Regression: Summary

‣ Model:

‣ Inference:

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

argmaxyPw(y|x)

Page 101: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Logis$c Regression: Summary

‣ Model:

‣ Learning: gradient ascent on the discrimina$ve log-likelihood

‣ Inference:

“towards gold feature value, away from expecta$on of feature value”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

argmaxyPw(y|x)

Page 102: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Page 103: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Sox Margin SVM

Page 104: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Sox Margin SVM

Minimize �kwk22 +mX

j=1

⇠jslack variables > 0 iff example is support vector

Page 105: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Sox Margin SVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslack variables > 0 iff example is support vector

Page 106: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Sox Margin SVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslack variables > 0 iff example is support vector

Image credit: Lang Van Tran

Page 107: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslack variables > 0 iff example is support vector

Page 108: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslack variables > 0 iff example is support vector

Page 109: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

slack variables > 0 iff example is support vector

Page 110: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Correct predic$on now has to beat every other class

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

slack variables > 0 iff example is support vector

Page 111: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Correct predic$on now has to beat every other class

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

Score comparison is more explicit now

slack variables > 0 iff example is support vector

Page 112: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Correct predic$on now has to beat every other class

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The 1 that was here is replaced by a loss func$on

Score comparison is more explicit now

slack variables > 0 iff example is support vector

Page 113: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

Page 114: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

Page 115: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

too many drug trials, too few pa5ents

Health

SportsSports

Science

Page 116: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

too many drug trials, too few pa5ents

Health

SportsSports

ScienceSportsPredicted : bad error

Page 117: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

too many drug trials, too few pa5ents

Health

SportsSports

ScienceSportsScience

PredictedPredicted : not so bad

: bad error

Page 118: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

‣ We can define a loss func$on `(y, y⇤)

too many drug trials, too few pa5ents

Health

SportsSports

ScienceSportsScience

PredictedPredicted : not so bad

: bad error

Page 119: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

‣ We can define a loss func$on `(y, y⇤)

too many drug trials, too few pa5ents

Health

SportsSports

ScienceSportsScience

PredictedPredicted : not so bad

: bad error

`( , ) =HealthSports 3

Page 120: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Training (loss-augmented)

‣ Are all decisions equally costly?

‣ We can define a loss func$on `(y, y⇤)

too many drug trials, too few pa5ents

Health

SportsSports

ScienceSportsScience

PredictedPredicted : not so bad

: bad error

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Page 121: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM8j8y 2 Y w>f(xj , y

⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

Page 122: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports Y

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

Page 123: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+0

Y

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

Page 124: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+01.8+1

Y

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

Page 125: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

Page 126: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Does gold beat every label + loss? No!

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

Page 127: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Does gold beat every label + loss? No!

‣ Most violated constraint is Sports; what is ?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

⇠j

Page 128: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Does gold beat every label + loss? No!

‣ = 4.3 - 2.4 = 1.9⇠j

‣ Most violated constraint is Sports; what is ?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

⇠j

Page 129: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Does gold beat every label + loss? No!

‣ = 4.3 - 2.4 = 1.9⇠j

‣ Most violated constraint is Sports; what is ?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

‣ Perceptron would make no update here

⇠j

Page 130: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

Page 131: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ One slack variable per example, so it’s set to be whatever the most violated constraint is for that example

Page 132: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ One slack variable per example, so it’s set to be whatever the most violated constraint is for that example

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 133: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ One slack variable per example, so it’s set to be whatever the most violated constraint is for that example

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 134: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Mul$class SVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ One slack variable per example, so it’s set to be whatever the most violated constraint is for that example

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Plug in the gold y and you get 0, so slack is always nonnega$ve!

Page 135: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Compu$ng the Subgradient

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

Page 136: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Compu$ng the Subgradient

‣ If , the example is not a support vector, gradient is zero⇠j = 0

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

Page 137: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Compu$ng the Subgradient

‣ If , the example is not a support vector, gradient is zero⇠j = 0

‣ Otherwise,

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 138: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Compu$ng the Subgradient

‣ If , the example is not a support vector, gradient is zero⇠j = 0

‣ Otherwise, @

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 139: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Compu$ng the Subgradient

‣ If , the example is not a support vector, gradient is zero⇠j = 0

‣ Otherwise,

(update looks backwards — we’re minimizing here!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 140: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Compu$ng the Subgradient

‣ If , the example is not a support vector, gradient is zero⇠j = 0

‣ Otherwise,

(update looks backwards — we’re minimizing here!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like, but we update away from *loss-augmented* predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 141: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Puzng it Together

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

Page 142: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Puzng it Together

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized) gradients:

Page 143: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Puzng it Together

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized) gradients:

‣ SVM: f(x, y⇤)� f(x, ymax) (loss-augmented max)

Page 144: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Puzng it Together

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized) gradients:

‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Log reg:f(x, y⇤)� f(x, ymax) (loss-augmented max)

Page 145: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Puzng it Together

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized) gradients:

‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Log reg:f(x, y⇤)� f(x, ymax) (loss-augmented max)

‣ SVM: max over s to compute gradient. LR: need to sum over s`(y, y⇤) `(y, y⇤)

Page 146: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

Page 147: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Recap‣ Four elements of a machine learning method:

Page 148: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Recap‣ Four elements of a machine learning method:

‣ Model: probabilis$c, max-margin, deep neural network

Page 149: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Recap‣ Four elements of a machine learning method:

‣ Model: probabilis$c, max-margin, deep neural network

‣ Objec$ve:

Page 150: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Recap‣ Four elements of a machine learning method:

‣ Model: probabilis$c, max-margin, deep neural network

‣ Inference: just maxes and simple expecta$ons so far, but will get harder

‣ Objec$ve:

Page 151: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Recap‣ Four elements of a machine learning method:

‣ Model: probabilis$c, max-margin, deep neural network

‣ Inference: just maxes and simple expecta$ons so far, but will get harder

‣ Training: gradient descent?

‣ Objec$ve:

Page 152: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

Page 153: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

Page 154: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*w w + ↵g, g =

@

@wL

Page 155: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code upw w + ↵g, g =

@

@wL

Page 156: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code upw w + ↵g, g =

@

@wL

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ

:B�

:B�

Page 157: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH

‣ What if loss changes quickly in one direc$on and slowly in another direc$on?

Page 158: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code upw w + ↵g, g =

@

@wL

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH

‣ What if loss changes quickly in one direc$on and slowly in another direc$on?

Page 159: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ What if loss changes quickly in one direc$on and slowly in another direc$on?

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"9HU\�VORZ�SURJUHVV�DORQJ�VKDOORZ�GLPHQVLRQ��MLWWHU�DORQJ�VWHHS�GLUHFWLRQ

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH

Page 160: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code upw w + ↵g, g =

@

@wL

‣ What if loss changes quickly in one direc$on and slowly in another direc$on?

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"9HU\�VORZ�SURJUHVV�DORQJ�VKDOORZ�GLPHQVLRQ��MLWWHU�DORQJ�VWHHS�GLUHFWLRQ

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH

Page 161: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*'

:KDW�LI�WKH�ORVV�IXQFWLRQ�KDV�D�ORFDO�PLQLPD�RU�VDGGOH�SRLQW"

6DGGOH�SRLQWV�PXFK�PRUH�FRPPRQ�LQ�KLJK�GLPHQVLRQ

'DXSKLQ�HW�DO��³,GHQWLI\LQJ�DQG�DWWDFNLQJ�WKH�VDGGOH�SRLQW�SUREOHP�LQ�KLJK�GLPHQVLRQDO�QRQ�FRQYH[�RSWLPL]DWLRQ´��1,36�����

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ What if the loss func$on has a local minima or saddle point?

Page 162: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*'

:KDW�LI�WKH�ORVV�IXQFWLRQ�KDV�D�ORFDO�PLQLPD�RU�VDGGOH�SRLQW"

6DGGOH�SRLQWV�PXFK�PRUH�FRPPRQ�LQ�KLJK�GLPHQVLRQ

'DXSKLQ�HW�DO��³,GHQWLI\LQJ�DQG�DWWDFNLQJ�WKH�VDGGOH�SRLQW�SUREOHP�LQ�KLJK�GLPHQVLRQDO�QRQ�FRQYH[�RSWLPL]DWLRQ´��1,36�����

‣ Stochas$c gradient *ascent*

‣ Very simple to code upw w + ↵g, g =

@

@wL

‣ What if the loss func$on has a local minima or saddle point?

Page 163: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

w w + ↵g, g =@

@wL

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

)LUVW�2UGHU�2SWLPL]DWLRQ

/RVV

Z�

��� 8VH�JUDGLHQW�IRUP�OLQHDU�DSSUR[LPDWLRQ��� 6WHS�WR�PLQLPL]H�WKH�DSSUR[LPDWLRQ

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

6HFRQG�2UGHU�2SWLPL]DWLRQ

/RVV

Z�

��� 8VH�JUDGLHQW�DQG�+HVVLDQ�WR�IRUP�TXDGUDWLF�DSSUR[LPDWLRQ��� 6WHS�WR�WKH�PLQLPD�RI�WKH�DSSUR[LPDWLRQ

Page 164: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

Page 165: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

‣ Newton’s method‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

Page 166: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

‣ Newton’s method‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 167: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

‣ Newton’s method‣ Second-order technique

‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 168: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

‣ Newton’s method‣ Second-order technique

Inverse Hessian: n x n mat, expensive!

‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 169: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

‣ Newton’s method‣ Second-order technique

Inverse Hessian: n x n mat, expensive!‣ Op$mizes quadra$c instantly

‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 170: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Op$miza$on

‣ Stochas$c gradient *ascent*

‣ Very simple to code up

‣ “First-order” technique: only relies on having gradient

‣ Newton’s method‣ Second-order technique

Inverse Hessian: n x n mat, expensive!‣ Op$mizes quadra$c instantly

‣ Quasi-Newton methods: L-BFGS, etc. approximate inverse Hessian

‣ Sezng step size is hard (decrease when held-out performance worsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 171: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

AdaGrad

Duchi et al. (2011)

‣ Op$mized for problems with sparse features

‣ Per-parameter learning rate: smaller updates are made to parameters that get updated frequently

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

$GD*UDG

$GGHG�HOHPHQW�ZLVH�VFDOLQJ�RI�WKH�JUDGLHQW�EDVHG�RQ�WKH�KLVWRULFDO�VXP�RI�VTXDUHV�LQ�HDFK�GLPHQVLRQ

³3HU�SDUDPHWHU�OHDUQLQJ�UDWHV´�RU�³DGDSWLYH�OHDUQLQJ�UDWHV´

'XFKL�HW�DO��³$GDSWLYH�VXEJUDGLHQW�PHWKRGV�IRU�RQOLQH�OHDUQLQJ�DQG�VWRFKDVWLF�RSWLPL]DWLRQ´��-0/5�����

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO���������)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH���� $SULO�����������

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH

Page 172: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

AdaGrad

Duchi et al. (2011)

‣ Op$mized for problems with sparse features

‣ Per-parameter learning rate: smaller updates are made to parameters that get updated frequently

Page 173: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

AdaGrad

Duchi et al. (2011)

‣ Op$mized for problems with sparse features

‣ Per-parameter learning rate: smaller updates are made to parameters that get updated frequently

(smoothed) sum of squared gradients from all updates

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Page 174: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

AdaGrad

Duchi et al. (2011)

‣ Op$mized for problems with sparse features

‣ Per-parameter learning rate: smaller updates are made to parameters that get updated frequently

(smoothed) sum of squared gradients from all updates

‣ Generally more robust than SGD, requires less tuning of learning rate

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Page 175: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

AdaGrad

Duchi et al. (2011)

‣ Op$mized for problems with sparse features

‣ Per-parameter learning rate: smaller updates are made to parameters that get updated frequently

(smoothed) sum of squared gradients from all updates

‣ Generally more robust than SGD, requires less tuning of learning rate

‣ Other techniques for op$mizing deep models — more later!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Page 176: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Summary

Page 177: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Summary‣ Design tradeoffs need to reflect interac$ons:

Page 178: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Summary‣ Design tradeoffs need to reflect interac$ons:

‣ Model and objec$ve are coupled: probabilis$c model <-> maximize likelihood

Page 179: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Summary‣ Design tradeoffs need to reflect interac$ons:

‣ Model and objec$ve are coupled: probabilis$c model <-> maximize likelihood

‣ …but not always: a linear model or neural network can be trained to minimize any differen$able loss func$on

Page 180: Mul$class Classifica$on - GitHub PagesMul$class Classifica$on ‣ Decision rule: ‣ Formally: instead of two labels, we have an output space containing a number of possible classes

Summary‣ Design tradeoffs need to reflect interac$ons:

‣ Model and objec$ve are coupled: probabilis$c model <-> maximize likelihood

‣ …but not always: a linear model or neural network can be trained to minimize any differen$able loss func$on

‣ Inference governs what learning: need to be able to compute expecta$ons to use logis$c regression