1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2...

55
1

Transcript of 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2...

Page 1: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

1

Page 2: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

2

RECAP

Page 3: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

3

Parallel NB Training

Documents/labels

Documents/labels – 1

Documents/labels – 2

Documents/labels – 3

counts counts counts

DFs Split into documents subsets

Sort and add vectors

Compute partial counts

counts

Key Points: • The “full” event counts are a sum of the “local” counts• Easy to combine independently computed local counts

Page 4: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

4

Parallel Rocchio

Documents/labels

Documents/labels – 1

Documents/labels – 2

Documents/labels – 3

v-1 v-2 v-3

DFs Split into documents subsets

Sort and add vectors

Compute partial v(y)’s

v(y)’s

Key Points: • We need shared read access to DF’s, but not write access.• The “full classifier” is a weighted average of the “local” classifiers –

still easy!

Page 5: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

5

Parallel Perceptron Learning?

Documents/labels

Documents/labels – 1

Documents/labels – 2

Documents/labels – 3

v-1 v-2 v-3

Classifier Split into documents subsets

v(y)’s

Like DFs or event counts, size is O(|V|)

Key Points: • The “full classifier” is a weighted average of the “local”

classifiers• Obvious solution requires read/write access to a shared

classifier.

Page 6: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

6

Parallel Streaming Learning

Documents/labels

Documents/labels – 1

Documents/labels – 2

Documents/labels – 3

v-1 v-2 v-3

Classifier Split into documents subsets

v(y)’s

Key Point: We need shared write access to the classifier – not just read access. So we only need to not copy the information but synchronize it. Question: How much extra communication is there?Answer: Depends on how the learner behaves……how many weights get updated with each example … (in Naïve Bayes and Rocchio, only weights for features with non-zero weight in x are updated when scanning x)…how often it needs to update weight … (how many mistakes it makes)

Like DFs or event counts, size is O(|V|)

Page 7: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

7

The perceptron game

A Binstance xi Compute: yi = sign(vk . xi )

^

yi

^

yi

If mistake: vk+1 = vk + yi xi

x is a vectory is -1 or +1

Page 8: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

8

The perceptron game

A Binstance xi Compute: yi = sign(vk . xi )

^

yi

^

yi

If mistake: vk+1 = vk + yi xi

x is a vectory is -1 or +1

u

-u

v1

+x2

v2

Depends on how easy the learning problem is, not dimension of vectors x

Fairly intuitive:• “Similarity” of v to u looks like

(v.u)/|v.v|• (v.u) grows by >= γafter

mistake• (v.v) grows by <= R2

Page 9: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

9

The Voted Perceptron for Ranking and Structured Classification

Page 10: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

10

The voted perceptron for ranking

A Binstances x1 x2 x3 x4… Compute: yi = vk . xi

Return: the index b* of the “best” xi

^

b*

bIf mistake: vk+1 = vk + xb - xb*

Page 11: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

11

u

-u

xx

x

x

x

γ

Ranking some x’s with the target vector u

Page 12: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

12

u

-u

xx

x

x

x

γ

v

Ranking some x’s with some guess vector v – part 1

Page 13: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

13

u

-u

xx

x

x

x

v

Ranking some x’s with some guess vector v – part 2.

The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.

Page 14: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

14

u

-u

xx

x

x

x

v

Correcting v by adding xb – xb*

Page 15: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

15

xx

x

x

x

vk

Vk+1

Correcting v by adding xb – xb*

(part 2)

Page 16: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

16

(3a) The guess v2 after the two positive examples: v2=v1+x2

u

-u

v1

+x2

v2

u

-u

u

-u

xx

x

x

x

v

Page 17: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

17

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

3

u

-u

u

-u

xx

x

x

x

v

Page 18: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

18

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

3

u

-u

u

-u

xx

x

x

x

v

Page 19: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

19

u

-u

v1

+x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

u

-u

u

-u

xx

x

x

x

v

Notice this doesn’t depend at all on the number of x’s being ranked

Neither proof depends on the dimension of the x’s.

2

2

Page 20: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

20

Ranking perceptrons structured perceptrons

• The API:– A sends B a (maybe

huge) set of items to rank

– B finds the single best one according to the current weight vector

– A tells B which one was actually best

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)

– Output: list of labels: y=(y1,…,yn)

– If there are K classes, there are Kn labels possible for x

Page 21: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

21

Ranking perceptrons structured perceptrons

• Suppose we can1. Given x and y, form a feature vector F(x,y)– Then we can score

x,y using a weight vector: w.F(x,y)

2. Given x, find the top-scoring y, with respect to w.F(x,y)

Then we can learn….

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)

– Output: list of labels: y=(y1,…,yn)

– If there are K classes, there are Kn labels possible for x

Page 22: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

22

Example Structure classification problem: segmentation or NER

– Example: Addresses, bib records– Problem: some DBs may split records up differently (eg

no “mail stop” field, combine address and apt #, …) or not at all

– Solution: Learn to segment textual form of records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume Page

When will prof Cohen post the notes …

Page 23: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

23

Converting segmentation to a feature set (Begin,In,Out)

When will prof Cohen post the notes …

Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them.

• (y(i)==B or y(i)==I) and (token(i) is capitalized)

• (y(i)==I and y(i-1)==B) and (token(i) is hyphenated)

• (y(i)==B and y(i-1)==B)

• eg “tell Rahul William is on the way”

Idea 2: construct a graph where each path is a possible sequence labeling.

Page 24: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

24

Find the top-scoring y

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

•Inference: find the highest-weight path•This can be done efficiently using dynamic programming (Viterbi)

Page 25: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

25

Ranking perceptrons structured perceptrons

• New API:– A sends B the word

sequence x– B finds the single best

y according to the current weight vector using Viterbi

– A tells B which y was actually best

– This is equivalent to ranking pairs g=(x,y’) based on w.F(x,y)

• Structured classification on a sequence– Input: list of words:

x=(w1,…,wn)

– Output: list of labels: y=(y1,…,yn)

– If there are K classes, there are Kn labels possible for x

Page 26: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

26

The voted perceptron for NER

A Binstances g1 g2 g3 g4… Compute: yi = vk . gi

Return: the index b* of the “best” gi

^

b* bIf mistake: vk+1 = vk + gb - gb*

1. A sends B feature functions, and instructions for creating the instances g:

• A sends a word vector xi. Then B could create the instances g1

=F(xi,y1), g2= F(xi,y2), …

• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.

2. A sends B the correct label sequence yi.

3. On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)

Page 27: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

27

EMNLP 2002

Page 28: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

28

Some background…• Collins’ parser: generative model…• …New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete

Structures, and the Voted Perceptron, Collins and Duffy, ACL 2002.• …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted

Perceptron, Collins, ACL 2002.– Propose entities using a MaxEnt tagger (as in MXPOST)– Use beam search to get multiple taggings for each document (20)– Learn to rerank the candidates to push correct ones to the top, using some

new candidate-specific features:• Value of the “whole entity” (e.g., “Professor_Cohen”)• Capitalization features for the whole entity (e.g., “Xx+_Xx+”)• Last word in entity, and capitalization features of last word• Bigrams/Trigrams of words and capitalization features before and after

the entity

Page 29: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

29

Some background…

Page 30: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

30

Collins’ Experiments

• POS tagging • NP Chunking (words and POS tags from Brill’s

tagger as features) and BIO output tags• Compared Maxent Tagging/MEMM’s (with iterative

scaling) and “Voted Perceptron trained HMM’s”– With and w/o averaging– With and w/o feature selection (count>5)

Page 31: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

31

Collins’ results

Page 32: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

32

STRUCTURED PERCEPTRONS…

Page 33: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

33

NAACL 2010

Page 34: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

35

Parallel Structured Perceptrons• Simplest idea:

– Split data into S “shards”– Train a perceptron on each shard independently

• weight vectors are w(1) , w(2) , …

– Produce some weighted average of the w(i)‘s as the final result

Page 35: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

36

Parallelizing perceptrons

Instances/labels

Instances/labels – 1 Instances/labels – 2 Instances/labels – 3

vk -1 vk- 2 vk-3

vk

Split into example subsets

Combine by some sort of

weighted averaging

Compute vk’s on subsets

Page 36: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

37

Parallel Perceptrons• Simplest idea:

– Split data into S “shards”– Train a perceptron on each shard independently

• weight vectors are w(1) , w(2) , …

– Produce some weighted average of the w(i)‘s as the final result

• Theorem: this doesn’t always work.• Proof: by constructing an example where you can converge on every

shard, and still have the averaged vector not separate the full training set – no matter how you average the components.

Page 37: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

38

Parallel Perceptrons – take 2

Idea: do the simplest possible thing iteratively.

• Split the data into shards• Let w = 0• For n=1,…

• Train a perceptron on each shard with one pass starting with w

• Average the weight vectors (somehow) and let w be that average

Extra communication cost: • redistributing the weight vectors• done less frequently than if fully synchronized, more frequently than if fully parallelized

All-Reduce

Page 38: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

39

Parallelizing perceptrons – take 2

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1 w- 2 w-3

w

Split into example subsets

Combine by some sort of

weighted averaging

Compute local vk’s

w (previous)

Page 39: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

40

A theorem

Corollary: if we weight the vectors uniformly, then the number of mistakes is still bounded.

I.e., this is “enough communication” to guarantee convergence.

Page 40: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

41

What we know and don’t know

uniform mixing…μ=1/S

could we lose our speedup-from-parallelizing to slower convergence?

CP speedup by factor of S is cancelled by slower convergence by factor of S

Page 41: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

42

Results on NER

perceptron Averaged perceptron

Page 42: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

43

Results on parsing

perceptron Averaged perceptron

Page 43: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

44

The theorem…

Shard iIteration n

This is not new….

Page 44: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

45

u

-u

u

-u

v1

+x2

v2

+x1v1

-x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2

If mistake: vk+1 = vk + yi xi

Page 45: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

46

The theorem…

This is not new….

Page 46: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

47

u

-u

u

-u

v1

+x2

v2

+x1v1

-x2

v2

(3a) The guess v2 after the two positive examples: v2=v1+x2

(3b) The guess v2 after the one positive and one negative example: v2=v1-x2

If mistake: yi xi vk < 0

2

2

2

Page 47: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

48This is new …. We’ve never considered averaging operations before

Follows from: u.w(i,1) >= k1,i γ

Page 48: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

49

IH1 inductive case:

γ

IH1

From A1

Distribute

μ’s are distribution

Page 49: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

50

IH2 proof is similar

IH1, IH2 together imply the bound (as in the usual perceptron case)

Page 50: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

51

Review/outline

• Streaming learning algorithms … and beyond– Naïve Bayes– Rocchio’s algorithm

• Similarities & differences– Probabilistic vs vector space models– Computationally similar– Parallelizing Naïve Bayes and Rocchio

• Alternative:– Adding up contributions for every example vs

conservatively updating a linear classifier– On-line learning model: mistake-bounds

• some theory• a mistake bound for perceptron

– Parallelizing the perceptron

Page 51: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

52

What we know and don’t know

uniform mixing…

could we lose our speedup-from-parallelizing to slower convergence?

Page 52: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

53

What we know and don’t know

Page 53: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

54

What we know and don’t know

Page 54: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

55

What we know and don’t know

Page 55: 1. RECAP 2 Parallel NB Training Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 counts DFs Split into documents subsets.

56

Review/outline

• Streaming learning algorithms … and beyond– Naïve Bayes– Rocchio’s algorithm

• Similarities & differences– Probabilistic vs vector space models– Computationally similar– Parallelizing Naïve Bayes and Rocchio

• Alternative:– Adding up contributions for every example vs

conservatively updating a linear classifier– On-line learning model: mistake-bounds

• some theory• a mistake bound for perceptron

– Parallelizing the perceptron