It is the best of times (and the worst of times)

67
It is the best of times (and the worst of times) Kenneth Church Microsoft [email protected]

description

It is the best of times (and the worst of times). Kenneth Church Microsoft [email protected]. Responsibility; Attribute Dangerous Positions to Others. Interesting & Controversial. Wow! (What a difference a decade makes). Lonely. Preaching to Choir. Empiricism has come of age - PowerPoint PPT Presentation

Transcript of It is the best of times (and the worst of times)

Page 1: It is the best of times (and the worst of times)

It is the best of times(and the worst of times)

Kenneth ChurchMicrosoft

[email protected]

Page 2: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 2

Wow!(What a difference a decade makes)

• Empiricism has come of age– Radical Fringe Mainstream

• 1993: Workshop on Very Large Corpora (WVLC)– Intended to be a 1-time event– But so successful that it

evolved into a series of EMNLP conferences

• EMNLP-2004 received so many submissions that the program committee had to be expanded at the last minute– Success/Catastrophe

0%20%40%60%80%

100%

1985

1990

1995

2000

2005

ACL Meeting%

Sta

tistic

al

Pape

rsBob Moore Fred Jelinek

Lonely Preaching to Choir

Interesting & Controversial

Responsibility; Attribute Dangerous Positions to Others

Page 3: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 3

The Structure of Scientific Revolutions (1962) – Kuhn (p.10)

• Paradigms– Examples from Physics

• Aristotle’s Physica• Ptolemy’s Almagest• Newton’s Principia and Optics• Franklin’s Electricity• Lavoisier’s Chemistry• Lyell’s Geology

• Two characteristics:1. Sufficiently unprecedented to attract an enduring group of

adherents from competing modes of scientific activity2. Simultaneously, sufficiently open-ended to leave all sorts of

problems for the redefined group of practitioners to resolve

Page 4: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 4

Organizational Innovations(Radical Mainstream)

• Late Submission Deadline– Immediately after ACL notifications

• ACL was rejecting good papers for bad reasons– Short review cycles Freshness

• Invest in the Future: Encourage Innovation– Chair (Energetic, Promising, Source of new ideas)– Co-chair (Established, Knows how it has been done)

• Avoid incremental papers– Reviewers prefer boring papers over radical ones– Reviewers do what reviewers do; chairs correction

• Inclusiveness: Diversity Growth (Sales)– Thankless chores Marketing carrots– 1/3 promising, 1/3 stability, 1/3 outreach– Hold conferences in Europe, Asia & America

Innovation

Checks & Balances

Short term ≠ Long term

Page 5: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 5

What Worked and What Didn’t?• Stay on msg: It is data, stupid!It is data, stupid!

– WVLC (Very Large) >> EMNLP (Empirical Methods)– If you have a lot of data,

• Then you don’t need a lot of methodology• Empiricism means diff things to diff people

1. Machine Learning (Self-organizing Methods)2. Exploratory Data Analysis (EDA)3. Corpus-Based Lexicography

• Lots of papers on 1– EMNLP-2004 theme (error analysis) 2– Senseval grew out of 3

Kucera & Francis gave great invited talk

(but they couldn’t follow submitted talks)

Data

Methodology

Page 6: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 6

Word Sense Disambiguation (WSD) History

• Bar-Hillel (1960): – Abandoned Machine

Translation (MT)– Couldn’t see how to make

progress on WSD (pen)– Can’t translate without

disambiguating• bank (money) banque• bank (river) banc

• 1990s– Parallel text ≈ Labeled corpus

for supervised training and testing

– Isn’t it great the translators have WSD labeled all this data for us!

• Yarowsky:– Parallel corpus

encyclopedia + thesaurus– Bilingual ≠ Monolingual

• interest• wear

– ML: Co-training• Supervised

Unsupervised

• Lexicography: Hector– Joint collaboration: Oxford

University Press & DEC– flagging flogging

• Senseval

Page 7: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 7

A Road Rarely Taken:Tukey’s Exploratory Data Analysis (EDA)

• Linear Regression– Standard practice:

• Plug data into off-the-shelf package

• Publish (if “significant”)– Better:

• Check for outliers• Bowed residuals

– Evidence of a positive or negative derivative

• Deviations from assumptions (normality)

– Fanout• Slocum’s Thesis (1981)

– “Proof” that CKY takes linear time

0

10000

20000

30000

40000

50000

0 10 20 30

Sentence Length

Tim

e0

10000

20000

30000

40000

50000

0 10 20 30

Sentence Length

Tim

e

Standard texts (e.g., Aho)… consider … worst case… This

assumption clearly fails to apply to natural language… Our

experiments have shown that average-case time performance…

is approximately linear (p. 102)

No Result

Page 8: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 8

Many Machine Learning (ML) Techniques (SVMs, Perceptrons) are Similar to (Logistic) Regression;

Rarely see EDA (Robust Statistical) Methods in MLThe E

lements of S

tatistical Learning – H

astie, Tibshirani, Friedman

(2001), p 380

Page 9: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 9

Historical Context• 1950s:

– Rigorous methodology• Information theory• Behaviorism

• Unfulfilled unrealistic expectations video– ALPAC report– Whither Speech Recognition?

• 1970s:– Let it all hang out

• Artificial Intelligence• Cognitive Psychology

• 1990s: – Revival of empiricism

0%20%40%60%80%

100%

1985

1990

1995

2000

2005

ACL Meeting%

Sta

tistic

al

Pape

rsBob Moore Fred Jelinek

Empiricists feel lonely

Rationalists feel lonely

Kuhn Crisis

Kuhn Crisis

Page 10: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 10

…ASR is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, or going to the moon.

Most recognizers behave not like scientists, but like mad inventors or untrustworthy engineers.

…performance will continue to be very limited unless the recognizing device understands what is being said with something of the facility of a native speaker (that is, better than a foreigner fluent in the language)

Any application of the foregoing discussion to work in the general area of pattern recognition is left as an exercise for the reader.

“Whither Speech Recognition?” Pierce, JASA 1969

Borrowed Slide: Jelinek (LREC)

Also, ALPAC (chair)& Bell Labs exec

Page 11: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 11

ALPAC (1966): the (in)famous reportJohn Hutchins

• The best known event in the history of MT is …– Automatic Language Processing Advisory Committee (ALPAC)

• Its effect was to bring to an end the substantial funding of MT research in US for some twenty years.– More significantly was the clear message to the general public

and the rest of the scientific community that MT was hopeless.– For years afterwards, an interest in MT was something to keep

quiet about; it was almost shameful.– To this day, the 'failure' of MT is still repeated by many as an

indisputable fact.• The impact of ALPAC is undeniable

– While the fame or notoriety of ALPAC is familiar,– What the report actually said is now becoming less familiar and

often forgotten or misunderstood…

Page 12: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 12

ALPAC Recommendations The committee recommends expenditures in two distinct areas

• Computational linguistics as part of linguistics– Studies of parsing,

generation… including experiments in translation…

– Linguistics should be supported as science,

• and should not be judged by any immediate or foreseeable contribution to practical translation

• Improvement of translation:1. practical methods for evaluation of

translations;2. means for speeding up the human

translation process;3. evaluation of quality and cost of various

sources of translations;4. investigation of the utilization of

translations, to guard against production of translations that are never read;

5. study of delays in the over-all translation process, and means for eliminating them, both in journals and in individual items;

6. evaluation of the relative speed and cost of various sorts of machine-aided translation;

7. adaptation of existing mechanized editing and production processes in translation;

8. the over-all translation process; and9. production of adequate reference works

for the translator, including the adaptation of glossaries that now exist primarily for automatic dictionary look-up in machine translation

Theory

Practice

Page 13: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 13

Outline

• We’re making consistent progress, or• We’re running around in circles, or

– Don’t worry; be happy• We’re going off a cliff…

Best of Times

Page 14: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 14

Where have we been and where are we going? Moore’s Law: Ideal Answer

Moores: Bob ≠ Gorden ≠ Roger

Page 15: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 15

Erro

r Rat

e

Date (15 years)

Moore’s Law Time Constant:• 10x improvement per decade

Borrowed SlideAudrey Le (NIST)

Page 16: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 16

Charles Wayne’s Challenge:Demonstrate Consistent Progress Over Time

• Controversial in 1980s– But not in 1990s– Though, grumbling

• Benefits1. Agreement on what to do2. Limits endless discussion3. Helps sell the field

• Manage expectations• Fund raising

• Risks (similar to benefits)1. All our eggs are in one basket

(lack of diversity)2. Not enough discussion

• Hard to change course3. Methodology Burden

ManagingExpectations

Page 17: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 17

Hockey StickBusiness Case

2003 2004 2005

t

$

LastYear

ThisYear Next

Year

Page 18: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 18

Where have we been and where are we going?Consistent Progress over Time

Extrapolation/Prediction is Applicable

Extrapolation/Prediction is Not Applicable

2003 2004 2005

t

$

ManageExpectations

Page 19: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 19

When will we see the last non-statistical paper? 2010?

0%20%40%60%80%

100%

1985

1990

1995

2000

2005

ACL Meeting

% S

tatis

tical

Pa

pers

Bob Moore Fred Jelinek

Page 20: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 20

Top Ten Metrics of Success1. Value Creation (Reality)2. Stock Prices (Belief)3. Startup Companies Raise Venture Capital (Excitement)4. Prototype Applications (Plausibility)5. Grand-Students (Survive the Test of Time)6. Students Get Good Jobs7. Students Finish PhD Theses8. Citations9. Conference Registrations10. Publications (Quantity)

We are

here

Senseval wants to be here

Speech

Search

Page 21: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 21

Outline

• We’re making consistent progress, or• We’re running around in circles, or

– Don’t worry; be happy• We’re going off a cliff…

Best of Times(Not!)

Been there;Done that

Page 22: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 22

It has been claimed thatRecent progress made possible by EmpiricismEmpiricism

Progress (or Oscillating Fads)?• 1950s: Empiricism was at its peak

– Dominating a broad set of fields• Ranging from psychology (Behaviorism)• To electrical engineering (Information Theory)

– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)• Word association norms (priming): bread and butter, doctor / nurse

– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)• Firth: “You shall know a word by the company it keeps”• Collocations: Strong tea v. powerful computers

• 1970s: Rationalism was at its peak– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).

• 1990s: Revival of EmpiricismEmpiricism– Availability of massive amounts of data (popular arg, even before the web)

• “More data is better data”• Quantity >> Quality (balance)

– Pragmatic focus:• What can we do with all this data?• Better to do something than nothing at all

– Empirical methods (and focus on evaluation): Speech Language• 2010s: Revival of Rationalism (?)

Page 23: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 23

It has been claimed thatRecent progress made possible by EmpiricismEmpiricism

Progress (or Oscillating Fads)?• 1950s: EmpiricismEmpiricism was at its peak

– Dominating a broad set of fields• Ranging from psychology (Behaviorism)• To electrical engineering (Information Theory)

– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)• Word association norms (priming): bread and butter, doctor / nurse

– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)• Firth: “You shall know a word by the company it keeps”• Collocations: Strong tea v. powerful computers

• 1970s: Rationalism was at its peak– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).

• 1990s: Revival of EmpiricismEmpiricism– Availability of massive amounts of data (popular arg, even before the web)

• “More data is better data”• Quantity >> Quality (balance)

– Pragmatic focus:• What can we do with all this data?• Better to do something than nothing at all

– Empirical methods (and focus on evaluation): Speech Language• 2010s: Revival of Rationalism (?)

Page 24: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 24

It has been claimed thatRecent progress made possible by EmpiricismEmpiricism

Progress (or Oscillating Fads)?• 1950s: EmpiricismEmpiricism was at its peak

– Dominating a broad set of fields• Ranging from psychology (Behaviorism)• To electrical engineering (Information Theory)

– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)• Word association norms (priming): bread and butter, doctor / nurse

– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)• Firth: “You shall know a word by the company it keeps”• Collocations: Strong tea v. powerful computers

• 1970s: RationalismRationalism was at its peak– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).

• 1990s: Revival of EmpiricismEmpiricism– Availability of massive amounts of data (popular arg, even before the web)

• “More data is better data”• Quantity >> Quality (balance)

– Pragmatic focus:• What can we do with all this data?• Better to do something than nothing at all

– Empirical methods (and focus on evaluation): Speech Language• 2010s: Revival of Rationalism (?)

Page 25: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 25

It has been claimed thatRecent progress made possible by EmpiricismEmpiricism

Progress (or Oscillating Fads)?• 1950s: EmpiricismEmpiricism was at its peak

– Dominating a broad set of fields• Ranging from psychology (Behaviorism)• To electrical engineering (Information Theory)

– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)• Word association norms (priming): bread and butter, doctor / nurse

– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)• Firth: “You shall know a word by the company it keeps”• Collocations: Strong tea v. powerful computers

• 1970s: RationalismRationalism was at its peak– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).

• 1990s: Revival of EmpiricismEmpiricism– Availability of massive amounts of data (popular arg, even before the web)

• “More data is better data”• Quantity >> Quality (balance)

– Pragmatic focus:• What can we do with all this data?• Better to do something than nothing at all

– Empirical methods (and focus on evaluation): Speech Language• 2010s: Revival of RationalismRationalism (?)

Consistent progress?

• Periodic signals are continuous• Support extrapolation/prediction• Progress? Consistent progress?

Extrapolation/Prediction: Applicable?

Page 26: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 26

Speech Language Has the pendulum

swung too far?• What happened between TMI-1992 and TMI-2002 (if anything)?• Have empirical methods become too popular?

– Has too much happened since TMI-1992?• I worry that the pendulum has swung so far that

– We are no longer training students for the possibility• that the pendulum might swing the other way

• We ought to be preparing students with a broad education including:– Statistics and Machine Learning– as well as Linguistic Theory

• History repeats itself: Mark Twain; bad idea then and still a bad idea now– 1950s: empiricism– 1970s: rationalism (empiricist methodology became too burdensome)– 1990s: empiricism– 2010s: rationalism (empiricist methodology is burdensome, again)

Page 27: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 27

Speech Language Has the pendulum

swung too far?• What happened between TMI-1992 and TMI-2002 (if anything)?• Have empirical methods become too popular?

– Has too much happened since TMI-1992?• I worry that the pendulum has swung so far that

– We are no longer training students for the possibility• that the pendulum might swing the other way

• We ought to be preparing students with a broad education including:– Statistics and Machine Learning– as well as Linguistic Theory

• History repeats itself: Mark Twain; bad idea then and still a bad idea now– 1950s: empiricism– 1970s: rationalism (empiricist methodology became too burdensome)– 1990s: empiricism– 2010s: rationalism (empiricist methodology is burdensome, again)

Plays well at Machine

Translation conferences

Page 28: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 28

Speech Language Has the pendulum

swung too far?• What happened between TMI-1992 and TMI-2002 (if anything)?• Have empirical methods become too popular?

– Has too much happened since TMI-1992?• I worry that the pendulum has swung so far that

– We are no longer training students for the possibility• that the pendulum might swing the other way

• We ought to be preparing students with a broad education including:– Statistics and Machine Learning– as well as Linguistic Theory

• History repeats itself: Mark Twain; bad idea then and still a bad idea now– 1950s: empiricism– 1970s: rationalism (empiricist methodology became too burdensome)– 1990s: empiricism– 2010s: rationalism (empiricist methodology is burdensome, again)

Plays well at Machine

Translation conferences

Page 29: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 29

Speech Language Has the pendulum

swung too far?• What happened between TMI-1992 and TMI-2002 (if anything)?• Have empirical methods become too popular?

– Has too much happened since TMI-1992?• I worry that the pendulum has swung so far that

– We are no longer training students for the possibility• that the pendulum might swing the other way

• We ought to be preparing students with a broad education including:– Statistics and Machine Learning– as well as Linguistic Theory

• History repeats itself:– 1950s: empiricismempiricism– 1970s: rationalismrationalism (empiricist methodology became too burdensome)– 1990s: empiricismempiricism– 2010s: rationalismrationalism (empiricist methodology is burdensome, again)

Plays well at Machine

Translation conferences

Grandparents and grandchildren have a natural alliance…

Page 30: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 30

Rationalism Empiricism

Well-known advocates Chomsky, Minsky Shannon, Skinner, Firth,

HarrisModel Competence Model Noisy Channel Model

Contexts of Interest Phrase-Structure N-Grams

GoalsAll and Only Minimize Prediction Error

(Entropy)Explanatory Descriptive

Theoretical Applied

Linguistic Generalizations

Agreement & Wh-movement

Collocations & Word Associations

Parsing StrategiesPrinciple-Based,

CKY (Chart), ATNs, Unification

Forward-Backward (HMMs), Inside-outside (PCFGs)

ApplicationsUnderstanding RecognitionWho did what to

whomNoisy Channel

Applications

Page 31: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 31

Covering all the BasesIt is hard to make predictions (especially about the future)

• When will we see the last non-statistical paper?– 2010?

• Revival of rationalism: – 2010?

The answer to any question: 6 years!

Page 32: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 32

Outline

• We’re making consistent progress, or• We’re running around in circles, or

– Don’t worry; be happy• We’re going off a cliff…

Rising tide of data lifts all boats

No matter what happens, it’s goin’

be great!

Page 33: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 33

Rising Tide of Data Lifts All BoatsIf you have a lot of data, then you don’t need a lot of methodology

• 1985: “There is no data like more data”– Fighting words uttered by radical fringe elements (Mercer at

Arden House)• 1993 Workshop on Very Large Corpora

– Perfect timing: Just before the web– Couldn’t help but succeed– Fate

• 1995: The Web changes everything• All you need is data (magic sauce)

– No linguistics– No artificial intelligence (representation)– No machine learning– No statistics– No error analysis

Page 34: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 34

“It never pays to think until you’ve run out of data” – Eric Brill

Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001)

Fire everybody and spend the money on data

More data is better data!

No consistentlybest learner

Quo

ted

out o

f con

text

Moore’s Law Constant:Data Collection Rates Improvement Rates

Page 35: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 35

Benefit of Data LIMSI: Lamel (2002) – Broadcast News

Supervised: transcriptsLightly supervised: closed captions

WER

hours

Borrowed Slide: Jelinek (LREC)

Page 36: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 36

The rising tide of data will lift all boats!TREC Question Answering & Google:

What is the highest point on Earth?

Page 37: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 37

The rising tide of data will lift all boats!Acquiring Lexical Resources from Data:

Dictionaries, Ontologies, WordNets, Language Models, etc.http://labs1.google.com/sets

England Japan Cat catFrance China Dog more

Germany India Horse lsItaly Indonesia Fish rm

Ireland Malaysia Bird mvSpain Korea Rabbit cd

Scotland Taiwan Cattle cpBelgium Thailand Rat mkdirCanada Singapore Livestock manAustria Australia Mouse tail

Australia Bangladesh Human pwd

Page 38: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 38

• More data better results – TREC Question Answering

• Remarkable performance: Google and not much else

– Norvig (ACL-02)– AskMSR (SIGIR-02)

– Lexical Acquisition• Google Sets

– We tried similar things» but with tiny corpora» which we called large

Rising Tide of Data Lifts All BoatsIf you have a lot of data, then you don’t need a lot of methodology

Page 39: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 39

Applications• What good is word sense disambiguation (WSD)?

– Information Retrieval (IR)• Salton: Tried hard to find ways to use NLP to help IR

– but failed to find much (if anything)• Croft: WSD doesn’t help because IR is already using those methods• Sanderson (next two slides)

– Machine Translation (MT)• Original motivation for much of the work on WSD• But IR arguments may apply just as well to MT

• What good is POS tagging? Parsing? NLP? Speech?• Commercial Applications of Natural Language Processing,

CACM 1995– $100M opportunity (worthy of government/industry’s attention)

1. Search (Lexis-Nexis)2. Word Processing (Microsoft)

• Warning: premature commercialization is risky

Don’t worry;Be happy

ALPAC

5 Ia

n A

nder

sons

Page 40: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 40

Sanderson (SIGIR-94)http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf

Not much?

• Could WSD help IR?• Answer: no

– Introducing ambiguity by pseudo-words doesn’t hurt (much)

Short queries matter most, but hardest for WSD

F

Query Length (Words)

5 Ia

n A

nder

sons

Page 41: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 41

Sanderson (SIGIR-94)http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf

• Resolving ambiguity badly is worse than not resolving at all– 75% accurate WSD

degrades performance– 90% accurate WSD:

breakeven point

Soft WSD?

Query Length (Words)

F

Page 42: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 42

Some Promising Suggestions(Generate lots of conference papers, but may not support the field)

• Two Languages are Better than One– For many classic hard NLP

problems• Word Sense

Disambiguation (WSD)• PP-attachment• Conjunction• Predicate-argument

relationships• Japanese and Chinese

Word breaking– Parallel corpora plenty of

annotated (labeled) testing and training data

– Don’t need unsupervised magic (data >> magic)

• Demonstrate that NLP is good for something– Statistical methods (IR & WSD)

focus on bags of nouns,• Ignoring verbs, adjectives,

predicates, intensifiers, etc.– Hypothesis: Ignored because

perceptrons can’t model XOR– Task: classify “comments” into

“good,” “bad” and “neutral”• Lots of terms associated with just

one category• Some associated with two

– Depending on argument• Good & Bad, but not neutral:

Mickey Mouse, Rinky Dink– Bad: Mickey Mouse(us)– Good: Mickey Mouse(them)

– Current IR/WSD methods don’t capture predicate-argument relationships

An example of Error Analysis/Representation

Senseval++

Page 43: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 43

English Lexical Sample(fine-grained scoring)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

PrecisionRecall

Unsupervised Supervised Baseline

Supervision >> Magic > Baselinehttp://www.sle.sharp.co.uk/senseval2/Results/all_graphs.xls

Bragging Rights

Supervision

Magic

Baseline

Page 44: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 44

Breakdown by Systems & Words• Spelling correction task

– Golding & Schabes (1996)• Some methods work better

on some words– and other methods work

better on other words• Should breakdown

Senseval results by both systems and words

• Discover opportunities for hybrids across systems

• Error analysis– POS distinctions (easy)– Local context (trigrams)– Larger contexts (IR)

Page 45: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 45

Goals of Shared Evaluations• Marketing & Sales

– Scores going up and up Funding goes up and up

– Rising tide lifts all boats• Shared learnings

– Compare and contrast– What works and what doesn’t?– Error analysis

• Benchmarking: – How hard are various problems? – What makes problems easier or

harder?– Rate of progress?

• Not bragging rights: – Mirror, mirror on the wall, who’s the

smartest of them all…

English Lexical Sample(fine-grained scoring)

0.40

10.

319

0.29

30.

244

0.23

90.

232

0.22

0.64

20.

638

0.62

90.

617

0.61

30.

594

0.57

10.

568

0.56

80.

564

0.55

40.

550.

542

0.53

90.

534

0.52

30.

508

0.49

80.

411

0.24

90.

233

0.51

20.

476

0.43

70.

427

0.26

80.

230.

226

0.18

30.

163

0.14

1

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

UNED - LS

-U

ITRI - W

ASPS-Work

benc

h

CL Res

earch

- DIM

APIIT

2 (R

)IIT

1 (R

)IIT

2IIT

1JH

U (R)

SMUls

KUNLP

Stanfo

rd - C

S224N

Sineq

ua-L

IA - S

CTTA

LPDulu

th 3

JHU

UMD - SST

BCU -

ehu-

dlist-

allDulu

th 5

Duluth

CDulu

th 4

Duluth

2Dulu

th 1

Duluth

ADulu

th B

UNED - LS

-TAli

cante IRST

BCU -

ehu-

dlist-

best

Base

l ine L

esk C

orpus

Base

l ine C

ommon

est

Base

l ine G

roupin

g Lesk

Cor

pus

Basel i

ne G

roup

ing C

ommon

est

Base

l ine G

roupin

g Lesk

Base

l ine G

roupin

g Lesk

Def

Basel i

ne Lesk

Basel i

ne G

roup

ing R

ando

m

Base

l ine L

esk D

ef

Base

l ine R

andom

PrecisionRecall

Unsupervised Supervised Baseline

Page 46: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 46

Outline

• We’re making consistent progress, or• We’re running around in circles, or

– Don’t worry; be happy• We’re going off a cliff…

According to unnamed sources:Speech Winter Language Winter

Dot Boom Dot Bust

Page 47: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 47

Early Warning Signs for Future• Senseval feels the need to demonstrate applications of their stuff (and

maybe there aren’t any)• Complacency (don’t worry; be happy)

– Too little dissent: students aren’t rebelling against their teachers– I get uncomfortable when

• There is so much agreement on what to do and so much optimism • And so few worries and so little dissent/controversy. 

• Mindless Metrics– Whatever you measure, you get…– Scores go up and up and up, but are we really doing better?

• According to the scores, parsing is doing well without words,• But you can’t solve classic problems (PPs) without words!

• Burdensome Methodology Exclusiveness– Can’t play (in speech) unless you work in a big lab

• Following Speech off a Cliff– Empirical methods: Speech Language– Speech Winter Language Winter (Dot Boom Dot Bust)– What goes up, (usually) comes down…

Been great, but…

Kuhn Crisis

Cam

pbel

l (A

CL-

04):

Rul

es >

> M

L

Page 48: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 48

Page 49: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 49

Page 50: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 50

Sample of 20 Survey Questions(Strong Emphasis on Applications)

• When will– More than 50% of new PCs have dictation on them, either at

purchase or shortly after.– Most telephone Interactive Voice Response (IVR) systems

accept speech input.– Automatic airline reservation by voice over the telephone is the

norm.– TV closed-captioning (subtitling) is automatic and pervasive.– Telephones are answered by an intelligent answering machine

that converses with the calling party to determine the nature and priority of the call.

– Public proceedings (e.g., courts, public inquiries, parliament, etc.) are transcribed automatically.

• Two surveys of ASRU attendees: 1997 & 2003

Page 51: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 51

2003 Responses ≈ 1997 Responses + 6 Years(6 years of hard work No progress)

Page 52: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 52

Top Ten Metrics of Success(Risky to Promise Apps and Fail to Deliver)

1. Value Creation (Reality)2. Stock Prices (Belief)3. Startup Companies Raise Venture Capital (Excitement)4. Prototype Applications (Plausibility)5. Grand-Students (Survive the Test of Time)6. Students Get Jobs7. Students Finish PhD Theses8. Citations9. Conference Registrations10. Publications (Quantity)

We are

here

Senseval wants to be here

SpeechSearch

Page 53: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 53

Wrong Apps?• New Priorities

– Increase demand for space >> Data entry

• New Killer Apps– Search >> Dictation

• Speech Google!– Data mining

• Old Priorities– Dictation app dates back to

days of dictation machines– Speech recognition has not

displaced typing• Speech recognition has

improved• But typing skills have

improved even more– My son will learn typing in

1st grade– Sec rarely take dictation

– Dictation machines are history• My son may never see one• Museums have slide rulers

and steam trains– But dictation machines?

Page 54: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 54

Speech Data Mining & Call Centers:

An Intelligence Bonanza • Some companies are collecting

information with technology designed to monitor incoming calls for service quality.

• Last summer, Continental Airlines Inc. installed software from Witness Systems Inc. to monitor the 5,200 agents in its four reservation centers.

• But the Houston airline quickly realized that the system, which records customer phone calls and information on the responding agent's computer screen, also was an intelligence bonanza, says André Harris, reservations training and quality-assurance director.

Page 55: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 55

Speech Data Mining• Label calls as success or failure based on

some subsequent outcome (sale/no sale)• Extract features from speech• Find patterns of features that can be used

to predict outcomes• Hypotheses:

– Customer: “I’m not interested” no sale– Agent: “I just want to tell you…” no sale

Inter-ocular effect (hits you between the eyes);Don’t need a statistician to know which way the wind is blowing

Page 56: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 56

Ways for Conferences to Fail• Incrementalism/Burdensome Methodology (Lesson from 1950s)

– We do research for fun and profit – Arno Penzias– Fun and/or Profit >> By-the-Book Correctness

• Arrogance, Mindless Metrics, etc.• Control

– Too much control• Excessive Exclusiveness (mutual admiration society/old-boy network) • Change (serendipity) is essential: New and Different Fun and Excitement• Growth and prosperity depends on new talent (students) & new topics• Can’t afford to keep doing what we already know how to do

– Too little control• Stay on msg: It’s data, stupid!It’s data, stupid! (Our msg ≠ ACL’s)

• Set Inappropriate Expectations– Promise too little

• Senseval feels the need to become more applied– Promise too much: Promise Applications and Fail to Deliver – Success/Catastrophe

• What if we actually achieved all our goals?

Rarely a problem, especially with

thesis proposals

Rarely a problem (except for

March of Dimes)

Page 57: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 57

Ways for Conferences to Succeed

• I wish I knew…• Fate (can’t fail)

– Rising Tide of Data Lifts All Boats• Luck/timing: WVLC-93 was just before Web• Sales & Marketing

– Evaluation, Evaluation, Evaluation• Strategic Vision

– In retrospect, 1993 WVLC worked wonderfully– Distinguished us from mainstream– Offered excitement and hope for future

• Especially appealing to students (growth opportunity)

Page 58: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 58

Great Challenge: Annotating Data

• Produce annotated data with minimal supervision

• Active learning– Identify reliable labels– Identify best candidates for annotation

• Co-training• Bootstrap (project) resources from one

application to another

Borrowed Slide: Jelinek (LREC)

Self-organizing “Magic” ≠ Error Analysis

Great Strategy Success

Page 59: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 59

Grand Challengesftp://ftp.cordis.lu/pub/ist/docs/istag040319-draftnotesofthemeeting.pdf

Page 60: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 60

Roadmaps: Structure of a Strategy(not the union of what we are all doing)

• Goals– Example: Replace keyboard with

microphone– Exciting (memorable) sound bite– Broad grand challenge that we can work

toward but never solve• Metrics

– Examples: • WER: word error rate• Time to perform task

– Easy to measure• Milestones

– Should be no question if it has been accomplished

– Example: reduce WER on task x by y% by time t

• Accomplishments v. Activities– Accomplishments are good– Activity is not a substitute for

accomplishments– Milestones look forward whereas

accomplishments look backward• Serendipity is good!

• Small is beautiful– Quantity is not a good thing– Awareness– 1-slide version

• if successful, you get maybe 3 more slides

• Size of container– Goal: 1-3– Metrics: 3– Milestones: a dozen

• Mostly for next year: Q1-4• Plus some for years 2, 5, 10 & 20

– Accomplishments: a dozen• Broad applicability & illustrative

– Don’t cover everything– Highlight stuff that

• Applies to multiple groups• Forward-Looking / Exciting

Page 61: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 61

€ € €

ResourcesApps & Techniques

Grand Challenges

Goal: Reduce barriers to entry

Goals:1. The multilingual companion2. Life log

Goal: Produce NLP apps that improve the way people communicate

with one another

Evaluation

Page 62: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 62

Summary: What Workedand What Didn’t?

• Data– Stay on msg: It is the data, stupid!It is the data, stupid!

• WVLC (Very Large) >> EMNLP (Empirical Methods)• If you have a lot of data,

– Then you don’t need a lot of methodology• Rising Tide of Data Lifts All Boats

• Methodology– Empiricism means different things to different people

1. Machine Learning (Self-organizing Methods)2. Exploratory Data Analysis (EDA)3. Corpus-Based Lexicography

– Lots of papers on 1• EMNLP-2004 theme (error analysis) 2• Senseval grew out of 3

Substance: Recommended if…

Magic: Recommended if…

Promise: Recommended if…

Short term ≠ Long term

Lonely

What’s the right answer?

There’ll be a quiz at the end of the decade…

Page 63: It is the best of times (and the worst of times)

Backup

Page 64: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 64

Speech Language

• Been great so far,– But too much of a good thing…

• Take the good

Page 65: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 65

Fire• Fuel

– Infrastructure: Shared datasets and lexical resources• Wordnet, LDC, the Web

– Organizers• Walker & Zampolli

– Funding• Darpa (Charles Wayne), EU…

• Sparks– Exciting Applications (The Web)– Grand Challenges– Leaders: Jelinek, Mercer, Miller, Kucera & Francis,

Leech, Sinclair, Tukey, Liberman…

Page 66: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 66

• Hi Ken,

• Rada probably has more to add, but obviously we would like to hear something about WSD or word senses. We are currently trying to move Senseval to include application-specific evaluations (eg within MT or IR, or in specialized domains) and to more general semantic analysis of text (eg frames or subcats). Something to inspire people in this direction would be great.

• Phil.

Page 67: It is the best of times (and the worst of times)

July 25, 2004 EMNLP-2004 & Senseval-2004 67

Organizational Innovations(Radical Mainstream)

• Late Submission Deadline– Immediately after ACL notifications

• ACL was rejecting good papers for bad reasons– Short review cycles Freshness

• Invest in the Future: Encourage Innovation– Chair (Energetic, Promising, Source of new ideas)– Co-chair (Established, Knows how it has been done)

• Inclusiveness:– Thankless Chores Marketing Carrots (Maximize # of reviewers)– Balance program committee, reviewers (and hopefully submissions,

acceptances and registrations): • 1/3 stability, 1/3 promising, 1/3 outreach • Diversity: experience, gender, geography, topic

– Hold conferences in Europe, Asia & America• Huge potential market in Asia: 4 out of 5 jumbo jets

– Maintain 20-25% acceptance rate Parallel Sessions & Posters• Avoid incremental papers

– Average grades (low grade dominates) Advocate + Second

Innovation

Checks & Balances