Large-scale Knowledge Resources in Speech and Language Research Mark Liberman Mark Liberman...

47
Large-scale Knowledge Resources in Speech and Language Research Mark Liberman University of Pennsylvania [email protected] LKR2004 3/8/2004

Transcript of Large-scale Knowledge Resources in Speech and Language Research Mark Liberman Mark Liberman...

Large-scale Knowledge Resources in

Speech and Language Research

Mark LibermanUniversity of Pennsylvania

[email protected]

LKR2004 3/8/2004

3/8/2004 LKR2004 2

Outline

• Glimpse of LKR in the U.S. landscape

• What is the relationship betweenlarge-scale knowledge resourcesand research and developmenton speech and language?

• What are some needs and opportunities?

• What are the trends?

• Illustrative examples

3/8/2004 LKR2004 3

Glimpses of the U.S. LKR landscape

• DARPA research areas– Human Language Technology– Cognitive Information Processing

• NSF initiatives– Digital Libraries– ITR, Human Social Dynamics– “terascale linguistics”

• Biomedical research:– text, ontologies, databases, experiments– collaborations with Japan and Europe

• Language documentation• Web archives in many disciplines• ...too many other things to list...

3/8/2004 LKR2004 4

What is the relationship between large-scale knowledge resourcesand research and development on speech and language?

Speech and language R&D needs LKR

Speech and language R&D creates LKR

Modeling text: 104-106 words in 1975, 109-1012 words todayModeling speech: 1-10 hours in 1975, 103-104 hours today+ lexicons, parallel text, DBs for entity tracking, etc.+ a thousand languages and dialects+ history, social variation, register and genre, ...

see above. but also something entirely new...

3/8/2004 LKR2004 5

Some needs and opportunities• Standards and tools for LKR

– for creation, improvement, maintenance– for publication, distribution, archiving– for search, access and use

• An academic culture that rewards production and distribution of LKR– most LKR are a side effect

of individual and small-group research– virtual “meta-resources” from many sources

• Part of the answer: integrate LKR into the system of (scientific and scholarly) publication

3/8/2004 LKR2004 6

Themes and trends

• A New Empiricismfocus on large-scale resources, because

quantity (of data) → quality (of knowledge)

• Language + Life = Meaningsomething new emerges from large collections

of symbols, signals, contexts, connections

• People and machines: better together– cognitive prosthetics– interactive working, playing and learning

• Failure is the basis for successif we can measure error, we can learn to improve

3/8/2004 LKR2004 7

Some illustrative examples...

3/8/2004 LKR2004 8

A famous argument

  (1) Colorless green ideas sleep furiously.  (2) Furiously sleep ideas green colorless.

“. . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally ‘remote’ from English. Yet (1), though nonsensical, is grammatical, while (2) is not.”

Noam Chomsky, “Syntactic Structures” (1957)

3/8/2004 LKR2004 9

But is it true?

3/8/2004 LKR2004 10

43 years later• someone finally checked...

– Pereira, “Formal grammar and information theory” (2000)– simple “aggregate bigram model” using hidden class variables c

– with C=16, trained on ~100MW of newswire data

• the result:"Furiously sleep green ideas colorless"

is more than 200,000 times less probable than“Colorless green ideas sleep furiously”

3/8/2004 LKR2004 11

What changed?

• Partly:– new models and estimation methods– better computing resources– more accessible data

• Mostly:– willingness to look for solutions– opportunities to apply them

To be fair, this kind of modeling became a real option only about 1980 Now it can be done as an undergraduate term project ...

3/8/2004 LKR2004 12

Social structure from conversation

• Human social dynamics: model of conversational turn-taking

• U.S. Supreme Court oral arguments

• Modeling is simple and local– one session modeled at a time (~250 turns)– data is just sequence of (~250) speaker IDs

• Undergraduate term project in intro course (credit to: Chris Osborn)

3/8/2004 LKR2004 13

CHIEF JUSTICE WILLIAM H. REHNQUIST: We'll hear argument next in No. 01-298, Paul Lapides v. the Board of Regents of the University System of Georgia. Spectators are admonished, do not talk until you get outside the courtroom. The court remains in session. Mr. Bederman.

MR. DAVID J. BEDERMAN: Mr. Chief Justice, and may it please the Court: When a State affirmatively invokes the jurisdiction of the Federal court by removing a case, that acts as a waiver of the State's forum immunity to Federal jurisdiction under the Eleventh Amendment. This principle ...

JUSTICE ANTONIN SCALIA: When you say as an actor in any role, does it ever intervene as a defendant?

MR. BEDERMAN: Yes, Justice Scalia. This Court's precedents seem to indicate that wherever the State is cast in the role of plaintiff, defendant, intervenor, or claimant, that the entry into the Federal proceeding submits the State to the jurisdiction of the Federal court.

CHIEF JUSTICE REHNQUIST: How about the Ford Motor Company case?

MR. BEDERMAN: Well, of course, the authorization requirement in Ford Motor -- and that's the particular holding in Ford Motor that I think is of concern to this Court -- need not be reached here because, of course, ...

CHIEF JUSTICE REHNQUIST: So, you think a line can be drawn between the State defendant being drawn in as a respondent or involuntarily as opposed to removing and thereby invoking Federal jurisdiction.

+ ... 254 turns ...

3/8/2004 LKR2004 14

class 1 = (chief justice william h. rehnquist justice anthony kennedyjustice antonin scalia justice john paul stevensjustice ruth bader ginsburg justice sandra day o'connorjustice stephen g. breyer)class 2 = (mr. david j. bedermanmr. irving l. gornsteinms. devon orland ms. julie c. parsley))

Two-class “aggregate bigram model”, trained on a single one-hour argument (01-298), highest-probability class for each speaker:

3/8/2004 LKR2004 15

and sometimes you don’t need a lot of data.

...though in this case, it was crucial that Jerry Goldman’s Oyez Project

is publishing all Supreme Court oral arguments (audio and transcripts)

In most cases the quantity of data is crucial: Data quantity → knowledge quality ... and available resources are just starting to pass a threshold

So human social roles can emerge from a trivial statistical model of speaker sequencing in a formal setting.

3/8/2004 LKR2004 16

A case where size matters...

• English complex nominals:sequence of nouns and adjectives, e.g.

Volume Feeding Management Success Formula Award

• Part-of-speech string offers little help in parsing:

[ stone [ traffic barrier ]][[ job growth ] statistics ] N N N

• Apparently, parsing requires “understanding”

3/8/2004 LKR2004 17

The MEDLINE corpus

• U.S. National Library of Medicine

• ~12 million references and abstracts– biomedical journal articles – 1966 to present

• ~109 words

3/8/2004 LKR2004 18

[NN]Nsickle cell anemia

10561 2422

N[NN]rat bile duct

203 22366

[NA]Ninformation theoretic criterion

  112       5

N[AN]monkey temporal lobe

   16     10154

[AN]Ngiant cell tumour

7272 1345

A[NN] cellular drug transport

262  746

[AA]N  small intestinal activity

8723       120

A[AN]inadequate topical cooling

   4     195

Parsing by counting (in MEDLINE)

3/8/2004 LKR2004 19

[N [N N]stone traffic barrier 338 7,010

[[N N] N] job growth statistics 349,000 11,600

Parsing by counting (google hits)

First attempt at this idea: for AT&T TTS in 1987

First real success: ~15 years later

The difference: It doesn’t really work with 107-108 tokens It works pretty well with 109-1012 tokens

“You can observe a lot just by watching.”-Yogi Berra

here... “You can analyze a lot just by counting.”

3/8/2004 LKR2004 20

As the SCOTUS example suggests, “large-scale” is not just the number of words or hours.

Structure, context and external relationships can also be crucial – here it was the sequence of speaker identities.

Here’s a simple but compelling example of how symbol-like structure emerges as zebra finches practice a song...

This is research by Ofer Tchernichovski (CCNY),

Partha Mitra and others

3/8/2004 LKR2004 21

8

0 Time (ms) 700

Fre

quen

cy

(Hz)

Zebra finch song learningOfer Tchernichovski (CCNY)

3/8/2004 LKR2004 22

Song motifs vary across individuals

3/8/2004 LKR2004 23

Song imitation – young birds imitate adults

Tutor’s song

Pupil’s song

3/8/2004 LKR2004 24

Song imitation

* Can be very accurate

* Critical period – developmental learning

* Song template – memory traces of a model

* Learning requires auditory feedback

0 20 40 60 80 100 Age(days)

Sensory phase

Sensory-motor phase

3/8/2004 LKR2004 25

Initially:Social & acoustic isolation

Days 35 / 43 / 60:Start training

3/8/2004 LKR2004 26

The training systemLaboratory of Animal Behavior, CCNY

3/8/2004 LKR2004 27

3/8/2004 LKR2004 28

3/8/2004 LKR2004 29

Real-time calculation of acoustic features

4 simple acoustic features with articulatory correlates:

NoisePure tone

Wiener entropy

+-

HighLowSpectral continuity

+-

HighLowPitch

+-

HighLowFM

+-

3/8/2004 LKR2004 30

The training system

Song recognition

Song analysis

Database table

3/8/2004 LKR2004 31

5733 66 0.295980722 802.5073242 -2.626851082 33.58778763 0.804081738

6756 66 0.152581334 704.6381836 -2.524046659 27.59897423 0.802883089

7297 53 0.167008847 812.2409058 -1.880394816 45.26642609 0.73422879

7876 62 0.219140843 744.0402222 -2.562429667 34.36729431 0.77498275

8253 76 0.261799634 1212.450928 -2.24555397 48.8947258 0.649886608

8393 121 0.825781465 663.1687012 -2.535212278 20.65950394 0.749277711

8589 61 0.383003145 719.1973877 -2.427448273 29.89187622 0.67703712

8760 65 0.261223316 1119.903198 -2.556747913 45.04622269 0.633399487

8840 92 0.391378433 980.5782471 -2.776203156 29.98022079 0.742950559

9579 50 0.070019156 1089.148315 -2.479059219 29.93981934 0.839425206

10523 70 0.166663319 811.1593628 -2.734509706 27.13637352 0.836294293

10733 51 0.176689878 763.8659058 -1.616189003 45.17594528 0.496240675

10874 36 0.076791681 1103.130981 -1.929902196 58.78096008 0.811875403

10972 62 0.10109444 2110.150879 -2.650181532 46.28370285 0.830607355

11042 44 0.221805096 2779.580322 -3.222234249 60.9871254 0.79437232

11136 53 0.203947186 878.0430298 -1.2962991 46.85206223 0.485266626

11465 53 0.14567025 811.8573608 -1.186548352 41.14878082 0.42596662

11521 65 0.139529422 868.633667 -1.330822468 42.92938232 0.542328238

12355 81 0.536730945 982.7991333 -2.679917574 37.7701149 0.523121655

13481 55 0.185585603 733.9207764 -2.271656036 39.42351151 0.816531181

13669 72 0.342740119 772.1679077 -2.455365419 30.38383102 0.765049458

14466 53 0.276962578 699.7897949 -2.140806913 40.342556 0.822018743

14612 47 0.078976907 1122.309326 -1.729982138 48.15994644 0.823718846

16304 55 0.143629089 769.4672852 -1.626844049 34.90858841 0.711382151

16454 76 0.216472968 769.9150391 -2.356431723 39.29466629 0.794104338

16571 54 0.52569139 687.6394043 -1.956387162 37.81315613 0.616944551

17000 58 0.135118335 864.5578613 -2.363121986 31.00643349 0.858065724

17189 51 0.124977574 752.3527222 -1.94250226 36.36558151 0.691144586

17761 58 0.144002378 1021.027527 -2.258356094 40.53672409 0.708231866

17873 47 0.066938281 1339.068604 -1.668018103 46.29984665 0.69986397

18051 38 0.066276349 1847.560913 -2.551876307 38.55633545 0.805839062

18092 81 0.200010121 2080.408936 -3.075473547 50.34065247 0.776402116

18219 66 0.335276693 858.1080933 -1.750756502 46.40740204 0.511499882

18536 69 0.261755675 890.3964233 -1.860459447 42.50422668 0.500995994

19446 46 0.15915972 993.3217773 -1.601477981 43.11263275 0.527124286

20405 51 0.193706796 800.2883911 -1.413753867 41.22149277 0.428571522

20644 65 0.24410592 802.0982666 -1.589150429 39.50386429 0.429761887

20729 61 0.166723967 901.6841431 -1.771348119 47.49161148 0.556119919

20847 51 0.198818251 852.6430664 -1.053611994 48.11198425 0.44106108

23287 68 0.178408563 784.8914185 -2.134843588 41.99195862 0.656920671

24243 70 0.185866207 990.8589478 -2.562700748 39.49663925 0.763919473

Start on Duration Mean Amp Mean Pitch Mean Entropy Mean FM Mean Continuity

3/8/2004 LKR2004 32

Duration Mean PitchMean

Entropy Mean FM

66 802.5073242 -2.626851082 33.58778763

66 704.6381836 -2.524046659 27.59897423

53 812.2409058 -1.880394816 45.26642609

62 744.0402222 -2.562429667 34.36729431

76 1212.450928 -2.24555397 48.8947258

121 663.1687012 -2.535212278 20.65950394

61 719.1973877 -2.427448273 29.89187622

65 1119.903198 -2.556747913 45.04622269

92 980.5782471 -2.776203156 29.98022079

50 1089.148315 -2.479059219 29.93981934

70 811.1593628 -2.734509706 27.13637352 0

10

20

30

40

50

60

70

80

90

0 100 200 300 400 500

Duration

Mea

n F

M

Dynamic Vocal Development maps

3/8/2004 LKR2004 33

Dynamic Vocal Development (DVD) Map

of a single bird

0

10

20

30

40

50

60

70

80

90

0 100 200 300 400 500

Duration

Mea

n F

M

Dev

elop

men

t

Day 35

Day 45

Day 55

Day 65

Day 75

Day 85

Onset oftraining

3/8/2004 LKR2004 34

3/8/2004 LKR2004 35

Language + Life = Meaning

• Text (and speech) structured by:– conversational context

• time, place, sequence, participants, ...

– content• types and identities of referenced entities• explicit links (anaphora, references, hyperlinks)• implicit links (quotation, imitation, opposition)

– other contextual data• e.g. neurological, gene expression data in birdsong learning• gaze, gesture, posture, physiological data in conversation

3/8/2004 LKR2004 36

A small application:real conversational transcription

• Perfect automatic speech-to-text (STT) yields:

• STT + “metadata” yields “Rich Transcription”:

ew very nice yes that’s that’s the ah first car uh well my first ownership of something major that’s cool i had to buy my car my other car burned down so it was my first brand new car uh-huh but i love it so i am very happy

Speaker 1: Very nice.

Speaker 2: Yes. That’s my first ownership of something major.

Speaker 1: That’s cool. I had to buy my car. My other car burned down. It was my first brand new car.

Speaker 2: Uh-huh.

Speaker 1: But I love it. I am very happy.

3/8/2004 LKR2004 37

One aspect of conversational metadata: Diarization

Goal: Label acoustic “sources” and their attributes – speakers, music, noise, DTMF, background events

Ch

an

nel

BC

han

nel

A

Source | Attributes

Speaker 1 | M

Speaker 2 | F

Music

DTMF

Speaker 3 | M

DTMF

Noise | High

Time

5.0 10.0 15.0 20.0 25.0 30.0 35.0

Ch

an

nel

BC

han

nel

A

Source | Attributes

Speaker 1 | M

Speaker 2 | F

Music

DTMF

Speaker 3 | M

DTMF

Noise | High

Time

5.0 10.0 15.0 20.0 25.0 30.0 35.0

3/8/2004 LKR2004 38

Interactive annotation

• Supervised learning:human annotates, machine learns

• Unsupervised learning:machine looks for structure in raw data

• Semi-supervised learning:human annotates a few examples,machine tries to generalize

• “Active learning”: machine selects cases that are interesting or uncertain, asks for human judgments

• Sampling experiments human checks machine annotation of selected cases, apply sample confusion matrix to estimate overall statistics

3/8/2004 LKR2004 39

The cycle of interactive annotation

Machine Learning

(Selective) Sampling/Labeling

Hand Correction

Hand Annotation

Automaticannotation

3/8/2004 LKR2004 40

POS taggertrained on WSJ

applied to MEDLINE:

3/8/2004 LKR2004 41

Same tagger,after retraining...

(~200 MEDLINE abstracts):

3/8/2004 LKR2004 42

The key to success: learn to measure failure...

Even a badly flawed measure can produce important gains.

3/8/2004 LKR2004 43

One year of quantitative evaluation...One year of quantitative evaluation...

Arabic to English

51%

89%

57% 58%

2002 2003

Best Research System

Best COTS System

50%

60%

70%

80%

90%

100%P

erce

nt o

f H

uman

3/8/2004 LKR2004 44

Scoring Method Machine Translation Score

Percent of Human = ——————————— x 100 Human Translation Score

Translation Score = Weighted sum of n-gram matches between translation being scored (human or machine)

and three good reference translations

Reference translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

Tri-gram match Bi-gram matchUni-gram match

3/8/2004 LKR2004 45

Best System Outputs

insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .

And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " .

Certain are " the lines is air Libyan I will start also in of three trips running weekly to Cairo in the coordination with Egypt for flying " .

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

The Libyan Arab Airways will also in the conduct of the three times a week in Cairo in coordination with egyptair ".

20022002 20032003

3/8/2004 LKR2004 46

Human v. Machine

Egypt Air May Resume its Flights to Libya Tomorrow

Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.

The official said that, "the company sent a letter to the Ministry of Foreign Affairs to inquire about the lifting of the air embargo on Libya, and in the event that it receives a response, then the first flight to Libya, will take off, Wednesday morning."

He stressed that "the Libyan Airlines will begin scheduling three weekly flights to Cairo, in coordination with Egypt air."

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

The Libyan Arab Airways will also in the conduct of the three times a week in Cairo in coordination with egyptair ".

HumanHuman 20032003

3/8/2004 LKR2004 47

Summary• Speech and Language Research

– needs LKR– creates LKR– can help other disciplines deal with LKR– is helped by other disciplines, who provide

• raw data as well as relevant LKR pieces• problems, algorithms, inspiration

• The whole is greater than the sum of the parts– Types, sources and amounts of data– Collaboration within and across disciplines– Cooperation of humans and machines