Data Mining, Information Extraction and Search in Spoken Documents

29
06/28/22 1 Data Mining, Information Extraction and Search in Spoken Documents Julia Hirschberg CS 4706

description

Data Mining, Information Extraction and Search in Spoken Documents. Julia Hirschberg CS 4706. Today. Data mining from text Searching audio data instead of text Information extraction from spoken documents Speech data mining. Data Mining . - PowerPoint PPT Presentation

Transcript of Data Mining, Information Extraction and Search in Spoken Documents

Page 1: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 1

Data Mining, Information Extraction and Search in Spoken Documents

Julia HirschbergCS 4706

Page 2: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 2

Today

• Data mining from text• Searching audio data instead of text• Information extraction from spoken documents• Speech data mining

Page 3: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 3

Data Mining

• Discovery of trends and patterns across very large datasets, usually for decision-making purposes– Fraud detection in banking, telephony– Stock market– Indications of demographic disasters– New causes of diseases– …finding things you don’t know you’re looking

for• Information retrieval vs. ‘mining for nuggets’

Page 4: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 4

Dating Mining in Computational Linguistics

• Finding lexical co-occurrence information• Finding parallel text corpora on the web for MT• Finding ‘new’ topics in news stories

– TDT task• Exploring citation links:

– Networks of influence• Information extraction, e.g. find mutual

acquaintances

Page 5: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 5

• Snowball (Agichtein et al ’01):– Seed set of patterns (e.g. Norman Mailer, 59

<firstname> <lastname>, <age>; the 59-year-old Mailer the <age>-year-old <lastname>)

– Find more patterns by looking for e.g. Mailer close to 59

• Mailer turned 59 last week.• Though Mailer is 59…

Page 6: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 6

But Searching Audio Data is Harder

• Large amounts of audio data available: on the web, in company archives, in our homes– We have tools supporting random access to

text – but for audio we’re limited to serial search

– How can we develop methods to search audio as easily as text?

Page 7: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 7

Applications

• Searching online TV and radio news and archives

• Library of Congress• Searching a/v archives, movies• Searching trial recordings and legislative

sessions• Searching meetings, customer care exchanges,

focus groups• Telephone calls and voicemail

Page 8: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 8

Current Approach

• Train/adapt a speech recognizer for the corpus• Produce an ASR transcript• Segment spoken `documents’ into sentences,

turns, topics• Index (errorful) transcripts for Information

Retrieval and link to audio via timestamps• Enables audio search by content

Page 9: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 9

Some Examples

• SpeechBot searching internet broadcasts• Google Voice Search: search audio by voice

(not yet)• SCANMail searching voicemail

Page 10: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 10

Information Extraction and QA from Speech

• DARPA GALE project: improve information gathering from text, speech, translations

• Current Domain: newswire and news broadcasts in English, Arabic, and Mandarin

• 3 competing teams– ASR/MT bakeoffs– ‘Distillation’ evaluations

• QA• User studies• Requires identification and annotation of

information and ‘formatting’ in speech

Page 11: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 11

Sample Distillation Questions

• List facts about <event>• Find people who are mutual acquaintances of

<person1> and <person2>• Identify persons arrested from <organization>

and give their name and role in that organization• Produce a biography of <person>• Provide information on <organization>• Find statements made by or attributed to

<person> about <topic>• How did <country> react to <event>

Page 12: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 12

Audio diarization

ASR

Topic modeling

Prosodic metadata

Speaker modeling

Linguistic structure

Punctuation Capitalization

MT

Intelligence delivery

Information assimilation

Names Relations

Prosodic analysis

Info repository

SourceLanguage

TargetLanguage

AutomaticAnnotation

Distillation

Nightingale Architecture

Page 13: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 13

Information Annotation

• Spoken documents …– Lack many cues found in text documents

• Format (sentences, turns, paragraphs)– Include spontaneous speech phenomena

which are difficult for ASR and NLP technologies to handle

• Disfluencies, fragments– Contain errors

• Annotation can turn a weakness into a strength

Page 14: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 14

From an ASR Transcript

• aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please

Page 15: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 15

To Speaker Segmentation (Diarization)

• Speaker: 0 - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston

• Speaker: 1 - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please

Page 16: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 16

Add Speaker Role Labels

• Anchor - aides tonight in boston in depth the truth squad for special series until election day tonight the truth about the budget surplus of the candidates are promising the two international flash points getting worse while the middle east and a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u s was was told local own boss good evening uh from the university of massachusetts in boston

• Reporter - the site of the widely anticipated first of eight between vice president al gore and governor george w bush with the election now just five weeks away this is the beginning of a sprint to the finish and a strong start here tonight is important this is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p b s n b c’s david gregory is here with governor bush claire shipman is covering the vice president claire you begin tonight please

Page 17: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 17

Perform Sentence Detection and Punctuation

• Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston.

• Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please.

Page 18: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 18

Detect Story Boundaries

• Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston.

• Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please.

Page 19: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 19

Detect Disfluencies (and Keep/Remove)

• Anchor - Aides tonight in boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening uh from the university of massachusetts in boston.

• Reporter - The site of the widely anticipated first of eight between vice president al gore and governor george w. bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from jim lehrer of p. b. s. n. b. c.'s david gregory is here with governor bush. Claire shipman is covering the vice president claire you begin tonight please.

Page 20: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 20

Detect Named Entities

• Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston.

• Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush. Claire Shipman is covering the vice president Claire you begin tonight please.

Page 21: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 21

Resolve References

• Anchor - Aides tonight in Boston. In depth the truth squad for special series until election day. Tonight the truth about the budget surplus of the candidates are promising. The two international flash points getting worse. While the middle east. And a new power play by Milosevic and a lifelong a family tries to say one child life by having another amazing breakthrough the u. s. was was told local own boss. Good evening from the University of Massachusetts in Boston.

• Reporter - The site of the widely anticipated first of eight between vice president Al Gore and Governor George W. Bush. With the election now just five weeks away. This is the beginning of a sprint to the finish. And a strong start here tonight is important. This is the stage for the two candidates will appear before a national television audience taking questions from Jim Lehrer of P.B.S. N.B.C.'s David Gregory is here with Governor Bush [Governor George W. Bush]. Claire Shipman is covering the vice president Claire [Claire Shipman] you begin tonight please.

Page 22: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 27

Speech Data Mining

• How does it differ from text data mining?– Must handle errorful transcription– Lacks (reliable) formatting– Contains spontaneous speech phenomena

• We need to bring additional sources to bear on the problem

Page 23: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 28

Maskey et al 2004: Improving Proper Name Transcription in Voicemail

• How can we improve transcription of proper names without increasing the size of the ASR lexicon?

• Use meta-data available at runtime to hypothesize caller’s and callee’s names– Caller ID string – “cname”– Name of mailbox owner – “mname”

Page 24: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 29

Corpus

• Scanmail corpus– 100 hours of voicemail messages from 140

employees of AT&T.– Manually transcribed with “cname” and

“mname” tags – Gender balanced – ~12% non-native speakers

• 238 random messages for testing, rest (~ 10,000 messages) for training

Page 25: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 30

Approach

• Create a class-based language model • Create a name network to give instances for the classes of the model• Replace the class-based language model at runtime with the appropriate name networks, identified from the cname and mname of the call

Page 26: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 31

Name Network

• To get values for “mname” and “cname”, an internal AT&T employee directory (~ 40,000 people) listing used

• “cname” created from variations of static titles (Miss, Mr), full first names and nicknames (Alexander, Alex), and last names (Jones)

Page 27: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 32

Name Network

• Probability within class – training corpus

• Probability within first names – AT&T directory listing

Page 28: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 33

Experimental Results

• Word Error Rates (WER) improvement small– Absolute reduction of 0.6%

• Named Error Rate (NER) improvement significant – Absolute reduction of 20 %

• Large reduction in NER important:– Getting a name right is important to business

users– Scanmail users expressed a strong desire for

the system to recognize their own names correctly

Page 29: Data Mining, Information Extraction and Search in Spoken Documents

04/22/23 34

Next Class

• HTK Toolkit and HW5 (Fadi Biadsy)