13.0 Voice-based Information Retrieval

References: 1. “ Speech and Language Technologies for Audio Indexing and Retrieval ”, Proceedings of the IEEE, Aug 2000

2. “ Discriminating Capabilities of Syllable-based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese”, IEEE Transactions on Speech and Audio Processing, Vol.10, No.5, July 2002, pp.303-314.

3. Baeza-Yates & Ribeiro Neto, “ Modern Information Retrieval”, ACM Press, 1999

4. ACM Special Interest Group on Information Retrieval,

5. “ A Hidden Markov Model Information Retrieval System”, ACM SIGIR, 1999

6. “ Improved Spoken Document Retrieval with Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA)”, Informational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2006

7. “Position Specific Posterior Lattices for Indexing Speech”, 43-th Annual Meeting of the Association for Computational Linguistics (ACL), 2005, pp.443-450

Voice –enabled Web-based Applications

voice information

Private and Personal Services

Public Information and Services

Future Networks



outputvoice-based information


text informationtext-to-speech


spoken dialogue

• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Replaced by Voice in the Future

• Human-Network Interactions can be Accomplished by Spoken Dialogues

• Voice-based Information Retrieval needs to be integrated with Spoken Dialogues

• More Multi-media Information including Voice but not including Enough Text will be Available on the Web in the Future

Voice-based Information Retrieval

Voice Queries

我想找有關紐約受到恐怖攻擊的新聞?我想找有關紐約受到恐怖攻擊的新聞?Text Queries


Text Information






Voice Information


•Speech/Text Queries, Speech/Text Documents

•Mobile/Office User Environments with Multi-modality

•Speech Provides Better User Interface in Wireless Environment

Information Retrieval

• Indexing– Document representation :d

• Query formation– User request representation :q

• Retrieval– Matching query to documents

– Returning relevant documents

• Relevance feedback– Assessing retrieved results

– Modifying initial query

– Iterated retrieval: automatic (blind)/manual

• Performance evaluation– Performance measure



Evaluation Feedback



user request


document representation: d

query representation: q


list of relevant documents in order

Performance Measures

• Recall and Precision Rates

• Non-Interpolated Average Precision– Averaged at all relevant documents retrieved and over all queries

– e.g. relevant documents ranked at 1, 5, 10, precisions are 1/1, 2/5, 3/10, non-interpolated average precision=(1/1+2/5+3/10)/3


retrieved documents

relevant documents

Precision rate =

Recall rate =A



– similar to missing/false alarm rates– recall-precision plot similar to ROC

curves– recall rate may be difficult to

evaluate, while precision rate is directly perceived by users

Approaches to Speech-based Information Retrieval

• Indexing Elements– Words: Large-vocabulary Based

· create text transcription of spoken documents/queries by speech recognition

· use text retrieval methods· error propagation, out-of-vocabulary

(OOV) problems, special terms– Subword Units: Subword Based

· subword units: phones/syllables/something similar

· a segment of one to a few subword units may carry some indexing information

· not limited by the vocabulary· small size/handling some

OOV/probably more ambiguity

– Keywords: Keyword Based· based on a set of keywords· keyword selection: user specify/a

prior/fixed/automatic generated· special terms for dynamic documents

– Hybrid: Fusion of Information

• Indexing Features– a single element– different combinations of more than one

elements– pre-defined, or automatically selected

by data-driven approaches– each of such features is called an

“indexing term”• Retrieval Model Examples

– vector space models– latent semantic indexing (LSI)– statistical (probabilistic) models– hidden Marcov model (HMM)– combinations/hybrid models

Vector Space Model

• Vector Representations of query q and document d– for each type j of indexing feature a vector is generated – each component in this vector is the weighted statistics zjt of a specific indexing

term t

ct: frequency counts for the indexing term t present in the query q or document d (for text), or sum of normalized recognition scores or confidence measures for the indexing term t (for speech)

N: total number of documents in the database Nt: total number of documents in the database which include the indexing term tIDF: the significance (or importance) or indexing power for the indexing term t

• The Overall Relevance Score is the Weighted Sum of the Relevance Scores for all Types of Indexing Features

feature indexing of with typedocument and query for tionsrepresentavector :,






tscoefficien weighting:






ttjt NNcz ln]ln[1

Inverse Document Frequency(IDF)

Term Frequency(TF)

Improved Retrieval Technique Examples

• Blind Relevance Feedback– the information from the relevant and irrelevant documents retrieved in the

previous stage used to identify more helpful indexing terms– the initial query is reformulated accordingly:

q= · q + · d - · d

q, d: vector representation for the query and documentsDr : selected set of relevant documents retrieved in the previous stageDirr: selected set of irrelevant documents deleted in the previous stageq: new query representation

,,: weighting coefficients• Query Expansion by Term Association

– the indexing terms co-occurring frequently in the same documents assumed to have some synonymity association

– build an association matrix for each type of the indexing features, in which each entry ( i , j ) stands for the association between indexing terms ti and tj :

– reformulate the query expression by adding indexing terms with higher synonymity

and termsindexingboth including database in the documents ofnumber :ˆ, termsindexing theincluding database in the documents ofnumber :,

1),(0 , examplean as ˆ













Dr Dirr

Difficulties in Speech-based Information Retrieval for Chinese Language

• Even for Text-based Information Retrieval, Flexible Wording Structure Makes it Difficult to Search by Comparing the Character Strings Alone

– name/title 李登輝→李前總統登輝,李前主席登輝 (President T.H Lee)

– arbitrary abbreviation 北二高→北部第二高速公路 (Second Northern Freeway)

– similar phrases 中華文化→中國文化 (Chinese culture)– translated terms 巴塞隆那→巴瑟隆納 (Barcelona)

• Word Segmentation Ambiguity Even for Text-based Information Retrieval

–腦科 (human brain studies) →電腦科學 (computer science)

–土地公 (God of earth) →土地公有政策 (policy of public sharing of the land)

• Uncertainties in Speech Recognition– errors (deletion, substitution, insertion)

– out of vocabulary (OOV) words, etc.

– very often the key phrases for retrieval are OOV

Syllable-Level Indexing Features for Chinese Language

• A Whole Class of Syllable-Level Indexing Features with Complete Phonological Coverage and Better Discriminating Functions– Overlapping syllable segments with length N

– Syllable pairs separated by M syllables

• Character- or Word-Level Features can be Similarly Defined

S(N), N=1



P(M), M=1


S1 S2 S3 S4 S5 ………S10

Syllable Pair Separated by M syllables


P(M), M=1 (s1 s3) (s2 s4)……(s8 s10)

P(M), M=2 (s1 s4) (s2 s5)……(s7 s10)

P(M), M=3 (s1 s5) (s2 s6)……(s6 s10)

P(M), M=4 (s1 s6) (s2 s7)……(s5 s10)

Syllable-Level Statistical Features

• Single Syllables– each syllable usually shared by more than one characters with different

meanings, thus causing ambiguity– all words are composed by syllables, thus partially handle OOV problem– very often relevant words have some syllables in common

• Overlapping Syllable Segments with Length N– capturing the information of polysyllabic words or phrases with flexible

wording structures

– majority of Chinese words are bi-syllabic

– not too many polysyllabic words share the same pronunciation

• Syllable Pairs Separated by M Syllables– tackling the problems arising from the flexible wording structure,

abbreviations, and deletion, insertion, substitution errors in speech recognition

Improved Syllable-level Indexing Features

• Syllable Lattice and syllable-level utterance verification– Including multiple syllable hypothesis to construct syllable-aligned

lattices for both query and documents– Generating multiple syllable-level indexing features from syllable

lattices– filtering out indexing terms with lower acoustic confidence scores

• Infrequent term deletion (ITD)– Syllable-level statistics trained with text corpus used to prune infrequent

indexing terms• Stop terms (ST)

– Indexing terms with the lowest IDF scores are taken as the stop terms

syllables with higher acoustic confidence scoressyllables with lower acoustic confidence scoressyllable pairs S(N), N=2 pruned by ITDsyllable pairs S(N), N=2 pruned by ST

Hidden Markov Model (HMM) for Speech-based Information Retrieval

• Modeling the Query q as a Sequence of Input Observations (Indexing Terms),, and each Document d as a HMM (1-state at the moment) Composed of Distributions of N-gram Parameters

• MAP Principle (as a simple example)

• Observation Probability in the HMM state (as a simple example)

– m1,m2,m3,m4 trained by EM/MCE











P (tn|d), p(tn|tn-1,d) unigram/bi-gram trained from the

document dP (tn|C), p(tn|tn-1,C)

unigram/bi-gram trained from a large corpus, specially helpful for missing terms in the documents

)R is dProb(qd

R) is )Prob(dR is dProb(q)qR is Prob(dd

max arg d


max arg d

max arg d


q: input query, d: all documents in the database“is R”: is relevant

reduced to maximum likelihood without prior knowledge





q = t1

p (tn|d)

p (tn|C)

p (tn|tn-1, d)

p (tn|tn-1, C)

m1+ m2+m3+m4=1

Latent Semantic Indexing (LSI) Model for Speech-based Information Retrieval• Term-Document Matrix

– M indexing terms {t1,t2,...tM} and N documents {d1,d2,....dN}

– wij =lij·gi , lij: local weight

gi: global weight

• Singular Value Decomposition (SVD)

– u i = uiS term vectorv i = v iS document vector

– reduced to R-dimensional space of “latent semantic concepts”• Query q considered as a new document “folded-in”

– relevance score:

NMijwW ][







TF/IDF ),/ln()]ln(1[







, normalized with document length and term entropy, or


, S= diagonal with singular values

Udv qq T

Concept Matching

Term Matching

Speech-based Information Retrieval by Keywords ― An Example• Automatic Keyword Extraction from Texts integrated with Keyword


• Integration with Other Approaches

Keyword Spotting

Keyword-based Retrieval

Keyword Set


AutomaticKeyword Extraction




Retrieved Text/Speech Documents

Extracted Keywords


input speech query

transcription of speech


Voice-based Information Retrieval

— how far are we from the text-based information retrieval ?

Lin-shan Lee

National Taiwan University

Taipei, Taiwan, ROC

Voice-based Information Retrieval

Text/Voice-based Information Retrieval

• Text-based Information Retrieval Extremely Successful

– information desired by the users can be obtained efficiently in real time– all users like it– producing very successful applications and industry

• All Roles of Texts can be Accomplished by Voice– spoken information or multimedia information with voice in audio part– voice instructions/queries via handheld devices

• How about Voice-based Information Retrieval?




Server Documents/Information

Voice-based Information Retrieval (1/2)

Voice Instructions/Queries

Newly elected president of US?Newly elected president of US?Text Instructions/Queries

Text InformationVoice Information

(multimedia including audio part)

• If Voice Documents/Queries can be Accurately Recognized - voice-based reduced to text-based information retrieval• Correct but Never Possible

Barack Obama….Barack Obama….

Voice-based Information Retrieval (2/2)

Voice Instructions/Queries

Newly elected president of US?Newly elected president of US?Text Instructions/Queries

Text InformationVoice Information

(multimedia including audio part)

•User Instructions and/or Network Content Can be in form of Voice - text queries/spoken documents

- spoken queries/text documents

- spoken queries/spoken documents

Barack Obama….Barack Obama….

Text Queries/Spoken Documents

• Spoken Document Retrieval

– started with longer documents/queries at relatively higher ASR accuracies

– started with text-based approaches applied on 1-best transcriptions

– inadequate for short documents/queries with relatively poor ASR accuracies

• Spoken Term Detection– emerged probably from the successful term matching paradigm for text-based approaches

– considering multiple alternatives from ASR output (e.g. lattices) to handle ASR errors

– different from the traditional task of Keyword Spotting in that the query set is open

[Chelba, Hazen, Saraclar, IEEE SPM 08][Vergyri, et al, Interspeech 07]

[Saraclar & Sproat, HLT 04][Mamou, et al, SIGIR 06][Chelba & Acero, ACL


Spoken Queries/Text Documents

• Voice Search– information to be retrieved existing in a large text database (e.g. directory

assistance)– out-of-vocabulary (OOV) words in the database– disambiguated by dialogues

• Spoken Query Processing– using a lattice of possible terms as the queries– more semantic analysis performed during retrieval

[Moreno-Daniel, Juang, Wilpon, ICASSP 07, Interspeech 08]

user SearchASR


query Database

[Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08]

[Yu, Wang, Acero, Interspeech 07]

n-best results

Spoken Queries/Spoken Documents

• Uncertainty on Both Sides

• Query-by-example

[Chia, et al, SIGIR 08]• Comparing Two Lattices of Queries/Documents by Graphical Model

[Lin et al, Interspeech 08]

Wireless and Multimedia Technologies are Creating An Environment of Voice-based Information retrieval

voice information Multimedia






text information

• Many Hand-held Devices with Multimedia Functionalities Commercially Available Today

• Unlimited Quantities of Multimedia Content Available over the Internet

• User-Content Interaction necessary for Information Retrieval can be Accomplished by Spoken and Multi-modal Dialogues

• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Voice

Multimedia Content Analysis

Text Information Retrieval

Text Content

Voice-based Information


Text-to-Speech Synthesis

Spoken and multi-modal


Why Is Text-based Information Retrieval Useful and Attractive?

• Spoken/multimedia documents not easily summarized on-screen, thus difficult to scan and select

• Lacks efficient user-system interaction

• Retrieved documents easily summarized on-screen, thus easily scanned and selected by user

• Users may easily select query terms suggested for next iteration retrieval in an interactive process

• Problems with speech recognition errors, especially for spontaneous speech under adverse environments

• Retrieval accuracy acceptable to users

• Retrieved documents properly ranked and filtered

• Spoken/multimedia content are the new trend

• Can be realized even sooner given mature technologies

• Rich resources—huge quantities of text documents available over the Internet

• Quantity continues to increase exponentially due to convenient access


How about Voice-based Information Retrieval?R













Accuracy for Voice-based Information retrieval

Accuracy for Voice-based Information Retrieval

• Low Recognition Accuracies for Spontaneous Speech including Out-of-Vocabulary (OOV) Words under Adverse Environment

considering lattices with multiple alternatives rather than 1-best output

– higher probability of including correct words, but also including more noisy words

– correct words may still be excluded (OOV and others)– huge memory and computation requirements other approaches: confusion matrix, fuzzy matching…











Start nodeEnd node

Time index

Wi: word hypotheses

[Mamou & Ramabhadran, Interspeech 08]

Efficient Forms of Lattices for Indexing Purposes – Indexing Structures• Lattices

• An Example of Indexing Structure

– reduced memory and computation requirements (still huge…)– added possible paths– noisy words discriminated by posterior probabilities or similar scores– n-grams matched and accumulated for all possible n











Start nodeEnd node

Time index

W9, p9 W5, p5

W2, p2

W10, p10

W4, p4

W8, p8

W3, p3

W6, p6

W7, p7

W1, p1

W9, p9 W5, p5

W2, p2

W10, p10

W4, p4

W8, p8

W3, p3

W6, p6

W7, p7

W1, p1

Examples of Indexing Structures

• Position Specific Posterior Lattices (PSPL)[Chelba & Acero, ACL 05]

• Confusion Networks (CN)[Mamou, et al, SIGIR 06][Hori, Hazen, Glass, ICASSP 07]

• Time-based Merging for Indexing (TMI)[Zhou, Chelba, Seide, HLT 06][Seide, et al, ASRU 07]

• Time-anchored Lattice Expansion (TALE)[Seide, et al, ASRU 07]


– directly compile the lattice into a weighted finite state transducer

[Allauzen, et al, HLT 04][Saraclar & Sproat, HLT 04]

Two Examples of Indexing Structures: Position Specific Posterior Lattices (PSPL), Confusion Networks (CN)

End node

W6 W8











Start node

Time index

W1W2, W3W4W5, W6W8W9W10,

• PSPL:─ Locating a word in a segment according to the order of the word in a path

• CN:─ Clustering several words in a segment according to similar time spans and word


W3: prob

W7: prob

W2: probW1: prob W5: probW9: prob

W10: prob

PSPL structure:

W6: prob

segment 1

W4: probW8: prob

segment 2 segment 4segment 3 segment 1 segment 2 segment 3

W6: prob

W2: probW9: probW4: probW1: prob

CN structure:

W3: prob

W7: prob

W8: prob W5: probW10: prob

segment 4

All paths:


OOV or Rare Words Handled by Subword Units

• OOV Word W=w1w2w3w4

– wi : subword units : phonemes, syllables…– a, b, c, d, e : other subword units

• W can’t be Recognized and never Appears in Lattice – can’t be found– W=w1w2w3w4 hidden at subword level– can be matched at subword level without being recognized

• Subword-based PSPL (S-PSPL) or CN (S-CN), for Example







Time index

Subword-based Indexing Structures (1/2)

Time index






w1_3 w2_














w6_2 w8_1w8_2









• Constructed from Phone Lattices (assuming the subword unit is the phone) from Phone Decoder

– Relatively higher phone error rates

[Ng, MIT 00][Wallace, et al, Interspeech 07]

• Word Lattices Represented by Subword Arcs:

– Only sub-strings of subword units for in-vocabulary words can be generated

[Saraclar & Sproat, HLT 04][Vergyri, et al, Interspeech 07]

Subword-based Indexing Structures (2/2)

S-PSPL structure: S-CN structure:

….. …..

w1_1: prob w1_2: prob …. …..

segment 1



segment 2 segment 8

w5_4: prob

….. …..

w1_1: prob w1_2: prob …. …..


segment 1



segment 2 segment 8


w2_4: prob


– Strings of subword units are not constrained by in-vocabulary words any longer

[Pan & Lee, Interspeech 07][Pan & Lee, ASRU 07]

• Subword-based PSPL and CN (S-PSPL, S-CN)

• Hybrid Word-based and Subword-based Structures[Yu & Seide, HLT 05]

Frequently Used Subword Units – Language Dependent (1/2)

• Phonemes– English and many alphabetic languages– Phone n-grams– Particles : groups of phonemes obtained data-driven

[Ng, MIT 00][Wallace, et al, Interspeech 07][Logan, et al, IEEE T. Multimedia 05]

• Graphemes[Wang & King, ICASSP 08]

• Graphones[Bisani & Ney, Interspeech 05][Akbacak & Vergyri, ICASSP 08]

• Morphs– Morph-based languages : Finnish, Turkish, etc.– Morpheme-like units

[Turunen & Kurimo, SIGIR 07][Parlak & Saraclar, ICASSP 08]

Frequently Used Subword Units – Language Dependent (2/2)

• Phonetic Word Fragments– Derived bottom-up data-driven

[Yu & Seide, HLT 05]

• Syllables/Characters– Mandarin Chinese and similar monosyllable-based languages– Syllable/character n-grams– Syllable/character pair separated by a syllable/character

[Chen & Lee, IEEE T. SAP 02][Pan & Lee, ASRU 07]

[Meng & Seide, ASRU 07, Interspeech 08]

[Shao & Seide, Interspeech 08]

User-System Interaction for Voice-based Information Retrieval

Issues in User-System Interation — Difficulties in Browsing, Scanning, and Selecting Multimedia/Spoken Documents Text Documents (including those for voice search, etc.) are

Better Structured and Easier to Browse

— in paragraphs with titles, or in well structured databases

— easily summarized on-screen

— easily scanned and selected by user Multimedia/Spoken Documents are just Video/Audio Signals

— not easily summarized on-screen

— difficult to scan and select

— lacks efficient scenario for user-system interaction

Proposed Approach—Multimedia/Spoken Document Understanding and Organization for Multi-modal User Interfaces Semantic Analysis for Spoken Documents — analyzing the semantic content of the spoken documents

Key Term Extraction from Multimedia/Spoken Documents — very often are out-of-vocabulary (OOV) words such as person/organization/ location names Multimedia/Spoken Document Segmentation — automatically segmenting a spoken document into short paragraphs, each with a

central topic Summarization and Title Generation for Multimedia/Spoken Documents — automatically generating a summary and a title (in text or speech form) for each short paragraph Topic Analysis and Organization for Multimedia/Spoken Documents — analyzing the subject topics for the short paragraphs and organizing them into

graphic structures

[Lee & Chen, IEEE SPM 05][Lee, et al, Interspeech 06]

Creating A Set of Latent Topics between A Set of Terms and A Set of Documents

Modeling the Relationships by Probabilistic Models Trained with EM Algorithm

An Example Approach of Semantic Analysis for Spoken Documents : Probabilistic Latent Semantic Analysis (PLSA)

Latent Topic Entropy

- carries less topical information - carries more topical information

Key Term Extraction ( 1/2)

[Kong & Lee, ICASSP 06][Hsieh & Lee, ICASSP 06]

Page 41: 13.0 Voice-based Information Retrieval References: 1. “ Speech and Language Technologies for Audio Indexing and Retrieval ”, Proceedings of the IEEE, Aug.

Latent Topic Significance

— a term tj with respect to a topic Tk

P(Tk|Di) : how each document Di is focused on the topic Tk

[1-P(Tk|Di)] : the probability that each document Di addresses all other topics different

from Tk

[Kong & Lee, ICASSP 06]

Key Term Extraction ( 1/2)

Page 42: 13.0 Voice-based Information Retrieval References: 1. “ Speech and Language Technologies for Audio Indexing and Retrieval ”, Proceedings of the IEEE, Aug.

Spoken Document Summarization

Selecting Important Sentences to be Concatenated into a Summary

— sentence scoring and selection

— given a summarization ratio Selected Sentences Collectively Represent Some Concepts Closest

to those of the Complete Document

— removing the concepts already mentioned previously

— concepts presented smoothly[Furui, et al, ICASSP 05, IEEE T. SAP 04][Hirschberg, et al, Interspeech 05]

[Murray, Renals, et al, ACL 05, HLT 06][Kawahara, et al, ICASSP 04]

[Nakagawa, et al, SLT 06][Zhu& Penn, Interspeech 06]

[Fung, et al, ICASSP 08][Kong & Lee, ICASSP 06, SLT 06]

One example: Delicate Scored Viterbi Search







Spoken documentASR and Automatic





Title Generation for Spoken Documents

[Witbrock & Mittal, SIGIR 99][Jin & Hauptmann, HLT 01]

[Chen & Lee, Interspeech 03] [Wang & Lee, SLT 08]

Global Semantic Structuring

— Offering a global picture of the semantic structure of the entire archive

Query-based Local Semantic Structuring

— Offering a detailed semantic structure of the relevant documents retrieved by the query

Latent Topic Analysis and Organization for Spoken Documents

Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or as a Multi-layered Map— Documents addressing similar topics grouped in the same cluster

— Distance between clusters on the map has to do with relationships between topics for the documents

— A cluster with many documents can be expanded into another map in the next layer

Two-dimensional Tree Structure

for Organized Topics

Global Semantic Structuring for the Entire Archive

[Li & Lee, Interspeech 05]

User’s Query Produces many Retrieved Spoken Documents– Difficult to be shown on-screen

A Topic Hierarchy Constructed for the Retrieved Documents– each node represents a cluster of retrieved documents labeled by a key

term (or topic)– User may select or delete the nodes directly

Better User-System Interaction

Multi-modal Dialogue


Retrieved Documents

Spoken Document


Retrieval System


Topic Hierarchy

Query-based Local Semantic Structuring for Retrieved Spoken Documents

[Pan & Lee, ASRU 05]

Improved Interactive Retrieval of Spoken Documents by Ranking the Key Terms in the Topic Hierarchy• Query Term Suggestions in Text-based Information Retrieval

very helpful• User-System Interaction for Spoken Document Retrieval

• Properly Ranking the Topics in the Topic Hierarchy– suggesting important/relevant key terms on the top of the hierarchy– automatically learned and performed by the dialogue manager

Topic Hierarchy


Multi-modal Dialogue

Retrieved Documents

Spoken Document


Retrieval System


[Pan & Lee, Interspeech 06, SLT 06]

User-System Interaction in Spoken Dialogue Systems


Speech, Graph, Tables

Dialogue Modeling

System Action

Dialogue Manager

Input Speech Utterance




Dialogue Act Classification

Semantic Frame

Dialogue State

Output Generator

Spoken language Understanding

User Act

U Au


• Spoken Dialogue Systems

• Example Goals– Higher task success rate (reliability)– Smaller average number of turns for successful tasks (efficiency)

Dialogue Systems for Voice-based Information Retrieval

• Voice-based Information Retrieval


Multimedia Document Archive

Retrieval Engine


word/ phone lattice, one-best, N-best


inverted index file

word/ phone lattice, one-best, N-best

Spoken Language based Information Access

Internal State

Dialogue Modeling

Related Documents


User Interface

Dialogue ManagerOutput


Multi-modal interactions Information



Spoken Documents

Input Spoken Query



Multimedia Document Archive

Retrieval Engine


word/ phone lattice, one-best, N-best


inverted index file

word/ phone lattice, one-best, N-best

Spoken Language based Information Access

Internal State

Dialogue Modeling

Related Documents


User Interface

Dialogue ManagerOutput


Multi-modal interactions Information



Spoken Documents

Input Spoken Query


• Example Goals– higher task success rate (success: user’s information need satisfied)– smaller average number of dialogue turns (average number of query terms entered) for

successful tasks

•Dialogues Equally Useful in Voice Search for Text Documents

[Pan & Lee, ASRU 07]

[Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08]

Concluding Remarks

Voice/text-based Information Retrieval

Accuracy — More Reliable Retrieval Techniques

• Problems– Poor recognition accuracies for spontaneous speech under adverse environments– Serious OOV problem

• Possible approaches– Lattices and efficient indexing structures– Subword units (covering OOV words, across different languages and using less space)– Methods for reducing computation and memory requirements– Other techniques useful in text-based retrieval: query expansion, semantic concept matching,





















User-System Interaction — More Efficient Interaction Scenario• Problems

– Spoken/multimedia documents not easily summarized on-screen, thus difficult to scan and select

– Lacks efficient user-system interactions

– Disambiguation by user-system interaction always important even for text documents (e.g. voice search)

• Possible Approaches– Automatic summary/title generation for spoken/multimedia documents

– Topic hierarchy construction for retrieved documents, with nodes labeled by key terms

– Multi-modal user-system dialogue with improved interaction

Titles Summaries



Retrieved Documents

Retrieval System


Topic Hierarchy

Multi-modal Dialogue