DCU at the NTCIR-11 SpokenQuery&Doc Taskresearch.nii.ac.jp/ntcir/workshop/Online...ASR-based topic...
Transcript of DCU at the NTCIR-11 SpokenQuery&Doc Taskresearch.nii.ac.jp/ntcir/workshop/Online...ASR-based topic...
Manual Match UnmatchAMLM0
0.020.040.060.08
0.10.120.14
UnmatchAMLM TranscriptsLILPr0.2 LILPr0.5 LIP0.9 TF_IDF
Spoken Query Types
MAP
Slide-group segments with prosody
This research is supported by the Science Foundation Ireland (Grant 12/CE/I2267) as part of CNGL (www.cngl.ie)
Results
Introduction• Speech is more than a simple sequence of words.• Prosodic variation encode rich information about:– emotions, discourse structure, dialogue acts, focus, emphasis, contrast, topic shifting, etc.
• We examined the potential of prosodic prominence in the NTCIR-11 SpokenQuery&Doc Task.
Indexing Stores normalised prosodic features for each term.
David N. Racca Gareth J.F. JonesCNGL Centre for Global Intelligent Content, School of Computing, Dublin City University, Dublin, Ireland
{dracca, gjones}@computing.dcu.ie
Background and Previous Work
Prosody may be useful in speech search: Relationship between stress and TF-IDF scores [1]. SDR exploiting amplitude and duration [2]. Topic tracking exploiting energy and pitch [3]. SCR exploiting pitch, loudness, and duration [4].
Conclusions• No significant differences between prosodic and text based runs.• Transcript quality affects retrieval effectiveness.• Prosodic-based models may be useful for certain queries.
Retrieval Increases weights of prominent terms.
References[1] F. Crestani. Towards the use of prosodic information for spoken document retrieval. SIGIR'01, 2001.[2] B. Chen et al. Improved spoken document retrieval by exploring extra acoustic and linguistic cues. INTERSPEECH'01, 2001.[3] C. Guinaudeau et al. Accounting for prosodic information to improve ASR-based topic tracking for TV broadcast news. INTERSPEECH'11, 2011.[4] D.N. Racca et al. DCU search runs at MediaEval 2014 Search and Hyperlinking. MediaEval 2014 Multimedia Benchmark Workshop, 2014.
IPUs with prosody
IPUGrouping
LecturesWAV
Segment Index
Manual Match UnmatchAMLM0
0.020.040.060.08
0.10.120.14
Match TranscriptsLILPr0.5 LIPr0.7 LIDur0.3 TF_IDF
Spoken Query Types
MAP
Match
Unmatch
Manual
Match
Unmatch
Manual
Match
Unmatch
Manual
Manual
Match
Unm
atch
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Query 1: Prosodic-based vs TF_IDF
TF_IDF Prosodic-based
AveP
Sp
oken
Qu
ery
Typ
es
Manual Match UnmatchAMLM0
0.020.040.060.08
0.10.120.14
Manual TranscriptsLIPr0.7 LILPr0.7 TF_IDF
Spoken Query Types
MAP
DCU at the NTCIR-11 SpokenQuery&Doc Task
Data Pre-processing
VAD
OpenSMILE
JuliusLVCSR
ChaSen
AnnotationRemoval
Forced Alignment
LectureNormalisation
QueriesWAV
tf-idf単語max (f0 tf-idf)=280.44 Hz Raw
max (f0 tf-idf)=0.58 Normalised
max (f0単語)=236.46 Hz Raw
max (f0単語)=0.49 Normalised
Terrier Matching
tf( i , j)=k1 tf i , j
tf i , j+k1(1−b+bdl j
avdl )idf (i ,C)=log( N
n i
+1)
ac (i , j)={f0 (i , j) Pitch [P]l (i , j) Loudness [L]d (i , j) Duration [Dur]f0range (i , j) Pitch Range [Pr]l (i , j) . f0 (i , j) [LP]l (i , j) . f0range( i , j) [LPr]
rel (q , s j)=∑i
M
w( i , j)
w (i , j)={idf (i ,C)[α . tf (i , j)+(1−α)ac (i , j) ] LI
θir . tf (i , j) . idf (i ,C )+θac .ac( i , j)θir+θac
G
tf (i , j) . idf (i ,C ) TF_IDF
f0 (i , j)=maxk
{max (f0i , jk )}
l(i , j)=maxk
{max( l i , jk
) }
f0range (i , j)=maxk
{max (f0i , jk
) }−mink
{min(f0i , jk
)}
d (i , j)=maxk
{d i , jk }
IPUsWAV
ManuallyAnnotatedTranscripts
F0,Loudness
every 10ms
ASRTranscripts
ManualTranscripts
NormalisedF0,
Loudness
EnrichedASR
Transcripts
EnrichedManual
Transcripts
EnrichedTranscripts
TerrierIndexing