Specification_of_Text_and_Speech_Corpus_for_Indonesian_LVCSR
Transcript of Specification_of_Text_and_Speech_Corpus_for_Indonesian_LVCSR
-
8/14/2019 Specification_of_Text_and_Speech_Corpus_for_Indonesian_LVCSR
1/2
Text and Speech Corpus for Indonesian LVCSR
1. Background
In the period of August 2005 April 2006, a joint research team: TEL!" #$% &enter
'TEL!"#isTI( Indonesia as the leader, Tel)om *chool of Engineering '*TT Tel)om(
Indonesia, and Ad+anced Telecommunication #esearch 'AT#( apan, conducted a research on
de+elopment of Te-t and *peech corpus for Indonesian Large .oca/ular &ontinuous *peech
#ecognition 'L.&*#(1 The project is funded / the 2005 round of AT 3#% rogram for
E-change of I&T #esearchers and Engineers1 The results of the project are: Te-t source of 500
ne4s domain sentences, Le-icon dictionar of 15 4ords, T4o sentence sets consist of 2,500
sentences of application domain and tri7phone /alanced ,896 sentences of ne4s domain, and 9
spo)en sentences 'utterances( for clean and telephon1 The o4ner of the all results is TEL!"
#$% &enter 'TEL!"#isTI( Indonesia as the are the project leader1
2. Text Corpus
The te-t corpus co+ers t4o domains: 8( application domain'fi+e e-isting, running applications
in TEL!"#isTI: %irector *er+ices for 3earing and *pea)ing impaired telecommunication
ser+ice, Tele7home securit, ;illing information *er+ices, #eser+ation ser+ices, and *tatus
trac)ing feature of e7place names( 4here the le-icon 4as
de+eloped / an Indonesian language e-pert1 Tri7phone /alanced sentence set ',896 sentences( is
e-tracted from the 500 ne4s domain te-t source1 This set co+ers ?,669 distinct tri7phones1
After4ard, the sentence sets from /oth domains are com/ined and distri/uted into 800 sentence
lists1 The 2,500 application domain sentences are di+ided into 800 sets 4here each set consists of
800 sentences 4ith o+erlap ratio of 5@, /ut the ,896 ne4s domain sentences are di+ided into
800 sets 4here each set consists of 880 sentences 4ith o+erlap ratio of 0@1 Thus, each sentence
list contains 280 sentences '800 application domain sentences and 880 ne4s domain sentences(1
These sentence lists 4ill /e read / 00 spea)ers1
3. Speech Corpus
3.1Speakers
There are 400spea)ers distri/uted / gender '201 malesand 199 emales(, age '20@ for 8972ears old, 0@ for 275 ears old, 0@ for 6750 ears, and 80@ for 58760 ears old(, and four
-
8/14/2019 Specification_of_Text_and_Speech_Corpus_for_Indonesian_LVCSR
2/2
major 4estern Indonesia accents '8615@ ;ata), 2915@ a)arta, 28@ a+anese, and 15@
*undanese(1
3.2 Soundproo room
The specifications of the soundproof room and recording euipment are as follo4:81 *oundproof parameter:
a( *ound insulation le+el: 0 d;
/( ;ac)ground noise le+el: 22 d;
c( #e+er/eration time: 0185 second
21 *oundproof design
a( Length: 290 cm
/( Bidth: 220 cm
c( 3eight: 20 cm
d( Thic)ness: 2615 215 cm
1 #ecording euipment
The recording euipment 4as configured as such to ena/le the recording of clean speech
'microphone source( and telephone speech1 ; follo4ing strict AT# reuirement, it is
e-pected that noise, mainl generated / electricit, 4ill /e ma-imall reduced so that the
recording result 4ill /e lo4 noise1
3.3 Speech Corpus Si!e
Each spea)er uttered 280 sentences1 Thus, there are 00 - 280 C "4#000 utterances1 ; using
monochannel, 4" k$!freuenc sampling, 1% &itsuantiDation le+el, and '()file format, the
corpus siDe is around 2% *iga B+tes4ith duration of around ,- hours1