    Text and Speech Corpus for Indonesian LVCSR

    1. Background

    In the period of August 2005 April 2006, a joint research team: TEL!" #$% &enter

    'TEL!"#isTI( Indonesia as the leader, Tel)om *chool of Engineering '*TT Tel)om(

    Indonesia, and Ad+anced Telecommunication #esearch 'AT#( apan, conducted a research on

    de+elopment of Te-t and *peech corpus for Indonesian Large .oca/ular &ontinuous *peech

    #ecognition 'L.&*#(1 The project is funded / the 2005 round of AT 3#% rogram for

    E-change of I&T #esearchers and Engineers1 The results of the project are: Te-t source of 500

    ne4s domain sentences, Le-icon dictionar of 15 4ords, T4o sentence sets consist of 2,500

    sentences of application domain and tri7phone /alanced ,896 sentences of ne4s domain, and 9

    spo)en sentences 'utterances( for clean and telephon1 The o4ner of the all results is TEL!"

    #$% &enter 'TEL!"#isTI( Indonesia as the are the project leader1

    2. Text Corpus

    The te-t corpus co+ers t4o domains: 8( application domain'fi+e e-isting, running applications

    in TEL!"#isTI: %irector *er+ices for 3earing and *pea)ing impaired telecommunication

    ser+ice, Tele7home securit, ;illing information *er+ices, #eser+ation ser+ices, and *tatus

    trac)ing feature of e7place names( 4here the le-icon 4as

    de+eloped / an Indonesian language e-pert1 Tri7phone /alanced sentence set ',896 sentences( is

    e-tracted from the 500 ne4s domain te-t source1 This set co+ers ?,669 distinct tri7phones1

    After4ard, the sentence sets from /oth domains are com/ined and distri/uted into 800 sentence

    lists1 The 2,500 application domain sentences are di+ided into 800 sets 4here each set consists of

    800 sentences 4ith o+erlap ratio of 5@, /ut the ,896 ne4s domain sentences are di+ided into

    800 sets 4here each set consists of 880 sentences 4ith o+erlap ratio of 0@1 Thus, each sentence

    list contains 280 sentences '800 application domain sentences and 880 ne4s domain sentences(1

    These sentence lists 4ill /e read / 00 spea)ers1

    3. Speech Corpus


    There are 400spea)ers distri/uted / gender '201 malesand 199 emales(, age '20@ for 8972ears old, 0@ for 275 ears old, 0@ for 6750 ears, and 80@ for 58760 ears old(, and four

    major 4estern Indonesia accents '8615@ ;ata), 2915@ a)arta, 28@ a+anese, and 15@


    3.2 Soundproo room

    The specifications of the soundproof room and recording euipment are as follo4:81 *oundproof parameter:

    a( *ound insulation le+el: 0 d;

    /( ;ac)ground noise le+el: 22 d;

    c( #e+er/eration time: 0185 second

    21 *oundproof design

    a( Length: 290 cm

    /( Bidth: 220 cm

    c( 3eight: 20 cm

    d( Thic)ness: 2615 215 cm

    1 #ecording euipment

    The recording euipment 4as configured as such to ena/le the recording of clean speech

    'microphone source( and telephone speech1 ; follo4ing strict AT# reuirement, it is

    e-pected that noise, mainl generated / electricit, 4ill /e ma-imall reduced so that the

    recording result 4ill /e lo4 noise1

    3.3 Speech Corpus Si!e

    Each spea)er uttered 280 sentences1 Thus, there are 00 - 280 C "4#000 utterances1 ; using

    monochannel, 4" k$!freuenc sampling, 1% &itsuantiDation le+el, and '()file format, the

    corpus siDe is around 2% *iga B+tes4ith duration of around ,- hours1