An overview of Broadcast News corpora

12
An overview of Broadcast News corpora David Graff * Linguistic Data Consortium, Suite 200, 3615 Market street, Philadelphia, PA 19204-2608, USA Abstract The LDC began its first Broadcast News (BN) speech collection in the spring of 1996, facing a host of challenges including IPR negotiations with broadcasters, establishment of new transcription conventions and tools, and a com- pressed schedule for creation and release of speech, transcripts and in-domain language model data. The amount of acoustic training data available for participants in the DARPA Hub4 English benchmark tests doubled from 50 h in 1996 to 100 h in 1997, and doubled again to 200 h in 1998. An additional 40 h has been made available as of the summer of 1999. The 1997 benchmark test also saw the addition of BN speech and transcripts in Spanish and Mandarin Chinese, though in lesser quantity, with 30 h of training data in each language. Supplements to the existing pronun- ciation lexicons in each language were also produced. More recently, the coordinated research project on topic de- tection and tracking (TDT) has called for a large collection of BN speech data, totaling about 1100 h in English and 300 h in Mandarin over two phases (TDT2 and TDT3), although the level of detail and quality in the TDT tran- scriptions is not comparable to that of the Hub4 collections. Ó 2002 Elsevier Science B.V. All rights reserved. 1. Introduction: history and overview of Hub4 data Starting in 1995, the Defense Advanced Re- search Projects Agency of the United States (DARPA) directed its research program for con- tinuous speech recognition (CSR) to focus on au- tomatic transcription of broadcast news (BN). This new application focus, referred to as ‘‘Hub4’’, was an extension of the previous CSR research programs (identified with the preceding ‘‘hubs’’), which had been based on transcription of jour- nalistic dictation. The transition to Hub4 involved a variety of significant changes that posed a wide range of new challenges, not only for CSR tech- nology development, but also for the creation of data resources to support the research. A fundamental difference introduced by Hub4 was the reliance on ‘‘found’’ speech as opposed to ‘‘elicited’’ speech. To make the speech recordings for earlier hubs, human subjects were recruited, placed in specific recording environments, provided with written news material, and instructed to read the material aloud or ‘‘spontaneously’’ dictate a new report based on the material. The recording environments for these collections were controlled to yield generally invariant background noise con- ditions throughout the recording sessions (and the ambient noise was typically benign or negligible). Signals were captured simultaneously from two microphones: one was always a particular head- mounted model of high quality, while the other varied in terms of microphone type and distance from the speaker’s mouth. There was no human www.elsevier.com/locate/specom Speech Communication 37 (2002) 15–26 * Tel.: +1-215-898-0887; fax: +1-215-573-2175. E-mail address: graff@ldc.upenn.edu (D. Graff). 0167-6393/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-6393(01)00057-7

Transcript of An overview of Broadcast News corpora

Page 1: An overview of Broadcast News corpora

An overview of Broadcast News corpora

David Graff *

Linguistic Data Consortium, Suite 200, 3615 Market street, Philadelphia, PA 19204-2608, USA

Abstract

The LDC began its first Broadcast News (BN) speech collection in the spring of 1996, facing a host of challenges

including IPR negotiations with broadcasters, establishment of new transcription conventions and tools, and a com-

pressed schedule for creation and release of speech, transcripts and in-domain language model data. The amount of

acoustic training data available for participants in the DARPA Hub4 English benchmark tests doubled from 50 h in

1996 to 100 h in 1997, and doubled again to 200 h in 1998. An additional 40 h has been made available as of the summer

of 1999. The 1997 benchmark test also saw the addition of BN speech and transcripts in Spanish and Mandarin

Chinese, though in lesser quantity, with 30 h of training data in each language. Supplements to the existing pronun-

ciation lexicons in each language were also produced. More recently, the coordinated research project on topic de-

tection and tracking (TDT) has called for a large collection of BN speech data, totaling about 1100 h in English and

300 h in Mandarin over two phases (TDT2 and TDT3), although the level of detail and quality in the TDT tran-

scriptions is not comparable to that of the Hub4 collections. � 2002 Elsevier Science B.V. All rights reserved.

1. Introduction: history and overview of Hub4 data

Starting in 1995, the Defense Advanced Re-search Projects Agency of the United States(DARPA) directed its research program for con-tinuous speech recognition (CSR) to focus on au-tomatic transcription of broadcast news (BN).This new application focus, referred to as ‘‘Hub4’’,was an extension of the previous CSR researchprograms (identified with the preceding ‘‘hubs’’),which had been based on transcription of jour-nalistic dictation. The transition to Hub4 involveda variety of significant changes that posed a widerange of new challenges, not only for CSR tech-

nology development, but also for the creation ofdata resources to support the research.

A fundamental difference introduced by Hub4was the reliance on ‘‘found’’ speech as opposed to‘‘elicited’’ speech. To make the speech recordingsfor earlier hubs, human subjects were recruited,placed in specific recording environments, providedwith written news material, and instructed to readthe material aloud or ‘‘spontaneously’’ dictate anew report based on the material. The recordingenvironments for these collections were controlledto yield generally invariant background noise con-ditions throughout the recording sessions (and theambient noise was typically benign or negligible).Signals were captured simultaneously from twomicrophones: one was always a particular head-mounted model of high quality, while the othervaried in terms of microphone type and distancefrom the speaker’s mouth. There was no human

www.elsevier.com/locate/specomSpeech Communication 37 (2002) 15–26

*Tel.: +1-215-898-0887; fax: +1-215-573-2175.

E-mail address: [email protected] (D. Graff).

0167-6393/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-6393 (01 )00057-7

Page 2: An overview of Broadcast News corpora

interlocutor or audience (except for the technicianin charge of the recording session) – people werespeaking to a recording device and striving tomaximize clarity of pronunciation, for no otherpurpose than to provide data for CSR.

In Hub4, the speech recordings are observationsof an on-going activity: the speakers are eitherworking as employees of a broadcasting agency(involved in a daily job of relaying news reportsto a nation-wide audience), or else they are par-ticipants in newsworthy events and public com-mentary. The recording environments, the signalcharacteristics, and many aspects of the speakers’behavior, are determined by properties of thebroadcast media and by the nature of the newsbeing reported; the decisions and actions that de-termine factors such as recording site, microphonechoice, speaking style and so on, are made byjournalists and broadcast engineers who are tryingto serve and hold the interest of the audience.

The shift from elicited to found speech gave riseto a host of complicating issues. Creating a digitalrecording from a broadcast signal is trivial, butmaking such a recording available and suitable foruse by the CSR research community required so-lutions to a variety of problems:

Intellectual Property Rights (IPR): The spokencontent of a broadcast is the property of thebroadcast company; the copyrights protecting thisownership require that special permission begranted by the owner for distribution and use ofthe recorded material.

Breadth of sampling (for both training and testdata): Despite the barriers imposed by IPR, theresearch requires a broad sampling of voices andvocabulary; many broadcast sources needed to beincluded, in sufficient amounts and over a suffi-cient span of time, to support effective researchand meaningful evaluation of CSR technology.

Language modeling: A key component in devel-oping CSR systems is the availability of a largebody of ‘‘within-domain’’ language data, in theform of computer readable text, to create a statis-tical language model representing the probabilityof occurrence for possible word sequences in thespeech to be recognized. For earlier CSR research,where speakers were simply reading aloud frompublished newspaper stories, a large collection of

newspaper and newswire text, comprising over 100million words and covering several years and nu-merous sources, was readily available to provide anappropriate language model. For Hub4, it wouldbe necessary to create a corpus of BN transcriptson some comparable scale to provide an adequatemodel for the spoken forms of journalism.

Reference transcripts: This was (and still is) byfar the most difficult issue. Because previous CSRresearch used prompted speech, controlled record-ing environments and a consistently carefulspeaking style by native speakers, the referencetranscripts used for acoustic training were fairlysimple in terms of their creation and form. Theywere derived from prompting texts by adding no-tations for breath noises, and altering the wordsequence to reflect the occasional disfluency, wordsubstitution, etc., by the person reading aloud.Each speech recording contained one speaker pro-ducing one sentence. In Hub4 data, a single speechrecording contained a complete news broadcast,involving several confounding factors, including:• multiple speakers (some with foreign-accentedEnglish),

• variable speaking styles (read reports, spontane-ous interviews, casual ‘‘news room chatter’’,with the speech of two or more people oftenoverlapping on the single recording channel),

• variable recording conditions (studio, indoor andoutdoor remote sites, telephone or radio relays),

• the presence and variation of music or back-ground noises during speech,

• significant boundaries of information structure,separating distinct news stories and non-newsportions within a broadcast.

In order to assess the effect of these new factors onCSR system performance, it would be necessary torepresent them all in the reference transcriptions.

For the very first use of BN data in a CSRbenchmark test, in November 1995, material wasdrawn from a single source, the PRI/KUSC‘‘Marketplace’’ program (selected in part becauseearlier hubs had been based on financial news). TheNational Institute of Standards and Technology(NIST, part of the US Department of Commerce)prepared small data sets for training (about 5 h),development testing (2.4 h), and evaluation (1.2 h)(Pallett et al., 1996). A professional transcription

16 D. Graff / Speech Communication 37 (2002) 15–26

Page 3: An overview of Broadcast News corpora

service was used to create initial transcripts forselected broadcasts, and NIST then verified thecontent and supplied additional annotations. Theadded markup involved just story-boundary timemarks for the training data; in the development testset, these were supplemented by time marks onspeaker turn changes and information about thespeakers (names, gender, anchor or regular corre-spondent versus others, American English dialectversus others); the evaluation set received all theabove plus additional time marks for the ranges ofspeech accompanied by music, and ranges of lim-ited-bandwidth versus high-fidelity speech signals.There was no BN-domain text material for lan-guage modeling. The limited resources and anno-tations were sufficient to conduct a pilot study ofthe Hub4 task, which was only one facet of the1995 evaluation.

In 1996, the CSR evaluation focused solely onthe Hub4 task, and posed much more ambitiousrequirements for resources and annotations. Anarchive of commercially produced BN transcrip-tion text, spanning 4 years and numerous broad-cast sources, was made available for languagemodeling. Fifty hours of acoustic training dataand two 10-h sets for development and evaluationtesting were drawn from a variety of Englishbroadcast sources between May and July 1996. Itwas decided among the participating sites that allreference transcripts (for training, developmenttesting and evaluation) should receive the sameextent of markup, including all the features an-notated in the 1995 test set, plus time marks for thepresence of background noise. All transcriptionand additional annotations for a given speech filewould be folded into a single text stream, usingSGML markup to represent these features:• the information structure of the data: the mark-

up divides each broadcast into a series of newsstories, non-news sections and untranscribedsections (e.g. commercials);

• the identification of speakers and their charac-teristics: within each story or non-news section,SGML tags marked the extent of one or morespeaker turns, and each turn tag included aspeaker name, dialect (native American Englishor not), speaking style (spontaneous or not), andthe relative fidelity of the microphone and trans-

mission channel (high, medium or low, judgedsubjectively by transcribers);

• the non-hierarchical events affecting the signal:onsets and changes of music and noise levels(Doddington, 1996).

Unfortunately, a number of decisions about tran-script specifications were not finalized until sometime after the initial transcription process wasunder way. Nevertheless, the goals of the datacreation project were met to a sufficient extent topermit a meaningful benchmark test to be per-formed on schedule (Graff, 1997; Pallett et al.,1997; Garofolo et al., 1997).

Hub4 was again the sole focus for the 1997 CSRbenchmark (Pallett et al., 1998), this time with anadditional 50 h of English training data and a new10-h test pool (cf. Section 3 regarding the record-ing epochs for these data). Also added to theproject were data collections for BN transcriptionin Mandarin Chinese and Spanish, which togetherwere designated as ‘‘Hub4-NE’’ (‘‘non-English’’).For each of these languages, the LDC provided30 h of acoustic training data and a test pool of5 h; the only substantive text archives availablefor language modeling were from newspaper andnewswire collections, and these were already avail-able as LDC corpora.

The 1998 evaluation saw the addition of an-other 100 h of acoustic training data in English,which had been recorded through the latter half of1997. Transcription for this set did not provide thesame level of detail about signal conditions as thefirst 100 h (cf. Section 5.2 below). The new test setfor this year, however, did receive fully detailedannotation, in order to support comparisons withthe previous benchmarks. The pool of BN lan-guage model data remained the same, but morerecent material from newswire sources was madeavailable to cover the time period of the newtraining data. For Mandarin and Spanish, no newtraining data were added, and unused portions ofthe previous year’s test pools were selected for use(Pallett et al., 1999).

The 1999 benchmark saw the addition of in-formation extraction tasks, involving the identifi-cation of named entities in the broadcasts. Newtest sets for English and Mandarin were collected,transcribed and annotated by the LDC (following

D. Graff / Speech Communication 37 (2002) 15–26 17

Page 4: An overview of Broadcast News corpora

the same markup standards as the previous testsets), and NIST handled the addition of specialannotations for information extraction in bothlanguages (Pallett et al., 2000).

Two other projects, unrelated to the DARPACSR effort, have also coordinated with the LDC toproduce BN corpora in other languages.

The 1999 Summer Workshop of the Center forLanguage and Speech Processing at Johns HopkinsUniversity included a research project on language-independent acoustic modeling (Byrne et al., 1999),for which the LDC created a collection of VOACzech broadcasts. A total of 46 half-an-hourbroadcasts were captured, and the recordings weresent to the Department of Cybernetics, Universityof West Bohemia, in Pilsen, Czech Republic, fortranscription. This yielded a total of about 20 h oftranscribed speech, annotated in a manner equi-valent to the Hub4 corpora, including the creationof a pronunciation lexicon. These resources haverecently been made available to the broader re-search community as a corpus distributed by theLDC.

MIT Lincoln Labs embarked on a researchproject to develop BN CSR for Korean. As part ofthis project, they contracted the LDC to collectand transcribe a 30-h collection of VOA Koreannews broadcasts, again using annotation in theHub4 style. As of this writing, the transcripts andpronunciation lexicon for this collection are in thefinal stages of preparation, and will soon beavailable as an LDC corpus.

The following sections provide more detailabout the creation and content of the various BNdata collections.

2. IPR constraints

In order to obtain permission for redistributionof broadcast recordings and transcriptions for re-search use, the LDC established IPR contractswith the following companies:• American Broadcasting Company (ABC),• Cable News Network, Inc. (CNN),• ECO (a Spanish-language broadcast network),• KAZN Radio, Inc. (a Mandarin-language radio

station in California),

• National Cable Satellite Corporation (CSPAN),• National Public Radio (NPR),• Primary Source Media (PSM, for existing ar-chives of BN transcripts),

• Public Radio International (PRI),• UNIVISION (a Spanish-language broadcastnetwork),

• USC Radio (Marketplace).Some of these sources would only grant distribu-tion rights to the LDC on the condition that thedistribution be limited to those who register asmembers of the LDC, and some have required thatspecial license agreements, in addition to the gen-eral LDC membership license, be executed by eachrecipient of the data. We will continue to workwith these copyright owners, seeking to provideeasier and wider access to the data. In the mean-time, we have sought to avoid placing undue limitson participation in CSR and other sponsoredevaluations by creating a special ‘‘evaluation-only’’ membership category, which can be grantedat essentially no cost to participants that lack thefunding for a full yearly membership in the LDC.

We also took advantage of a special relation-ship with the Voice of America broadcasting ser-vice run by the US Information Agency (USIA, abranch of the State Department). The legislationthat created the VOA service in the 1940s pro-hibited domestic reception or use of VOA broad-casts, unless specific exceptions were granted byCongress. In late 1996, the US Congress passedPublic Law 104-269, which designates the LDC asa domestic point of access to USIA materials forresearch and educational purposes, allowing us torecord and distribute VOA broadcasts in Spanishand Mandarin, as well as any of the other 50languages currently used by VOA.

3. Collection of audio samples

There were two major collection phases forHub4 English (‘‘Hub4-E’’) acoustic data, the firstin 1996 and the second in 1997–98; each phaseyielded, on average, 115 h of recordings, of which100 h were designated for use as training data, andthe remainder for use as test data. The end-pointof Hub4-E data collection (January 1998) coin-

18 D. Graff / Speech Communication 37 (2002) 15–26

Page 5: An overview of Broadcast News corpora

cided with the start of the TDT2 collection, whichextended through the next six months. Some finalHub4 collections were made in June and August1998 to support the 1998 and 1999 Hub4 evalua-tions. The TDT3 corpus collection spanned thelast three months of 1998.

3.1. The 1996 Hub4 collection

For the 1996 collection, the training set wasrecorded between 10 May and 3 July of that year;recordings for the 10-h development test setspanned 10–15 July, those for the 10-h evaluationtest set used in the 1996 benchmark spanned 11–25September, and the following year’s evaluationtest set was recorded between 14 October and 13November 1996. In each of these four partitions ofthe collection, samples from the various broadcastsources were distributed randomly. The sizes ofindividual sample files ranged between 30 and 120min per file.

Due to the difficulties in transcribing the 1996collection phase, that first 100 h of training datawas released in two increments of about 50 h each,so that only the first 50 h were available prior tothe 1996 evaluation, while all 100 h could be usedin preparing for the 1997 benchmark.

Table 1 summarizes the sampling by newssource for the 1996 collection. The ‘‘total hours’’column indicates the total broadcast time in therecordings, while the ‘‘total speech’’ column gives

the total amount of time encompassed by speakerturns in the transcripts; the numbers in parenthe-ses indicate the amounts for the 50-h subset thatwas released prior to the 1996 benchmark test.

The three distinct test pools created in thiscollection period (development set, 1996 bench-mark set, 1997 benchmark set) each contained a10-h sample of data from seven programs, as listedin Table 2; for each of the benchmark sets, NISTselected a subset of a few hours for use in theevaluations.

3.2. The 1997 Hub4 collection

In the 1997 collection phase, acoustic trainingdata in English were collected between June 1997and January 1998. In comparison to the 1996training collection, not only was this sampling of

Table 1

Sampling for the 1996 English BN training collection

Source # of files Total hours Total speech

ABC Nightline 23 (9) 11.5 (4.5) 7.76 (3.01)

ABC World Nightly News 25 (9) 12.5 (4.5) 6.10 (2.14)

ABC World News Tonight 12 (9) 6.0 (4.5) 4.05 (3.02)

CNN Early Edition 7 (6) 5.0 (4.5) 3.00 (2.78)

CNN Early Prime News 15 (7) 8.5 (3.5) 5.00 (2.48)

CNN Headline News 16 (9) 8.5 (4.5) 5.42 (2.84)

CNN Prime News 11 (9) 5.5 (4.5) 3.98 (3.23)

CNN The World Today 7 (4) 7.0 (4.0) 4.63 (2.63)

CSPAN Washington Journal 6 (2) 12.0 (4.0) 11.95 (4.00)

NPR ATC 37 (14) 20.0 (7.0) 17.16 (5.74)

NPR Marketplace 15 (9) 7.5 (4.5) 5.87 (3.63)

Total 174 (87) 104.0 (50.0) 74.92 (35.5)

Table 2

Sampling for test sets in the 1996 BN collection

Source Hours recorded

ABC Prime Time News 1

CNN Morning News 2

CNN World View 1

CSPAN Washington Journal 2

NPR Morning Edition 2

NPR Marketplace 1

PRI The World 1

Total 10

D. Graff / Speech Communication 37 (2002) 15–26 19

Page 6: An overview of Broadcast News corpora

BN sources relatively sparse over time (spanning 7months instead of 7 weeks), it was also somewhatuneven. Each source appears only in a limitedportion of the sampling period (e.g. PRI wasonly recorded during October); within any givenmonth, samples were typically taken from onlytwo or three sources (during August we sampledfour sources, while no samples were recorded inNovember). The amounts and time periods cov-ered for each source are displayed in Table 3.

There were fewer sources available for non-English BN data to support the Hub4-NE bench-marks, and the quantities involved were lower, asshown in Table 4.

In collecting test data for these languages, wemade sure that the recording dates of the test fileswere all at least a few days after the date of thelatest training file; the time period sampled for thetest data was about one week, at the end ofAugust, 1997. The single test pool collection hasbeen used for both the 1997 and 1998 Hub4-NEbenchmarks, and is summarized in Table 5.

3.3. Hub4 test set collections in 1998

The 1998 Hub4-E benchmark test set was builtfrom two subsets: ‘‘set 1’’ was composed of pre-viously unused test material from the 1996 EnglishBN collection described earlier; ‘‘set 2’’ was a col-lection of new recordings made in June 1998, aslisted in Table 6.

For the 1999 Hub4-E benchmark, a new set of10 h was recorded from four sources during the

Table 3

Sampling for the 1997 English BN training collection

Source Sampling period (YYMM) # of files Total hours Total speech

ABC World News Tonight 9801 19 9.5 6.21

CNN Headline News 9801 10 5.0 3.38

CNN Early Prime 9706–9708 14 14.0 10.38

CNN Prime News 9709–9710, 9712 27 13.5 8.59

CNN The World Today 9707–9709, 9712 21 21.0 13.71

CSPAN Public Policy 9707–9708 5 9.0 8.23

CSPAN Washington Journal 9706–9707 7 14.0 12.71

PRI The World 9710 11 11.0 9.15

Total 114 98.0 72.36

Table 4

Sampling for the 1997 non-English BN training collection

Source Sampling period (YYMM) # of files Total hours Total speech

CCTV (Mainland China) 9701–9704 25 13.0 11.7

KAZN (California) 9702–9704 9 4.5 2.7

VOA Mandarin 9705–9706 24 25.0 15.8

58 42.5 30.2

ECO (Mexico) 9612–9704 30 17.5 12.3

UNIVISION (Mexico) 9705–9706 24 12.0 8.2

VOA Spanish 9706–9707 27 17.5 11.8

81 47.0 32.3

Table 5

Sampling for the 1997–98 Hub4-NE benchmark test pools

Source Hours recorded

CCTV 1.5

KAZN 2.5

VOA Mandarin 4.0

8.0

ECO 1.5

UNIVISION 2.0

VOA Spanish 3.0

6.5

20 D. Graff / Speech Communication 37 (2002) 15–26

Page 7: An overview of Broadcast News corpora

month of August 1998, and initial transcripts werecreated by a professional service (Federal Docu-ment Clearing House); from this, NIST selectedabout 90 min of excerpts, which received carefultranscription, time marking, and annotation ofmusic and background conditions. In this case wecut down the annotation labor substantially bydoing the labeling of background and music con-ditions as a completely independent task, and thencombining those time marks with the transcriptsby means of ASR-based word level time alignmentof the transcripts.

The 1999 Hub4-NE benchmark involved onlyMandarin data, and here the intention was tomaximize the potential for parallel coverage be-tween the English and Mandarin components ofthe benchmark. Therefore, 10 h of data were re-corded from the same period in August of 1998,and these recordings received the same treatment,as the English portion of the test – including theuse of forced recognition word alignment to allowfor independent annotations of background andmusic conditions to be folded into the transcripts.

Also in 1999, the LDC released a supplementalcollection of KUSC Marketplace recordings foruse as training, consisting of about 40 h of mate-rial that had become available to us. As with otherrecent training data, the level of detail in thetranscription markup was limited to exclude an-notation of background and music conditions.

3.4. TDT data collections

In January 1998, the LDC embarked on thecollection of BN data for the project on topic de-

tection and tracking (TDT). The research goalsdefined for this project required daily samplingfrom a variety of sources over an extended period oftime. There had been an initial, text-only TDTcollection (TDT1, or ‘‘Pilot’’), created by JamesAllan at the University of Massachusetts (Allanet al., 1998), from the text archives previously pro-vided for Hub4 language modeling. TDT1 con-tained about 20 stories per day from each of twosources (CNN and the Reuters newswire service)over a one-year period. The data collection that webegan in 1998 would create the next phase (TDT2),by providing daily samples from four BN sources(ABC, CNN, PRI, VOA) and two newswire sources(New York Times and Associated Press) over a six-month period. At the end of June 1998, nearly 680 hof English BN recordings had been collected for useas training, development test and benchmark testfor the 1998 TDT benchmark. Because the focus ofthe project is on information mining and retrievalrather than speech recognition, virtually all par-ticipants based their efforts on the use of the texttranscription for this collection rather than usingthe audio data. But a subsequent project on spokendocument retrieval (SDR) has involved a widerrange of researchers in performing speech recogni-tion on this corpus (Cieri et al., 1999; Garofoloet al., 2000). There was one notable difficulty in thecollection of the VOAEnglish broadcasts: the LDCwas not able to record directly from its own satellitedownlink until mid-February; throughout Januaryand the first part of February, we collected dailyaudio samples that were posted on the VOA website – these were studio recordings digitized at 11KHz, instead of the 16 KHz sample rate used tocapture broadcasts at the LDC.

Mandarin audio data from VOA broadcastswere collected over the same time period as TDT2English, and two Mandarin text news sources werecollected as well (Xinhua news service from Beijingvia newswire, and ZaoBao news from Singaporevia World Wide Web). These sources were notprepared or distributed to TDT researchers untilthe 1999 phase of the project. Two problems af-fected the VOAMandarin collection: first, it beganin mid-February 1998, six weeks later than the startof the English collection; second, early in April,a schedule change in VOA broadcasts went into

Table 6

New sampling for the 1998 Hub4-E benchmark

Source Hours recorded

ABC World News Tonight 1.0

CNN Early Edition 2.0

CNN Headline News 0.5

CNN Morning News 2.0

CNN The World Today 2.0

CSPAN Public Policy 1.0

CSPAN Washington Journal 1.0

PRI The World 1.0

Total 10.5

D. Graff / Speech Communication 37 (2002) 15–26 21

Page 8: An overview of Broadcast News corpora

effect, but the automatic data capture process at theLDC was not adjusted accordingly (full manualaudits were not being conducted on the Mandarinrecordings because the data would not be anno-tated or used until the following year); as a result,the April, May and June recordings contained only10–15 min of news content per hour of recordedbroadcast, instead of the expected 50–55 min.

For the TDT3 corpus collection, two new En-glish television broadcasts were added to the list ofsources for daily sampling (NBC and MSNBC).The collection period spanned 1 October through31 December 1998. This corpus was used as the testset in both the 1999 and 2000 TDT evaluations.

Table 7 summarizes the audio content of theTDT corpora by source, in terms of total broad-cast time, and total time encompassed by newsstory content (i.e. excluding commercials, programintroductions, music breaks, and so on; this isroughly comparable to the total speech category inthe tables for Hub4 corpora).

4. Text collections for language modeling

In order to provide a substantial body of textdata in the BN domain for language modeling, theLDC established an arrangement with PrimarySource Media whereby we could extract and con-dition the content of a four-year archive of com-mercially produced broadcast transcripts from awide range of news sources, and then redistributethe resulting data to LDC members.

The archive covers January 1992 through June1996 and comprises over 800 MB of text (over 142million words) in 34,000 news stories, drawn fromabout 100 different regular and special news pro-grams that aired during this period on four net-works (ABC, CNN, NPR, PBS). Existing toolsprovided by BBN and MIT Lincoln Labs wereadapted to extract and condition the text, pro-ducing a format suitable for building statisticallanguage models.

5. Transcription

5.1. Common structural elements in BN transcrip-tions

The overall structure of transcripts remainedthe same in the 1996 and 1997 collections. Eachtranscript for a complete broadcast recording waspresented in SGML as one hEPISODEi element,comprising a temporally contiguous series ofhSECTIONi elements marking the ‘‘topical units’’(news stories, non-news ‘‘fillers’’ between stories,and portions to be left untranscribed, such ascommercial breaks); within each transcribedhSECTIONi, there were one or more elementsrepresenting speaker turns – these were initiallytagged as hSEGMENTi elements in the 1996 col-lection, and as hTURNi in the 1997 set. Withineach turn, transcribers could insert additional timestamps to break up long turns for easier tran-scription.

Table 7

Audio sources and quantities in TDT2 and TDT3 corpora

TDT2 TDT3

Source # of files Total hours Total speech # of files Total hours Total speech

ABC 162 81 50 76 38 24

CNN 641 321 198 349 175 109

MSNBC 0 0 0 51 51 33

NBC 0 0 0 87 43 27

PRI 121 121 92 65 65 53

VOA English 227 227 194 103 103 89

VOA Mandarin 177 56 46 121 121 98

Total 1328 806 580 852 596 433

22 D. Graff / Speech Communication 37 (2002) 15–26

Page 9: An overview of Broadcast News corpora

Each hSECTIONi and hTURNi tag includedattributes providing the time offsets, in secondsfrom the start of the recording, for the beginningand ending of the element. The concatenation oftime spans from all hSECTIONi tags in anhEPISODEi would yield the full duration of therecording, without overlap. However, the timespans of two successive hTURNi tags couldoverlap, indicating a period of time where twopeople were speaking at once. In addition to thetime stamps, each hSECTIONi was marked as toits type (‘‘report’’, ‘‘filler’’ or ‘‘non-trans’’), andeach turn was marked with a string to identify thespeaker (usually the speaker’s name, where thiswas available, or an arbitrary identifier unique toeach individual speaker, where the name was nevergiven in the broadcast).

5.2. Changes of transcript structure between 1996and 1997

There were changes in transcription practicebetween the 1996 and 1997 collection periods. Inthe 1996 collection, heavy emphasis was placedon the addition of descriptive information aboutchanges in signal quality, background conditionsand speaking style; at the same time we avoidedthe use of conventional punctuation (periods,question marks and commas) to mark sentenceand phrase boundaries (periods were used to marksingle-letter initials and acronyms, e.g. ‘‘F.B.I.’’).For the 1997 collection, we abandoned the mark-ing of signal quality, background conditions andspeaking style, because these annotations wereconsidered too costly in terms of the manual effortrequired to apply them, and to verify consistentquality in their application, over the full 100 h ofrecordings, given the project schedule. In addition,it was decided to apply conventional punctuation(periods, commas, question marks), because this isa natural tendency among transcribers that doesprovide useful information; we also added apractice of marking particular classes of wordswith special characters (e.g. initials and acronymswere marked with preceding underscore charac-ters, as in ‘‘_F_B_I’’, so that periods would beunambiguous as sentence boundaries).

The hSEGMENTi tags of the 1996 collection(i.e. the speaker turns) were marked with attributesto identify the relative ‘‘fidelity’’ of the acousticchannel being used in that speaker turn, and the‘‘mode’’ of speech used by the speaker in that turn.The fidelity was subjectively judged as ‘‘high’’,‘‘low’’ or ‘‘medium’’, to reflect, respectively, stu-dio-quality speech, speech transmitted over band-limited and/or noisy channels (e.g. telephone), andgradations between these extremes (e.g. field re-ports with hand-held microphones). The speak-ing mode attribute identified each turn as either‘‘planned’’ (very fluent, presumably read from ascript) or ‘‘spontaneous’’ (clearly unscripted, oftencontaining hesitations, stutters or other disfluen-cies). The fidelity and mode attributes were notused in the 1997 collection.

In the 1996 collection, various backgroundconditions (music, voices or other noise) weremarked by hBACKGROUNDi tags, with attri-butes to indicate the time offset, the type ofbackground condition being marked, and therelative (subjective) level of background signal(high, low or ‘‘off’’); unlike the section and turntags, which spanned across regions of time andcontained transcription text, the background tagsmarked single time offsets and were ‘‘content-less’’; they could be placed wherever necessary inthe transcription stream without regard to thehierarchical structure of sections and turns (e.g.music could be tagged as starting at a low levelin the middle of one turn, changing to high inthe middle of a subsequent turn, and reverting tooff in a following section element). Backgroundtags were not used in the 1997 training collec-tion.

One other difference between the two phases ofcollection involved the treatment of overlappingspeech. In the 1996 data, words within one turnthat overlapped with another speaker’s turn weresimply delimited by hash-mark characters (‘‘#’’);this made it difficult to parse out the overlap re-gions, especially in programs containing sponta-neous discussions among three people. In the 1997data, new SGML structure was introduced to ex-plicitly mark overlap regions within turns, in-cluding time stamp attributes, so that these regionscould be identified unambiguously.

D. Graff / Speech Communication 37 (2002) 15–26 23

Page 10: An overview of Broadcast News corpora

The transcription of the TDT corpus is quite aseparate issue. Owing to the focus on informationretrieval, it was decided that the primary format oftranscripts would follow the conventions estab-lished for text corpora in the TIPSTER andTREC projects; this is a fairly simple and flexibleSGML structure that identifies distinct news sto-ries as ‘‘hDOCi’’ elements, which are simply con-catenated within a given sample file. Each ‘‘doc’’contains some amount of descriptive information(e.g. date, headline, author – the amount and va-riety of information depends on the source) fol-lowed by a ‘‘hTEXTi’’ element containing theactual news story. For TDT, the markup wasnormalized to be consistent across all TDT sour-ces, to accommodate time stamps marking thestart and end points of stories in BN sources, andto permit the simple inclusion of ‘‘hTURNi’’ and‘‘hANNOTATIONi’’ tags within the text toidentify speaker turns, speaker names, and otherimportant features in the transcripts. Verbatimaccuracy of transcripts was not an essential re-quirement for TDT: for television sources, theclosed caption text stream was deemed an accept-able transcription; where a commercially producedtranscript was regularly available for a given pro-gram in the collection schedule, this was acquired;for radio sources where commercial transcriptswere not available (VOA and PRI), it was suffi-cient to send audio recordings to a commercialservice for transcription at a fixed fee per hour ofbroadcast. In all cases, the only signal-based an-notation of the transcripts was to establish theboundaries of all news stories, by placing time-stamped tags at appropriate points in the tran-scripts; no corrections or additions were made tothe lexical content that was produced by theoriginal transcription source.

5.3. Reconciling variant transcription conventions:NIST UTF format and annotation graphs

Faced with the need to use both phases of BNdata in a single evaluation cycle, and at the sametime add new markup structure to support addi-tional research tasks such as named-entity recog-nition, NIST defined a ‘‘Universal TranscriptionFormat’’ (UTF) (Fiscus et al., 1998) that would

retain all the information already provided in theBN transcripts, accommodate the fact that someinformation was available in one phase but notanother, eliminate unnecessary variations in thenaming of SGML tags and attributes, and permitthe addition of new tags. NIST handled the task ofconditioning the original transcript collections re-leased by the LDC, producing the UTF versions ofthese collections, and managing the process ofadding new annotations for subsequent researchtasks. The results of their efforts are now availablethrough the LDC. The UTF SGML markup isalso used for the VOA Czech and VOA KoreanBN collections.

More recently, Bird and Liberman (1999) havedeveloped a conceptual framework for unifying awide range of speech annotations by means ofprojecting their content onto a directed, acyclicgraph, a network of nodes and arcs in which arcscontain the annotation content (e.g. the tran-scription words and additional labels assigned topoints or ranges along the time line of a speechsignal), and the nodes that bound the arcs rep-resent the specific (or approximate) time offsetsthat delimit the content of attached arcs. In ap-plying this design to the representation of Hub4and TDT annotations, we have a simple meansfor integrating multiple data sets whose annota-tions contain varying amounts of detail – e.g.some Hub4 files have arcs labeled with regard tobackground and music conditions, while othersdo not, but they all have arcs for stories, speakerturns, word tokens, etc., retrieval of these com-mon units from a collection of annotation graphsis not impeded by the variable presence of theuncommon units. Also, the relationships amongarcs on a given time line are not constrained bythe hierarchical structure that limits annotationsin SGML.

5.4. Remarks about best methods

Experience has shown that the complex re-quirements of Hub4 transcription are best treatedas a set of independent passes. One pass shouldfocus simply on establishing time stamps for im-portant or useful points in the recordings, such asstory boundaries, speaker turns, and convenient

24 D. Graff / Speech Communication 37 (2002) 15–26

Page 11: An overview of Broadcast News corpora

breaks (e.g. breath pauses) for splitting long turnsinto manageable chunks. Once this is done, a sep-arate pass for typing a fully accurate record ofwhat is spoken can be completed fairly quickly,assuming that the transcription tool provides easykeyboard control of audio playback using the es-tablished time stamps in the transcript file. If ad-ditional annotation is needed, to mark speakerattributes, background or other signal conditions,it is best to put this aside as one or more separatepasses after the verbatim transcription. In thisway, the annotator can focus more carefully on asmaller set of decisions that need to be made for aparticular stage of annotation. Also, when carry-ing out a well-focused labeling task, the annotatormay be aided by the ability to see the full text al-ready in place, and has the opportunity to spotand correct errors made in previous passes.

An alternative which has also proven effective isto assign the typing stage to a competent com-mercial transcription service, where business op-erations, personnel and resources are all focusedon the rapid completion of verbatim transcripts.Such a service offers the advantages of efficiencyand reasonable cost per hour of broadcast data,and allows us to focus our more specialized ef-forts and tools on the technical details requiredby the CSR domain, such as accurate time stamp-ing, inclusion of disfluencies, and annotation ofspeaker and signal conditions; these details arefairly easy to add when the correct lexical contentof speaker turns has already been provided. Thiswas especially effective in the TDT project, wherethe amount of detail to be added was relativelysmall. It was also used to good advantage in thefirst Hub4 Pilot data sets, and in the most recentHub4 benchmark test data and supplementaltraining data.

In many respects, the design and formattingdecisions that have been applied to Hub4 tran-scripts have proven cumbersome and difficult interms of creating, maintaining and using thetranscript files. Requiring that all the markup re-garding signal conditions, speakers, overlappingspeech and information structure be woven to-gether with the transcribed word tokens into asingle SGML text stream places a heavy load onthe annotation process, and on the user who typ-

ically needs only a subset of this markup at anygiven time. Extending the markup to address otherresearch tasks besides speech recognition (such asnamed entity retrieval) can greatly magnify theproblem. Recent accomplishments in providingtools and infrastructure for the creation and use ofannotation graphs (Bird and Liberman, 1999) arecreating significant advances for design, specifica-tion, handling and integration of large, complexcorpora.

6. Lexicon development

The LDC has provided pronunciation lexiconsfor all three Hub4 languages. In the case of En-glish, the lexicon was already well developed toprovide good coverage of journalistic data, with atotal of over 92,000 entries; relatively little effortwas needed to provide the additional coverageof previously unseen words found in the Hub4training collections.

In the case of Mandarin and Spanish, the ex-isting lexicons had been developed primarily fromtranscripts of casual telephone conversations – the‘‘Callhome’’ collections provided for the projecton large vocabulary conversational speech recog-nition (LVCSR). In order to provide adequatecoverage for new words occurring in the Hub4-NEtraining data, lexicon development was a necessaryconcomitant to transcription. Native speakers ofMandarin and Spanish with linguistic trainingwere enlisted to serve both as team leaders of therespective transcription crews, and as lexicon edi-tors, to make sure that the lexicons were keptcurrent with respect to word lists from the trainingtranscripts, and to provide important quality-con-trol feedback to the transcribers.

Because the lexicon development was beingmanaged as part of the transcription effort, theperson in charge of the lexicon for Mandarin,Shudong Huang, developed a detailed set of prin-ciples for word segmentation of the Mandarintranscription texts; these principles were appliedby the transcribers working on Hub4 Mandarindata, so that the original transcript files in thiscorpus include manual word segmentation.

D. Graff / Speech Communication 37 (2002) 15–26 25

Page 12: An overview of Broadcast News corpora

7. Conclusions

The difficulties of creating, maintaining andusing collections of BN data stem partly fromthe magnitude of these collections, partly from thecomplexity of their content, and partly from theevolution of project needs and specifications. Weare still looking forward to applying more rationalstructures and methods in order to make thesecorpora more manageable and more fruitful. De-spite the problems experienced in the earlierreleases, there has been ample evidence that theBN corpora have fostered significant progress inspeech-related research and technology develop-ment, and will continue to do so.

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.,

1998. Topic detection and tracking pilot study: final report.

In: Proc. 1998 DARPA Broadcast News Transcription and

Understanding Workshop.

Bird, S., Liberman, M., 1999. A formal framework for linguistic

annotation, Technical Report MS-CIS-99-01, Department

of Computer and Information Science, University of Penn-

sylvania (expanded from version presented at ICSLP-98,

Sydney), in http://morph.ldc.upenn.edu/Papers/.

Byrne, B. et al., 1999. Towards language independent acous-

tic modeling, http://www.clsp.jhu.edu/ws99/projects/asr/

index.html.

Cieri, C., Graff, D., Liberman, M., Martey, N., Strassel, S.,

1999. The TDT-2 text and speech corpus. In: Proc. 1999

DARPA Broadcast News Workshop.

Doddington, G., 1996. The 1996 Hub-4 annotation specifica-

tion for evaluation of speech recognition on broadcast news,

ftp://jaguar.ncsl.nist.gov/csr96/h4/h4annot.ps.

Fiscus, J. et al., 1998. Universal transcription format specifica-

tion, http://www.itl.nist.gov/iaui/894.01/tests/bnr/hub4_98/

hub4_98.htm.

Garofolo, J., Fiscus, J., Fisher,W., 1997.Design andpreparation

of the 1996 Hub-4 broadcast news benchmark test corpora.

In: Proc. 1997 DARPA Speech Recognition Workshop.

Garofolo, J.S. et al., 2000. The TREC spoken document

retrieval track: a success story. In: RIAO’2000 Conf. Proc.,

College de France, Paris, 12–14 April 2000, pp. 1–20.

Graff, D., 1997. The 1996 broadcast news speech and language-

model corpus. In: Proc. 1997 DARPA Speech Recognition

Workshop.

Pallett, D., Fiscus, J., Garofolo, J., Przybocki, M., 1996. 1995

Hub-4 dry run broadcast materials benchmark test. In:

Proc. 1996 DARPA Speech Recognition Workshop.

Pallett, D., Fiscus, J., Przybocki, M., 1997. 1996 Preliminary

broadcast news benchmark test. In: Proc. 1997 DARPA

Speech Recognition Workshop.

Pallett, D., Fiscus, J., Martin, A., Przybocki, M., 1998. 1997

Broadcast news benchmark test results: English and non-

English. In: Proc. 1998 DARPA Broadcast News Tran-

scription and Understanding Workshop.

Pallett, D., Fiscus, J., Garofolo, J., Martin, A., Przybocki, M.,

1999. 1998 Broadcast news benchmark test results. In: Proc.

1999 DARPA Broadcast News Workshop.

Pallett, D., Fiscus, J., Przybocki, M., 2000. Broadcast news

1999 test results. In: Proc. 2000 the Speech Transcription

Workshop, May 16–19, 2000. University College Confer-

ence Center, University of Maryland, Available at: http:

//www.itl.nist.gov/iaui/894.01/publications/index.htm.

26 D. Graff / Speech Communication 37 (2002) 15–26