N CIS 530 Orientation - November 2001 1 CIS 530 Orientation November 2001 Linguistic Data Consortium...

CIS 530 Orientation - November 2001 1 CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104

Transcript of N CIS 530 Orientation - November 2001 1 CIS 530 Orientation November 2001 Linguistic Data Consortium...

CIS 530 Orientation - November 2001 1

CIS 530 Orientation

November 2001Linguistic Data ConsortiumUniversity of Pennsylvania

Philadelphia, PA 19104

CIS 530 Orientation - November 2001 2

Motivation There are several thousand languages. Over 320 are

spoken by over 1,000,000 speakers. The ability to process foreign languages supports

global economy, internationalization of business, software localization, military roles, intelligence gathering, humanitarian efforts, foreign policy

To develop technology for language requires large amounts of data appropriately selected sampled, organized and annotated in corpora

Corpus creation requires special equipment, unique legal arrangements and business models and specialized skills not usually taught in the programs of users of language data

LDC exists to make language data broadly available for linguistic education, research and technology development

CIS 530 Orientation - November 2001 3

LDC Role

LDC began in 1993 as a specialized publisher of language data. The data was typically produced elsewhere.

Distributed over 14,000 copies of 196 corpora to >1000 organizations worldwide

LDC gradually developed the ability to create language resources locally

newswires/text collection, collection of conversational data via telephone, broadcast news collection

transcription, time-alignment, topic relevance annotation, named entity annotation, phonological /morphological resources

LDC more recently extended its research program TalkBank & Linguistic Exploration, Open Languages Archives, African

Language Lexicons, DASL

Linguistic technologies Information Detection, Extraction and Summarization Speech Recognition and Speech Synthesis Machine Translation Language and Speaker Identification Language Teaching, Linguistics

CIS 530 Orientation - November 2001 4

Annotating LDC Corpora: TDT

Topic Detection & Tracking (TDT) Corpora

•TDT4 Corpus (most recent) contains 9 months of data in 6 languages•Subset of 4 months of English, Chinese, Arabic for annotation

•Topics selected and defined from all sources•Topic is a specific event or activity along with all directly related events (e.g., Hurricane Mitch)

•Multiple levels of annotation•segmentation of audio signal into individual stories•topic-story relevance judgements•first story identification•story-link identification

•Millions of annotation decisions

CIS 530 Orientation - November 2001 5

Using commercial transcripts or closed-caption annotators

assess existing story boundaries

add, delete, move boundaries as needed

classify units as “news” or “not news” (commercials, etc.)

set and confirm timestamps for all story boundaries

Audio Segmentation

CIS 530 Orientation - November 2001 6

Topic-Story AnnotationAnnotators read and evaluate news stories against topic list

Classify story as directly, briefly or not at all related to a target topic

CIS 530 Orientation - November 2001 7

Annotating LDC Corpora: ACE

Automatic Content Extraction Project (ACE)

Develop technology to support automatic processing of human language in text form

Classification, filtering, representing language content

Four annotation tasks

Identify all nominal entities in news storyCategorize according to type

Persons, organizations, GPE, location, facilityName, nominal, pronominal

Co-index all mentions of single entity within storyClassify relations among entities

CIS 530 Orientation - November 2001 8

Nominal Entity Tagging

CIS 530 Orientation - November 2001 9

•Best practices in use of large-scale corpora in study of linguistic variation

•Focus on -t/d deletion in American English (well-known variable)

•Four LDC Corpora, all created for linguistic technology development

•All data already transcribed, segmented to provide fine-grained access

•Basic demographic information available (gender, age, education, region, race/ethnicity)

Corpus ISBN Minutes Type of Data

TIMIT 1-58563-019-5 630 Phonetically Rich Sentences

Switchboard-1 1-58563-121-3 12000 Short Conversations with Constrained Topics among Strangers

CallHome American English

1-58563-111-6 1200 Long Conversations with Free Topics among Intimates

American English Broadcast News

1-58563-109-4 6240 Broadcast News

CIS 530 Orientation - November 2001 10

DASL Technology

Create concordance-regular expression search of corpus

Create tag set-specify which factors to code

Create annotation file-combines data with tag set

Annotate using web browser-play each example, tool supports common audio formats

-code factors in each factor group, adding comments when needed

-demographic information displayed

Save results and output to text file-can be exported to Excel Spreadsheet, statistical analysis package

CIS 530 Orientation - November 2001 11

CIS 530 Orientation - November 2001 12

TDT Overview

CIS 530 Orientation - November 2001 13

Transcripts<DOC><DOCNO> ABC19981001.1830.0750 </DOCNO><DOCTYPE> NEWS STORY </DOCTYPE><DATE_TIME> 10/01/1998 18:42:30.46 </DATE_TIME><BODY><TEXT>In the U.S. and Canada tonight, there is intense concern. It is fair to say, about the insulation used on 1,000 airplanes. It is the same insulation used on Swissair flight 111 and it has been linked to fires on three other planes. Swissair went down off Nova Scotia, which is why the Canadians are concerned.

The company that made that planes warned of the fire hazard years ago.ABC's Lisa Stark is in Washington.<TURN><ANNOTATION> Reporter: </ANNOTATION>This is the type of insulation in question....<TURN>Lisa Stark, ABC News, Washington.</TEXT></BODY><END_TIME> 10/01/1998 18:44:37.14 </END_TIME></DOC>

<W recid=1651> In<W recid=1652> the<W recid=1653> U.S.<W recid=1654> and<W recid=1655> Canada<W recid=1656> tonight,<W recid=1657> there<W recid=1658> is<W recid=1659> intense<W recid=1660> concern.<W recid=1661> It<W recid=1662> is<W recid=1663> fair<W recid=1664> to<W recid=1665> say,<W recid=1666> about<W recid=1667> the<W recid=1668> insulation<W recid=1669> used<W recid=1670> on<W recid=1671> 1,000<W recid=1672> airplanes.

CIS 530 Orientation - November 2001 14

ASR Output<X Bsec=749.13 Dur=1.34 Conf=NA><W recid=1822 Bsec=750.47 Dur=0.16 Clust=NA Conf=NA> IN<W recid=1823 Bsec=750.63 Dur=0.10 Clust=NA Conf=NA> THE<W recid=1824 Bsec=750.73 Dur=0.14 Clust=NA Conf=NA> U.<W recid=1825 Bsec=750.87 Dur=0.16 Clust=NA Conf=NA> S.<W recid=1826 Bsec=751.03 Dur=0.13 Clust=NA Conf=NA> AND<W recid=1827 Bsec=751.16 Dur=0.41 Clust=NA Conf=NA> CANADA<W recid=1828 Bsec=751.57 Dur=0.33 Clust=NA Conf=NA> TONIGHT<W recid=1829 Bsec=751.90 Dur=0.21 Clust=NA Conf=NA> THERE<W recid=1830 Bsec=752.11 Dur=0.26 Clust=NA Conf=NA> IS<W recid=1831 Bsec=752.40 Dur=0.76 Clust=NA Conf=NA> INTENSE<W recid=1832 Bsec=753.18 Dur=0.64 Clust=NA Conf=NA> CONCERN<W recid=1833 Bsec=753.82 Dur=0.13 Clust=NA Conf=NA> IT<W recid=1834 Bsec=753.95 Dur=0.12 Clust=NA Conf=NA> IS<W recid=1835 Bsec=754.07 Dur=0.20 Clust=NA Conf=NA> FAIR<W recid=1836 Bsec=754.27 Dur=0.09 Clust=NA Conf=NA> TO<W recid=1837 Bsec=754.36 Dur=0.21 Clust=NA Conf=NA> SAY<W recid=1838 Bsec=754.57 Dur=0.23 Clust=NA Conf=NA> ABOUT<W recid=1839 Bsec=754.80 Dur=0.15 Clust=NA Conf=NA> THE<W recid=1840 Bsec=754.95 Dur=0.69 Clust=NA Conf=NA> INSULATION<W recid=1841 Bsec=755.64 Dur=0.44 Clust=NA Conf=NA> USED<W recid=1842 Bsec=756.13 Dur=0.12 Clust=NA Conf=NA> ON<W recid=1843 Bsec=756.25 Dur=0.06 Clust=NA Conf=NA> A<W recid=1844 Bsec=756.31 Dur=0.57 Clust=NA Conf=NA> THOUSAND<W recid=1845 Bsec=756.88 Dur=0.66 Clust=NA Conf=NA> AIRPLANES

<W recid=1651> In<W recid=1652> the<W recid=1653> U.S.

<W recid=1654> and<W recid=1655> Canada<W recid=1656> tonight,<W recid=1657> there<W recid=1658> is<W recid=1659> intense<W recid=1660> concern.<W recid=1661> It<W recid=1662> is<W recid=1663> fair<W recid=1664> to<W recid=1665> say,<W recid=1666> about<W recid=1667> the<W recid=1668> insulation<W recid=1669> used<W recid=1670> on

<W recid=1671> 1,000<W recid=1672> airplanes.

CIS 530 Orientation - November 2001 15

Boundary Table

Boundary Table

<BOUNDARY docno=ABC19981001.1830.0617 doctype=MISCELLANEOUS Bsec=617.87 Esec=750.46 Brecid=1525 Erecid=1650>

<BOUNDARY docno=ABC19981001.1830.0750 doctype=NEWS Bsec=750.46 Esec=877.14 Brecid=1651 Erecid=2014>

<BOUNDARY docno=ABC19981001.1830.0877 doctype=NEWS Bsec=877.14 Esec=896.86 Brecid=2015 Erecid=2063>


<W recid=1632> The

<W recid=1633> most

<W recid=1634> luxurious

<W recid=1635> minivan

<W recid=1636> you

<W recid=1637> can

<W recid=1638> buy...

<W recid=1639> Chrysler

<W recid=1640> town

<W recid=1641> and

<W recid=1642> country.

<W recid=1643> We

<W recid=1644> call

<W recid=1645> it

<W recid=1646> limited.

<W recid=1647> You'll

<W recid=1648> call

<W recid=1649> it

<W recid=1650> unlimited.

<W recid=1651> In

<W recid=1652> the

<W recid=1653> U.S.

<W recid=1654> and

<W recid=1655> Canada

<W recid=1656> tonight,

<W recid=1657> there

<W recid=1658> is

<W recid=1659> intense

<W recid=1660> concern.

CIS 530 Orientation - November 2001 16

Relevance TableRelevance

Table<ONTOPIC topicid=3016 level=YES docno=ABC19981001.1830.0750 fileid=19981001_1830_1900_ABC_WNT comments="NO">

Topic Definition

30016. SwissAir111 Crash

Seminal Event

WHAT: SwissAir Flight 111 crashes

WHERE: Off the coast of Halifax, Nova Scotia.

WHEN: The crash occurs on 9/2/98; the investigation continues

through the fall of 1998.

Topic Explication

The MD-11 aircraft was en route from New York to Geneva, Switzerland when it crashed into the Atlantic Ocean, killing all 229 people on board. On topic: Stories covering the crash and ensuing investigation; plans to compensate the victims' families; any safety measures proposed or

adopted as a direct result of this crash.

Rule of Interpretation Rule 5: Accidents

Rule of Interpretation

5. Accidents:

Examples - plane- car- train crash, bridge collapse, accidental

shootings, boats sinking. The event would be causal activities and unavoidable consequences like death tolls, injuries, loss of property. The topic includes mourners pursuit of legal action, investigations, issues with responsible parties (like drug and alcohol tests for drivers etc.)

CIS 530 Orientation - November 2001 17

Story LinksStory Link

Table<LINK seed_docno=APW19981122.0381 comp_docno=ABC19981001.1830.0750 label=Y>



<DOC> <DOCNO> APW19981122.0381 </DOCNO>


<DATE_TIME> 11/22/1998 09:21:00 </DATE_TIME> (...)


Swissair CEO defends installation of in-flight entertainment



ZURICH, Switzerland (AP) _ Swissair ``did everything correctly''

in installing a state-of-the-art entertainment system switched off last month in the wake of the crash of Flight 111, the airline's chief executive said in an interview published Sunday.

Swissair acted voluntarily to disconnect the video-on-demand

system, connected to a power supply routed through the cockpit, after Canadian investigators detected signs of heat damage on wiring and other debris from the ceiling around the cockpit of the

MD-11. (...)


</BODY> (...)
