LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers:...

11
LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and Speaker Recognition* Christopher Cieri 1 , Stephanie Strassel 1 , Meghan Glenn 1 , Reva Schwartz 2 , Wade Shen 3 , Joseph Campbell 3 1. Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia, PA 19104 {ccieri, strassel, mlglenn}@ldc.upenn.edu 3. MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02421 {swade, jpc}@ll.mit.edu 2. United States Secret Service Washington, DC [email protected] ov * This work is sponsored by the Department of Homeland Security under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government

Transcript of LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers:...

Page 1: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Bridging the Gap between Linguists & Technology Developers:Large-Scale, Sociolinguistic Annotation for Dialect and

Speaker Recognition*

Christopher Cieri1, Stephanie Strassel1, Meghan Glenn1, Reva Schwartz2, Wade Shen3, Joseph Campbell3

1. Linguistic Data Consortium

3600 Market Street, Suite 810

Philadelphia, PA 19104

{ccieri, strassel, mlglenn}@ldc.upenn.edu

3. MIT Lincoln Laboratory

244 Wood Street

Lexington, MA 02421

{swade, jpc}@ll.mit.edu

2. United States Secret Service

Washington, DC

[email protected]

* This work is sponsored by the Department of Homeland Security under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government

Page 2: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Introduction to Phanotics Increased interest in speaker recognition community in high-level features that

abstract from the acoustic signal. lexical choice, presence of idiomatic expressions, syntactic structures

Forensic applications require robustness to channel differences channel adaptation and the identification of features inherently robust to channel difference

Language Recognition community increasingly mutually intelligible dialects, not just languages

Decades of research in dialectology suggest that high-level features can enable systems to cluster speakers according to the dialects they speak.

Phanotics (Phonetic Annotation of Typicality in Conversational Speech) seeks to Sponsored by United States Secret Service MIT Lincoln Laboratory coordinates effort and develops the systems Linguists from Arizona State and Old Dominion universities consult on dialectal

phenomena LDC and Appen Pty Ltd o Australia annotate data provided by LDC and Identify high-level features characteristic of American dialects, annotate a corpus for these features use the data to develop dialect recognition systems use the categorization to create better models for speaker recognition

Page 3: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Annotation Approach Annotating large corpora for many high-level features impractical

without existing data annotations technologies that simplify the annotator’s task

Phanotics uses data orthographically transcribed to serve as a guide to potential loci for the features sought orthographic transcripts, pronouncing lexicon, forced-aligner generate

putative, time-aligned, phonetic transcription that images that the speaker’s utterances were standard. high-level features of interest described as deviations from standard

pronunciation loci in which actual pronunciation differs from putative standard are

potential high-level features Since

complete phonetic transcription cost-prohibitive automatic phonetic transcription is not adequately accurate we lack dialect studies for every difference one might encounter

We do not count deviations directly but allow the technologies to guide human annotators to expected features.

Page 4: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Requirements Requires natural speech from speakers of target dialects

Initial focus on distinguishing African American Vernacular English (AAVE) from all other dialects of American English (non-AAVE)

plan to investigate other American dialects later Selected data collected to minimize the effect of observation

recordings of subjects engaged in conversations Project requires subjects categorized according to the dialect spoken. Since goal is to establish typicality of features by dialect,

categorization based on something other than features themselves relied on self-reported metadata AAVE

native speakers of American English born and raised in the United States ethnically African American

Non-AAVE American English speakers of other ethnicities

Remove subjects from either pool who appear later mis-categorized.

Page 5: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Data Selection Mixer Corpora

CTS, from LDC; supports robust SR development subjects provided age, sex, occupation, cities born/raised, ethnicity subjects completed

>=10 six-minute calls speaking to other subjects whom they typically did not know about assigned topics Bilinguals in Arabic, Mandarin, Russian, and Spanish used those languages &

English 7% calls in cross-channel recording room (8+ microphones on one side of call

calls audited for topic and audio quality but not generally transcribed Although not designed for the current effort includes self-report ethnicity. Pool contains speakers of multiple American English dialects who

categorized themselves as African American and other ethnicities 126 Mixer calls transcribed by Phanotics project

35 included conversations between two speakers of AAVE 91 include conversations between one AAVE and non-AAVE

Page 6: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Data Selection Fisher Corpus

collected at LDC to support STT development within DARPA EARS subjects provided age, sex, native language, and the cities where they were born

and raised subjects completed 1-25 10-minute calls, speaking to other participants, whom they

typically did not know, about assigned topics calls audited for topic and quality verbatim, time-aligned orthographic transcripts were produced lacks crucial information on the ethnicity of the speaker

but some subjects were LDC employees, their family, friends, and colleagues small number (171) could be assigned to an ethnic category after the fact

StoryCorps® Griot Initiative funded by Corporation for Public Broadcasting in US one-year effort to record one-hour interviews of African Americans. nine recording locations open for up to six weeks each subjects interview friends and family on topics of their choice potential users receive instructions on conducting good interviews; trained facilitator

present participants receive a free copy of their interview; other copies are archived and

distributed StoryCorps provides Phanotics selected interview in exchange for transcripts

Sociolinguistic Interviews recorded and contributed by researchers working in the United States variable quality being reviewed for potential use

Page 7: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Transcription Most audio lacked transcripts; LDC designed spec for this project.

similar to Fisher Quick Transcription specification emphasizes speed and accuracy. annotators segment speech at sentence level sentences further segmented if >8 seconds; >0.5 seconds internal silence segments overlap; audio containing no speech left un-segmented standard orthography, case, punctuation (period, question mark, comma) -- incomplete sentences and restarts; - incomplete words proper names, acronyms, letter strings capitalized uttered numbers written as words, not as strings of digits limited set of standard contractions are used and non-standard contractions (‘cause for because) written as the full word obviously mispronounced, idiosyncratic words tagged with ‘+’ no other attempt made to mark dialectal pronunciation

accomplished in annotation phase limited set of non-lexemes, (um, uh) used in filled pauses speech errors transcribed as produced

limited time to transcribe diffluencies since these will be rejected background noises not marked; limited set of markers for speaker noises transcribers indicate low confidence with double parentheses (()).

Page 8: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Feature Annotation Goal: identify features that distinguish dialect from standard features described as rules that change standard into non-standard rules apply variably according to internal and external constraints

lexical identity, morphology of affected word, position within sentence, phonological environment, functional effect of change (for example whether it neutralizes a distinction between two words), the age, sex, socioeconomic class of speakers, dialects they speak

Examples reduction of consonant clusters in final position

left => lef’, missed => miss) deletion of r, l, w

car => ca’, palm => pa’m, young ones => young ‘uns change of the voiced and voiceless interdental fricatives into stops

bother => boda’ Data preparation, customized tools simplify the annotation process Rules specified as a => b/x_y

a becomes b when preceded by x and followed by y input+environment, “xay”, constitute search term input+output a=>b constitute a question to be answered by human

Did the subject say xay or xby?

Page 9: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Feature Annotation SPAAT (Super Phonetic Annotation and Analysis Tool) designed for

rapid annotation and analysis for each feature, presents list of regions of interest (ROI) where rule may

have applied since transcript & audio previously forced-aligned, annotator can listen to

the audio with small amount of preceding and following context Annotator’s job is to decide whether or not the rule has applied.

Page 10: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Initial Results average time to annotate an ROI ranges 15-25 Approach to measuring inter-annotator agreement

distinguishes initial agreement measured at beginning of effort assess the difficulty of a task

from measures repeated after thorough documentation created, annotators undergone rigorous training, testing and selection

Initial inter-annotator agreement varies by rule, rule type, annotator and annotator training absolute average initial agreement across five annotators, all rules was

74.49% on three-way decision where a feature is annotated as present, intermediate or absent

converted to two-way decision (feature is present versus intermediate + absent) initial agreement climbs to 85.54%

Pair wise agreement by chance in three way and two way decisions is, respectively, 11.1% and 25%

initial two way agreement rates were 83.81% for rules involving substitutions and 91.95% for rules involving reductions and insertions.

Team now working to increase IAA expanding training program, documentation to include audio examples decision: form is standard, non-standard, intermediate, unrelated to rule,

indeterminate, ROI is mistaken creating a small gold standard

Page 11: LREC 2008, May 26 – June 1, Marrakesh Bridging the Gap between Linguists & Technology Developers: Large-Scale, Sociolinguistic Annotation for Dialect and.

LREC 2008, May 26 – June 1, Marrakesh

Summary Project connects sociolinguistics and HLT Seeks to determine typicality of high level features in distinguishing

dialect for forensic purposes Focuses initially on AAVE; later on other dialects of American English Uses existing audio from CTS and interviews Creates transcripts, audio-transcript time-alignments Combination of these with SPAAT speeds annotation Initial inter-annotator agreement encouraging Modifications of spec, training, tool expected to increase IAA Fisher audio and transcripts already available in LDC’s Catalog

LDC2005S13 Fisher English Training Part 2, Speech LDC2005T19 Fisher English Training Part 2, Transcripts LDC2004S13 Fisher English Training Speech Part 1 Speech LDC2004T19 Fisher English Training Speech Part 1 Transcripts

Mixer audio in queue Story Corps Griot and Sociolinguistic Interviews under negotiation To be distributed after use in the program

Mixer Transcripts Annotations possibly SPAAT