Focus Contrast in Web Harvested Data

download Focus Contrast in Web Harvested Data

of 68

  • date post

    09-Jan-2016
  • Category

    Documents

  • view

    28
  • download

    1

Embed Size (px)

description

Focus Contrast in Web Harvested Data. Mats Rooth Linguistics and CIS Cornell University based on joint research with Jonathan Howell. Radio sites. Hundreds use Everyzing/Ramp technology Full ASR transcripts often available Time offset sometimes available - PowerPoint PPT Presentation

Transcript of Focus Contrast in Web Harvested Data

  • Focus Contrast in Web Harvested DataMats RoothLinguistics and CISCornell Universitybased on joint research with Jonathan Howell

  • Radio sitesHundreds use Everyzing/Ramp technologyFull ASR transcripts often availableTime offset sometimes availableEither URL of audio or RSS feed almost always availableNot not enough hits for one target on a single siteA lot or repetitions of same audioSeemingly less spontaneous speech than on Everyzing

  • YoutubeSearchable closed captions, some obtained with ASR and some provided by video authorTime offset available on hit page and in URLYoutube player can seek to a timeTranscript of snippet availableFull transcript not availableNot enough data nowCan hope that a lot of indexed spontaneous speech will become available

  • Reuters InsiderSearchable audio based on Everyzing/RampFull transcripts availablePlayer seeks to timestamp

  • GoalsAssemble large, focused datasets of examples where intonation varies in a way that correlates with syntax, semantics, or pragmatics.Study correlation between lexical/grammatical/pragmatic context and acoustic realization.

  • he stayed longer than I did-er [[ he he stayed x long]2 than [ IF stayed x long ]~2]

    [ y stayed x-long ] antecedent clause[ speaker stayed x-long ] scope of focus

  • I should have liked that song a lot more than I did.[more x[[should w[ I like that song x well in w]] than [I like that song x well in w0]]]

  • I understand even less than I did beforeeven less [[ I prs understand x much]2 than [I understood x much beforeF] ]~2]

  • Alternative semantics for focus-er [[ he he stayed x long]2 than [ IF stayed x long ]~2][ y stayed x-long ] antecedent clause[ speaker stayed x-long ] scope of focusSemantics of focus is the set of alternative propositions of the form y stayed x long.Licensing condition for focus The proposition contributed by the antecedent is an element of the alternative set that is distinct from the proposition contributed by the scope.

  • Givenness/Entailment semantics for focus[ y stayed x-long ] antecedent clause[ speaker stayed x-long ] scope of focus Licensing condition for focus The antecedent entails the union of the alternative set (focus existential closure).If he stayed d long, then someone stayed d long.

  • Alternative semantics and givenness semantics are predictive theories of focus licensing, if the antecedent is stipulated.Almost always, the antecedent for focus in the than-clause is the main clause.With that hedge, grammar makes a prediction about where focus should go.Try to correlate this with acoustic signal.

  • Focus in comparative clausesCoherent semantic theory about where focus should goPossibilities are constrained, because the main clause is usually the antecedent for focus interpretation in the comparative clauseOn a theoretical basis, we often think we know the correct grammatical analysis of comparative sentences people use, including the features that determine focusNice model system for studying contextual conditioning and phonetic realization of contrastive intonation

  • Automatic harvest procedureReplicates how a user would interact with website.

  • curlretrieve information designated by URLcutmp3 cut audio file given offsetsawk process htmlawk, bashmake control

    Time for one run retrieving 1000 hits is less than a day.

  • 116 a1135.g.akamai.net110 hosted-media.podzinger.com76 media.weei.podzinger.com58 feeds.wnyc.org54 media.libsyn.com51 podcastdownload.npr.org50 feeds.feedburner.com39 library.kraftsportsgroup.com33 www.whiterosesociety.org24 www.kpbs.org21 www.podtrac.com21 media.wrko.podzinger.com

  • Jonathan Howell

    WAC_efficacy (2)

    100100100100

    9510010098

    959710097

    75767374

    67.6666666667696867

    61.6666666667636163

    60.3333333333536063

    53.3333333333435157

    `

    he himself

    his own

    for one thing

    the one thing

    Number of queries (normalized)

    WAC_efficacy

    100100100100

    9510010098

    959710097

    75767374

    67.6666666667696867

    61.6666666667636163

    60.3333333333536063

    53.3333333333435157

    he himself

    his own

    for one thing

    the one thing

    Chart2

    300284200154

    Number of mp3 files

    Retrieval Efficacy for "he himself"

    Queried

    Retrieved

    Cut

    Usable

    Chart1

    Sheet1

    300284200154

    Sheet2

    queriedindividual hit filesmp3s retrievedmp3s readabletime offset file non-emptymp3s correctly cutunique short mp3smp3s accurately transcribed

    he himself300285285225203185181160

    he himself10095957567.666666666761.666666666760.333333333353.3333333333

    his own100100977669635343

    for one thing1001001007368616051

    the one thing10098977467636357

  • Jonathan Howell

    Chart1

    22910250

    0154328500

    81431484750

    39515201000

    795711181250

    1500

    1750

    Switchboard

    Everyzing (collected/verified)

    Everyzing (projected)

    Markers

    3734

    na

    3750

    3500

    3250

    1000

    750

    500

    250

    Sheet1

    SwitchboardEveryzing (collected/verified)Everyzing (projected)

    than he did2291na

    he himself0154328

    his own81433734

    for one thing3951520

    the one thing79571118

    Data adjusted

    SwitchboardEveryzing (collected/verified)Everyzing (projected)MarkersLabels

    than he did2291na0250250

    he himself01543280500500

    his own814314840750750

    for one thing3951520010001000

    the one thing79571118012503250

    015003500

    017503750

    Sheet2

    Sheet3

  • Classification experimentHe stayed longer than IF did. s classantecedent: He stayed x longI should have liked that song a lot more than I didF. ns class antecedent: I should have liked that song x muchI understand even less than I did beforeFI understand even x littlens class

  • SVM classifier in R statistical environement (e1071 package)308 acoustic parameters extracted with Praat91 tokens in cross-validated design

    (Several hundred more tokens with similar results.)

  • all parametersduration of I onlyduration of I, duration of d closure, formant difference 40% into I

  • Jonathan Howell

  • Jonathan Howell

  • Method suggested by comparatives experimentFind common grammatical or lexical contexts that trigger representations with different prosodic realization, according to relatively well-understood and well-supported theory.Correlate the semantic-grammatical categories directly with the speech signal using machine learning.Dont worry about phonemic/morphemic categories like the accent types H* and L+H*, or assume they be annotated on the basis of pitch contour.

  • Fery and Ishihara (2009) Journal of Linguistics 45.3SOF: PrenuclearDie meisten unserer Kollegen waren beim Betriebsausflug lssig angezogen. Nur Peter hat eine Krawatte getragen.Nur Peter hat sogar einen Anzug getragen.

  • Hes gotta pick someone who is younger than he is, and is definitely more conservative than he is.[-er [ t is d young than he is d young]]2 and more [[ t is is d conservativeF]3 than [ heF is d conservative ] ~3 ] ~2

  • +Generic corpus of focused pronounsThe SVM classifier is good at detecting focused pronouns using local features on pronoun:

    Duration of vowel I [ai] Distance between f1 and f2 halfway into vowel i [ai]

  • Method suggested by comparatives experimentFind common grammatical or lexical contexts that trigger representations with different prosodic realization, according to relatively well-understood and well-supported theory.Correlate the semantic-grammatical categories directly with the speech signal using machine learning.Dont worry about phonemic/morphemic categories like the accent types H* and L+H*, or assume they be annotated on the basis of pitch contour.

  • Inherently contrastive phrasesin MY opinion ... admits that other things might be true in other peoples opinionsNEXT Friday ... at end weekly Friday radio programon the TENOR saxophone ... in Jazz program where there is frequent mention also of the Alto saxophone

  • 1162 of> my life1110 in> my life681 in> my mind377 in> my opinion276 in> my view231 in> my heart217 of> my career199 in> my career183 in> my head146 with> my life146 with> my family141 on> my way

    140 of> my mind139 on> my part134 in> my lifetime125 in> my office115 of> my family108 with> my wife106 on> my face106 in> my house99 on> my mind96 over> my head96 in> my family91 for> my family90 in> my face

  • + Does general SVM pronoun focus classifier work on SOF tokens?

    + How common is SOF?

  • [you made a very small amount more than I did]2 [nowF I make muchF more than youF do] ~22 is of the form required form of antecedent: at t speaker makes d-much more than hearer makesactual: at t hearer makes d-much more than speaker makes

  • two SOF tokens

    You made a very small amount more than I did. Now I make muchF more than youF do.

  • There is a correlation between the string context and prosody type?+ Learn information-theoretically-- two distributions of acoustic pronoun realizations-- two distributions of trigram contexts that condition them

  • P( in opinion) =def

    P(type 1) P(in,opinion| type 1) P( | type 1) + P(type 2) P(in,opinion| type 2)