Learning Within-Sentence Semantic Coherence

Learning Within-Sentence Semantic Coherence

Elena Eneva

Rose Hoberman

Lucian LitaCarnegie Mellon University

Semantic (in)Coherence

Trigram: content words unrelated Effect on speech recognition:

– Actual Utterance: “THE BIRD FLU HAS AFFECTED CHICKENS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMANS SICK”

– Top Hypothesis: “THE BIRD FLU HAS AFFECTED SECONDS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMAN SAID”

Our goal: model semantic coherence

A Whole Sentence Exponential Model [Rosenfeld 1997]

P0(s) is an arbitrary initial model (typically N-gram)

fi(s)’s are arbitrary computable properties of s (aka features)

Z is a universal normalizing constant

)exp()(1

)Pr( )(0 sii

i

fsPZ

s

(

def

A Methodology for Feature InductionGiven corpus T of training sentences:

1. Train best-possible baseline model, P0(s)

2. Use P0(s) to generate corpus T0 of “pseudo sentences”

3. Pose a challenge: find (computable) differences that allow discrimination between T and T0

4. Encode the differences as features fi(s)

5. Train a new model:

)exp()(1

)( )(01 sii

i

fsPZ

sP

Discrimination Task:

1. - - - feel - - sacrifice - - sense - - - - - - - - -meant - - - - - - - - trust - - - - truth

2. - - kind - free trade agreements - - - living - - ziplock bag - - - - - - university japan's daiwa bank stocks step –

Are these content words generated from atrigram or a natural sentence?

Building on Prior Work

Define “content words” (all but top 50) Goal: model distribution of content

words in sentence Simplify: model pairwise co-

occurrences (“content word pairs”) Collect contingency tables; calculate

measure of association for them

Q Correlation Measure

Q values range from –1 to +1

21122211

21122211

cccc

cc-cc

Q

W1 yes

W1 no

W2 yes c11 c21

W2 no c12 c22

Derived fromCo-occurrenceContingencyTable

Density Estimates

We hypothesized:– Trigram sentences: wordpair correlation

completely determined by distance– Natural sentences: wordpair correlation

independent of distance kernel density estimation

– distribution of Q values in each corpus– at varying distances

Q Distributions

Q Value

Den

sity

---- Trigram Generated Broadcast News

Distance = 1 Distance = 3

Likelihood Ratio Feature

ji ijij

ijij

TrigramdQ

BNewsdQL

, wordpairs ),|Pr(

),|Pr(

she is a country singer searching for fame and fortune in nashville

Q(country,nashville) = 0.76 Distance = 8Pr (Q=0.76|d=8,BNews) = 0.32 Pr(Q=0.76|d=8,Trigram) = 0.11 Likelihood ratio = 0.32/0.11 = 2.9

Simpler Features

Q Value based– Mean, median, min, max of Q values for content

word pairs in the sentence (Cai et al 2000)– Percentage of Q values above a threshold– High/low correlations across large/small distances

Other– Word and phrase repetition– Percentage of stop words– Longest sequence of consecutive stop/content

words

Datasets

LM and contingency tables (Q values) derived from 103 million words of BN

From remainder of BN corpus and sentences sampled from trigram LM:– Q value distributions estimated from ~100,000

sentences– Decision tree trained and test on ~60,000 sentences

Disregarded sentences with < 7 words – “Mike Stevens says it’s not real”– “We’ve been hearing about it”

Experiments

Learners: – C5.0 decision tree– Boosting decision stumps with

Adaboost.MH Methodology:

– 5-fold cross validation on ~60,000 sentences

– Boosting for 300 rounds

Results

Feature Set Classification

Accuracy

Q mean, median, min, max (Previous Work)

73.39 ± 0.36

Likelihood Ratio 77.76 ± 0.49

All but Likelihood Ratio 80.37 ± 0.42

All Features 80.37 ± 0.46

Likelihood Ratio + non-Q

Shannon-Style Experiment

50 sentences – ½ “real” and ½ trigram-generated– Stopwords replaced by dashes

30 participants– Average accuracy of 73.77% ± 6– Best individual accuracy 84%

Our classifier:– Accuracy of 78.9% ± 0.42

Summary

Introduced a set of statistical features which capture aspects of semantic coherence

Trained a decision tree to classify with accuracy of 80%

Next step: incorporate features into exponential LM

Future Work

Combat data sparsity– Confidence intervals– Different correlation statistic– Stemming or clustering vocabulary

Evaluate derived features– Incorporate into an exponential language model– Evaluate the model on a practical application

Agreement among Participants

Expected Perplexity Reduction

Semantic coherence feature– 78% of broadcast news sentences– 18% of trigram-generated sentences

Kullback-Leibler divergence: .814 Average perplexity reduction per word

= .0419 (2^.814/21) per sentence? Features modify probability of entire sentence Effect of feature on per-word probability is

small

Likelihood Value

Den

sity

---- Trigram Generated

Broadcast News

Distribution of Likelihood Ratio

Discrimination Task

Natural Sentence:– but it doesn't feel like a sacrifice in a sense that you're

really saying this is you know i'm meant to do things the right way and you trust it and tell the truth

Trigram-Generated:– they just kind of free trade agreements which have been

living in a ziplock bag that you say that i see university japan's daiwa bank stocks step though

Q Value

Den

sity


Q Values at Distance 1

Q Value

Den

sity


Q Values at Distance 3

Outline

The problem of semantic (in)coherence Incorporating this into the whole-

sentence exponential LM Finding better features for this model

using machine learning Semantic coherence features Experiments and results

Learning Within-Sentence Semantic Coherence

Documents

Transcript of Learning Within-Sentence Semantic Coherence