Discriminative Dialog Analysis Using a Massive Collection of BBS comments Eiji ARAMAKI (University...

24
Discriminative Dialog Analysis Using a Massive Collection of BBS comments Eiji ARAMAKI (University of Tokyo) Takeshi ABEKAWA (University of Tokyo) Yohei MURAKAMI (NICT) Bulletin Board Systems Why BBS? ← |BBS|>> What sort of text? |News Wired| |Wikipedia| Japan

Transcript of Discriminative Dialog Analysis Using a Massive Collection of BBS comments Eiji ARAMAKI (University...

Discriminative Dialog AnalysisUsing a Massive Collection of

BBS comments

Eiji ARAMAKI (University of Tokyo)Takeshi ABEKAWA (University of Tokyo)Yohei MURAKAMI (NICT)Akiyo NADAMOTO (NICT)

Bulletin Board Systems

Why BBS? ← |BBS|>>What sort of text?

|News Wired||Wikipedia|

Japan

IDID namename

please tell me why my nano sometimes stops even battery still remains.please tell me why my nano sometimes stops even battery still remains.

How about iriver N12? extremely light andsmall.How about iriver N12? extremely light andsmall.

It is because battery display approaches approx. Even battery runs out, display sometimes shows it is still left.

It is because battery display approaches approx. Even battery runs out, display sometimes shows it is still left.

iriver N series has stopped producing.iriver N series has stopped producing.

What is the most light or small mp3 player? iPod Shuffle is the best way to do?

What is the most light or small mp3 player? iPod Shuffle is the best way to do?

Not ReplyNot Reply

ReplyReply

IDID namename

please tell me why my nano sometimes stops even battery still remains.please tell me why my nano sometimes stops even battery still remains.

How about iriver N12? extremely light andsmall.How about iriver N12? extremely light andsmall.

It is because battery display approaches approx. Even battery runs out, display sometimes shows it is still left.

It is because battery display approaches approx. Even battery runs out, display sometimes shows it is still left.

iriver N series has stopped producing.iriver N series has stopped producing.

What is the most light or small mp3 player? iPod Shuffle is the best way to do?

What is the most light or small mp3 player? iPod Shuffle is the best way to do?

BUT: NLP suffers from gaps

between corresponding comments

BUT: NLP suffers from gaps

between corresponding comments

ReplyReply

ReplyReply

“N12” is a “small and light” “MP3 player”, but now

“has stopped producing”

“N12” is a “small and light” “MP3 player”, but now

“has stopped producing”

How Often Such Gaps?Gap length (distance) & Frequency

No gap (distance=1) is only 50%No gap (distance=1) is only 50%

Usually distance =2 ~5Usually distance =2 ~5

Gap is a popular phenomenonGap is a popular phenomenon

【 QUESTION 】Despite gaps, how does a human-being capture REPLY-TO relations

【 QUESTION 】Despite gaps, how does a human-being capture REPLY-TO relations

Linguistic already gave several answers

• One of answers is

Not enough!How to calculate

relevance?

Not enough!How to calculate

relevance?

Linguist

Relevance theory [Sperber1986]

Human communication is based on relevance

Human communication is based on relevance

Computer Scientist

【 This study’s GOAL 】To formalize relevance【 This study’s GOAL 】To formalize relevance

Outline

• Background• Method

• Task setting / Our Approach• How to formalize two types of relevance

• Experiment• Related Works• Conclusion

Task-setting• Natural Task-setting = To which a comment reply-to?

i thi thi-1i-1i-2i-2i-3i-3

• INSTEAD: Discriminative Task• Input: Two comments in the same BBS (P & Q)• Output: True (=Q is reply-to P) / False→ Suitable to Machine learning (such as SVM)

QQ

PP Trueor

False

→ Complex task

Our Task-setting

Our Approach/Assumption

• 2 types of relevance are available(1) Contents Relevance(1) Contents Relevance

(2) Discourse Relevance (2) Discourse Relevance

What is the most light or small mp3 player?

How about iriver N12? extremely light and small.

Roughly speaking: sentence similarity

Discourse or function of comments

Outline

• Background• Method

• Task setting / Our Approach• How to formalize two types of relevance

• (1) Contents Relevance• (2) Discourse Relevance

• Experiment• Related Works• Conclusion

Two Contents Relevance

• (1) Word Overlap Ratio = 4/12= 0.33

• (2) WebPMI based Sentence Similarity• WebPMI [Bollegala2007] is defined by ↓

What is the most light or small mp3 player?

How about iriver N12? extremely light and small.

1 3 4 5 62

1 3 4 5 62

Simple Word overlap Ratio can not capture mp3 player

iriver N12!!

Web-PMI

For each word in P, search Q’s word with the highest WebPMI, and sum up their

values

For each word in P, search Q’s word with the highest WebPMI, and sum up their

values

# of web pages that contain “N12” &“MP3”

# of web pages that contain “N12” &“MP3”

# of web pages that contains “N12”

# of web pages that contains “N12”

# of web pages that contains “MP3”

# of web pages that contains “MP3”

WEBPMI (p,q)=log H(p∩q) / NH(p) / N ・H(q)/NMutual information of two words in WEB pages

Content Relevance

Outline

• Background• Method

• Task setting• How to formalize two types of relevance

• (1) Contents Relevance• (2) Discourse Relevance

• Experiment• Related Works• Conclusion

Discourse Relevance (CMPI; Corresponding PMI ←newly proposed) • ALSO: PMI-based measure• BUT: Count co- occurring phrases in P and Q

# of P-Q pairs that contain “please tell me why” in P “It is because” in Q

# of P-Q pairs that contain “please tell me why” in P “It is because” in Q

# of Q that contain“It is because ”

# of Q that contain“It is because ”

CPMI (p,q)=log H(p∩q) / NH(p) / N ・H(q)/N!!

please tell me why my nano sometimes stops …It is because battery display approaches …

# of P that contain “please tell me why”

# of P that contain “please tell me why”

• Sometimes (=5.1%), we can easily know a response target by using lexical clues (NAME or COMMNET-ID)

UnknownUnknown

Known5.1 %Known5.1 %

It’s my first comment! Nice to meet you.It’s my first comment! Nice to meet you.

100> nice to meet you..100> nice to meet you..

100100

102102

• Of COURSE: 5.1% is low ratio• OUR SOLUTION: We rely on the data

scale (17,300,000 comments) → enough amount for PHI calculation

Building a collection of P & Q pairs,

by using Lexical-patterns

Outline

• Background• Method• Experiment• Related Works• Conclusion

Experiment 1

• TEST-SET: 140 comment pairs (140 P-Q pairs)• TASK: output Q reply-to P or not• METHODS:

– Human-A,B,C– Overlap: Only overlap ratio– WEBPMI: Only Contents Relevance– CPMI: Only Discourse Relevance– SVM: Feature=

IF ratio > ThThen TRUE Else FALSE

IF ratio > ThThen TRUE Else FALSE

IF PMI > ThThen TRUEElse FALSE

IF PMI > ThThen TRUEElse FALSE

VALUE: WEBPMI & CPMILEXICON: WORDS∈P, Q

Half is positive (extracted by patterns),

the other is random pairs

Half is positive (extracted by patterns),

the other is random pairs

Result Summary

Human-AHuman-A 79.279.2 83.383.3 75.375.3 79.179.1

Human-BHuman-B 75.775.7 78.278.2 73.973.9 76.076.0

Human-CHuman-C 70.770.7 71.671.6 72.672.6 72.172.1

OVERLAPOVERLAP 61.461.4 58.758.7 87.687.6 70.370.3

WEBPMI WEBPMI 61.461.4 72.072.0 42.442.4 53.453.4

CPMICPMI 65.765.7 66.266.2 69.869.8 67.967.9

SVMSVM 63.863.8 64.464.4 79.479.4 72.172.1

AccuracyAccuracy

precisionprecision recallrecall Fβ=1Fβ=1

>

= 70-79%

OVERLAP ≒ WEBPMI

< SVM < CPMIDiscourse is strong

Feature design is not suitable?

Kappa Matrix

• Agreement between methods

Human-BHuman-B Human-CHuman-C OVERLAPOVERLAP WEBPMIWEBPMI CPMICPMI

Human-AHuman-A

Human-AHuman-A

Human-CHuman-C

OVERLAPOVERLAP

WEBPMIWEBPMI

0.560.56 0.490.49 0.080.08 0.200.20 0.280.28

0.470.47 0.090.09 0.210.21 0.250.25

0.150.15 0.050.05 0.250.25

0.210.21 0.130.13

0.160.16

:=Moderate (High) :=Slight(Low)

Human output is similar to each other

WEBPMI & CPMI have low agreement → They succeed in

different examples,This supports our assumption,

which decompose relevance into two: (1) contents & (2) discourse

7.477.47 How about…How about… as soon as possibleas soon as possible

6.726.72 I …I … I…tooI…too

8.438.43 I’d like to goI’d like to go Wait for youWait for you8.378.37 It is in/atIt is in/at7.627.62 Please tell me…Please tell me… I think it is …I think it is …

Where is it …Where is it …

Several Examples of phrase pairs that have high CPHI values

6.806.80 Thank youThank you Your welcomeYour welcome

7.387.38 You can …You can … I try …I try …7.127.12 I think …I think … Thank youThank you6.936.93 …, isn’t it ?…, isn’t it ? MaybeMaybe

PHIPHI PP QQ

ANSWER and THANKINGANSWER and THANKING

These are outside the reach of sentence similarity,

motivating discourse clues

These are outside the reach of sentence similarity,

motivating discourse clues

Event sequence P says “go” & Q says “wait”

Event sequence P says “go” & Q says “wait”

Outline

• Background• Method• Experiment• Related Works (if enough time left)

• Conclusion

Related Works (1/2)in Linguistics

• 4 conversational maxims [Grice1975]• Relevance theory [Sperber1986]

How to calculate maxim/relevance?How to calculate

maxim/relevance?We’ve

formalized it!We’ve

formalized it!

In BBSs, adjacency pairs are not adjacent

This motivates our task

In BBSs, adjacency pairs are not adjacent

This motivates our task

• Adjacency Pair [Schegloff&Sacks1973]Which is a sequence of two utterances (such as “offers-acceptance”)

Related Works (2/2)in NLP

• Previous (Dialog and Discourse) Studies– Such as

– Based on carefully annotated corpus Rich set of labels/relations

• This Study– Only one relation (REPLY-TO relation)– BUT: not require human annotation → large scale→ enable to calculate Statistical Values (PMI)

DAMSL [Core&Allen1997]RST-DT [Carlson2002]Discourse Graph-Bank [Wolf2005]

Outline

• Background• Method• Experiment• Related Works• Conclusion

Conclusion• (1) NEW_TASK

– To Detect REPLY-TO relation in comments

• (2) Formalization for Relevance– To solve the task: We formalize two relevance

CONTENTS & DISCOURSE relevance

• (3) Automatic Corpus Building– To calculate DISCOURSE relevance, we also

proposed pattern based corpus construction

FINALLY: We believe this study will boost larger scale dialog study (using WEB)FINALLY: We believe this study will boost larger scale dialog study (using WEB)