Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

20
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online Forums

description

Finding Question-Answer Pairs from Online Forums. Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09. OUTLINE. Introduction Algorithms for question detection Algorithms for answer detection Preliminary - PowerPoint PPT Presentation

Transcript of Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

Page 1: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

Gao Cong, Long Wang, Chin-Yew Lin,

Young-In Song, Yueheng Sun

SIGIR’08

Speaker: Yi-Ling TaiDate: 2009/02/09

Finding Question-Answer Pairs from Online Forums

Page 2: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

OUTLINE

Introduction Algorithms for question detection Algorithms for answer detection

Preliminary Graph base propagation method

Experiments

Page 3: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

INTRODUCTION

Online forums contain a huge amount of valuable user generated content.

It is highly desirable if the human knowledge in user generated content can be extracted and reused.

40 forums were investigated and found that 90% of them contain question-answer knowledge.

This paper focus the problem of extracting question-answer pairs from forums.

Page 4: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

INTRODUCTION

Mining question-answer pairs from forums has the following applications. Enrich the knowledge base of CQA(community-based

Question Answering services). Access to forum content could be improved by querying

question-answer pairs extracted from forums. Augment the knowledge base of chatbot.

Page 5: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

INTRODUCTION

Question answer pairs embedded in forums are largely unstructured.

Question detection Question-mark and 5W1H question words, are not

adequate for forum data. This paper proposes a sequential patterns based

classification method to detect questions.

Page 6: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

INTRODUCTION

Answer detection multiple questions and answers may be discussed in

parallel and are often interweaved together. consider each candidate answer as an isolated

document and the question as a query. model the relationship between answers to form a

graph. For each candidate answer, we can compute an initial

score of being a true answer using a ranking method.

Page 7: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR QUESTION DETECTION

Question mark and 5W1H words, are not adequate. imperative sentences

“I am wondering where I can buy cheap and good clothing in beijing."

question marks are often omitted. short informal expressions should not be regarded as

questions. “really?"

To complement this, this paper extract labeled sequential patterns to identify sentences.

Page 8: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR QUESTION DETECTION

labeled sequential patterns (LSPs) “i want to buy an office software and wonder which

software company is best.“ → “wonder which...is“ LHS → c, where LHS is a sequence and c is a class

label. a sequence is contained in if

there exist integers such that

the distance between the two adjacent items and

in needs to be less than a threshold (we used 5). if the sequence p1.LHS is contained by p2.LHS and

p1.c = p2.c

Page 9: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR QUESTION DETECTION

To mine LSPs, need to pre-process each sentence by applying Part-Of-Speech (POS) tagger(MXPOST Toolkit).

keeping keywords including 5W1H, modal words, “wonder",“any" etc. “where can I find a job“ → “where can PRP VB DT NN“

The combination of POS tags and keywords allows us to capture representative features for question sentences by mining LSPs. “<anyone, VB, how> → Q”; “<what, do, PRP, VB> → Q“

Page 10: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR QUESTION DETECTION

sup(p) is the percentage of tuples in database D that contain the LSP p.

conf(p) = EX, , ,

Its support is 66.7% and its confidence is 100%

, with support 66.7% and confidence 66.7%.

p1 is a better indication of class Q than p2. In our experiments, we empirically set minimum

support at 0.5% and minimum confidence at 85%

Page 11: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR ANSWERDETECTION

Input : a forum thread with the questions annotated; Output : a list of ranked candidate answers for each

question.

For each question we assume its set of candidate answers to be the paragraphs in the following posts of the question. paragraphs are usually good answer segments in

forums. the answers to a question usually appear in the posts

after the post containing the question.

Page 12: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR ANSWERDETECTION

three IR methods to rank candidate answers Cosine Similarity

where f(w,X) denotes the frequency of word x in X Query likelihood language model

KL-divergence language model

Page 13: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR ANSWERDETECTION

The candidate answers for a questions are not independent in forums.

Graph based propagation method Given a question q, and the set Aq of its candidate

answers. Build a weighted directed graph denoted as (V;E) Given two candidate answers a0 and ag,

use KL(a0 | ag) to determine whether there will be an edge a0 → ag .

if 1 / (1 + KL(a0 | ag)) > µ, an edge will be formed from a0 to ag.

Page 14: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR ANSWERDETECTION

computing weight the distance between a candidate answer and the

question, denoted by d(q, a). the number of his replying posts and the number of

threads initiated by him.

Page 15: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

ALGORITHMS FOR ANSWERDETECTION

normalize weight in PageRank algorithm

Computing Propagated Scores Propagation without initial score

Propagation with initial score

Page 16: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

EXPERIMENTS

Data 1,212,153 threads from TripAdvisor forum 86,772 threads fromLonelyPlanet forum 25,298 threads from BootsnAll Network

From the TripAdvisor data, we randomly sampled 650 threads.

Two annotators were asked to tag questions and their answers in each thread.

Page 17: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

EXPERIMENTS

Q-TUnion a sentence was labeled as a question if it was marked as a question by either annotator; In Q-TInter a sentence was labeled as a question if both annotators marked it as a question.

Page 18: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

EXPERIMENTS

Page 19: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

EXPERIMENTS

Page 20: Gao Cong, Long Wang, Chin-Yew Lin,  Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai

EXPERIMENTS