LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang...

28
1 LOGO Corroborate and Le arn Facts from the Web Advisor Advisor Dr. Koh Jia-Ling Dr. Koh Jia-Ling Speaker Speaker Tu Yi-Lang Tu Yi-Lang Date Date 2008.03.06 2008.03.06 Shubin Zhao , Jonathan Betz (KDD '07 )

Transcript of LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang...

Page 1: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

1

LOGO

Corroborate and Learn Facts from the Web

Corroborate and Learn Facts from the Web

AdvisorAdvisor:: Dr. Koh Jia-LingDr. Koh Jia-LingSpeakerSpeaker:: Tu Yi-LangTu Yi-Lang

DateDate:: 2008.03.062008.03.06

Shubin Zhao , Jonathan Betz

(KDD '07 )

Page 2: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

2

Introduction

The web contains lots of interesting factual information about entities, such as celebrities, movies or products.

If we can collect them and provide a way to search them, it would be very helpful for answering questions or for improving web search in general.

Page 3: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

3

Introduction

Fro example

Page 4: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

4

Introduction

Facts are represented in the form of attribute-value pairs.

For instance “Birthday: June 4, 1975” and “Birth Name: Angelina Jolie Voight” are two facts for the entity “Angelina Jolie”.

Page 5: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

5

MapReduce

The system described in this paper is called GRAZER, and MapReduce is the computing model the GRAZER system is based on.

MapReduce(OSDI’04) is a programming model for processing large data sets in parallel.

A MapReduce is composed of mappers and reducers.

Page 6: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

6

MapReduce

Mapper: Mappers take the input data as key-value pairs. The intermediate data is sorted by keys and then shuf

fled to reducers.

Reducer: Each reducer can process the key-value pairs again a

nd output new values. Intermediate values with the same key are always pro

cessed by one reduce step of a reducer.

Page 7: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

7

The Grazer System

Definition: Fact: an attribute-value pair with a list of sources (url

s) where the fact is mentioned. Entity: formed by a list of facts. Relevant page: a page that is relevant to an entity. Pattern: it refers to any contiguous HTML tag sequenc

e that repeats at least two times in a page. Pattern instance: each repetition of a pattern.

Page 8: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

8

The Grazer System

System Overview:

Page 9: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

9

The Grazer System

Generate Initial Facts: In this paper initial facts are generated by scraping en.

wikipedia.org using manually-generated scripts. Wikipedia facts are a good source because it covers

many knowledge domains and this algorithm will corroborate and learn facts for each domain.

Page 10: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

10

Retrieve Relevant Pages

The solution in GRAZER is to match anchor text of a page with entity names.

To improve the precision of the result, other heuristics can be used also. One heuristic is to require the name to appear in the

page title. Another heuristic is to require the name to appear in a

salient position in a page, such as in heading.

Page 11: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

11

Retrieve Relevant Pages

The page retrieval algorithm is implemented in a

MapReduce.If the name can be found in the page, the mappe

r outputs the entity name as the key and content of the page as the value.

The reducer simply combines the pages for the same key (entity name) into a list.

Page 12: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

12

Retrieve Relevant Pages

The MapReduce algorithm is as follows:

Sometimes different entities share the same name, e.g.

“Independence Day” can be a movie or a holiday. For these entities, all the relevant pages are mixed toget

her as they are indexed by the same entity name.

Page 13: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

13

Corroborate Known Facts

Corroboration can be wrong with common facts such as gender, which have only two values “male” and “female”.

To avoid this kind of errors, here computes the probability p of all the fact values appearing randomly in a page given their attribute names.

Page 14: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

14

Corroborate Known Facts

Corroboration Strategies: (1) Values of a fact tend to vary, e.g. “June 4, 1975” ,

“4-June-1975” and “06/04/1975”. (2) Synonyms of attribute names can be used, e.g. “D

ate of Birth” and “Birthdate” for attribute “Birthday”. (3) For some facts, the attribute name does not appea

r explicitly in text, e.g.

Page 15: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

15

Corroborate Known Facts

Parallelization: The corroboration algorithm needs to corroborate eac

h fact of the entity in the relevant pages. This is implemented as a MapReduce.

• Reducer: just passes input key-value pairs to output.

Page 16: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

16

Extract New Facts

Pattern discovery: Pattern discovery is applied to the enclosing node of e

xamples in structured text to find repeated HTML patterns.

It tries to find clusters of nodes under the same parent that have similar HTML format (or tags).

Page 17: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

17

Extract New Facts

If multiple patterns are found under the same parent, the pattern with the largest span will be preferred.

E.g. if the html text is: <b>text</b><b>text</b>text<br><b>text</b>text<br>

the pattern discovered will be “<b></b><br>” but not

“<b></b>”. If a pattern can be matched and it contains an exampl

e fact, the extraction process will start to extract facts from it.

Page 18: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

18

Extract New Facts

Fact extraction: If we can find an HTML pattern in which one pattern in

stance contains a example fact, it is likely that other pattern instances also contain facts about the same entity.

It will refer to the pattern instance that contains the corroborated fact as PIe.

If the number of textblocks in PIe is more than two:• One of them contains the example attribute.

• Another one contains the example value.

Page 19: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

19

Extract New Facts

The positions of the two corresponding textblocks in PIe are recorded.

• Attribute name appear in the first textblock (attribute index = 1).

• Value may appear in the second one (value index = 2). When extracting from other pattern instances, we req

uire that they must have the same number of textblocks as PIe.

New facts will be created from the extracted attribute-value pairs.

Page 20: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

20

Extract New Facts

Bootstrapping: New facts extracted from a page are added to the see

d entity. Both the new facts and the original facts will be taken

as seeds for corroboration in the following pages.

Page 21: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

21

Extract New Facts

The bootstrapping process described here converges very well.

This is because incorrect facts extracted from one page are unlikely be corroborated in other pages.

Therefore the chance of error propagation is small. In practice we can stop the algorithm when no more n

ew facts can be extracted from any page.

Page 22: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

22

Extract New Facts

Fact selection: The result of the GRAZER is an enlarged known facts

set. All corroborated facts have at least two sources and u

ncorroborated facts have only one source. In general facts with more sources are more reliable. To determine the quality of facts, other signals can als

o be used, such as the reliability and diversity of sources.

Page 23: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

23

Experiment

Experiment on Wikipedia Facts: Seed facts:

• Entity names are extracted from the first sentence of the page.

• It generated 1.75 million entities and 12.6 million facts.

Page 24: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

24

Experiment

Relevant Pages:• The x-axis represents entities sorted by the number of r

elevant pages.

• The y-axis is the number of relevant pages in logarithmic scale.

Page 25: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

25

Experiment

Learning results:• For entities with lots of relevant pages, bootstrapping wi

th more rounds would generate more facts.

• It contains much redundant information about films, famous people and books.

Page 26: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

26

Experiment

It is difficult to evaluate all the learning results because it involves millions of webpages and facts.

We think most of the facts do not get corroborated for two reasons: (1) There is less redundancy about less popular entiti

es on the web. (2) Many facts are specific to wikipedia and we can no

t find them in other sites by shallow string matching.

Page 27: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

27

Conclusion

The paper presents an approach to find relevant pages about entities and extract new facts from them by corroborating existing facts.

Fact corroboration and extraction of each entity is a bootstrapping process which terminates well within a few learning rounds.

Page 28: LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

28

The algorithm is based on string match and HTML pattern discovery, so it is language-

independent.The corroboration can be based on the semantic

values of facts. This should be a better way to handle value vari

ations, but it is language-dependent.