Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

24
Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07

Transcript of Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

Page 1: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

Truth Discovery with Multiple Conflicting InformationProviders on the Web

KDD 07

Page 2: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

2

Motivation

• Example: Authors of books– We tried to find out who wrote the book “Rapid Contextual

Design”.• Many different sets of authors from different online bookstores

Accurate information

Incompleteinformation

Page 3: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

3

Motivation

• Is the world-wide web always trustable?

• Unfortunately, the answer is “NO”.

• There is no guarantee for the correctness of information on the web.

Page 4: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

4

Motivation

• Different web sites often provide conflicting information on a subject.

• 54% of Internet users trust news web sites

• 26% for web sites that sell products

• 12% for blogs

Page 5: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

5

Veracity

• Veracity i.e., conformity to truth• How to find true facts from a large amount of

conflicting information on many subjects that is provided by various web sites.

• This paper invent an algorithm called TRUTHFINDER, – A web site is trustworthy if it provides many pieces of true

information, and a piece of information is likely to be true if it is provided by many trustworthy web sites.

Page 6: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

6

Bookstore Books

Authors

Authors

Problem DefinitionsFacts: properties of the objects

Readers News

Emotions

Emotions

Page 7: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

7

Problem Definitions

• Definition 1: (Confidence of facts.) – The confidence of a fact f (denoted by s(f)) is the probabilit

y of f being correct, according to the best of our knowledge.

• Definition 2: (Trustworthiness of web sites.) – The trustworthiness of a web site w (denoted by t(w)) is the

expected confidence of the facts provided by w.

Page 8: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

8

Problem Definitions

• Implication between facts– Imp( f1 f2 ) :

• how much f2’s confidence should be increased or decreased according to f1’s confidence.

– Imp( f1 f2 ) is a value between -1 and 1.

• A positive value indicates if f1 is correct, f2 is likely to be correct.

• A negative value means if f1 is correct, f2 is likely to be wrong.

– Imp( f1 f2 ) = sim(f1, f2) - base_sim,

• where sim(f1, f2) is the similarity between f1 and f2, and base_sim is a threshold for similarity.

T

F

Page 9: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

9

Computational Model

• Heuristic 1: Usually there is only one true fact for a property of an object.

• Heuristic 2: This true fact appears to be the same or similar on different web sites.

• Heuristic 3: The false facts on different web sites are less likely to be the same or similar.

• Heuristic 4: In a certain domain, a web site that provides mostly true facts for many objects will likely provide true facts for other objects.

Page 10: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

10

Computational Model

• If a fact is provided by many trustworthy web sites, it is likely to be true

• If a fact is conflicting with the facts provided by many trustworthy web sites, it is unlikely to be true.

• A web site is trustworthy if it provides facts with high confidence

• Web site trustworthiness and fact confidence can be determined by each other

• True facts are more consistent than false facts (Heuristic 3)

Page 11: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

11

Computational Model- Basic Inference

• Basic Inference

t(w1) = 0.9 and t(w2) = 0.99t(w2) = 1.1 × t(w1)(w2) = 2× (w1)

Page 12: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

12

Computational Model- Basic Inference

• f1 is provided by w1 and w2, if f1 is wrong then both w1 and w2 are wrong

• the probability that both of them are wrong is – (1 − t(w1)) · (1 − t(w2))

• the probability that f1 is not wrong is

– 1 − (1 − t(w1)) · (1 − t(w2))

• s(f) can be computed as :

Page 13: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

13

Computational Model- Basic Inference

Page 14: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

14

Computational Model- Inferences between Facts

• Adjusted confidence score

New score of the fact

Page 15: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

15

Computational Model- Handling Additional Subtlety

• Different web sites are not independent with each other– dampening factor

• the confidence of a fact f can easily be negative

Page 16: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

16

Computational Model- Iterative Computation

New score of the fact

Page 17: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

17

Experiments

• Baseline– Voting :

• This method chooses the fact that is provided by most web sites.

= 0.5 = 0.3• Book Authors Dataset:

– 1,265 computer science books

– Using ISBN search on www.abebooks.com • 894 bookstores and 34,031 listings

• 5.4 different authors / book

Page 18: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

18

Experiments

• randomly select 100 books• manually find out their authors• accuracy :

• Partially correct facts:– last name , first name and middle name : 3:2:1

– for example : “Graeme Simsion” is 5/6 (omit middle name)

• If f1 has x authors and f2 has y authors, and there are z shared ones, then imp(f1 f2) = z/x − base-sim– base-sim = 0.5

Page 19: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

19

Experiments- Book Authors Dataset

Page 20: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

20

Experiments- Book Authors Dataset

• One book may make multiple errors

• Miss authors:– only provide subset of

all authors

Page 21: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

21

Experiments- Query with Google

• Google is good at finding authoritative web sites.

• But do these web sites provide accurate information ?

• Compare the online bookstores– highest ranks by Google vs.

– highest trustworthiness found by TruthFinder

• Querying Google with “bookstore”– Find all bookstores that exist in their dataset from the top 300

Google results.

Page 22: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

22

Experiments- Query with Google

Page 23: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

23

Conclusion

• Veracity problem

• TRUTHFINDER Algorithm

Page 24: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

24