Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
-
Upload
marjorie-harvey -
Category
Documents
-
view
216 -
download
0
Transcript of Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.
![Page 1: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/1.jpg)
Truth Discovery with Multiple Conflicting InformationProviders on the Web
KDD 07
![Page 2: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/2.jpg)
2
Motivation
• Example: Authors of books– We tried to find out who wrote the book “Rapid Contextual
Design”.• Many different sets of authors from different online bookstores
Accurate information
Incompleteinformation
![Page 3: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/3.jpg)
3
Motivation
• Is the world-wide web always trustable?
• Unfortunately, the answer is “NO”.
• There is no guarantee for the correctness of information on the web.
![Page 4: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/4.jpg)
4
Motivation
• Different web sites often provide conflicting information on a subject.
• 54% of Internet users trust news web sites
• 26% for web sites that sell products
• 12% for blogs
![Page 5: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/5.jpg)
5
Veracity
• Veracity i.e., conformity to truth• How to find true facts from a large amount of
conflicting information on many subjects that is provided by various web sites.
• This paper invent an algorithm called TRUTHFINDER, – A web site is trustworthy if it provides many pieces of true
information, and a piece of information is likely to be true if it is provided by many trustworthy web sites.
![Page 6: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/6.jpg)
6
Bookstore Books
Authors
Authors
Problem DefinitionsFacts: properties of the objects
Readers News
Emotions
Emotions
![Page 7: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/7.jpg)
7
Problem Definitions
• Definition 1: (Confidence of facts.) – The confidence of a fact f (denoted by s(f)) is the probabilit
y of f being correct, according to the best of our knowledge.
• Definition 2: (Trustworthiness of web sites.) – The trustworthiness of a web site w (denoted by t(w)) is the
expected confidence of the facts provided by w.
![Page 8: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/8.jpg)
8
Problem Definitions
• Implication between facts– Imp( f1 f2 ) :
• how much f2’s confidence should be increased or decreased according to f1’s confidence.
– Imp( f1 f2 ) is a value between -1 and 1.
• A positive value indicates if f1 is correct, f2 is likely to be correct.
• A negative value means if f1 is correct, f2 is likely to be wrong.
– Imp( f1 f2 ) = sim(f1, f2) - base_sim,
• where sim(f1, f2) is the similarity between f1 and f2, and base_sim is a threshold for similarity.
T
F
![Page 9: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/9.jpg)
9
Computational Model
• Heuristic 1: Usually there is only one true fact for a property of an object.
• Heuristic 2: This true fact appears to be the same or similar on different web sites.
• Heuristic 3: The false facts on different web sites are less likely to be the same or similar.
• Heuristic 4: In a certain domain, a web site that provides mostly true facts for many objects will likely provide true facts for other objects.
![Page 10: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/10.jpg)
10
Computational Model
• If a fact is provided by many trustworthy web sites, it is likely to be true
• If a fact is conflicting with the facts provided by many trustworthy web sites, it is unlikely to be true.
• A web site is trustworthy if it provides facts with high confidence
• Web site trustworthiness and fact confidence can be determined by each other
• True facts are more consistent than false facts (Heuristic 3)
![Page 11: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/11.jpg)
11
Computational Model- Basic Inference
• Basic Inference
t(w1) = 0.9 and t(w2) = 0.99t(w2) = 1.1 × t(w1)(w2) = 2× (w1)
![Page 12: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/12.jpg)
12
Computational Model- Basic Inference
• f1 is provided by w1 and w2, if f1 is wrong then both w1 and w2 are wrong
• the probability that both of them are wrong is – (1 − t(w1)) · (1 − t(w2))
• the probability that f1 is not wrong is
– 1 − (1 − t(w1)) · (1 − t(w2))
• s(f) can be computed as :
![Page 13: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/13.jpg)
13
Computational Model- Basic Inference
![Page 14: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/14.jpg)
14
Computational Model- Inferences between Facts
• Adjusted confidence score
New score of the fact
![Page 15: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/15.jpg)
15
Computational Model- Handling Additional Subtlety
• Different web sites are not independent with each other– dampening factor
• the confidence of a fact f can easily be negative
![Page 16: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/16.jpg)
16
Computational Model- Iterative Computation
New score of the fact
![Page 17: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/17.jpg)
17
Experiments
• Baseline– Voting :
• This method chooses the fact that is provided by most web sites.
= 0.5 = 0.3• Book Authors Dataset:
– 1,265 computer science books
– Using ISBN search on www.abebooks.com • 894 bookstores and 34,031 listings
• 5.4 different authors / book
![Page 18: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/18.jpg)
18
Experiments
• randomly select 100 books• manually find out their authors• accuracy :
• Partially correct facts:– last name , first name and middle name : 3:2:1
– for example : “Graeme Simsion” is 5/6 (omit middle name)
• If f1 has x authors and f2 has y authors, and there are z shared ones, then imp(f1 f2) = z/x − base-sim– base-sim = 0.5
![Page 19: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/19.jpg)
19
Experiments- Book Authors Dataset
![Page 20: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/20.jpg)
20
Experiments- Book Authors Dataset
• One book may make multiple errors
• Miss authors:– only provide subset of
all authors
![Page 21: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/21.jpg)
21
Experiments- Query with Google
• Google is good at finding authoritative web sites.
• But do these web sites provide accurate information ?
• Compare the online bookstores– highest ranks by Google vs.
– highest trustworthiness found by TruthFinder
• Querying Google with “bookstore”– Find all bookstores that exist in their dataset from the top 300
Google results.
![Page 22: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/22.jpg)
22
Experiments- Query with Google
![Page 23: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/23.jpg)
23
Conclusion
• Veracity problem
• TRUTHFINDER Algorithm
![Page 24: Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.](https://reader033.fdocuments.in/reader033/viewer/2022051417/56649f325503460f94c4d80d/html5/thumbnails/24.jpg)
24