Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...

25
Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston

Transcript of Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...

Page 1: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Tracking Information Epidemics in Blogspace

A paper synopsis

Alistair Wright, Ken Tan,

Kisan Kansagra, Jenn Houston

Page 2: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Contents

• Introduction

• Terminology

• Spread of URLs

• Inferring Infection Routes

• Visualisation

• Discussion

• Conclusion

Page 3: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Introduction

• What is a blog?– First appeared in 1994– Peter Merholz in early 1999– 60 million as of November 2006

• Information often republished by other blog users

Page 4: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Introduction• Form a complex social structure• Propagation of information could be

visualised as “infection”• Paper aims to track infection through

blogspace and determine the original source

• Most-related work on spread of foot-and-mouth disease

Page 5: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Terminology

• Meme

• Infected

• Patient zero

• Infection inference

• Infection tree

Page 6: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Spread of URLs

• Infection: www.giantmicrobes.com

• Data source: www.blogpulse.com

Page 7: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Spread of URLs• Do not expect all blogs which mention a given

URL to have seen it at the source• Aim is to determine the infection source for

any given blog• Most URLs appearing on blogs are free-

floating– From external channels, different URLs for same

page• Cannot guarantee links with timelines and

infection inference but can rule out some possibilities and find the most plausible

Page 8: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Spread of URLs

• Blogrolls– Two-way links to other blogs (e.g.

trackbacks)– One user links to another’s blog and that

automatically links back to the original

• Frequently find no explicit links to explain infection– Via links very rare

Page 9: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes

• Where explicit links are not present, use 5 classifiers to infer likely routes– Number of blog-blog links in common– Number of blog-non-blog links in common– Text similarity– Order and frequency of repeated infections– In- and out-link counts for both blogs

Page 10: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes• Classify blogs’ likeliness to be linked

based on similarity– Blog-blog and blog-non-blog links:

– Textual similarity:Term Frequency-Inverse Document Frequency weighted vector

• Features obtained from full text and differential text crawls

Page 11: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes• Similarity features often useful in predicting

the existence of a link

Page 12: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes• Classify explicit links’ likeliness to participate

in infection

• Infection six times more likely to happen again where it has happened previously

% Blog Pairs Citing 1 Common URL

Link type Same A > B A < B Either

AB 17.4 24.5 24.5 45

AB 10.9 22.9 17.0 36

None 0.6 1.5 1.3 3

Page 13: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes• Likeliness of links to participate in infection

not generally linked to similarity of blogs

Page 14: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes• First link classifier used with a three-class

SVM performed with only 57% accuracy– Difficult to distinguish reciprocated and

unreciprocated links

• Second link classifier performed better– SVM: 91.2% accuracy– Logistic regression: 91.9% accuracy but based on

fewer factors

Page 15: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Inferring Infection Routes

• Additional classifiers were created for plausible infection routes from links– Logistic regression: up to 77% accuracy– SVM: up to 71.5% accuracy

• Accuracy depended on which subset of classifiers was selected

Page 16: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Visualisation

• From inferred routes, can construct infection trees

• Directed Acyclic Graph (DAG) created for each URL

• Thinned out to make it more manageable

• Label each link with an inference score and dynamically control the display

Page 17: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Visualisation

Sparse Tree Algorithm:

For blog A and URLx, collect sets of blogs, B– indicated by A as explicit sources of URLx

– explicitly linked to A and also infected by a common URLx

– with an unreciprocated link to A that were infected by URLx prior to A

– inferred by the classifier with timing restrictions

Page 18: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Visualisation• For each blog A infected by URLx and for

the first non-empty set, draw a link to each blog B in that set

• If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link

• Note: doesn’t guarantee an “upward” link for each blog

Page 19: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Visualisation• Further refinement incorporates via data to

incorporate “hidden” blogs• Both types of graphs are available as a web

service for any users

Page 20: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Visualisation• Giant Microbes Infection Tree:

• CNN News Story Infection Tree:

Page 21: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Discussion• Incompleteness of crawl

• Small dataset• Unknown robustness of classifiers• Meme residing at multiple URLs

A B

C

Page 22: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Discussion

• Novel application of “infection” model to blogspace

• Useful visualisation tool developed

• Further research into influence of graph structure on spread of infection

• Could be useful for blog search engines

Page 23: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Conclusion

• Difficult objectives achieved to a limited extent

• Problems with dataset affect significance of work

• Further work required to fully determine usefulness of technique

Page 24: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Summary

• Introduction

• Terminology

• Spread of URLs

• Inferring Infection Routes

• Visualisation

• Discussion

Page 25: Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Any questions?