Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...

Tracking Information Epidemics in Blogspace

A paper synopsis

Alistair Wright, Ken Tan,

Kisan Kansagra, Jenn Houston

Contents

• Introduction

• Terminology

• Spread of URLs

• Inferring Infection Routes

• Visualisation

• Discussion

• Conclusion

Introduction

• What is a blog?– First appeared in 1994– Peter Merholz in early 1999– 60 million as of November 2006

• Information often republished by other blog users

Introduction• Form a complex social structure• Propagation of information could be

visualised as “infection”• Paper aims to track infection through

blogspace and determine the original source

• Most-related work on spread of foot-and-mouth disease

Terminology

• Meme

• Infected

• Patient zero

• Infection inference

• Infection tree

Spread of URLs

• Infection: www.giantmicrobes.com

• Data source: www.blogpulse.com

Spread of URLs• Do not expect all blogs which mention a given

URL to have seen it at the source• Aim is to determine the infection source for

any given blog• Most URLs appearing on blogs are free-

floating– From external channels, different URLs for same

page• Cannot guarantee links with timelines and

infection inference but can rule out some possibilities and find the most plausible

Spread of URLs

• Blogrolls– Two-way links to other blogs (e.g.

trackbacks)– One user links to another’s blog and that

automatically links back to the original

• Frequently find no explicit links to explain infection– Via links very rare

Inferring Infection Routes

• Where explicit links are not present, use 5 classifiers to infer likely routes– Number of blog-blog links in common– Number of blog-non-blog links in common– Text similarity– Order and frequency of repeated infections– In- and out-link counts for both blogs

Inferring Infection Routes• Classify blogs’ likeliness to be linked

based on similarity– Blog-blog and blog-non-blog links:

– Textual similarity:Term Frequency-Inverse Document Frequency weighted vector

• Features obtained from full text and differential text crawls

Inferring Infection Routes• Similarity features often useful in predicting

the existence of a link

Inferring Infection Routes• Classify explicit links’ likeliness to participate

in infection

• Infection six times more likely to happen again where it has happened previously

% Blog Pairs Citing 1 Common URL

Link type Same A > B A < B Either

AB 17.4 24.5 24.5 45

AB 10.9 22.9 17.0 36

None 0.6 1.5 1.3 3

Inferring Infection Routes• Likeliness of links to participate in infection

not generally linked to similarity of blogs

Inferring Infection Routes• First link classifier used with a three-class

SVM performed with only 57% accuracy– Difficult to distinguish reciprocated and

unreciprocated links

• Second link classifier performed better– SVM: 91.2% accuracy– Logistic regression: 91.9% accuracy but based on

fewer factors

Inferring Infection Routes

• Additional classifiers were created for plausible infection routes from links– Logistic regression: up to 77% accuracy– SVM: up to 71.5% accuracy

• Accuracy depended on which subset of classifiers was selected

Visualisation

• From inferred routes, can construct infection trees

• Directed Acyclic Graph (DAG) created for each URL

• Thinned out to make it more manageable

• Label each link with an inference score and dynamically control the display

Visualisation

Sparse Tree Algorithm:

For blog A and URLx, collect sets of blogs, B– indicated by A as explicit sources of URLx

– explicitly linked to A and also infected by a common URLx

– with an unreciprocated link to A that were infected by URLx prior to A

– inferred by the classifier with timing restrictions

Visualisation• For each blog A infected by URLx and for

the first non-empty set, draw a link to each blog B in that set

• If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link

• Note: doesn’t guarantee an “upward” link for each blog

Visualisation• Further refinement incorporates via data to

incorporate “hidden” blogs• Both types of graphs are available as a web

service for any users

Visualisation• Giant Microbes Infection Tree:

• CNN News Story Infection Tree:

Discussion• Incompleteness of crawl

• Small dataset• Unknown robustness of classifiers• Meme residing at multiple URLs

A B

C

Discussion

• Novel application of “infection” model to blogspace

• Useful visualisation tool developed

• Further research into influence of graph structure on spread of infection

• Could be useful for blog search engines

Conclusion

• Difficult objectives achieved to a limited extent

• Problems with dataset affect significance of work

• Further work required to fully determine usefulness of technique

Summary

• Introduction

• Terminology

• Spread of URLs

• Inferring Infection Routes

• Visualisation

• Discussion

Any questions?

Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...

Documents

Transcript of Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...