Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...
-
Upload
avery-boyle -
Category
Documents
-
view
214 -
download
1
Transcript of Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan...
Tracking Information Epidemics in Blogspace
A paper synopsis
Alistair Wright, Ken Tan,
Kisan Kansagra, Jenn Houston
Contents
• Introduction
• Terminology
• Spread of URLs
• Inferring Infection Routes
• Visualisation
• Discussion
• Conclusion
Introduction
• What is a blog?– First appeared in 1994– Peter Merholz in early 1999– 60 million as of November 2006
• Information often republished by other blog users
Introduction• Form a complex social structure• Propagation of information could be
visualised as “infection”• Paper aims to track infection through
blogspace and determine the original source
• Most-related work on spread of foot-and-mouth disease
Terminology
• Meme
• Infected
• Patient zero
• Infection inference
• Infection tree
Spread of URLs
• Infection: www.giantmicrobes.com
• Data source: www.blogpulse.com
Spread of URLs• Do not expect all blogs which mention a given
URL to have seen it at the source• Aim is to determine the infection source for
any given blog• Most URLs appearing on blogs are free-
floating– From external channels, different URLs for same
page• Cannot guarantee links with timelines and
infection inference but can rule out some possibilities and find the most plausible
Spread of URLs
• Blogrolls– Two-way links to other blogs (e.g.
trackbacks)– One user links to another’s blog and that
automatically links back to the original
• Frequently find no explicit links to explain infection– Via links very rare
Inferring Infection Routes
• Where explicit links are not present, use 5 classifiers to infer likely routes– Number of blog-blog links in common– Number of blog-non-blog links in common– Text similarity– Order and frequency of repeated infections– In- and out-link counts for both blogs
Inferring Infection Routes• Classify blogs’ likeliness to be linked
based on similarity– Blog-blog and blog-non-blog links:
– Textual similarity:Term Frequency-Inverse Document Frequency weighted vector
• Features obtained from full text and differential text crawls
Inferring Infection Routes• Similarity features often useful in predicting
the existence of a link
Inferring Infection Routes• Classify explicit links’ likeliness to participate
in infection
• Infection six times more likely to happen again where it has happened previously
% Blog Pairs Citing 1 Common URL
Link type Same A > B A < B Either
AB 17.4 24.5 24.5 45
AB 10.9 22.9 17.0 36
None 0.6 1.5 1.3 3
Inferring Infection Routes• Likeliness of links to participate in infection
not generally linked to similarity of blogs
Inferring Infection Routes• First link classifier used with a three-class
SVM performed with only 57% accuracy– Difficult to distinguish reciprocated and
unreciprocated links
• Second link classifier performed better– SVM: 91.2% accuracy– Logistic regression: 91.9% accuracy but based on
fewer factors
Inferring Infection Routes
• Additional classifiers were created for plausible infection routes from links– Logistic regression: up to 77% accuracy– SVM: up to 71.5% accuracy
• Accuracy depended on which subset of classifiers was selected
Visualisation
• From inferred routes, can construct infection trees
• Directed Acyclic Graph (DAG) created for each URL
• Thinned out to make it more manageable
• Label each link with an inference score and dynamically control the display
Visualisation
Sparse Tree Algorithm:
For blog A and URLx, collect sets of blogs, B– indicated by A as explicit sources of URLx
– explicitly linked to A and also infected by a common URLx
– with an unreciprocated link to A that were infected by URLx prior to A
– inferred by the classifier with timing restrictions
Visualisation• For each blog A infected by URLx and for
the first non-empty set, draw a link to each blog B in that set
• If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link
• Note: doesn’t guarantee an “upward” link for each blog
Visualisation• Further refinement incorporates via data to
incorporate “hidden” blogs• Both types of graphs are available as a web
service for any users
Visualisation• Giant Microbes Infection Tree:
• CNN News Story Infection Tree:
Discussion• Incompleteness of crawl
• Small dataset• Unknown robustness of classifiers• Meme residing at multiple URLs
A B
C
Discussion
• Novel application of “infection” model to blogspace
• Useful visualisation tool developed
• Further research into influence of graph structure on spread of infection
• Could be useful for blog search engines
Conclusion
• Difficult objectives achieved to a limited extent
• Problems with dataset affect significance of work
• Further work required to fully determine usefulness of technique
Summary
• Introduction
• Terminology
• Spread of URLs
• Inferring Infection Routes
• Visualisation
• Discussion
Any questions?