Thomas van der Elsen, Richard Lawrence,
Jumi Oladimeji, Alastair Smith
IntroductionPeople increasingly publish their reactions to
public events using a blogA tool that enables this info to be published quicklyA journal that is available on the web
Need for effective data-mining techniques specific to blogs and similar tools (e.g. the Semantic Web)
“Our goal is to develop a method of capturing hot conversations by automating readers’ processes for characterizing and monitoring blogs.”
OverviewData-mining techniques
Creation of blog link structureAnalysing link structure
Types of important bloggersAgitatorsSummarisers
Applications, analysis and conclusionsReal-world applications and extensionsPros and cons of the paper
Crawling blogsExtracting hyperlinksExtracting blog threads
Crawling blogs
System crawls through RSS list registering for each entry:TitlePermalink List entry date
Aggregator: gathers RSS feeds from multiple sources and organises them
OPML: file format used to share RSS feed lists
RSS: A format for distributing content on the web
Aggregators
RSS list
RSS feeds
OPML
Extracting hyperlinks
Problem: Different tag structures per server
RSS feed from list
Description
Blog entries
Hyperlink list
Extracting blog threadsHyperlink
If sourceLinkIf replyLink
Check links exist in thread data
Add
Check departure URL exists in thread data
Check destination URL points to entry on list
&&
Add dest entry to thread
11
Add destination entry to entry list and add to thread
10
Add departure entry to thread
01Create new thread
00
Example Results
AgitatorsSummarisersJoe Bloggs
AgitatorsDiscussion stimulatorThreads often grow after an agitator’s entryThree discriminants for an agitator
Link (Agi1)Popularity (Agi2)Topic (Agi3)
The three discriminants can be weighted using the following formula:
Link-based Discriminantex is an agitator if
(kx) > θ1
ex = a blog entry
kx = no of entries
in threadi with a
replyLink to ex
Popularity-based discriminantex is an agitator if
(lx/mx) > θ2
ex = a blog entrylx = no of entries in
threadi
published t days after ex
mx = no of entries in
threadi published t days
before ex
Topic-based discriminantex is an agitator if
ex = a blog entry
n = number of entries
Summarizers Publish entries that collate
and compact previous posts Provide a convenient way of
digesting an entire thread The discriminant for
summarizers is link-based:ex is a summarizer if
(px) > θ4
ex = a blog entry
px = number of entries in threadi that have a replyLink from ex
ApplicationsPros and ConsConclusions
ApplicationsSupplementary info e.g. TV, news site etc
Home and Away – who shot Josh West Agitator
Sports, etc. – used by studios and media to highlight points of interest in a match Summariser
Analysis – ProsBasis for future research – a brief intro to the
subject. Multiple thread analysisIdentification of areas of bloggers’ expertise
Highly effective in certain specific areasNews and reviews
Implementation of theory (feature vector)
Analysis – ConsOnly 25 sites used in sample (but 1000s of
blogs)Does not take context into consideration
E.g., an agitator may be posting offensive entries
No measurement of summary successComments are not analysedInappropriate for certain areas
MySpace, Bebo, et al. (due to target audience)
ConclusionsCreated a data-mining framework for future
researchMay instigate research into further work
Nice idea and potentially useful but needs to be extended
Thank you for your time
Top Related