Identifying and ranking topic clusters in the blogosphere
-
Upload
m-atif-qureshi -
Category
Technology
-
view
404 -
download
0
description
Transcript of Identifying and ranking topic clusters in the blogosphere
![Page 1: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/1.jpg)
Identifying and Ranking Topic Clusters in the Blogosphere
Muhammad Atif QureshiKorea Advanced Institute of Science and Technology
Arjumand YounusKorea Advanced Institute of Science and Technology
Muhammad SaeedUniversity of Karachi
Nasir TouheedInstitute of Business Administration
![Page 2: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/2.jpg)
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
1COLING 2010 CCSR WORKSHOP
![Page 3: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/3.jpg)
Web 1.0 to Web 2.0
Paradigm shift From a read-only Web to a read-write Web Increased user participation User generated content
Wikis (Wikipedia, Wiktionary) Social networking sites (Facebook, Myspace, Twitter) Digital media sharing websites (YouTube, Flickr) Blogs (Blogspot, Wordpress)
2COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 4: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/4.jpg)
The Blogosphere
Blogs empower people to voice their opinions and share their ideas.
Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.
How can we identify these topic clusters? Who is most influential blogger in a given cluster?
3COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 5: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/5.jpg)
The Blogosphere
Blogs empower people to voice their opinions and share their ideas.
Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.
How can we identify these topic clusters? Who is most influential blogger in a given cluster?
4COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 6: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/6.jpg)
The Blogosphere
Blogs empower people to voice their opinions and share their ideas.
Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.
How can we identify these topic clusters? Who is the most influential blogger in a given cluster?
5COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 7: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/7.jpg)
Problem Definition
Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in
some particular topic. Which blog holds the greatest influence for the
particular topic?
6COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 8: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/8.jpg)
Problem Definition
Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in
some particular topic. Which blog holds the greatest influence for the
particular topic?
7COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 9: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/9.jpg)
Link Based Methods for the Blogosphere
Link based methods don’t work well for the blogosphere Weakly linked nature of blog pages Blog posts need some time to get in-links Bloggers try to exploit the link based methods by
assuming role of spammers
8COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 10: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/10.jpg)
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
9COLING 2010 CCSR WORKSHOP
![Page 11: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/11.jpg)
Blog Communities vs. Topic Clusters
Blog community Discovered by following blog threads’ discussions
Topic clusters Role of blogs as conversational medium diminished Bloggers having interest in a specific topic form
socially linked network with other bloggers writing about same topic
10COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 12: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/12.jpg)
Blog Dimensions
Blog considered along three dimensions: Part of speech Occurrence Blog post no.
11COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 13: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/13.jpg)
Topic Discussion Isolation Rank
Metric used to discover the topic clusters Based on set of given topic words and some linguistic rules
We define the TDIR score of a blog as follows:
nnoun, nadjective and nadverb is respectively the number of times a noun, adjective or adverb for a specific topic are found in all the blog posts
wn, wadj and wadv are respective weights assigned to the noun, adjective and adverb for a specific topic
posts total of Number
wnwnwnTDIR
advadverbadjadjectivennoun )()()(1
12COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 14: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/14.jpg)
Topic Discussion Rank
Metric used to rank the blogs within a topic cluster Based on hyperlinked social network of blogs and blog post
contents
We define the TDR score of a blog as follows:
Matching_Outlinks represent blogs that are part of topic cluster
o : (o,b) – outlinks from blog b
damp is the damping factor
otherwise damp; x TDIR inksTotal_OutlutlinksMatching_O TDIR
blog from outlinks zero if TDIR;b TDR
boo ),(:
][
13COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 15: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/15.jpg)
Role of Damping Factor
Assume TDIR of blog A is 2 and TDIR of blog B is 1
TDR without damping factor A: 2 + (1/1 x 1) = 3 B: 1 + (1/1 x 2) = 3
TDR with damping factor A: 2 + (1/1 x 1 x 0.9) = 2.9 B: 1 + (1/1 x 2 x 0.9) = 2.8
14COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 16: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/16.jpg)
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
15COLING 2010 CCSR WORKSHOP
![Page 17: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/17.jpg)
Experimental Setup
Experimental data Real blog data collected during crawling of blogspot
domain 102 blog sites comprising of 50,471 blog posts
Experimental topics “compute”, “democracy”, “secularism”,
“bioinformatics”, “Haiti”, “Obama”
16COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 18: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/18.jpg)
Experimental Measures
Precision
Recall
Ca represents topic cluster set found using our algorithmCt represents true topic cluster set
Ca
CaCt
Ct
CaCt
17COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 19: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/19.jpg)
Experimental Results - Precision
Average precision found to be 0.87
18COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 20: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/20.jpg)
Experimental Results - Recall
Average recall found to be 0.971
19COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 21: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/21.jpg)
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
20COLING 2010 CCSR WORKSHOP
![Page 22: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/22.jpg)
Conclusions
This work presents the concept of “topic clusters” to solve the blog categorization problem for the Information Retrieval domain.
The proposed method takes into account both blog posts’ content and link structure.
Natural language processing techniques incorporated into the method ensure high coverage.
The method was evaluated using a real word dataset of the blogspot domain.
21COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
![Page 24: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/24.jpg)
Appendix
23COLING 2010 CCSR WORKSHOP
![Page 25: Identifying and ranking topic clusters in the blogosphere](https://reader033.fdocuments.in/reader033/viewer/2022060109/55530cb7b4c905533f8b4eab/html5/thumbnails/25.jpg)
Additional Experiments
Experiment on topic “Obama” repeated with additional term “Democrats” Precision increased from 0.907 to 0.95 Ranks of some blogs higher than ranks obtained
previously
Two more experiments on fine-grained topics Healthcare bill: Precision was found to be 0.857 and
recall obtained was 1; additional term “obamacare” was used
Avatar: Precision was found to be 0.47 and recall obtained was 1; additional terms had no effect
24COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions