Web Communities
description
Transcript of Web Communities
![Page 1: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/1.jpg)
Web Communities
Prasanna Desikan(06/13/2002)
![Page 2: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/2.jpg)
2
Definition
Web community: Groups of individuals who share
common interests, together with the web pages most popular among them.
Web page collections with a shared topic.
![Page 3: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/3.jpg)
3
Types of Communities Explicitly- defined.
Communities that manifest themselves as newsgroups or as resource collections on directories such as Yahoo!
Implicitly- defined. Communities that result from nature of
content-creation of the web.
![Page 4: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/4.jpg)
4
Terms and Definitions
Directed Bipartite Graph: A graph whose nodes set can be partitioned into two sets F and C, and every directed edge in the graph is from a node u in F to a node v in C.
![Page 5: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/5.jpg)
5
Terms and Definitions
Completed Bipartite Graph: A bipartite graph that contains all possible edges between a vertex of F and a vertex of C.
Core: A complete bipartite sub-graph with at least i nodes from F and at least j nodes from C. In the web world, the i pages the contains the
links are referred to as ‘fans’ and the j pages that are referenced as ‘centers’.
![Page 6: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/6.jpg)
6
Inferring Web Communities From Link Topology
Community is a core of central authoritative pages linked together by hub pages.
Identify communities corresponding to the principal and non-principal eigenvectors discovered by HITS.
For communities on broad topics: the grouping of pages discovered is relatively independent of the exact choice of root set.
![Page 7: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/7.jpg)
7
Inferring Web Communities From Link Topology
Findings on Structure of Communities. Robustness: For broad topics, HITS
produces stable, robust communities. Topic Generalization: HITS tend to
generalize topics that are not broad. “Michael Jordan” produces links to pages on
MJ and his team. “Dennis Ritchie” produces links that reference
to “C – Programming Language.”
![Page 8: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/8.jpg)
8
Inferring Web Communities From Link Topology
Other Generalization: HITS tends to converge on topics with greater density of linkage. E.g for a query on “linguistics”, the top authorities are
focused on a sub-topic “computational linguistics” because of its greater density of linkage on web.
Temporal Issues: For obtaining long-term “core” of a topic, we can superimpose the results of HITS on the same topic, spaced-out several month periods.
![Page 9: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/9.jpg)
9
Trawling the Web for Emerging Web Communities
Trawling: Systematic Enumeration of emerging communities from web crawl.
Scan through a web crawl and identify all instances of graph structures that are indicative signatures of communities.
![Page 10: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/10.jpg)
10
Trawling the Web for Emerging Web Communities
Data Source: A copy of web from Alexa.Pre-processing data. Identify potential fan pages (a page that
has links to at least six different websites) – out of 200 million pages around 24 million were extracted.
Eliminate mirrors (out of 24 million it removed around 60% of pages.
![Page 11: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/11.jpg)
11
Trawling the Web for Emerging Web Communities
Prune by in-degree. Eliminate all pages that have an in-degree
greater than a threshold value k. k is set as 50 in the experiments.
Iterative pruning. When looking for (i,j) cores any potential fan
with out-degree smaller than j can be pruned and the corresponding edges deleted from the graph.
![Page 12: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/12.jpg)
12
Trawling the Web for Emerging Web Communities
Inclusion-exclusion pruning. Let {c1,c2,…..,cj} be centers adjacent to
a fan x. N(ct) = neighborhood of ct, the set of
fans that point to ct. x is a part of core if and only if the
intersection of sets N(ct) has size at least i.
Filter nepotistic cores.
![Page 13: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/13.jpg)
13
Trawling the Web for Emerging Web Communities
Evaluation of Communities. Fossilization: 30% of communities were
fossilized. A fossil is a community core not all of whose
fans exist on the web today. Reliability: Only 4% of the trawled cores
were coincidental i.e a collection of fan pages without any cogent theme unifying them.
![Page 14: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/14.jpg)
14
Trawling the Web for Emerging Web Communities
Quality: 56% were not in Yahoo as constructed from the crawl. And 29% were not in Yahoo at the time of the paper. This indicates identification of emerging
communities by trawling.
![Page 15: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/15.jpg)
15
Self Organization and Identification of Web Communities
Web community is defined as a collection of web pages such that each member page has more hyperlinks (in either direction) within the community than outside of the community.
Approach: Maximal Flow – Minimal Cut framework.
Benefits: Focused crawling, automatic population of portal categories.
![Page 16: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/16.jpg)
16
A Simple Community Identification Example
Figure : Maximum Flow methods will separate the two subgraphs with any choice of s and t that has s on the left subgraph and t on the right subgraph, removing the three dashed links.
![Page 17: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/17.jpg)
17
Approximate Flow Community
![Page 18: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/18.jpg)
18
Exact Flow Community
![Page 19: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/19.jpg)
19
Exact Flow Community
An artificial source ‘s’, is added with infinite capacity edges routed to all seed vertices in S.
Each pre-existing edge is made bi-directional and rescaled to a constant value k.
![Page 20: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/20.jpg)
20
Exact Flow Community
All vertices except the source, sink, and seed vertices are routed to the artificial sink with unit capacity.
A residual flow graph is produced by a maximum flow procedure.
All vertices accessible from s through non-zero positive edges form the desired result and satisfy our definition of a community.
![Page 21: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/21.jpg)
21
Sample Results From Community Identification
The scores are the total number of inbound and outbound links that a web page has to other pages that are also in the community.
![Page 22: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/22.jpg)
22
Characterization of Communities
Table 3: The fifteen most significant text features for each community, sorted in descending order of the Kullback-Leibler metric.
![Page 23: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/23.jpg)
23
Discovering Seeds of New Interest Spread From Premature Pages.
A method for discovering topics, which stimulate communities of people into earnest communications on the topics’ meaning, and grow into a trend of popular interest.
Community is a group of people sharing some value.
![Page 24: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/24.jpg)
24
Agora Method on Links Archive page - Page of highest rank
according to Google in a community.
Agora Pages - Pages linked from multiple archive-pages but are not in any community themselves are taken as novel topics attracting multiple communities, called agora-topic pages.
![Page 25: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/25.jpg)
25
Agora Method on Links Step 1: A query representing user’s
interest domain is entered to a search engine (Google here, obtaining 105 to 106 pages).
Step 2: Communities, of pages obtained in Step 1, are obtained and archive-pages are selected from communities.
![Page 26: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/26.jpg)
26
Agora Method on Links Step 3: Pages, not in the
communities but linked from multiple archive-pages, are obtained as agora-pages. Having all obtained results by here, archive pages (black nodes), agora-pages (red nodes) and the links between them are visualized as in Fig.1.
![Page 27: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/27.jpg)
27
Fig: The output of Agora on Links, for domain query “Human Genome”
![Page 28: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/28.jpg)
28
Evaluation Stage 1. An interest domain is fixed, a group
of people relevant to the domain gathered, and the domain-name is input as a query (e.g. ”information retrieval”).
Stage 2. The output graph adding real and fake red nodes, as if they all were really obtained as agora-pages, is shown to the subjects. That is, some red nodes, not really obtained, were added with red links to black archive-nodes. Subjects reported individual impressions and exchanged ideas in the group.
![Page 29: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/29.jpg)
29
Sample Results Institutes in ‘red’ were the ones who
have data sources of human or mouse genomes, and is useful for researchers in other institutes to look at those data.
8 of the 12 ‘red’ nodes were termed as “interesting for thinking of future work” by the subjects.
![Page 30: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/30.jpg)
30
References [1]D. Gibson, J. Klienberg, and P.Raghavan. Inferring web
communties from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia.
[2]Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Trawling the web for emerging cyber-communities. In Proc 8th Int. World Wide Web Conf.,1999.
[3] Gary William Flake, Steve Lawrence, C. Lee Giles . Efficient Identification of Web Communities. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[4] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee. Self-Organization and Identification of Web Communities. IEEE Computer, 35(3), 66–71, 2002.
![Page 31: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/31.jpg)
31
References [5] Naohiro Matsumura , Yukio Ohsawa , Mitsuru Ishizuka
Discovering Seeds of New Interest Spread from Premature Pages Cited by Multiple Communities, 2001 International Conference on Web Intelligence.
![Page 32: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/32.jpg)
32
Kullback-Leibler Metric
Let p and q be probability distributions with support X and Y respectively. The relative entropy or Kullback-Liebler distance between two probability distributions p and q is defined as
Back
![Page 33: Web Communities](https://reader035.fdocuments.in/reader035/viewer/2022070411/5681479f550346895db4d8f1/html5/thumbnails/33.jpg)
33Back