Identifying Overlapping Communities in Folksonomies or...
Transcript of Identifying Overlapping Communities in Folksonomies or...
Identifying Overlapping Communities in Folksonomies or Tripartite Hypergraphs
Thesis submitted to Indian Institute of Technology, Kharagpur
In partial fulfillment of the requirements
For the award of the degree of
Master of Technology
by
Pushkar Kane
09CS6019
Under the guidance of
Dr. Niloy Ganguly
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY, KHARAGPUR
MAY 2011
Certificate
This is to certify that the thesis entitled "Identifying Overlapping Communities in
Folksonomies or Tripartite Hypergraphs" submitted by Pushkar Kane, to the Department
of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, in
partial fulfillment of the requirements for the degree of Master of Technology in
Computer Science and Engineering, is a bonafide record of the work and investigation
carried out by him under my supervision and guidance.
Prof Niloy Ganguly
Dept. of Computer Science and Engineering
Indian Institute of Technology
Kharagpur – 721302, India
IIT Kharagpur
May 4, 2011
ANNEXURE - IIIHATD
Handling and Archiving of Theses and Dissertations submitted to the
Indian Institute of Technology, Kharagpur 721302
Declaration by the Author of the Thesis or Dissertation
I, Sri/Smt/Kum …………………………………………………………………………………….
Roll no ……………………………registered as a Research Scholar or a student of programs such as B.Tech / B.Arch. / B.Sc. / M.Sc. / M.Tech. / MCP / MS / MMST / MBM or equivalent/ Ph.D./ D.Sc. (tick whichever is applicable) in the Department/ Centre / School of …………………………………………………………………….……………
Indian Institute of Technology, Kharagpur, India (hereinafter referred to as the ‘Institute’) do hereby submit my thesis, title:
…………………………………………………………..………………………………………………………………………………………………………………………………..
(hereinafter referred to as ‘my thesis’) in a printed as well as in an electronic version for holding in the library of record of the Institute.
I hereby declare that:
1. The electronic version of my thesis submitted herewith on CDROM is in …………………...format. (mention whether PostScript or PDF).
2. My thesis is my original work of which the copyright vests in me and my thesis does not infringe or violate the rights of anyone else.
3. The contents of the electronic version of my thesis submitted herewith are the same as that submitted as final hard copy of my thesis after my viva voce and adjudication of my thesis on …………………………(date).
4. I agree to abide by the terms and conditions of the Institute Policy on Intellectual Property (hereinafter Policy) currently in effect, as approved by the competent authority of the Institute.
5. I agree to allow the Institute to make available the abstract of my thesis in both hard copy (printed) and electronic form.
6. For the Institute’s own, non-commercial, academic use I grant to the Institute the non-exclusive license to make limited copies of my thesis in whole or in part and to loan such copies at the Institute’s discretion to academic persons and bodies approved of from time to time by the Institute for non-commercial academic use. All usage under this clause will be governed by the relevant fair use provisions in the Policy and by the Indian Copyright Act in force at the time of submission of the thesis.
ANNEXURE – III (contd.) 7. Furthermore (strike out whichever is not applicable)
(a) I agree / do not agree to allow the Institute to place such copies of the electronic version of my thesis on the private Intranet maintained by the Institute for its own academic community.
(b) I agree / do not agree to allow the Institute to publish such copies of the electronic version of my thesis on a public access website of the Internet should it so desire.
8. That in keeping with the said Policy of the Institute I agree to assign to the Institute
(or its Designee/s) according to the following categories all rights in inventions, discoveries or rights of patent and/or similar property rights derived from my thesis where my thesis has been completed (tick whichever relevant):
a. with use of Institute-supported resources as defined by the Policy and
revisions thereof, b. with support, in part or whole, from a sponsored project or program, vide
clause 6(m) of the Policy.
I further recognize that:
c. All rights in intellectual property described in my thesis where my work does not qualify under sub-clauses 8(a) and/or 8(b) remain with me.
9. The Institute will evaluate my thesis under clause 6(b1) of the Policy. If intellectual
property described in my thesis qualifies under clause 6(b1) (ii) as Institute-owned intellectual property, the Institute will proceed for commercialization of the property under clause 6(b4) of the Policy. I agree to maintain confidentiality as per clause 6(b4) of the Policy.
10. If the Institute does not wish to file a patent based on my thesis, and it is my opinion
that my thesis describes patentable intellectual property to which I wish to restrict access, I agree to notify the Institute to that effect. In such a case no part of my thesis may be disclosed by the Institute to any person(s) without my written authorization for one year after the date of submission of the thesis or the period necessary for sealing the patent, whichever is earlier.
(Name of Student) (Name of supervisor) 1.
2.
(Signature of the Student)
Department/Centre/School:
Signature of the Head of the Department/Center/School
Acknowledgement
The entire work mentioned in this report is carried out at the Department of Computer
Science and Engineering, IIT Kharagpur. I would like to express my sincere thanks and
gratitude to Prof. Niloy Ganguly for the constant motivation and valuable guidance
throughout the course of this work. Saptarshi Ghosh, a research scholar at the
department, along with my supervisor, closely followed and guided this work. Without
him this report would not have materialised. I am highly indebted to both of them for
clarifying my doubts and for bringing into perspective, the different aspects of the topic
with suggestions and criticisms on my work.
Pushkar Kane
09CS6019
Abstract
The advent of Web 2.0 has seen a rapid rise in the popularity of online social systems,
and the major online social systems have hundreds of millions of users at present. Some
of the most popular online social systems today are folksonomies which facilitate users
to share content online. The shared content can be annotated by user defined keywords
called tags. Over a period of time, as a result of collaborative tagging, a non hierarchical
system develops that can be used to share and search for resource content.
Online folksonomies are modelled as tripartite hypergraphs in order to study their
structural and behavioural properties. Detecting communities of similar nodes from
such networks is a challenging and a well-studied problem. However, almost every
existing algorithm known to us for community detection in hypergraphs assigns unique
communities to nodes, whereas in reality, nodes in folksonomies belong to multiple
overlapping communities. For instance, users have multiple topical interests, and the
same resource is often tagged with semantically different tags. In this thesis, we
propose an algorithm to detect overlapping communities in folksonomies by
customizing a recently proposed edge-clustering algorithm (that is originally for
traditional graphs) for use on hypergraphs. Experiments carried out on synthetically-
generated data as well as on real data show the effectiveness of the proposed
algorithm.
Table of Contents
1 Introduction ............................................................................................................... 10
1.1 Online Social Networks ......................................................................................... 10
1.2 Folksonomy .......................................................................................................... 11
1.2.1 Formal Definition of a folksonomy ................................................................. 11
1.2.2 Modeling folksonomies: tripartite hypergraphs ............................................. 13
1.3 Communities in folksonomies............................................................................... 14
1.4 Motivations of community detection ................................................................... 16
1.5 Our contribution................................................................................................... 17
1.6 Organization of the report .................................................................................... 17
2 Literature Survey ........................................................................................................ 15
3 Algorithm for community detection .......................................................................... 23
3.1 Community detection in folksonomies ................................................................. 23
3.2 Overview of the algorithm .................................................................................... 23
3.3 Definitions ............................................................................................................ 24
3.4 Algorithm for detecting communities in folksonomies.......................................... 28
3.5 Discussions ........................................................................................................... 30
4 Experiments and Results ............................................................................................ 31
4.1 Data Collection ................................................................................................... 31
4.2 Synthetic Data Generation ................................................................................... 32
4.3 Metrics for evaluation .......................................................................................... 33
4.4 Experiments on the synthetic data ....................................................................... 34
4.5 Experiments on the real world data ...................................................................... 37
4.5.1 Quality of the communities............................................................................ 37
4.5.2 Quality of the overlap .................................................................................... 39
5 Conclusion .................................................................................................................. 43
Publications .................................................................................................................. 44
References .................................................................................................................... 45
List of figures
1.1 Folksonomy in Delicious.com ................................................................................... 12
1.2 Folksonomy as a three regular tripartite hypergraph ............................................... 13
1.3 Overlapping communities in a folksonomy .............................................................. 16
3.1 Neighbor sets of two adjacent hyperedges .............................................................. 25
3.2 Similarity between two hyperedge communities ..................................................... 27
3.3 Agglomerative clustering of hyperedge communities .............................................. 29
4.1 Community quality as a function of fraction of nodes in multiple communities ....... 36
4.2 Community quality as a function of fraction of scattered hyperedges ...................... 36
4.3 Average cosine similarity values for the detected communities ............................... 38
4.4 Distribution of node community sizes for the detected communities ...................... 41
4.5 Distribution of nodes in multiple detected communities.......................................... 42
10
Chapter 1
Introduction
This chapter summarizes the preliminary concepts that form the basis of the following
chapters. The chapter begins with a brief introduction of online social network and
further explains the concept of folksonomies in detail. Further the problem of
community detection is explained, with emphasis on community detection in
folksonomies.
1.1 Online Social Networks
A social network service is an online service, platform, or site that focuses on building
and reflecting of social networks or social relations among people, e.g., who share
interests and/or activities. A social network service essentially consists of a
representation of each user (often a profile), his/her social links, and a variety of
additional services. Most social network services are web based and provide means for
users to interact over the internet, such as e-mail and instant messaging. Although
online, community services are sometimes considered as a social network service. In a
broader sense, social network service usually means an individual-centered service
whereas online community services are group-centered. Social networking sites allow
users to share ideas, activities, events, and interests within their individual networks.
Basically, there are two broad types of Online Social Networks - some Online Social
Networks where the users (members of the social networking service) and their social
relationships are the most important aspects, e.g. Facebook, Twitter. The other type –
Online Social Networks which focus on the maintenance and sharing of a certain type of
11
resource are called folksonomies. In folksonomies, the users interact with each other
primarily through their mutual liking for these resources, and they annotate the
resources with keywords (known as tags). Different folksonomies, also called social
tagging systems, focus on different types of resources e.g. Webpages for Delicious,
photos for Flickr, music files for livefm, publication entries for Bibsonomy, etc
1.2 Folksonomy
A new family of so-called “Web 2.0” applications is currently emerging on the Web.
These include user-centric publishing and knowledge management platforms like Wikis,
Blogs and social resource sharing systems. Many a systems allow users to annotate
content on the web. This annotation over a period of time leads to a formation of a list
of words called the folksonomy. The word folksonomy is a blend of the words taxonomy
and folk, and stands for conceptual structures created by the people.
A folksonomy is basically a collection of all tag assignments (user-tag-resource bindings)
in the system. It can be modeled as graph which makes it possible to apply graph-based
search and ranking algorithms. Users share resources in a network. Resources
annotated with user-defined keywords called “tags”. Collection of such tags and the
underlying and the system of organization is called folksonomy. No hierarchy in the
categorization and no predefined categories exist in folksonomies.
1.2.1 Formal Definition of a folksonomy
A folksonomy describes the users, resources, and tags, and the user-based assignment
of tags to resources. Formally, a folksonomy is a quadruple,
F := (U, T, R, Y) where
U := finite set of users
12
T := finite set of tags
R := finite set of resources
Y ⊆ U x T x R (tag assignment relation)
Figure 1.1 is a screenshot of the Delicious website which shows the resources, tags and
users to illustrate the concept of Folksonomy.
Figure 1.1 Folksonomy in Delicious.com
Delicious.com, an online bookmarking website allows annotation of bookmarks with
user-defined tags. The screenshot shows bookmarks as resources and the users along
with the tags that have been used to annotate the bookmarks. A user may apply any
number of tags to any number of bookmarks. Each tag assignment consists of the
application of a tag by a user to a resource. The collection of these tag assignments
comprises the Delicious folksonomy
13
1.2.2 Modeling folksonomies: tripartite hypergraphs
In order to study the structural and behavioral properties of folksonomies from the
viewpoint of network theory, such systems are usually represented as tripartite
hypergraphs. A hypergraph is a generalization of a graph, where an edge (or hyperedge)
can connect any number of vertices. Formally, a hypergraph G can be defined as a pair
(V, E), where V is a set of vertices, and E is a set of hyperedges between the vertices.
Each hyperedge is a set of vertices: E ⊆ {{u, v, ...} ∈ 2V}. A k-partite hypergraph is a
hypergraph wherein there are k partite sets and no two vertices of the same set are a
part of the same hyperedge. To represent the folksonomy we make use of a tripartite
hypergraph in which there are three types of vertices representing resources, tags, and
users, and three-way hyperedges joining them in such a way that each hyperedge links
together exactly one resource, one tag, and one user. Each hyperedge corresponds to
the act of a user applying a tag to a resource and hence the tripartite hypergraph
preserves the full structure of the folksonomy. This is evident from Figure 1.2, where
users are represented by circles, and resources by squares and tags by diamonds.
Figure 1.2 Folksonomy as a three regular tripartite hypergraph
14
Figure 1.2 shows a folksonomy as a three-regular, tripartite hypergraph, in which the
node set V is partitioned into three disjoint sets:
V = U U T U R,
where
U is the set of users (circular nodes in red)
T is the set of tags (diamond shaped nodes in green)
R is the set of resources (square nodes in blue)
and every hyperedge {t, u, r} consists of exactly one tag, one user, and one resource.
In this work, folksonomies are treated as hypergraphs with the partite sets called as
Type X, Type Y and Type Z instead of users, tags and resources in particular.
1.3 Communities in folksonomies
Community structures are quite common in real networks. Social networks often
include community groups (the origin of the term, in fact) based on common location,
interests, occupation, etc. Metabolic networks have communities based on functional
groupings. Citation networks form communities by research topic. Being able to identify
these sub-structures within a network can provide insight into how network function
and topology affect each other. Finding communities within an arbitrary network can be
a difficult task. The number of communities, if any, within the network is typically
unknown and the communities are often of unequal size and/or density. Despite these
difficulties, however, several methods for community finding have been developed and
employed with varying levels of success.
Folksonomies grow as a result of consistent social interaction resulting into the addition
of resources and users to the folksonomies and the use of new tags. Eventually, the
15
folksonomy start to develop different topics of interest. A user may be interested in
multiple topics which are defined by a set of resources and described by a set of tags.
Identification of communities in folksonomies aids in searching various topics of interest
as well as in recommendation of resources to the users.
Detecting communities from hypergraphs is practically important to identify users
having similar topical interests as well as similar resources and tags; this helps in
classification of resources into semantic categories and recommendation of potential
friends and resources of matching interest to users of the folksonomy. Though several
algorithms for community detection in hypergraphs have been proposed (e.g. [2]), one
important aspect of the problem that has seldom been considered is that nodes in
folksonomies frequently belong to multiple overlapping communities (rather than a
single community). Most users have multiple topics of interest, and thus link to
resources and tags of many different semantic categories. Similarly, the same resource
(e.g. photo, web-page) is frequently associated with semantically different tags by users
who appreciate different properties of the resource. The only work known to us on
detecting overlapping communities in folksonomies is [3] which consider communities
of tags only. However, detecting overlapping communities of users and resources in
folksonomies is equally necessary for personalized recommendation and categorization
of resources and tags.
As a motivating example, consider a popular photo of a daffodil in Flickr (See Figure 1.3).
Since many users are likely to tag the photo with ‘flower’ (or ‘daffodil’), as compared to
few users using the tag ‘yellow’, algorithms assigning single communities to nodes
would place this photo in the community related to flowers (or daffodils).
16
Figure 1.3 Overlapping communities in a folksonomy
Community-based recommendation schemes, which recommend resources to users
based on common-memberships in communities, would thus overlook the fact that this
photo is an excellent candidate for recommendation to a user who favors tagging
objects that are yellow-colored (e.g. photos of yellow cars, sunset, etc). On the other
hand, an algorithm detecting multiple overlapping communities would place the photo
in both communities related to flowers and the color ‘yellow’, and thus raise the
chances that this popular photo is recommended to the said user. Out of the few
algorithms for detecting overlapping communities of nodes in traditional graphs (but
not for hypergraphs), a recently proposed one identifies communities as a set of closely
inter-related edges, hence different edges created by a node make the node a part of
multiple overlapping communities [1]. In this paper, we identify overlapping
communities in folksonomies by customizing the algorithm in [1] for use on
hypergraphs.
1.4 Motivations of community detection
There are very strong motivations towards the detection of communities in social
networks viz.
17
1. Identifying close friends (nodes within the same community) can help in
recommending new friends and resources to users
2. Meeting the scaling requirements of rapidly-growing OSNs by partitioning the
storage among different servers; users within the same community (e.g. a group of
users who frequently tag resources uploaded by one another) can be allocated to
the same server so that frequent interactions among such users do not produce high
network traffic.
3. Being able to identify these sub-structures within a network can provide insight into
how network function and topology affect each other.
1.5 Our contribution
Various algorithms as explained in the following chapter exist that find communities in
folksonomies. However, to our knowledge, there is no algorithm that finds overlapping
communities in folksonomies. We have proposed an algorithm for the same and
evaluated it using synthetic as well as real world data. This algorithm can be used in
recommendation of resources to users. Also recommendation or suggestion of other
like-minded users to users is a probable application.
1.6 Organization of the report
The following chapter gives a brief overview of the extensive literature survey of that
was carried out before the work began. Chapter 3 gives an insight into the proposed
algorithm whereas Chapter 4 details the important part of validation of the results and
the methodologies used to do the same. Chapter 5 concludes the report and lays down
the possibilities for future work.
18
Chapter 2
Literature Survey
A lot of technical papers were studied and summaries of a few follow. The motivation
behind the work is largely attributed to this literature. Folksonomies are a unique
phenomenon that have been around and became popular in the last decade.
Folksonomies have attracted lot of research in recent times with some of the directions
of research on folksonomies including but not limited to recommendation of resources,
tags, community detection, understanding the network properties and the evolution of
folksonomies.
When modeled as graphs the network properties of folksonomies are peculiar as the
folksonomies grow as a result of collaborative tagging. These properties help in the
understanding of the growth and formation of communities in folksonomies. Various
structural properties unique to folksonomies such as characteristic path length,
clustering coefficients, cliquishness, and connectedness have been studied and analyzed
in [4]. The paper also analyzes the tag concurrence network obtained from the
folksonomy.
Hotho et al. have proposed a ranking algorithm for information retrieval in folksonomies
[12]. The algorithm is based on and is an adaptation of the PageRank algorithm called
the FolkRank. The algorithm proceeds by transforming the graph into an undirected
unweighted tripartite graph. A random surfer model is used to traverse the graph and
rank the nodes in the graph. The ranks can then be used for searching as well as for
recommendation.
Tagging has emerged as a powerful mechanism that enables users to find, organize, and
understand online entities. Recommendation has always been an important application
19
in social networks and recommender systems enable users to efficiently navigate vast
collections of items. Some recommender systems recommend tags to users instead of
resources by predicting the user's liking based on previous tagging behavior and
recommending tags that the user is likely to use to annotate the resource.
Recommender systems for tags called Tagommenders have been described in [9].
Common interests shared by groups of users in social networks are discovered by
utilizing user tags in [6]. Tags implicitly and concisely represent user’s interests. A topic
of interest consisting of a set of tags describing a particular popular topic is identified
and the corresponding users and resources are clustered and indexed. Based on
application this index is used as an aid for recommendation.
A folksonomy grows as a result of collaborative tagging and the nodes therein have
various similarities that can be inferred from the structure of the folksonomy. Various
such measures of similarity between nodes for folksonomies based on the properties of
the nodes and hyperedges as well as the structure of the folksonomies have been
discussed in [5]. The direct application of the similarity measures could be used for
community detection and recommendation. The paper discusses various similarity
measures such as matching/overlap measure on weighted and unweighted projections,
Jaccard coefficient of the overlapped tags/resources, cosine similarity, dice coefficient
and measure of the amount of mutual information among others. The paper also
focuses on tag and resource similarity as a way of finding communities for
recommendation and answering query results.
Bipartite networks are similar to folksonomies in a way that they have nodes of different
types in the network and where nodes can be divided into disjoint sets such that no two
nodes within the same set are linked. Murata et al. have proposed a modularity
technique for bipartite networks in [11]. A measure for modularity has been described
and defined that quantifies the goodness of the communities in k-partite hypergraphs
[7]. The paper discusses the conversion of multipartite graphs into bipartite graphs by
reducing k-partite graphs into k(k-1)/2 bipartite graphs. The paper also discusses two
20
ways of converting k-partite graphs into unipartite graphs viz. flattening and projections.
Murata et al. have also proposed a modularity measure for tripartite hypergraphs in
[12]. The measure in based on greedy optimization and is not applicable on real world
folksonomies. Moreover the method does not detect overlapping communities. Zhang
et al. have proposed and defined an edge clustering coefficient for bipartite graphs in
analogy to the node clustering coefficient in graphs in [8]. Based on the measure, triples
of nodes are formed with two adjacent edges and the edge similarity is calculated as
being the node similarity between the two end nodes of the triple. In this way
communities are detected containing nodes from both the partite sets. The paper also
demonstrates how the communities detected by one mode projections of the graph are
not accurate due to the loss of information. Yong-Yeol Ahn et al. have proposed a novel
approach for community detection in graphs by progressively grouping edges together
instead of nodes as is the conventional approach [1]. In this way, edge communities are
formed from which node communities could be obtained. The paper discusses an
agglomerative bottom up method for clustering of edges based on a measure of
partition density that describes analogous to the modularity of communities. The
resultant edge grouping defines edge communities. The nodes incident upon the edges
lie in the community containing that edge. In this way, a node can be a part of multiple
communities, essentially overlapping communities. A part of our work is largely based
on the idea of edge clustering that has been extended to hyperedges.
A brief summary of the papers that were studied as a part of the literature survey is
detailed as follows.
21
Brief Summary of the papers studied
Serial
Number
Name of the paper Name of the
authors
Remarks
1 Network Properties
of Folksonomies
Christoph
Schmitz, Miranda
Grahl, Andreas
Hotho, Gerd
Stumme
describes various structural
properties unique to
folksonomies and analyzes tag
concurrence network
2 “Tagommenders:
Connecting Users to
Items through Tags”
Shilad Sen, Jesse
Vig, John Riedl
recommendation of tags to user
by predictions based on previous
tagging behavior
3 Tag-based Social
Interest Discovery
Xin Li, Lei Guo,
Yihong (Eric)
Zhao
identifies topics of interest
identified by a set of tags in
folksonomies and clusters and
indexes resources
4 Detecting
Communities from
Bipartite Networks
Based on Bipartite
Modularities
Tsuyoshi Murata proposes a modularity measure
for bipartite graphs for detecting
communities of nodes that
contain nodes of different partite
sets together
5 Towards Community
Detection in k-
partite k-uniform
hypergraphs
Nicolas
Neubauer, Klaus
Obermayer
proposes a modularity measure
for multipartite graphs by
reducing k partite graphs into
k(k-1)/2 bipartite graphs
22
6 Detecting
Communities from
Tripartite Networks
Tsuyoshi Murata proposes a modularity measure
for tripartite graphs to evaluate
the partition of folksonomy into
communities
7 Evaluating Similarity
Measures for
Emergent Semantics
of Social Tagging
Benjamin
Markines, Ciro
Cattuto, Filippo
Menczer,
Dominik Benz,
Andreas Hotho,
Gerd Stumme
discusses various similarity
measures between nodes for
folksonomies and focuses on tag
and resource similarity as a way
of finding communities for
recommendation and answering
query results
8 Clustering
coefficient and
community
structure of
bipartite networks
Peng Zhang,
Jinliang Wang,
Xiaojia Li,
Menghui Li,
Zengru Di, Ying
Fan
Proposes a measure called the
edge clustering coefficient for
clustering nodes of bipartite
graphs into communities
9 Link communities
reveal multiscale
complexity in
networks
Yong-Yeol Ahn,
James P. Bagrow,
Sune Lehmann
proposes a new method for
detecting communities in graphs
by progressively grouping edges
together and detecting edge
communities which further give
rise to overlapping node
communities
10 Information
Retrieval in
Folksonomies:
Search and Ranking
Andreas Hotho,
Robert Jaschke,
Christoph
Schmitz, Gerd
Stumme
Proposes a ranking algorithm for
folksonomies based on the
adaptation of PageRank
algorithm
23
Chapter 3
ALGORITHM FOR DETECTION OF
OVERLAPPING COMMUNITIES IN
FOLKSONOMIES
3.1 Community detection in folksonomies
Communities in folksonomies arise as a result of social tagging by users. Eventually as
the folksonomy grows, various topics of interest develop and overlapped topics of
interest arise. Detecting communities i.e. sub-networks that are densely connected
inside and sparsely connected outside, from folksonomies is practically important for
finding similar entities and understanding the structure of social media. All existing
methods find single communities for nodes in folksonomies, but as stated earlier, users
and resources in real-world folksonomies are likely to be members of multiple
overlapping communities. Here we propose an algorithm to detect such overlapping
communities in tripartite hypergraphs. The proposed algorithm initially detects
communities of similar hyperedges, and later uses these communities of hyperedges to
identify communities of nodes. To the best of our knowledge no particular method
exists that finds overlapping communities for nodes in a folksonomy.
3.2 Overview of the algorithm
The community detection algorithm proceeds in a bottom-up hierarchical way, by
merging most similar communities of hyperedges until one community remains. The
24
hyperedges are clustered together based on similarity between adjacent hyperedges. In
order to define adjacency we considered various notions viz. two nodes common
between two hyperedges, one node common between two hyperedges, at least one
node common between two hyperedges. Of the above we found that the criterion of at
least one node common between two hyperedges captures the notion of adjacency the
best. The resultant structure is a dendrogram. The resultant dendrogram is cut at a
particular point where the optimization measure, partition density (described later), is
maximum. Later, node communities are formed from the hyperedge communities. Each
community comprises of nodes from all the three sets of Type X, Type Y and Type Z
3.3 Definitions
1. Folksonomy representation: Folksonomy is represented as a hypergraph with list of
hyperedges between vertices of Type X, Type Y and Type Z
2. Hyperedge Representation: (a, b, c) represents a hyperedge between node a of Type
X, node b of Type Y and node c of Type Z
3. Adjacency of hyperedges: Two hyperedges (a, b, c) and (p, q, r) are adjacent if they
have at least one vertex in common i.e. either a = p or b = q or c = r.
An alternative measure for adjacency of hyperedges analogous to this, considers
two hyperedges (a, b, c) and (p, q, r) to be adjacent if they have exactly two vertices
in common i.e. either (a = p and b = q) or (b = q and c = r) or (a = p and c = q). This
measure is revisited in the discussions section in this chapter.
4. Neighborhood of hyperedges: The neighborhood of a hyperedge is defined in
collaboration with an adjacent edge and it depends upon the neighboring edge in
consideration. It is based on the set of neighbors of the constituent nodes of the
hyperedge explained as follows:
25
Consider hyperedges (a, b, c) and (p, q, r) that are adjacent. Without loss of
generality let node and node p be the same (a = p) i.e. node of Type X is common
(however either of the other two nodes may be common as well). The following
figure shows these edges with a common node.
Figure 3.1 Neighbor sets of two adjacent hyperedges
As shown in Figure 3.1 the neighbor sets of the constituent nodes of each hyperedge
are shown enclosed in ellipses. The red circles, blue triangles and green squares
represent the nodes of Type X, Type Y and Type Z respectively. The figure shows the
hyperedges (a, b, c) and (a, q, r) with node a of Type X, nodes b and y of Type Y and
nodes c and z of Type Z. NX(b) is the set of nodes that are neighbors of set of b i.e.
the set of nodes of Type X which are connected to node b by a hyperedge. NY(c) is
26
the set of nodes of Type Y that are neighbors of node c. Similarly it is defined for
other nodes. Thus set S1 is obtained by the union of sets NX(b) and NX(c).
Based on the node sets of the users Type X, Type Y and Type Z neighbor sets are
defined for the hyperedges (a, b, c) and (a, q, r). The Type X, Type Y and Type Z
neighbor sets, S1, S2 and S3 for the hyperedge (a, b, c) are defined as follows:
S1 = {Neighbor nodes of b and c of Type X} = NX(b) U NX(c)
S2 = {Neighbor nodes of c of Type Y} = NY(c)
S3 = {Neighbor nodes of b of Type Z} = NZ(b)
Similarly the Type X, Type Y and Type Z neighbor sets viz. S1’, S2’ and S3’ of (a, q, r)
are defined as follows:
S1’ = NX(q) U NX(r)
S2’ = NY(r)
S3’ = NZ(q)
5. Similarity of hyperedges: Similarity for non-adjacent hyperedges is defined to be
zero. For adjacent hyperedges, similarity measure is explained as follows:
The similarity for hyperedges (a, b, c) and (a, q, r) is defined to be:
|S1 ∩ S1’| + |S2 ∩ S2’| + |S3 ∩ S3’|
|S1 U S1’| + |S2 U S2’| + |S3 U S3’|
where S1, S2 and S3 are the neighbor sets of hyperedge (a, b, c) and S1’, S2’ and S3’
are the neighbor sets of hyperedge (a, q, r) as explained earlier. Higher values of this
expression indicate higher similarity between the hyperedges.
27
6. Similarity of hyperedge communities: Similarity of hyperedge communities is equal
to the maximum similarity between pairs of constituent hyperedges one from each
community.
Figure 3.2 Similarity between two hyperedge communities
Figure 3.2 shows two hyperedge communities in black circles with red colored circles
as constituent hyperedges enclosed within represented in vector space. The
Euclidean distance between two hyperedges equals the similarity between them.
The similarity between two communities is equal to the maximum similarity
between the constituent hyperedges one from each community and is depicted by a
blue line.
7. Partition density: The partition density for a community is defined as follows:
X = Number of hyperedges in the community.
Y = Maximum number of hyperedges possible in the community
Partition Density is calculated as X / Y
The partition density for all the communities is equal to the average of the partition
densities of each community weighted by the number of hyperedges in each
28
community. The weighted average of these partition densities gives the partition
density of the current partition of the folksonomy into communities. A higher value
of the weighted average indicates that the partition of the folksonomy into
communities is good. The algorithm identifies the partition that results into the
highest value of the partition density with number of communities being at least
two.
3.4 Algorithm for detecting communities in folksonomies
A tripartite hypergraph is denoted as G = (V,E) where the set of nodes V is composed of
three partite sets (types) VX, VY and VZ, and E is the set of hyperedges; each hyperedge
connects triples of nodes (a, b, c) where a ϵ VX, b ϵ VY , c ϵ VZ. Further, let the notations
NX(i), NY (i) and NZ(i) denote the set of neighbors of node i of node sets VX, VY and VZ
respectively. The proposed algorithm performs an agglomerative hierarchical clustering
of hyperedges using single-linkage similarity among clusters of hyperedges. The
following algorithm gives our customized measure for the similarity of hyperedges
between two adjacent hyperedges (i.e. having at least one node in common). Non-
adjacent hyperedges are assumed to have zero similarity as explained earlier.
The hierarchical clustering, continued until all hyperedges belong to a single cluster,
builds a dendrogram (as shown in Figure 3.2), and cutting this dendrogram at some
suitable level gives communities of hyperedges. The optimal level for the cut, on which
the quality of the obtained communities depends, is decided based on the partition
density metric [1] as follows. The partition density of a community C of edges (or
hyperedges, in case of hypergraphs) is the number of edges in C, normalized by the
minimum and maximum number of edges possible among the induced nodes (i.e. nodes
that are touched by the edges in C). The global partition density for a given partitioning
of the edges (hyperedges) is the average partition density of all communities weighted
by the fraction of edges present in each community.
29
Figure 3.3 Agglomerative clustering of hyperedge communities
Our customized partition density metric for use on hypergraphs is explained in the
definitions section. Similar to [1], the dendrogram is cut at the level at which the global
partition density is maximum (See Figure 3.2). Thus each hyperedge is placed into a
single community, and a node inherits membership of all the communities into which its
edges are placed.
This procedure is explained by the following pseudo-code of the algorithm as follows:
1. Initialize all hyperedges to be in different communities.
2. Do
i. Find the similarities between all pairs of communities.
ii. Merge the two most similar communities (the resultant community is given the
least identifier among the two merged communities) using single linkage
clustering.
iii. Find the Partition Density for this division.
Until only one community remains
(The above loop generates a dendrogram)
30
3. Trace the dendrogram and find the level at which the Partition Density attained a
maximum.
4. Cut the dendrogram at that level and save the resultant hyperedge communities at
that level.
5. Form node communities from hyperedge communities with a node community
corresponding to each hyperedge community. Each node community consists of
nodes of all the three types viz. Type X, Type Y and Type Z. Nodes belong to the node
communities corresponding to the hyperedge communities of those hyperedges
that are incident on them.
3.5 Discussions
For the purpose of finding the similarity between hyperedges two approaches were
tried out for defining the adjacency of hyperedges. One of the approaches is
explained in the earlier section. The other approach considers two hyperedges to be
adjacent if and only if they share exactly two nodes. This approach is more stringent
and it was found to generate better results than the earlier one in case of very
dense hypergraphs. Since hypergraphs in the real world are not very dense we
moved on with the earlier approach.
31
Chapter 4
Experiments and Results
4.1 Data Collection
For the purpose of community detection in folksonomies we collected the folksonomy
of MovieLens.
3 different data sets of varying sizes were obtained viz.
1. A large data set containing 10000054 ratings and 95580 tags applied to 10681
movies by 71567 users of the online movie recommender service MovieLens. The
data also contains keywords indicating the genres of the movies
2. A medium sized data set containing tags applied to 3592 movies by 6040 users
with at least 20 tags per user.
3. A small sized data set containing 100,000 ratings (1-5) from 943 users on 1682
movies.
Thus MovieLens contains two folksonomies viz. the folksonomy of users, tags and
movies and that of users, ratings and movies. The data sets are available for download
from GroupLens Research website (http://www.grouplens.org/). We have used the
folksonomy of users, tags and movies in our study.
Since ground truth is not available for most real world folksonomies, it is difficult to
validate the community structure obtained by our algorithm. Hence we have used
synthetically generated hypergraphs that have a predefined community structure with
overlapping communities for nodes. In addition to the quantification of performance
based on synthetic data we have used a subset of the real world data for qualitative as
32
well as quantitative analysis of the algorithm. We have obtained metadata for the
MovieLens folksonomy data and formulated a measure to judge the performance of the
algorithm on real world folksonomies (explained in later sections).
In addition to the real world data we have used synthetically generated hypergraphs
having overlapping communities for nodes.
4.2 Synthetic Data Generation
The synthetic data that were used in the experiments were generated as follows:
1. The three node sets of Type X, Type Y and Type Z were created with equal nodes in
each set.
2. A fixed number of communities were decided randomly.
3. Nodes from each set were assigned to a particular community chosen randomly from
the pre decided number of communities.
4. A pre-decided fraction of the nodes were chosen to lie in multiple communities. For
each node of these nodes the number of communities in which the node would lie
was decided randomly and subsequently these nodes were allotted to additional
communities
5. A pre-decided fraction of hyperedges were chosen to be scattered. These scattered
hyperedges were to connect nodes lying in different communities and the un-
scattered hyperedges to the nodes of the same community.
6. Scattered and un-scattered hyperedges were added randomly between nodes of the
same community and between nodes of different communities, respectively, based
33
on the predefined fractions
7. A set of hypergraphs based on different values of the fraction of nodes in multiple
communities and fraction of scattered hyperedges were generated. The algorithm
was executed and results were obtained for each set of values for the variable
parameters by averaging the results over a set of hypergraphs.
4.3 Metrics for evaluation
Using the above algorithm multiple synthetic hypergraphs were generated and
validated using the following community quality measure which measures the fraction
of pairs of nodes in the resultant community that were together in the same community
for each community.
Community quality =
Where f(x,y) indicates the metadata similarity of nodes x and y i.e. the similarity
between nodes x and y based on the ground truth (which is known for synthetic data)
and <f(x,y)> indicates the average metadata similarity value. Essentially the community
quality measures the ratio of average metadata similarity between all pairs of nodes
identified into same communities by the algorithm to average metadata similarity
between all pairs of nodes. The metadata similarity is based on the predefined
community structure and its value is assumed by the function f(x,y) for nodes x and y.
The function f(x,y) for the node pair x and y is defined as follows:
f(x,y) = | CX ∩ CY | / | CX U CY |
Where Cx and Cy represent the sets of communities that nodes x and y belong to in the
<f(x,y)> for nodes in a detected community
<f(x,y)> for all possible nodes
34
synthetic hypergraphs.
Values of community quality higher than 1 indicate that the identified community
structure indeed groups similar nodes into the same communities. The experiments that
were carried out over the synthetic data used this metric for the evaluation of the
algorithm in identifying overlapping communities of nodes in folksonomies.
4.4 Experiments on the synthetic data
Experiments on synthetic data were carried out by running the algorithm over the
synthetic data and comparing the output with the predefined community structure of
the data. As explained earlier, a series of synthetic hypergraphs were generated and the
community quality measure was obtained for it by running the algorithm on the
synthetic data.
Each experiment was carried out by varying the following set of parameters:
1. Number of communities in the synthetic hypergraph.
2. Number of nodes in each of the three sets Type X, Type Y and Type Z
The numbers of nodes in each of the three sets were chosen to be equal to each
other.
i.e. |VX| = |VY| = |VZ|
3. Average node degree (Average number of hyperedges per node)
The average node degree controls the number of hyperedges in the synthetic
hypergraph that are equal to the product of the average node degree and the
number of nodes in the hypergraph.
4. Fraction of nodes in multiple communities.
This value controlled the amount of overlap between the node communities.
35
5. Fraction of scattered hyperedges
Scattered hyperedges connect nodes from different node communities. They denote
the passing interests of users who have tagged the resources but the node, tag and
interest do not constitute a topic of interest.
The experiments were carried out for synthetic data with the following values for the
parameters of the synthetic data being fixed:
1. Number of communities in the synthetic hypergraph = 5
2. Number of nodes in each of the three sets = 100
3. Average node degree = 10
4. Number of hyperedges = 1000
The experiments were carried out over hypergraphs varying the fraction of nodes in
multiple communities from 0.0 to 1.0 in intervals of 0.2 keeping the fraction of scattered
hyperedges and then varying the fraction of nodes in multiple communities in the same
range and interval keeping the other value constant. The following plots show the
variation of community quality with the changes in the fraction of nodes in multiple
communities and the fraction of scattered hyperedges
Figures 4.1 and Figure 4.2 show community quality as a function of the fraction of nodes
in multiple communities and the fraction of scattered hyperedges respectively. The
community quality value is the highest when the fraction of nodes in multiple
communities and the fraction of scattered hyperedges are both zero and decreases
gracefully with the increase in either of the two.
36
Figure 4.1 Community quality as a function of fraction of nodes in multiple communities
Figure 4.2 Community quality as a function of fraction of scattered hyperedges
37
The community quality value is however higher than 1 which indicates that the resultant
community structure is identified correctly.
4.5 Experiments on the real world data
A subset of the MovieLens folksonomy (mentioned in the datasets) was used with 1000
hyperedges such that no node has a high degree. The subset was obtained by sorting the
set of hyperedges lexicographically and selecting hyperedges restricting the degree of
each node to four. The resultant subset was used as an input for the algorithm and the
resultant communities that were obtained contained users, movies and the tags
describing the movies in each community. As explained earlier, ground truth is not
available for judging the results of the algorithm for a real world folksonomy. In order to
get a fair idea about the performance of the algorithm on real world data, metadata in
the form of information about movies was obtained and used. These metadata
comprised of information about the genres of the movies from the MovieLens
folksonomy obtained from IMDb. The data obtained from IMDb contains a list of movies
with each movie associated with a set of keywords that describe the genre of the movie.
These keywords are a subset of 18 predefined keywords by IMDb viz. Action, Adventure,
Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror,
Musical, Mystery, Romance, Sci-Fi, Thriller, War, and Western. This data is used in
comparison of the movies that have been grouped into the same community by the
algorithm.
4.5.1 Quality of the communities
In order to find out the similarity between movies each movie is treated as a vector and
the measures for vector similarity are used. Each movie is represented as a vector in 18
dimensions corresponding to the 18 keywords. The similarity between two movies is the
cosine similarity between the two vectors which is the cosine of the angle between the
38
two vectors. Cosine function takes values between 0 and 1 for angles between 0 and π/2.
Lower values of angle and higher values of cosine indicate higher similarity.
The movies from each detected community were considered. The cosine similarity for
each pair of movies was found out and was averaged over all pairs. Figure 4.3 shows a
plot of the average cosine similarity values for each community. The plot shows the
cumulative average cosine similarity values for the MovieLens folksonomy for all the
detected communities. As is apparent, there are a reasonable number of communities
that have high values of average cosine similarity indicating that the performance of the
algorithm over the real world MovieLens folksonomy is fairly well.
Figure 4.3 Average cosine similarity values for the detected communities
The output communities were used to determine the distribution of community sizes
(see Figure 4.4). This result gives an indication of the nature of the original graph. If the
39
original graph was sparse more communities of size three (corresponding to a single
hyperedge) are formed.
4.5.2 Quality of the overlap
In order to find out whether the overlapping communities are actually any better than
the non-overlapping ones, we found out if the movies that have been placed in multiple
communities actually need to belong in multiple communities.
We found out another community structure for the movies wherein each movie is
placed in exactly one community. This structure is derived from the partition obtained
by our algorithm by allocating the best community to each movie among the ones that it
belongs to. Analogous to the cosine similarity of a community we define the cosine
similarity of a movie which is the cosine of that movie with another movie. Based on this
metric we find the community for which the average cosine similarity of a particular
movie with other movies in that particular community is the highest. That particular
movie is allocated to the community the elements of which it has the highest average
cosine similarity.
The experiments were carried out over the subset of the MovieLens dataset as
explained earlier over the two community structures viz.
1. The overlapping communities obtained by our algorithm
2. The non-overlapping communities obtained by the method explained earlier
The average cosine similarity values were calculated for each community and the values
for the communities obtained by our algorithm were compared with the corresponding
values for the non-overlapping communities obtained by the approach discussed earlier.
Out of the 178 communities found out by our algorithm, 119 overlapping communities
had higher average cosine similarity values than the corresponding non-overlapping
communities i.e. over two-thirds of the overlapping communities have higher average
40
cosine similarity values. Thus the movies placed in multiple communities are correctly
placed so.
The communities are of different sizes so in order to give an overview of the quality of
all communities irrespective of the size we found out the average similarity values
weighted by size of the communities over all the communities for the two community
structures. The weighted average cosine similarity values for the two community
structures are as follows:
Community structure Weighted average cosine similarity value
Overlapping 0.373794877108
Non-overlapping 0.21820724018
Thus the overlapping communities obtained by our algorithm are apt.
4.5.3 Other measures
The algorithm was run on the MovieLens data described above and the distribution of
node community sizes and the distribution of nodes in multiple communities were
found out to justify the purpose of the algorithm.
Figure 4.4 shows the distribution of node community sizes of the detected communities
in the MovieLens folksonomy. A large number of communities have three nodes in them
i.e. one each from each of the three sets of Type X, Type Y and Type Z. These node
communities have been created as a result of a single hyperedge classified into an edge
community by the algorithm. This occurs as a result of the MovieLens folksonomy being
a disconnected hypergraph. The other communities of larger size have multiple movies
41
grouped together. These movies have been compared together using the average cosine
similarity values as explained earlier.
Figure 4.4 Distribution of node community sizes for the detected communities
42
The output communities were also used to determine the distribution of nodes that
have been placed in multiple communities i.e. finding the number of communities in
which each node is placed was found out (see Figure 4.5). It was found that there are a
substantial number of nodes have been grouped into multiple communities which per se
highlights the property of folksonomies. Thus the purpose of the algorithm is justified.
Figure 4.5 Distribution of nodes in multiple detected communities
43
Chapter 5
Conclusion
In this thesis, we propose the first algorithm to detect overlapping communities in
Folksonomies, to the best of our knowledge. The algorithm gives reasonably good results
for synthetic as well as real world folksonomies and it detects overlapping communities
of nodes. This algorithm can be applied in practice for the purpose of recommendation
of resources to users.
Future Work
The implementation of the algorithm and its actual use in recommendation or search is
an important application of the algorithm. Another application of this algorithm would
be to suggest friends to users. Nodes are grouped into communities and each
community represents a topic of interest. Many users have different areas of interest
and based on these interests likeminded users can be found out. Thus friend suggestions
can be made to users. Yet another application of the algorithm would be the searching
and ranking of results pertaining to the folksonomy. Actual application of the algorithm
on a large real world folksonomy and using the results for search or recommendation
comprises the future work.
44
Publications
[1] Saptarshi Ghosh, Pushkar Kane, Niloy Ganguly. Identifying overlapping
communities in folksonomies or tripartite hypergraphs, International World Wide
Web Conference, March 2011
45
References
[1] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann. Link communities reveal multiscale
complexity in networks. Nature, 466(7307):761–764, August 2010.
[2] T. Murata. Modularity for heterogeneous networks. In ACM Hypertext, pages
129–134, June 2010.
[3] S. Papadopoulos, Y. Kompatsiaris, A. Vakali. Leveraging collective intelligence
through community detection in tag networks. In CKCaR, September 2009.
[4] Christoph Schmitz, Miranda Grahl, Andreas Hotho, Gerd Stumme. Network
properties of Folksonomies. AI Communications, Vol. 20, Nr. 4 Amsterdam, The
Netherlands: IOS Press, dec (2007), p. 245--262.
[5] Benjamin Markines, Ciro Cattuto, Filippo Menczer, Dominik Benz, Andreas Hotho,
Gerd Stumme. Evaluating Similarity Measures for Emergent Semantics of Social
Tagging. International World Wide Web Conference, 2009
[6] Xin Li, Lei Guo, Yihong (Eric) Zhao. Tag-based Social Interest Discovery.
International World Wide Web Conference, 2008
[7] Nicolas Neubauer, Klaus Obermayer. Towards Community Detection in k-partite
k-uniform hypergraphs. Workshop on Analyzing Networks and Learning with
Graphs at NIPS 2009.
[8] Peng Zhang, Jinliang Wang, Xiaojia Li, Menghui Li, Zengru Di, Ying Fan. Clustering
coefficient and community structure of bipartite networks. Physica A, Volume
46
387, Issue 27, p. 6869-6875
[9] Shilad Sen, Jesse Vig, John Riedl. Tagommenders: Connecting Users to Items
through Tags. International World Wide Web Conference, April 2009
[10] Murata T. Detecting Communities from Tripartite Networks. International World
Wide Web Conference, April 2010
[11] Murata T. Detecting Communities from Bipartite Networks Based on Bipartite
Modularities. IEEE/WIC/ACM International Joint Conference on Web Intelligence
and Intelligent Agent Technology 2009.
[12] Andreas Hotho, Robert Jaschke, Christoph Schmitz, Gerd Stumme. Information
Retrieval in Folksonomies: Search and Ranking. In York Sure and John Domingue,
editors, The Semantic Web: Research and Applications, volume 4011 of LNAI,
pages 411–426, Heidelberg, June 2006. Springer