Identifying Overlapping Communities in Folksonomies or...

Identifying Overlapping Communities in Folksonomies or Tripartite Hypergraphs

Thesis submitted to Indian Institute of Technology, Kharagpur

In partial fulfillment of the requirements

For the award of the degree of

Master of Technology

by

Pushkar Kane

09CS6019

Under the guidance of

Dr. Niloy Ganguly

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, KHARAGPUR

MAY 2011

Certificate

This is to certify that the thesis entitled "Identifying Overlapping Communities in

Folksonomies or Tripartite Hypergraphs" submitted by Pushkar Kane, to the Department

of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, in

partial fulfillment of the requirements for the degree of Master of Technology in

Computer Science and Engineering, is a bonafide record of the work and investigation

carried out by him under my supervision and guidance.

Prof Niloy Ganguly

Dept. of Computer Science and Engineering

Indian Institute of Technology

Kharagpur – 721302, India

IIT Kharagpur

May 4, 2011

ANNEXURE - IIIHATD

Handling and Archiving of Theses and Dissertations submitted to the

Indian Institute of Technology, Kharagpur 721302

Declaration by the Author of the Thesis or Dissertation

I, Sri/Smt/Kum …………………………………………………………………………………….

Roll no ……………………………registered as a Research Scholar or a student of programs such as B.Tech / B.Arch. / B.Sc. / M.Sc. / M.Tech. / MCP / MS / MMST / MBM or equivalent/ Ph.D./ D.Sc. (tick whichever is applicable) in the Department/ Centre / School of …………………………………………………………………….……………

Indian Institute of Technology, Kharagpur, India (hereinafter referred to as the ‘Institute’) do hereby submit my thesis, title:

…………………………………………………………..………………………………………………………………………………………………………………………………..

(hereinafter referred to as ‘my thesis’) in a printed as well as in an electronic version for holding in the library of record of the Institute.

I hereby declare that:

1. The electronic version of my thesis submitted herewith on CDROM is in …………………...format. (mention whether PostScript or PDF).

2. My thesis is my original work of which the copyright vests in me and my thesis does not infringe or violate the rights of anyone else.

3. The contents of the electronic version of my thesis submitted herewith are the same as that submitted as final hard copy of my thesis after my viva voce and adjudication of my thesis on …………………………(date).

4. I agree to abide by the terms and conditions of the Institute Policy on Intellectual Property (hereinafter Policy) currently in effect, as approved by the competent authority of the Institute.

5. I agree to allow the Institute to make available the abstract of my thesis in both hard copy (printed) and electronic form.

6. For the Institute’s own, non-commercial, academic use I grant to the Institute the non-exclusive license to make limited copies of my thesis in whole or in part and to loan such copies at the Institute’s discretion to academic persons and bodies approved of from time to time by the Institute for non-commercial academic use. All usage under this clause will be governed by the relevant fair use provisions in the Policy and by the Indian Copyright Act in force at the time of submission of the thesis.

ANNEXURE – III (contd.) 7. Furthermore (strike out whichever is not applicable)

(a) I agree / do not agree to allow the Institute to place such copies of the electronic version of my thesis on the private Intranet maintained by the Institute for its own academic community.

(b) I agree / do not agree to allow the Institute to publish such copies of the electronic version of my thesis on a public access website of the Internet should it so desire.

8. That in keeping with the said Policy of the Institute I agree to assign to the Institute

(or its Designee/s) according to the following categories all rights in inventions, discoveries or rights of patent and/or similar property rights derived from my thesis where my thesis has been completed (tick whichever relevant):

a. with use of Institute-supported resources as defined by the Policy and

revisions thereof, b. with support, in part or whole, from a sponsored project or program, vide

clause 6(m) of the Policy.

I further recognize that:

c. All rights in intellectual property described in my thesis where my work does not qualify under sub-clauses 8(a) and/or 8(b) remain with me.

9. The Institute will evaluate my thesis under clause 6(b1) of the Policy. If intellectual

property described in my thesis qualifies under clause 6(b1) (ii) as Institute-owned intellectual property, the Institute will proceed for commercialization of the property under clause 6(b4) of the Policy. I agree to maintain confidentiality as per clause 6(b4) of the Policy.

10. If the Institute does not wish to file a patent based on my thesis, and it is my opinion

that my thesis describes patentable intellectual property to which I wish to restrict access, I agree to notify the Institute to that effect. In such a case no part of my thesis may be disclosed by the Institute to any person(s) without my written authorization for one year after the date of submission of the thesis or the period necessary for sealing the patent, whichever is earlier.

(Name of Student) (Name of supervisor) 1.

2.

(Signature of the Student)

Department/Centre/School:

Signature of the Head of the Department/Center/School

Acknowledgement

The entire work mentioned in this report is carried out at the Department of Computer

Science and Engineering, IIT Kharagpur. I would like to express my sincere thanks and

gratitude to Prof. Niloy Ganguly for the constant motivation and valuable guidance

throughout the course of this work. Saptarshi Ghosh, a research scholar at the

department, along with my supervisor, closely followed and guided this work. Without

him this report would not have materialised. I am highly indebted to both of them for

clarifying my doubts and for bringing into perspective, the different aspects of the topic

with suggestions and criticisms on my work.

Pushkar Kane

09CS6019

Abstract

The advent of Web 2.0 has seen a rapid rise in the popularity of online social systems,

and the major online social systems have hundreds of millions of users at present. Some

of the most popular online social systems today are folksonomies which facilitate users

to share content online. The shared content can be annotated by user defined keywords

called tags. Over a period of time, as a result of collaborative tagging, a non hierarchical

system develops that can be used to share and search for resource content.

Online folksonomies are modelled as tripartite hypergraphs in order to study their

structural and behavioural properties. Detecting communities of similar nodes from

such networks is a challenging and a well-studied problem. However, almost every

existing algorithm known to us for community detection in hypergraphs assigns unique

communities to nodes, whereas in reality, nodes in folksonomies belong to multiple

overlapping communities. For instance, users have multiple topical interests, and the

same resource is often tagged with semantically different tags. In this thesis, we

propose an algorithm to detect overlapping communities in folksonomies by

customizing a recently proposed edge-clustering algorithm (that is originally for

traditional graphs) for use on hypergraphs. Experiments carried out on synthetically-

generated data as well as on real data show the effectiveness of the proposed

algorithm.

Table of Contents

1 Introduction ............................................................................................................... 10

1.1 Online Social Networks ......................................................................................... 10

1.2 Folksonomy .......................................................................................................... 11

1.2.1 Formal Definition of a folksonomy ................................................................. 11

1.2.2 Modeling folksonomies: tripartite hypergraphs ............................................. 13

1.3 Communities in folksonomies............................................................................... 14

1.4 Motivations of community detection ................................................................... 16

1.5 Our contribution................................................................................................... 17

1.6 Organization of the report .................................................................................... 17

2 Literature Survey ........................................................................................................ 15

3 Algorithm for community detection .......................................................................... 23

3.1 Community detection in folksonomies ................................................................. 23

3.2 Overview of the algorithm .................................................................................... 23

3.3 Definitions ............................................................................................................ 24

3.4 Algorithm for detecting communities in folksonomies.......................................... 28

3.5 Discussions ........................................................................................................... 30

4 Experiments and Results ............................................................................................ 31

4.1 Data Collection ................................................................................................... 31

4.2 Synthetic Data Generation ................................................................................... 32

4.3 Metrics for evaluation .......................................................................................... 33

4.4 Experiments on the synthetic data ....................................................................... 34

4.5 Experiments on the real world data ...................................................................... 37

4.5.1 Quality of the communities............................................................................ 37

4.5.2 Quality of the overlap .................................................................................... 39

5 Conclusion .................................................................................................................. 43

Publications .................................................................................................................. 44

References .................................................................................................................... 45

List of figures

1.1 Folksonomy in Delicious.com ................................................................................... 12

1.2 Folksonomy as a three regular tripartite hypergraph ............................................... 13

1.3 Overlapping communities in a folksonomy .............................................................. 16

3.1 Neighbor sets of two adjacent hyperedges .............................................................. 25

3.2 Similarity between two hyperedge communities ..................................................... 27

3.3 Agglomerative clustering of hyperedge communities .............................................. 29

4.1 Community quality as a function of fraction of nodes in multiple communities ....... 36

4.2 Community quality as a function of fraction of scattered hyperedges ...................... 36

4.3 Average cosine similarity values for the detected communities ............................... 38

4.4 Distribution of node community sizes for the detected communities ...................... 41

4.5 Distribution of nodes in multiple detected communities.......................................... 42

10

Chapter 1

Introduction

This chapter summarizes the preliminary concepts that form the basis of the following

chapters. The chapter begins with a brief introduction of online social network and

further explains the concept of folksonomies in detail. Further the problem of

community detection is explained, with emphasis on community detection in

folksonomies.

1.1 Online Social Networks

A social network service is an online service, platform, or site that focuses on building

and reflecting of social networks or social relations among people, e.g., who share

interests and/or activities. A social network service essentially consists of a

representation of each user (often a profile), his/her social links, and a variety of

additional services. Most social network services are web based and provide means for

users to interact over the internet, such as e-mail and instant messaging. Although

online, community services are sometimes considered as a social network service. In a

broader sense, social network service usually means an individual-centered service

whereas online community services are group-centered. Social networking sites allow

users to share ideas, activities, events, and interests within their individual networks.

Basically, there are two broad types of Online Social Networks - some Online Social

Networks where the users (members of the social networking service) and their social

relationships are the most important aspects, e.g. Facebook, Twitter. The other type –

Online Social Networks which focus on the maintenance and sharing of a certain type of

11

resource are called folksonomies. In folksonomies, the users interact with each other

primarily through their mutual liking for these resources, and they annotate the

resources with keywords (known as tags). Different folksonomies, also called social

tagging systems, focus on different types of resources e.g. Webpages for Delicious,

photos for Flickr, music files for livefm, publication entries for Bibsonomy, etc

1.2 Folksonomy

A new family of so-called “Web 2.0” applications is currently emerging on the Web.

These include user-centric publishing and knowledge management platforms like Wikis,

Blogs and social resource sharing systems. Many a systems allow users to annotate

content on the web. This annotation over a period of time leads to a formation of a list

of words called the folksonomy. The word folksonomy is a blend of the words taxonomy

and folk, and stands for conceptual structures created by the people.

A folksonomy is basically a collection of all tag assignments (user-tag-resource bindings)

in the system. It can be modeled as graph which makes it possible to apply graph-based

search and ranking algorithms. Users share resources in a network. Resources

annotated with user-defined keywords called “tags”. Collection of such tags and the

underlying and the system of organization is called folksonomy. No hierarchy in the

categorization and no predefined categories exist in folksonomies.

1.2.1 Formal Definition of a folksonomy

A folksonomy describes the users, resources, and tags, and the user-based assignment

of tags to resources. Formally, a folksonomy is a quadruple,

F := (U, T, R, Y) where

U := finite set of users

12

T := finite set of tags

R := finite set of resources

Y ⊆ U x T x R (tag assignment relation)

Figure 1.1 is a screenshot of the Delicious website which shows the resources, tags and

users to illustrate the concept of Folksonomy.

Figure 1.1 Folksonomy in Delicious.com

Delicious.com, an online bookmarking website allows annotation of bookmarks with

user-defined tags. The screenshot shows bookmarks as resources and the users along

with the tags that have been used to annotate the bookmarks. A user may apply any

number of tags to any number of bookmarks. Each tag assignment consists of the

application of a tag by a user to a resource. The collection of these tag assignments

comprises the Delicious folksonomy

13

1.2.2 Modeling folksonomies: tripartite hypergraphs

In order to study the structural and behavioral properties of folksonomies from the

viewpoint of network theory, such systems are usually represented as tripartite

hypergraphs. A hypergraph is a generalization of a graph, where an edge (or hyperedge)

can connect any number of vertices. Formally, a hypergraph G can be defined as a pair

(V, E), where V is a set of vertices, and E is a set of hyperedges between the vertices.

Each hyperedge is a set of vertices: E ⊆ {{u, v, ...} ∈ 2V}. A k-partite hypergraph is a

hypergraph wherein there are k partite sets and no two vertices of the same set are a

part of the same hyperedge. To represent the folksonomy we make use of a tripartite

hypergraph in which there are three types of vertices representing resources, tags, and

users, and three-way hyperedges joining them in such a way that each hyperedge links

together exactly one resource, one tag, and one user. Each hyperedge corresponds to

the act of a user applying a tag to a resource and hence the tripartite hypergraph

preserves the full structure of the folksonomy. This is evident from Figure 1.2, where

users are represented by circles, and resources by squares and tags by diamonds.

Figure 1.2 Folksonomy as a three regular tripartite hypergraph

14

Figure 1.2 shows a folksonomy as a three-regular, tripartite hypergraph, in which the

node set V is partitioned into three disjoint sets:

V = U U T U R,

where

U is the set of users (circular nodes in red)

T is the set of tags (diamond shaped nodes in green)

R is the set of resources (square nodes in blue)

and every hyperedge {t, u, r} consists of exactly one tag, one user, and one resource.

In this work, folksonomies are treated as hypergraphs with the partite sets called as

Type X, Type Y and Type Z instead of users, tags and resources in particular.

1.3 Communities in folksonomies

Community structures are quite common in real networks. Social networks often

include community groups (the origin of the term, in fact) based on common location,

interests, occupation, etc. Metabolic networks have communities based on functional

groupings. Citation networks form communities by research topic. Being able to identify

these sub-structures within a network can provide insight into how network function

and topology affect each other. Finding communities within an arbitrary network can be

a difficult task. The number of communities, if any, within the network is typically

unknown and the communities are often of unequal size and/or density. Despite these

difficulties, however, several methods for community finding have been developed and

employed with varying levels of success.

Folksonomies grow as a result of consistent social interaction resulting into the addition

of resources and users to the folksonomies and the use of new tags. Eventually, the

15

folksonomy start to develop different topics of interest. A user may be interested in

multiple topics which are defined by a set of resources and described by a set of tags.

Identification of communities in folksonomies aids in searching various topics of interest

as well as in recommendation of resources to the users.

Detecting communities from hypergraphs is practically important to identify users

having similar topical interests as well as similar resources and tags; this helps in

classification of resources into semantic categories and recommendation of potential

friends and resources of matching interest to users of the folksonomy. Though several

algorithms for community detection in hypergraphs have been proposed (e.g. [2]), one

important aspect of the problem that has seldom been considered is that nodes in

folksonomies frequently belong to multiple overlapping communities (rather than a

single community). Most users have multiple topics of interest, and thus link to

resources and tags of many different semantic categories. Similarly, the same resource

(e.g. photo, web-page) is frequently associated with semantically different tags by users

who appreciate different properties of the resource. The only work known to us on

detecting overlapping communities in folksonomies is [3] which consider communities

of tags only. However, detecting overlapping communities of users and resources in

folksonomies is equally necessary for personalized recommendation and categorization

of resources and tags.

As a motivating example, consider a popular photo of a daffodil in Flickr (See Figure 1.3).

Since many users are likely to tag the photo with ‘flower’ (or ‘daffodil’), as compared to

few users using the tag ‘yellow’, algorithms assigning single communities to nodes

would place this photo in the community related to flowers (or daffodils).

16

Figure 1.3 Overlapping communities in a folksonomy

Community-based recommendation schemes, which recommend resources to users

based on common-memberships in communities, would thus overlook the fact that this

photo is an excellent candidate for recommendation to a user who favors tagging

objects that are yellow-colored (e.g. photos of yellow cars, sunset, etc). On the other

hand, an algorithm detecting multiple overlapping communities would place the photo

in both communities related to flowers and the color ‘yellow’, and thus raise the

chances that this popular photo is recommended to the said user. Out of the few

algorithms for detecting overlapping communities of nodes in traditional graphs (but

not for hypergraphs), a recently proposed one identifies communities as a set of closely

inter-related edges, hence different edges created by a node make the node a part of

multiple overlapping communities [1]. In this paper, we identify overlapping

communities in folksonomies by customizing the algorithm in [1] for use on

hypergraphs.

1.4 Motivations of community detection

There are very strong motivations towards the detection of communities in social

networks viz.

17

1. Identifying close friends (nodes within the same community) can help in

recommending new friends and resources to users

2. Meeting the scaling requirements of rapidly-growing OSNs by partitioning the

storage among different servers; users within the same community (e.g. a group of

users who frequently tag resources uploaded by one another) can be allocated to

the same server so that frequent interactions among such users do not produce high

network traffic.

3. Being able to identify these sub-structures within a network can provide insight into

how network function and topology affect each other.

1.5 Our contribution

Various algorithms as explained in the following chapter exist that find communities in

folksonomies. However, to our knowledge, there is no algorithm that finds overlapping

communities in folksonomies. We have proposed an algorithm for the same and

evaluated it using synthetic as well as real world data. This algorithm can be used in

recommendation of resources to users. Also recommendation or suggestion of other

like-minded users to users is a probable application.

1.6 Organization of the report

The following chapter gives a brief overview of the extensive literature survey of that

was carried out before the work began. Chapter 3 gives an insight into the proposed

algorithm whereas Chapter 4 details the important part of validation of the results and

the methodologies used to do the same. Chapter 5 concludes the report and lays down

the possibilities for future work.

18

Chapter 2

Literature Survey

A lot of technical papers were studied and summaries of a few follow. The motivation

behind the work is largely attributed to this literature. Folksonomies are a unique

phenomenon that have been around and became popular in the last decade.

Folksonomies have attracted lot of research in recent times with some of the directions

of research on folksonomies including but not limited to recommendation of resources,

tags, community detection, understanding the network properties and the evolution of

folksonomies.

When modeled as graphs the network properties of folksonomies are peculiar as the

folksonomies grow as a result of collaborative tagging. These properties help in the

understanding of the growth and formation of communities in folksonomies. Various

structural properties unique to folksonomies such as characteristic path length,

clustering coefficients, cliquishness, and connectedness have been studied and analyzed

in [4]. The paper also analyzes the tag concurrence network obtained from the

folksonomy.

Hotho et al. have proposed a ranking algorithm for information retrieval in folksonomies

[12]. The algorithm is based on and is an adaptation of the PageRank algorithm called

the FolkRank. The algorithm proceeds by transforming the graph into an undirected

unweighted tripartite graph. A random surfer model is used to traverse the graph and

rank the nodes in the graph. The ranks can then be used for searching as well as for

recommendation.

Tagging has emerged as a powerful mechanism that enables users to find, organize, and

understand online entities. Recommendation has always been an important application

19

in social networks and recommender systems enable users to efficiently navigate vast

collections of items. Some recommender systems recommend tags to users instead of

resources by predicting the user's liking based on previous tagging behavior and

recommending tags that the user is likely to use to annotate the resource.

Recommender systems for tags called Tagommenders have been described in [9].

Common interests shared by groups of users in social networks are discovered by

utilizing user tags in [6]. Tags implicitly and concisely represent user’s interests. A topic

of interest consisting of a set of tags describing a particular popular topic is identified

and the corresponding users and resources are clustered and indexed. Based on

application this index is used as an aid for recommendation.

A folksonomy grows as a result of collaborative tagging and the nodes therein have

various similarities that can be inferred from the structure of the folksonomy. Various

such measures of similarity between nodes for folksonomies based on the properties of

the nodes and hyperedges as well as the structure of the folksonomies have been

discussed in [5]. The direct application of the similarity measures could be used for

community detection and recommendation. The paper discusses various similarity

measures such as matching/overlap measure on weighted and unweighted projections,

Jaccard coefficient of the overlapped tags/resources, cosine similarity, dice coefficient

and measure of the amount of mutual information among others. The paper also

focuses on tag and resource similarity as a way of finding communities for

recommendation and answering query results.

Bipartite networks are similar to folksonomies in a way that they have nodes of different

types in the network and where nodes can be divided into disjoint sets such that no two

nodes within the same set are linked. Murata et al. have proposed a modularity

technique for bipartite networks in [11]. A measure for modularity has been described

and defined that quantifies the goodness of the communities in k-partite hypergraphs

[7]. The paper discusses the conversion of multipartite graphs into bipartite graphs by

reducing k-partite graphs into k(k-1)/2 bipartite graphs. The paper also discusses two

20

ways of converting k-partite graphs into unipartite graphs viz. flattening and projections.

Murata et al. have also proposed a modularity measure for tripartite hypergraphs in

[12]. The measure in based on greedy optimization and is not applicable on real world

folksonomies. Moreover the method does not detect overlapping communities. Zhang

et al. have proposed and defined an edge clustering coefficient for bipartite graphs in

analogy to the node clustering coefficient in graphs in [8]. Based on the measure, triples

of nodes are formed with two adjacent edges and the edge similarity is calculated as

being the node similarity between the two end nodes of the triple. In this way

communities are detected containing nodes from both the partite sets. The paper also

demonstrates how the communities detected by one mode projections of the graph are

not accurate due to the loss of information. Yong-Yeol Ahn et al. have proposed a novel

approach for community detection in graphs by progressively grouping edges together

instead of nodes as is the conventional approach [1]. In this way, edge communities are

formed from which node communities could be obtained. The paper discusses an

agglomerative bottom up method for clustering of edges based on a measure of

partition density that describes analogous to the modularity of communities. The

resultant edge grouping defines edge communities. The nodes incident upon the edges

lie in the community containing that edge. In this way, a node can be a part of multiple

communities, essentially overlapping communities. A part of our work is largely based

on the idea of edge clustering that has been extended to hyperedges.

A brief summary of the papers that were studied as a part of the literature survey is

detailed as follows.

21

Brief Summary of the papers studied

Serial

Number

Name of the paper Name of the

authors

Remarks

1 Network Properties

of Folksonomies

Christoph

Schmitz, Miranda

Grahl, Andreas

Hotho, Gerd

Stumme

describes various structural

properties unique to

folksonomies and analyzes tag

concurrence network

2 “Tagommenders:

Connecting Users to

Items through Tags”

Shilad Sen, Jesse

Vig, John Riedl

recommendation of tags to user

by predictions based on previous

tagging behavior

3 Tag-based Social

Interest Discovery

Xin Li, Lei Guo,

Yihong (Eric)

Zhao

identifies topics of interest

identified by a set of tags in

folksonomies and clusters and

indexes resources

4 Detecting

Communities from

Bipartite Networks

Based on Bipartite

Modularities

Tsuyoshi Murata proposes a modularity measure

for bipartite graphs for detecting

communities of nodes that

contain nodes of different partite

sets together

5 Towards Community

Detection in k-

partite k-uniform

hypergraphs

Nicolas

Neubauer, Klaus

Obermayer

proposes a modularity measure

for multipartite graphs by

reducing k partite graphs into

k(k-1)/2 bipartite graphs

22

6 Detecting

Communities from

Tripartite Networks

Tsuyoshi Murata proposes a modularity measure

for tripartite graphs to evaluate

the partition of folksonomy into

communities

7 Evaluating Similarity

Measures for

Emergent Semantics

of Social Tagging

Benjamin

Markines, Ciro

Cattuto, Filippo

Menczer,

Dominik Benz,

Andreas Hotho,

Gerd Stumme

discusses various similarity

measures between nodes for

folksonomies and focuses on tag

and resource similarity as a way

of finding communities for

recommendation and answering

query results

8 Clustering

coefficient and

community

structure of

bipartite networks

Peng Zhang,

Jinliang Wang,

Xiaojia Li,

Menghui Li,

Zengru Di, Ying

Fan

Proposes a measure called the

edge clustering coefficient for

clustering nodes of bipartite

graphs into communities

9 Link communities

reveal multiscale

complexity in

networks

Yong-Yeol Ahn,

James P. Bagrow,

Sune Lehmann

proposes a new method for

detecting communities in graphs

by progressively grouping edges

together and detecting edge

communities which further give

rise to overlapping node

communities

10 Information

Retrieval in

Folksonomies:

Search and Ranking

Andreas Hotho,

Robert Jaschke,

Christoph

Schmitz, Gerd

Stumme

Proposes a ranking algorithm for

folksonomies based on the

adaptation of PageRank

algorithm

23

Chapter 3

ALGORITHM FOR DETECTION OF

OVERLAPPING COMMUNITIES IN

FOLKSONOMIES

3.1 Community detection in folksonomies

Communities in folksonomies arise as a result of social tagging by users. Eventually as

the folksonomy grows, various topics of interest develop and overlapped topics of

interest arise. Detecting communities i.e. sub-networks that are densely connected

inside and sparsely connected outside, from folksonomies is practically important for

finding similar entities and understanding the structure of social media. All existing

methods find single communities for nodes in folksonomies, but as stated earlier, users

and resources in real-world folksonomies are likely to be members of multiple

overlapping communities. Here we propose an algorithm to detect such overlapping

communities in tripartite hypergraphs. The proposed algorithm initially detects

communities of similar hyperedges, and later uses these communities of hyperedges to

identify communities of nodes. To the best of our knowledge no particular method

exists that finds overlapping communities for nodes in a folksonomy.

3.2 Overview of the algorithm

The community detection algorithm proceeds in a bottom-up hierarchical way, by

merging most similar communities of hyperedges until one community remains. The

24

hyperedges are clustered together based on similarity between adjacent hyperedges. In

order to define adjacency we considered various notions viz. two nodes common

between two hyperedges, one node common between two hyperedges, at least one

node common between two hyperedges. Of the above we found that the criterion of at

least one node common between two hyperedges captures the notion of adjacency the

best. The resultant structure is a dendrogram. The resultant dendrogram is cut at a

particular point where the optimization measure, partition density (described later), is

maximum. Later, node communities are formed from the hyperedge communities. Each

community comprises of nodes from all the three sets of Type X, Type Y and Type Z

3.3 Definitions

1. Folksonomy representation: Folksonomy is represented as a hypergraph with list of

hyperedges between vertices of Type X, Type Y and Type Z

2. Hyperedge Representation: (a, b, c) represents a hyperedge between node a of Type

X, node b of Type Y and node c of Type Z

3. Adjacency of hyperedges: Two hyperedges (a, b, c) and (p, q, r) are adjacent if they

have at least one vertex in common i.e. either a = p or b = q or c = r.

An alternative measure for adjacency of hyperedges analogous to this, considers

two hyperedges (a, b, c) and (p, q, r) to be adjacent if they have exactly two vertices

in common i.e. either (a = p and b = q) or (b = q and c = r) or (a = p and c = q). This

measure is revisited in the discussions section in this chapter.

4. Neighborhood of hyperedges: The neighborhood of a hyperedge is defined in

collaboration with an adjacent edge and it depends upon the neighboring edge in

consideration. It is based on the set of neighbors of the constituent nodes of the

hyperedge explained as follows:

25

Consider hyperedges (a, b, c) and (p, q, r) that are adjacent. Without loss of

generality let node and node p be the same (a = p) i.e. node of Type X is common

(however either of the other two nodes may be common as well). The following

figure shows these edges with a common node.

Figure 3.1 Neighbor sets of two adjacent hyperedges

As shown in Figure 3.1 the neighbor sets of the constituent nodes of each hyperedge

are shown enclosed in ellipses. The red circles, blue triangles and green squares

represent the nodes of Type X, Type Y and Type Z respectively. The figure shows the

hyperedges (a, b, c) and (a, q, r) with node a of Type X, nodes b and y of Type Y and

nodes c and z of Type Z. NX(b) is the set of nodes that are neighbors of set of b i.e.

the set of nodes of Type X which are connected to node b by a hyperedge. NY(c) is

26

the set of nodes of Type Y that are neighbors of node c. Similarly it is defined for

other nodes. Thus set S1 is obtained by the union of sets NX(b) and NX(c).

Based on the node sets of the users Type X, Type Y and Type Z neighbor sets are

defined for the hyperedges (a, b, c) and (a, q, r). The Type X, Type Y and Type Z

neighbor sets, S1, S2 and S3 for the hyperedge (a, b, c) are defined as follows:

S1 = {Neighbor nodes of b and c of Type X} = NX(b) U NX(c)

S2 = {Neighbor nodes of c of Type Y} = NY(c)

S3 = {Neighbor nodes of b of Type Z} = NZ(b)

Similarly the Type X, Type Y and Type Z neighbor sets viz. S1’, S2’ and S3’ of (a, q, r)

are defined as follows:

S1’ = NX(q) U NX(r)

S2’ = NY(r)

S3’ = NZ(q)

5. Similarity of hyperedges: Similarity for non-adjacent hyperedges is defined to be

zero. For adjacent hyperedges, similarity measure is explained as follows:

The similarity for hyperedges (a, b, c) and (a, q, r) is defined to be:

|S1 ∩ S1’| + |S2 ∩ S2’| + |S3 ∩ S3’|

|S1 U S1’| + |S2 U S2’| + |S3 U S3’|

where S1, S2 and S3 are the neighbor sets of hyperedge (a, b, c) and S1’, S2’ and S3’

are the neighbor sets of hyperedge (a, q, r) as explained earlier. Higher values of this

expression indicate higher similarity between the hyperedges.

27

6. Similarity of hyperedge communities: Similarity of hyperedge communities is equal

to the maximum similarity between pairs of constituent hyperedges one from each

community.

Figure 3.2 Similarity between two hyperedge communities

Figure 3.2 shows two hyperedge communities in black circles with red colored circles

as constituent hyperedges enclosed within represented in vector space. The

Euclidean distance between two hyperedges equals the similarity between them.

The similarity between two communities is equal to the maximum similarity

between the constituent hyperedges one from each community and is depicted by a

blue line.

7. Partition density: The partition density for a community is defined as follows:

X = Number of hyperedges in the community.

Y = Maximum number of hyperedges possible in the community

Partition Density is calculated as X / Y

The partition density for all the communities is equal to the average of the partition

densities of each community weighted by the number of hyperedges in each

28

community. The weighted average of these partition densities gives the partition

density of the current partition of the folksonomy into communities. A higher value

of the weighted average indicates that the partition of the folksonomy into

communities is good. The algorithm identifies the partition that results into the

highest value of the partition density with number of communities being at least

two.

3.4 Algorithm for detecting communities in folksonomies

A tripartite hypergraph is denoted as G = (V,E) where the set of nodes V is composed of

three partite sets (types) VX, VY and VZ, and E is the set of hyperedges; each hyperedge

connects triples of nodes (a, b, c) where a ϵ VX, b ϵ VY , c ϵ VZ. Further, let the notations

NX(i), NY (i) and NZ(i) denote the set of neighbors of node i of node sets VX, VY and VZ

respectively. The proposed algorithm performs an agglomerative hierarchical clustering

of hyperedges using single-linkage similarity among clusters of hyperedges. The

following algorithm gives our customized measure for the similarity of hyperedges

between two adjacent hyperedges (i.e. having at least one node in common). Non-

adjacent hyperedges are assumed to have zero similarity as explained earlier.

The hierarchical clustering, continued until all hyperedges belong to a single cluster,

builds a dendrogram (as shown in Figure 3.2), and cutting this dendrogram at some

suitable level gives communities of hyperedges. The optimal level for the cut, on which

the quality of the obtained communities depends, is decided based on the partition

density metric [1] as follows. The partition density of a community C of edges (or

hyperedges, in case of hypergraphs) is the number of edges in C, normalized by the

minimum and maximum number of edges possible among the induced nodes (i.e. nodes

that are touched by the edges in C). The global partition density for a given partitioning

of the edges (hyperedges) is the average partition density of all communities weighted

by the fraction of edges present in each community.

29

Figure 3.3 Agglomerative clustering of hyperedge communities

Our customized partition density metric for use on hypergraphs is explained in the

definitions section. Similar to [1], the dendrogram is cut at the level at which the global

partition density is maximum (See Figure 3.2). Thus each hyperedge is placed into a

single community, and a node inherits membership of all the communities into which its

edges are placed.

This procedure is explained by the following pseudo-code of the algorithm as follows:

1. Initialize all hyperedges to be in different communities.

2. Do

i. Find the similarities between all pairs of communities.

ii. Merge the two most similar communities (the resultant community is given the

least identifier among the two merged communities) using single linkage

clustering.

iii. Find the Partition Density for this division.

Until only one community remains

(The above loop generates a dendrogram)

30

3. Trace the dendrogram and find the level at which the Partition Density attained a

maximum.

4. Cut the dendrogram at that level and save the resultant hyperedge communities at

that level.

5. Form node communities from hyperedge communities with a node community

corresponding to each hyperedge community. Each node community consists of

nodes of all the three types viz. Type X, Type Y and Type Z. Nodes belong to the node

communities corresponding to the hyperedge communities of those hyperedges

that are incident on them.

3.5 Discussions

For the purpose of finding the similarity between hyperedges two approaches were

tried out for defining the adjacency of hyperedges. One of the approaches is

explained in the earlier section. The other approach considers two hyperedges to be

adjacent if and only if they share exactly two nodes. This approach is more stringent

and it was found to generate better results than the earlier one in case of very

dense hypergraphs. Since hypergraphs in the real world are not very dense we

moved on with the earlier approach.

31

Chapter 4

Experiments and Results

4.1 Data Collection

For the purpose of community detection in folksonomies we collected the folksonomy

of MovieLens.

3 different data sets of varying sizes were obtained viz.

1. A large data set containing 10000054 ratings and 95580 tags applied to 10681

movies by 71567 users of the online movie recommender service MovieLens. The

data also contains keywords indicating the genres of the movies

2. A medium sized data set containing tags applied to 3592 movies by 6040 users

with at least 20 tags per user.

3. A small sized data set containing 100,000 ratings (1-5) from 943 users on 1682

movies.

Thus MovieLens contains two folksonomies viz. the folksonomy of users, tags and

movies and that of users, ratings and movies. The data sets are available for download

from GroupLens Research website (http://www.grouplens.org/). We have used the

folksonomy of users, tags and movies in our study.

Since ground truth is not available for most real world folksonomies, it is difficult to

validate the community structure obtained by our algorithm. Hence we have used

synthetically generated hypergraphs that have a predefined community structure with

overlapping communities for nodes. In addition to the quantification of performance

based on synthetic data we have used a subset of the real world data for qualitative as

32

well as quantitative analysis of the algorithm. We have obtained metadata for the

MovieLens folksonomy data and formulated a measure to judge the performance of the

algorithm on real world folksonomies (explained in later sections).

In addition to the real world data we have used synthetically generated hypergraphs

having overlapping communities for nodes.

4.2 Synthetic Data Generation

The synthetic data that were used in the experiments were generated as follows:

1. The three node sets of Type X, Type Y and Type Z were created with equal nodes in

each set.

2. A fixed number of communities were decided randomly.

3. Nodes from each set were assigned to a particular community chosen randomly from

the pre decided number of communities.

4. A pre-decided fraction of the nodes were chosen to lie in multiple communities. For

each node of these nodes the number of communities in which the node would lie

was decided randomly and subsequently these nodes were allotted to additional

communities

5. A pre-decided fraction of hyperedges were chosen to be scattered. These scattered

hyperedges were to connect nodes lying in different communities and the un-

scattered hyperedges to the nodes of the same community.

6. Scattered and un-scattered hyperedges were added randomly between nodes of the

same community and between nodes of different communities, respectively, based

33

on the predefined fractions

7. A set of hypergraphs based on different values of the fraction of nodes in multiple

communities and fraction of scattered hyperedges were generated. The algorithm

was executed and results were obtained for each set of values for the variable

parameters by averaging the results over a set of hypergraphs.

4.3 Metrics for evaluation

Using the above algorithm multiple synthetic hypergraphs were generated and

validated using the following community quality measure which measures the fraction

of pairs of nodes in the resultant community that were together in the same community

for each community.

Community quality =

Where f(x,y) indicates the metadata similarity of nodes x and y i.e. the similarity

between nodes x and y based on the ground truth (which is known for synthetic data)

and <f(x,y)> indicates the average metadata similarity value. Essentially the community

quality measures the ratio of average metadata similarity between all pairs of nodes

identified into same communities by the algorithm to average metadata similarity

between all pairs of nodes. The metadata similarity is based on the predefined

community structure and its value is assumed by the function f(x,y) for nodes x and y.

The function f(x,y) for the node pair x and y is defined as follows:

f(x,y) = | CX ∩ CY | / | CX U CY |

Where Cx and Cy represent the sets of communities that nodes x and y belong to in the

<f(x,y)> for nodes in a detected community

<f(x,y)> for all possible nodes

34

synthetic hypergraphs.

Values of community quality higher than 1 indicate that the identified community

structure indeed groups similar nodes into the same communities. The experiments that

were carried out over the synthetic data used this metric for the evaluation of the

algorithm in identifying overlapping communities of nodes in folksonomies.

4.4 Experiments on the synthetic data

Experiments on synthetic data were carried out by running the algorithm over the

synthetic data and comparing the output with the predefined community structure of

the data. As explained earlier, a series of synthetic hypergraphs were generated and the

community quality measure was obtained for it by running the algorithm on the

synthetic data.

Each experiment was carried out by varying the following set of parameters:

1. Number of communities in the synthetic hypergraph.

2. Number of nodes in each of the three sets Type X, Type Y and Type Z

The numbers of nodes in each of the three sets were chosen to be equal to each

other.

i.e. |VX| = |VY| = |VZ|

3. Average node degree (Average number of hyperedges per node)

The average node degree controls the number of hyperedges in the synthetic

hypergraph that are equal to the product of the average node degree and the

number of nodes in the hypergraph.

4. Fraction of nodes in multiple communities.

This value controlled the amount of overlap between the node communities.

35

5. Fraction of scattered hyperedges

Scattered hyperedges connect nodes from different node communities. They denote

the passing interests of users who have tagged the resources but the node, tag and

interest do not constitute a topic of interest.

The experiments were carried out for synthetic data with the following values for the

parameters of the synthetic data being fixed:

1. Number of communities in the synthetic hypergraph = 5

2. Number of nodes in each of the three sets = 100

3. Average node degree = 10

4. Number of hyperedges = 1000

The experiments were carried out over hypergraphs varying the fraction of nodes in

multiple communities from 0.0 to 1.0 in intervals of 0.2 keeping the fraction of scattered

hyperedges and then varying the fraction of nodes in multiple communities in the same

range and interval keeping the other value constant. The following plots show the

variation of community quality with the changes in the fraction of nodes in multiple

communities and the fraction of scattered hyperedges

Figures 4.1 and Figure 4.2 show community quality as a function of the fraction of nodes

in multiple communities and the fraction of scattered hyperedges respectively. The

community quality value is the highest when the fraction of nodes in multiple

communities and the fraction of scattered hyperedges are both zero and decreases

gracefully with the increase in either of the two.

36

Figure 4.1 Community quality as a function of fraction of nodes in multiple communities

Figure 4.2 Community quality as a function of fraction of scattered hyperedges

37

The community quality value is however higher than 1 which indicates that the resultant

community structure is identified correctly.

4.5 Experiments on the real world data

A subset of the MovieLens folksonomy (mentioned in the datasets) was used with 1000

hyperedges such that no node has a high degree. The subset was obtained by sorting the

set of hyperedges lexicographically and selecting hyperedges restricting the degree of

each node to four. The resultant subset was used as an input for the algorithm and the

resultant communities that were obtained contained users, movies and the tags

describing the movies in each community. As explained earlier, ground truth is not

available for judging the results of the algorithm for a real world folksonomy. In order to

get a fair idea about the performance of the algorithm on real world data, metadata in

the form of information about movies was obtained and used. These metadata

comprised of information about the genres of the movies from the MovieLens

folksonomy obtained from IMDb. The data obtained from IMDb contains a list of movies

with each movie associated with a set of keywords that describe the genre of the movie.

These keywords are a subset of 18 predefined keywords by IMDb viz. Action, Adventure,

Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror,

Musical, Mystery, Romance, Sci-Fi, Thriller, War, and Western. This data is used in

comparison of the movies that have been grouped into the same community by the

algorithm.

4.5.1 Quality of the communities

In order to find out the similarity between movies each movie is treated as a vector and

the measures for vector similarity are used. Each movie is represented as a vector in 18

dimensions corresponding to the 18 keywords. The similarity between two movies is the

cosine similarity between the two vectors which is the cosine of the angle between the

38

two vectors. Cosine function takes values between 0 and 1 for angles between 0 and π/2.

Lower values of angle and higher values of cosine indicate higher similarity.

The movies from each detected community were considered. The cosine similarity for

each pair of movies was found out and was averaged over all pairs. Figure 4.3 shows a

plot of the average cosine similarity values for each community. The plot shows the

cumulative average cosine similarity values for the MovieLens folksonomy for all the

detected communities. As is apparent, there are a reasonable number of communities

that have high values of average cosine similarity indicating that the performance of the

algorithm over the real world MovieLens folksonomy is fairly well.

Figure 4.3 Average cosine similarity values for the detected communities

The output communities were used to determine the distribution of community sizes

(see Figure 4.4). This result gives an indication of the nature of the original graph. If the

39

original graph was sparse more communities of size three (corresponding to a single

hyperedge) are formed.

4.5.2 Quality of the overlap

In order to find out whether the overlapping communities are actually any better than

the non-overlapping ones, we found out if the movies that have been placed in multiple

communities actually need to belong in multiple communities.

We found out another community structure for the movies wherein each movie is

placed in exactly one community. This structure is derived from the partition obtained

by our algorithm by allocating the best community to each movie among the ones that it

belongs to. Analogous to the cosine similarity of a community we define the cosine

similarity of a movie which is the cosine of that movie with another movie. Based on this

metric we find the community for which the average cosine similarity of a particular

movie with other movies in that particular community is the highest. That particular

movie is allocated to the community the elements of which it has the highest average

cosine similarity.

The experiments were carried out over the subset of the MovieLens dataset as

explained earlier over the two community structures viz.

1. The overlapping communities obtained by our algorithm

2. The non-overlapping communities obtained by the method explained earlier

The average cosine similarity values were calculated for each community and the values

for the communities obtained by our algorithm were compared with the corresponding

values for the non-overlapping communities obtained by the approach discussed earlier.

Out of the 178 communities found out by our algorithm, 119 overlapping communities

had higher average cosine similarity values than the corresponding non-overlapping

communities i.e. over two-thirds of the overlapping communities have higher average

40

cosine similarity values. Thus the movies placed in multiple communities are correctly

placed so.

The communities are of different sizes so in order to give an overview of the quality of

all communities irrespective of the size we found out the average similarity values

weighted by size of the communities over all the communities for the two community

structures. The weighted average cosine similarity values for the two community

structures are as follows:

Community structure Weighted average cosine similarity value

Overlapping 0.373794877108

Non-overlapping 0.21820724018

Thus the overlapping communities obtained by our algorithm are apt.

4.5.3 Other measures

The algorithm was run on the MovieLens data described above and the distribution of

node community sizes and the distribution of nodes in multiple communities were

found out to justify the purpose of the algorithm.

Figure 4.4 shows the distribution of node community sizes of the detected communities

in the MovieLens folksonomy. A large number of communities have three nodes in them

i.e. one each from each of the three sets of Type X, Type Y and Type Z. These node

communities have been created as a result of a single hyperedge classified into an edge

community by the algorithm. This occurs as a result of the MovieLens folksonomy being

a disconnected hypergraph. The other communities of larger size have multiple movies

41

grouped together. These movies have been compared together using the average cosine

similarity values as explained earlier.

Figure 4.4 Distribution of node community sizes for the detected communities

42

The output communities were also used to determine the distribution of nodes that

have been placed in multiple communities i.e. finding the number of communities in

which each node is placed was found out (see Figure 4.5). It was found that there are a

substantial number of nodes have been grouped into multiple communities which per se

highlights the property of folksonomies. Thus the purpose of the algorithm is justified.

Figure 4.5 Distribution of nodes in multiple detected communities

43

Chapter 5

Conclusion

In this thesis, we propose the first algorithm to detect overlapping communities in

Folksonomies, to the best of our knowledge. The algorithm gives reasonably good results

for synthetic as well as real world folksonomies and it detects overlapping communities

of nodes. This algorithm can be applied in practice for the purpose of recommendation

of resources to users.

Future Work

The implementation of the algorithm and its actual use in recommendation or search is

an important application of the algorithm. Another application of this algorithm would

be to suggest friends to users. Nodes are grouped into communities and each

community represents a topic of interest. Many users have different areas of interest

and based on these interests likeminded users can be found out. Thus friend suggestions

can be made to users. Yet another application of the algorithm would be the searching

and ranking of results pertaining to the folksonomy. Actual application of the algorithm

on a large real world folksonomy and using the results for search or recommendation

comprises the future work.

44

Publications

[1] Saptarshi Ghosh, Pushkar Kane, Niloy Ganguly. Identifying overlapping

communities in folksonomies or tripartite hypergraphs, International World Wide

Web Conference, March 2011

45

References

[1] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann. Link communities reveal multiscale

complexity in networks. Nature, 466(7307):761–764, August 2010.

[2] T. Murata. Modularity for heterogeneous networks. In ACM Hypertext, pages

129–134, June 2010.

[3] S. Papadopoulos, Y. Kompatsiaris, A. Vakali. Leveraging collective intelligence

through community detection in tag networks. In CKCaR, September 2009.

[4] Christoph Schmitz, Miranda Grahl, Andreas Hotho, Gerd Stumme. Network

properties of Folksonomies. AI Communications, Vol. 20, Nr. 4 Amsterdam, The

Netherlands: IOS Press, dec (2007), p. 245--262.

[5] Benjamin Markines, Ciro Cattuto, Filippo Menczer, Dominik Benz, Andreas Hotho,

Gerd Stumme. Evaluating Similarity Measures for Emergent Semantics of Social

Tagging. International World Wide Web Conference, 2009

[6] Xin Li, Lei Guo, Yihong (Eric) Zhao. Tag-based Social Interest Discovery.

International World Wide Web Conference, 2008

[7] Nicolas Neubauer, Klaus Obermayer. Towards Community Detection in k-partite

k-uniform hypergraphs. Workshop on Analyzing Networks and Learning with

Graphs at NIPS 2009.

[8] Peng Zhang, Jinliang Wang, Xiaojia Li, Menghui Li, Zengru Di, Ying Fan. Clustering

coefficient and community structure of bipartite networks. Physica A, Volume

46

387, Issue 27, p. 6869-6875

[9] Shilad Sen, Jesse Vig, John Riedl. Tagommenders: Connecting Users to Items

through Tags. International World Wide Web Conference, April 2009

[10] Murata T. Detecting Communities from Tripartite Networks. International World

Wide Web Conference, April 2010

[11] Murata T. Detecting Communities from Bipartite Networks Based on Bipartite

Modularities. IEEE/WIC/ACM International Joint Conference on Web Intelligence

and Intelligent Agent Technology 2009.

[12] Andreas Hotho, Robert Jaschke, Christoph Schmitz, Gerd Stumme. Information

Retrieval in Folksonomies: Search and Ranking. In York Sure and John Domingue,

editors, The Semantic Web: Research and Applications, volume 4011 of LNAI,

pages 411–426, Heidelberg, June 2006. Springer

Identifying Overlapping Communities in Folksonomies or...

Documents

Transcript of Identifying Overlapping Communities in Folksonomies or...