Mining Email Social Networks

Post on 29-Nov-2014

276 views 0 download

description

 

Transcript of Mining Email Social Networks

Mining Email Social Networks

Christian Bird, Alex Gourley,Prem Devanbu, Michael Gertz, Anand Swaminathan

University of California, Davis

Presented By:Arnamoy Bhattacharyya

Communication & Co-ordination (C&C) activities are central to large software projects

Communication & Co-ordination (C&C) activities are central to large software projects

Difficult to observe and study in traditional (closed-source, commercial) settings

Communication & Co-ordination (C&C) activities are central to large software projects

Difficult to observe and study in traditional (closed-source, commercial) settings

the email archives of OSS projects provide a useful trace of the communication and co-ordination activities of the participants

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Anyone can post messages to the list.

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Anyone can post messages to the list.

Posted messages are visible to all the mailing list subscribers.subscribers.

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Anyone can post messages to the list.

Posted messages are visible to all the mailing list subscribers.

Posters include developers, bug-reporters, contributors (who submitpatches, but don't have commit privileges) and ordinaryusers.

subscribers.

A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say

A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say

It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.

A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say

It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.

However, the vast majority of individuals participating on the email list sent very few messages, and received very few replies to their messages

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog“ - Peter Steiner

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog"

The same individualcan use different email aliases

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog"

The same individualcan use different email aliases

developer Ian Holsman uses 7 different email aliases

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog"

The same individualcan use different email aliases

developer Ian Holsman uses 7 different email aliases

Ignoring these aliases would confound latersteps of data analysis

Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <reddrum@attglobal.net>

Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <reddrum@attglobal.net>

Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs)

Execute a clustering algorithm that measure the similarity between every pair of IDs

Manually Post Process the clusters formed to remove further aliases

Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <reddrum@attglobal.net>

Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs)

Execute a clustering algorithm that measure the similarity between every pair of IDs

Manually Post Process the clusters formed to remove further aliases

set the cluster similarity threshold quite low:easier to split big clusters than to unify two disparate clusters from a very large set.

THE CLUSTERING ALGORITHM

1. Normalize name

à remove all punctuation, suffixes(“jr")

àturn all whitespace into a single space

à Remove generic terms like “admin", “support", from the name

à split the name into first name and last name (using whitespace and commas as cues)

THE CLUSTERING ALGORITHM

2. Name Similarity:

Use a scoring algorithm between –

à The full namesà The first name and last name separatelyà Consider names similar if the full names are similar, orif both first and last names are similarif both first and last names are similar

e.G Andy Smith <-> Andrew Smith

Deepa Patel !<-> Deepa Ratnaswamy

THE CLUSTERING ALGORITHM

3. Names-email Similarity:

à If the email contains both first and last names – match

Arnamoy Bhattacharyya <-> ar.bhat@yahoo.com

à if the email contains the initial of one part of the name and entirety of the other part – match

Erin Bird <-> ebirdErin Bird <-> erinb

4. Email Similarity:

à If the Levenshtein edit distance between two email address bases (not including the domain, after the "@") is small – Match

THE CLUSTERING ALGORITHM

THE CLUSTERING ALGORITHM

5. Cumulative ID similarity:

à The similarity between two IDs is the maximum of the all mentioned above

E.G

Name Similarity – 3Names-email similarity – 5Names-email similarity – 5Email Similarity – 2

If the threshold is 4, it would be considered as a match

vast majority of people send only one message, andthere are some who send a great many

Out-degree - # of different people from whom an individual has received responses

Higher out-degree <-> higher status

In-degree - # of different people to whom an individual has replied-to

Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests

In-degree - # of different people to whom an individual has replied-to

Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests

The distributions show a small-world character

High correlation between messages sent and replies got(out order) -0.97

Correlation may not be true-

1. People who only post relevant messages get large responds to messages

2. Only people who receive replies from several people keep sending messages (Survival Effect)

Each link indicates at least 150 messages least 150 messages sent

C&C ACTIVITY AND DEVELOPMENTACTIVITY

How does email activity relate to software development activity?

73 committers-

1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make –

more software development work <-> more C&C activitymore software development work <-> more C&C activity

C&C ACTIVITY AND DEVELOPMENTACTIVITY

How does email activity relate to software development activity?

73 committers-

1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make –

more software development work <-> more C&C activity

2. A correlation of 0.57 between the number of messages sent by an individual, and number of document changes they make

source code activities require much more co-ordination effortthan documentation effort

more software development work <-> more C&C activity

Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?

Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?

Betweenness (BW)---

Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?

Betweenness (BW)---

High betweenness <-> that the person is a kind of broker, or gatekeeper

mean

mean

Developers are higher in status than non-developers

Relative Status of Developers

Do the most active developers have the highest status among developers ?

Relative Status of Developers

Do the most active developers have the highest status among developers ?

Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree

Relative Status of Developers

Do the most active developers have the highest status among developers ?

Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree

Source changes shows the strongest rank correlation with the social network status <-> the most active developers play the strongest role of communicators, brokers, and gatekeepers

The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.

Conclusion

The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.

Social network measures such as in-degree, out-degree and betweennessindicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.

Conclusion

The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.

Social network measures such as in-degree, out-degree and betweennessindicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.

Conclusion

Even within the select group of developers, there is a strong correlation between the social network importance and level of source code change activity.

Questions?