An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network...

Post on 13-Jan-2016

214 views 0 download

Tags:

Transcript of An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network...

An Investigation into the An Investigation into the Free/Open Source Software Free/Open Source Software

Phenomenon using Data Phenomenon using Data Mining, Social Network Mining, Social Network

Theory, and Agent-Based Theory, and Agent-Based

Greg MadeyComputer Science & Engineering

University of Notre Dame

UIUC - NSF Workshop on Continuous (Re)Design ofOpen Source Software

University of Illinois, Urbana-ChampaignOctober 8-9, 2003

This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829

ContributorsContributors• Vincent Freeh, Computer Science, North Carolina State University Vincent Freeh, Computer Science, North Carolina State University

(Principal Investigator)(Principal Investigator)• Yongqin Gao, Computer Science and Engineering, University of Yongqin Gao, Computer Science and Engineering, University of

Notre Dame (Graduate Student)Notre Dame (Graduate Student)• Jeff Goett, University of Notre Dame (REU Student)Jeff Goett, University of Notre Dame (REU Student)• Chris Hoffman, University of Notre Dame (REU Student)Chris Hoffman, University of Notre Dame (REU Student)• Nadir Kiyanclar, University of Notre Dame (REU Student)Nadir Kiyanclar, University of Notre Dame (REU Student)• Greg Madey, Computer Science & Engineering, University of Notre Greg Madey, Computer Science & Engineering, University of Notre

Dame (Principal Investigator)Dame (Principal Investigator)• Patrick McGovern, Director SourceForge.net, VA Software Patrick McGovern, Director SourceForge.net, VA Software

(Industrial Collaborator)(Industrial Collaborator)• Carlos Siu, University of Notre Dame (REU Student)Carlos Siu, University of Notre Dame (REU Student)• Renee Tynan, Department of Management, College of Business, Renee Tynan, Department of Management, College of Business,

University of Notre Dame (Principal Investigator)University of Notre Dame (Principal Investigator)• Jin Xu, Computer Science & Engineering, University of Notre Dame Jin Xu, Computer Science & Engineering, University of Notre Dame

(Graduate Student)(Graduate Student)

OutlineOutline

• Research approachResearch approach• Tools and definitions: Agents, models, Tools and definitions: Agents, models,

simulations, collaborative social networks, simulations, collaborative social networks, computer experimentscomputer experiments

• Data collection and analysisData collection and analysis• Example research questionExample research question• SimulationSimulation• Computer experimentsComputer experiments• ResultsResults

One Approach to One Approach to Researching F/OSSDResearching F/OSSD

• Online dataOnline data– Screen scrapingScreen scraping– Database dumpsDatabase dumps

• ModelingModeling– Social network theorySocial network theory– Evolutionary assumptionsEvolutionary assumptions

• SimulationSimulation– Verification and validationVerification and validation– Computer experimentsComputer experiments

• Variation of Classical Scientific MethodVariation of Classical Scientific Method

Classical Scientific Classical Scientific MethodMethod

1.1. Observe the worldObserve the worlda)a) Identify a puzzling phenomenonIdentify a puzzling phenomenon

2.2. Generate a falsifiable hypothesis Generate a falsifiable hypothesis (K. Popper)(K. Popper)

3.3. Design and conduct an experiment with Design and conduct an experiment with the goal of disproving the hypothesisthe goal of disproving the hypothesisa)a) If the experiment “fails”, then the hypothesis If the experiment “fails”, then the hypothesis

is accepted (until replaced)is accepted (until replaced)b)b) If the experiment “succeeds”, then reject If the experiment “succeeds”, then reject

hypothesis, but additional insight into the hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 phenomenon may be obtained and steps 2-3 repeatedrepeated

The Computer The Computer ExperimentExperiment

Agent-Based Simulation as Agent-Based Simulation as a Component of the a Component of the

Scientific MethodScientific MethodModeling(Hypothesis)

Agent -BasedSimulation(Experiment)

Observation

Agent-Based Simulation as Agent-Based Simulation as a Component of the a Component of the

Scientific MethodScientific MethodModeling(Hypothesis)

Agent -BasedSimulation(Experiment)

Observation

Social NetworkModel of F/OSS

Grow ArtificialSourceForge

Analysis ofSourceForge

Data

Agent-Based Modeling and SimAgent-Based Modeling and Simulationulation

• Conceptual models of a phenomenonConceptual models of a phenomenon• Simulations are computer implementations of Simulations are computer implementations of

the conceptual modelsthe conceptual models• Agents in models and simulations are distinct Agents in models and simulations are distinct

entities (instantiated objects)entities (instantiated objects)– Tend to be simple, but with large numbers of them Tend to be simple, but with large numbers of them

(thousands, or more) - i.e., swarm intelligence(thousands, or more) - i.e., swarm intelligence– Contrasted with higher level AI “intelligent agents”Contrasted with higher level AI “intelligent agents”

• Foundations in complexity theoryFoundations in complexity theory– Self-organizationSelf-organization– EmergenceEmergence

Collaborative Social NetwCollaborative Social Networksorks• Research-paper co-authorship, small world phenomenon, e.g., Erdos Research-paper co-authorship, small world phenomenon, e.g., Erdos

number number (Barabasi 2001, Newman 2001)(Barabasi 2001, Newman 2001)

• Movie actors, small world phenomenon, e.g., Kevin Bacon number Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts (Watts 1999, 2003)1999, 2003)

• Interlocking corporate directorshipsInterlocking corporate directorships• Terrorist NetworksTerrorist Networks• Open-source software developers Open-source software developers (Madey et al, AMCIS 2002)(Madey et al, AMCIS 2002)

• Collaborators are nodes in a graph, and collaborative relationship are the Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenonedges of the graph => a framework to model data/phenomenon

SourceForgeSourceForge

• VA Software• Part of OSDN• Started 12/1999• Collaboration tools• 70,000 Projects• 90,000 Developers• 700,00 Registered Users

SavannahSavannah• SourceForge Software? • Free Software Foundation•1,600 Projects•16,000 Registered Users

ObservationsObservations

• Web miningWeb mining• Web crawler (scripts)Web crawler (scripts)

– PythonPython– PerlPerl– AWKAWK– SedSed

• MonthlyMonthly• Since Jan 2001 Since Jan 2001 • ProjectIDProjectID• DeveloperIDDeveloperID• Almost 2 million recordsAlmost 2 million records• Relational databaseRelational database

PROJ|DEVELOPER8001|dev3788001|dev89758001|dev99728002|dev276508005|dev313518006|dev125098007|dev193958007|dev46228007|dev356118008|dev8975

Collaboration NetworksCollaboration Networks

Adapted from Newman, Strogatz and Watts, 2001

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57]

7597 dev[46]dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

dev[58]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

F/OSS Developers - Collaboration Social NetworkDevelopers are nodes / Projects are links

24 Developers5 Projects

2 Linchpin Developers1 Cluster

Topological Analysis of Topological Analysis of the Datathe Data

• Statistics inspectedStatistics inspected– DiameterDiameter– Average degreeAverage degree– Clustering coefficientClustering coefficient– Degree distributionDegree distribution– Cluster size distributionCluster size distribution– Relative size of major clusterRelative size of major cluster– Fitness and life cycleFitness and life cycle

• Evolution of these statisticsEvolution of these statistics• Dual networks Dual networks

– developer network and project networkdeveloper network and project network

TerminologyTerminology• DiameterDiameter

– Average length of shortest paths between all pairs of verticesAverage length of shortest paths between all pairs of vertices• DegreeDegree

– The count of edges connected to given vertexThe count of edges connected to given vertex• Average degreeAverage degree

– Average of the degrees of all vertices in the networkAverage of the degrees of all vertices in the network• ClusterCluster

– The connected components of the networkThe connected components of the network• Clustering coefficient (CC)Clustering coefficient (CC)

– CCCCii: Fraction representing the number of links actually present re: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in ilative to the total possible number of links among the vertices in its neighborhood.ts neighborhood.

– CC: average of all CCCC: average of all CCii in a network in a network• Degree distributionDegree distribution

– The distribution of degrees throughout a networkThe distribution of degrees throughout a network• Major clusterMajor cluster

– The largest cluster in the networkThe largest cluster in the network

Degree Distribution: Degree Distribution: DevelopersDevelopers

Degree Distribution: Degree Distribution: ProjectsProjects

Diameter of Developer Diameter of Developer Network vs. TimeNetwork vs. Time

• Network Network size size increased increased from from 30,000 to 30,000 to 70,00070,000

Diameter of Project Diameter of Project Network vs. TimeNetwork vs. Time

• Network size inNetwork size increased from 2creased from 20,000 to 50,000.0,000 to 50,000.

• Diameter decreDiameter decreasing with time asing with time both for develoboth for developer network anper network and project netwd project networkork

Clustering Coefficient of Clustering Coefficient of Developer Network vs. TimeDeveloper Network vs. Time

Clustering Coefficient of Clustering Coefficient of Project Network vs. TimeProject Network vs. Time

Cluster Size DistributionCluster Size Distribution

• RR22 with with major major cluster is cluster is 0.74260.7426

• RR22 without without major major cluster is cluster is 0.9799 0.9799

Relative Size of Major Relative Size of Major Cluster vs. TimeCluster vs. Time

• Increase of Increase of the relative the relative size of the size of the major major clustercluster

• ApproachinApproaching steady-g steady-state?state?

An Example Research An Example Research QuestionQuestion

• What processes can explain the evolution What processes can explain the evolution of the project and developer social of the project and developer social networks?networks?– Randomly growing network (Erdos-Reyni, Randomly growing network (Erdos-Reyni,

1960)?1960)?– Evolving network with preferential attachment Evolving network with preferential attachment

(Barabasi-Albert, 1999)?(Barabasi-Albert, 1999)?– Evolving network with preferential attachment Evolving network with preferential attachment

and fitness (Barabasi-Albert, 2001)?and fitness (Barabasi-Albert, 2001)?– Others?Others?

Computer ExperimentsComputer Experiments

• Agent-based simulationsAgent-based simulations• Java programs using Swarm class libraryJava programs using Swarm class library

– Validation (docking) exercises using Java/RepastValidation (docking) exercises using Java/Repast

• Grow artificial SourceForge’s Grow artificial SourceForge’s (Epstein & Axtell, (Epstein & Axtell, 1996)1996)

– Parameterized with observed data, e.g., developer Parameterized with observed data, e.g., developer behaviorsbehaviors• Join ratesJoin rates• New project additionsNew project additions• Leave projectsLeave projects

– Evaluation of multiple models (hypotheses)Evaluation of multiple models (hypotheses)– Verification/validation Verification/validation

Cycles of Modeling & Cycles of Modeling & SimulationSimulation

Modeling(Hypothesis)

Agent -BasedSimulation(Experiment)

Observation

Social Network ModelsER => BA => BA+Fitness => BA+Dynamic Fitness

Grow ArtificialSourceForge

Analysis ofSourceForge

Data

Degree DistributionAverage Degree

DiameterClustering Coefficient

Cluster Size Distribution

Model for SourceForgeModel for SourceForge

• ABM based on bipartite graphABM based on bipartite graph• Model descriptionModel description

– Agent: developerAgent: developer– Behaviors: Create, join, abandon and idleBehaviors: Create, join, abandon and idle– Preference: developer’s and project’sPreference: developer’s and project’s– FitnessFitness

• Four models in iterationsFour models in iterations– ER, BA, BA with constant fitness and BA with ER, BA, BA with constant fitness and BA with

dynamic fitnessdynamic fitness

• Comparison of empirical and simulated Comparison of empirical and simulated datadata

ER Model – Degree ER Model – Degree DistributionDistribution

• Degree Degree distribution distribution is normal is normal distribution distribution while it is while it is power law power law in empirical in empirical datadata

• Fit Fails!Fit Fails!

ER Model - DiameterER Model - Diameter• Average degree Average degree

is decreasing is decreasing while it is while it is increasing in increasing in empirical dataempirical data

• Diameter is Diameter is increasing increasing while it is while it is decreasing in decreasing in empirical dataempirical data

• Fit Fails!Fit Fails!

ER Model – Clustering ER Model – Clustering CoefficientCoefficient

• Clustering Clustering coefficient is coefficient is relatively low relatively low under 0.3 under 0.3 while it is while it is around 0.7 in around 0.7 in empirical data.empirical data.

• Fit fails!Fit fails!

ER Model – Cluster Size ER Model – Cluster Size DistributionDistribution

• Power law Power law distribution with distribution with RR22 as 0.6667 as 0.6667 (0.9653 without (0.9653 without the major cluster) the major cluster) while Rwhile R22 in in empirical data is empirical data is 0.7426 (0.9799 0.7426 (0.9799 without the major without the major cluster)cluster)

• The actual The actual distribution is distribution is different from different from empirical dataempirical data

• Fit Fails!Fit Fails!

BA Model – Degree BA Model – Degree DistributionDistribution

• Power laws in degree Power laws in degree distributions, similar to distributions, similar to empirical data (o for empirical data (o for simulated data and x simulated data and x for empirical data).for empirical data).

• For developer For developer distribution: simulated distribution: simulated data has Rdata has R22 as 0.9798 as 0.9798 and empirical data has and empirical data has RR22 as 0.9714. as 0.9714.

• For project For project distribution: simulated distribution: simulated data has Rdata has R22 as 0.6650 as 0.6650 and empirical data has and empirical data has RR22 as 0.9838. as 0.9838.

• Partial Fit!Partial Fit!

BA Model – Diameter and BA Model – Diameter and Clustering CoefficientClustering Coefficient

• Small diameter Small diameter and high and high clustering clustering coefficient like coefficient like empirical dataempirical data

• Diameter and Diameter and clustering clustering coefficient are coefficient are both decreasing both decreasing like empirical like empirical datadata

• Good Fit!Good Fit!

BA Model with Constant BA Model with Constant FitnessFitness

• Power laws in degree Power laws in degree distributions, similar to distributions, similar to empirical data (o for empirical data (o for simulated data and x for simulated data and x for empirical data).empirical data).

• For developer distribution: For developer distribution: simulated data has Rsimulated data has R22 as as 0.9742 and empirical data 0.9742 and empirical data has Rhas R22 as 0.9714. as 0.9714.

• For project distribution: For project distribution: simulated data has Rsimulated data has R22 as as 0.7253 and empirical data 0.7253 and empirical data has Rhas R22 as 0.9838. as 0.9838.

• Improved fit!Improved fit!

Discovery: Project Life Discovery: Project Life CycleCycle

BA Model with Dynamic BA Model with Dynamic FitnessFitness

• Power laws in degree Power laws in degree distribution, similar to distribution, similar to empirical data (o for empirical data (o for simulated data and x simulated data and x for empirical data).for empirical data).

• For developer For developer distribution: simulated distribution: simulated data has Rdata has R22 as 0.9695 as 0.9695 and empirical data has and empirical data has RR22 as 0.9714. as 0.9714.

• For project distribution: For project distribution: simulated data has Rsimulated data has R22 as 0.8051 and empirical as 0.8051 and empirical data has Rdata has R22 as 0.9838. as 0.9838.

• Somewhat better fit!Somewhat better fit!

Models of the F/OSS Social Models of the F/OSS Social NetworkNetwork

(Alternative Hypotheses)(Alternative Hypotheses)• General model featuresGeneral model features– Agents are nodes on a graph (developers or projects) Agents are nodes on a graph (developers or projects) – Behaviors: Create, join, abandon and idleBehaviors: Create, join, abandon and idle– Edges are relationships (joint project participation)Edges are relationships (joint project participation)– Growth of network: random or types of preferential Growth of network: random or types of preferential

attachment, formation of clustersattachment, formation of clusters– FitnessFitness – Network attributes: diameter, average degree, Network attributes: diameter, average degree,

degree distribution, clustering coefficientdegree distribution, clustering coefficient• Four specific modelsFour specific models

– ER (random graph) - (1960)ER (random graph) - (1960)– BA (preferential attachment) - (1999)BA (preferential attachment) - (1999)– BA ( + constant fitness) - (2001)BA ( + constant fitness) - (2001)– BA ( + dynamic fitness) - (2003)BA ( + dynamic fitness) - (2003)

SummarySummary

SummarySummary

• Why Agent-Based Modeling and Simulation?Why Agent-Based Modeling and Simulation?– Can be used as components of the Scientific MethodCan be used as components of the Scientific Method– A research approach for studying socio-technical syA research approach for studying socio-technical sy

stemsstems• Case study: F/OSS - Collaboration Social NetworCase study: F/OSS - Collaboration Social Networ

ksks– SourceForge conceptual models: ER, BA, BA with coSourceForge conceptual models: ER, BA, BA with co

nstant fitness and BA with dynamic fitness.nstant fitness and BA with dynamic fitness.– Simulations Simulations

• Computer experiments that tested conceptual modelsComputer experiments that tested conceptual models• Provided insight into the phenomenon under study and gProvided insight into the phenomenon under study and g

uided data mining of collected observationsuided data mining of collected observations

QuestionsQuestions

• Validity of approachesValidity of approaches– Social networksSocial networks– SimulationSimulation

• Value/Utility of approachsValue/Utility of approachs• Applicability to other areas of F/OSS Applicability to other areas of F/OSS

researchresearch– Project sites, e.g., Mozilla.orgProject sites, e.g., Mozilla.org– Individual projects, e.g., Linux kernelIndividual projects, e.g., Linux kernel

Thank youThank you