Six Easy Pieces of Quantitatively Analyzing Open Source

41
Six Easy Pieces (of Quantitatively Analyzing Open Source Software) Dirk Riehle SAP Research, SAP Labs LLC [email protected], www.riehle.org, twitter.com/driehle

description

For the first time in the history of software engineering, we can both broadly and deeply analyze the behavior and dynamics of software development projects. This has become possible because of open source, which is publicly developed software. In this presentation, I will discuss our recent findings about open source software, its development process, and programmer behavior. I also discuss the challenges we encountered when quantitatively mining software repositories for such insights.

Transcript of Six Easy Pieces of Quantitatively Analyzing Open Source

Page 1: Six Easy Pieces of Quantitatively Analyzing Open Source

Six Easy Pieces(of Quantitatively Analyzing Open Source Software)

Dirk Riehle

SAP Research, SAP Labs LLC

[email protected], www.riehle.org, twitter.com/driehle

Page 2: Six Easy Pieces of Quantitatively Analyzing Open Source

Open Source Software

• Definition of open source software– Software that is provided under an approved OSI license

• General properties of open source software– Software that is available in source code form

– Software that you can modify and redistribute

• Further important (not always present) properties– The project has gathered a thriving community

– Feedback and ideas about the software are public and abundant

– Projects are egalitarian, meritocratic, and self-organizing

Page 3: Six Easy Pieces of Quantitatively Analyzing Open Source

Talk Overview (Agenda)

The Growth of Open Source Software

Data Mining for Fun and Profit

Efficiently Estimating Commit Sizes

Developer Activity in Open Source Software Projects

1.

2.

3.

5.

The Commit Size Distribution of Open Source4.

The Commenting Practice of Open Source6.

Team Size Evolution in Open Source Projects7.

Conclusions8.

Page 4: Six Easy Pieces of Quantitatively Analyzing Open Source

The Growth of Open Source Software

Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag,

2008. Page 197-209.

http://www.riehle.org/2008/03/14/the-total-growth-of-open-source/

Page 5: Six Easy Pieces of Quantitatively Analyzing Open Source

Source Code Growth in Open Source

0

200

400

600

800

1,000

1,200

Jan-

93

Jan-

94

Jan-

95

Jan-

96

Jan-

97

Jan-

98

Jan-

99

Jan-

00

Jan-

01

Jan-

02

Jan-

03

Jan-

04

Jan-

05

Jan-

06

Time

To

tal

SL

oC

[m

illi

on

s]

SLoC = source lines of code

Page 6: Six Easy Pieces of Quantitatively Analyzing Open Source

Model of Source Code Growth

where,y: Total open source lines of codex: Time from Jan 1995 to Dec 2006 in months

0.964y = 2E+06*e0.0464xLower bound

0.961y = 784098*e0.0555xUpper bound

R-square valueModelApproach

Page 7: Six Easy Pieces of Quantitatively Analyzing Open Source

Project Growth in Open Source

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

Jan-

95

Jan-

96

Jan-

97

Jan-

98

Jan-

99

Jan-

00

Jan-

01

Jan-

02

Jan-

03

Jan-

04

Jan-

05

Jan-

06

Time

To

tal

Nu

mb

er o

f P

roje

cts

Page 8: Six Easy Pieces of Quantitatively Analyzing Open Source

Model of Project Growth

where,y: Total number of open source projectsx: Time from Jan 1995 to Dec 2006 in months

0.956y = 7.1511e0.0499x

R-square valueModel

Page 9: Six Easy Pieces of Quantitatively Analyzing Open Source

Where Open Source is Growing

It is not the size of individual projects,

but the total number of active working open source projects that is growing exponentially

Open source is growing in every domain,

including business applications

Page 10: Six Easy Pieces of Quantitatively Analyzing Open Source

Data Mining for Fun and Profit

Oliver Arafat, Amit Deshpande, Philipp Hofmann, Dirk Riehle.

http://www.riehle.org/publications/

Page 11: Six Easy Pieces of Quantitatively Analyzing Open Source

Motivation and Approach

• Gain insight into how open source software development works– Review processes unique to open source (or transferable)

– Possibly improve corporate software development

– Possibly build novel tools

• Post-facto detailed quantitative analysis of actual behavior– Analyzing what people do rather than what they say they do

– Open source is publicly developed software so there is lots of data

Page 12: Six Easy Pieces of Quantitatively Analyzing Open Source

Data Source, Data Quality

• Using the Ohloh.net repository of open source project information– Detailed information on a large number of projects (>9000 in 2008

snapshot)

– Includes project structure and members but also detailed code information

– Ohloh captures every commit down to diff sections (>8M commits)

– See http://www.ohloh.net for more information

• To be useful, data needs to be cleaned and filtered– Depends on the question at hand

– Developed a variety of easily applicable filters

Page 13: Six Easy Pieces of Quantitatively Analyzing Open Source

Open Source Analytics Tool Chain

Raw Data(ohloh.net)

Aggr. Data(RDBMS)

12

34

5

SQL, Java

Excel / Calc,R, spec. tools

Raw data source· Local database (ohloh.net snapshot, crawled sources)· Web services access (ohloh.net, sourceforge.net, others)

Pre-processing· Database querying using SQL and scripts· Java library for computationally heavyweight filters, aggregation

Aggregated data source· Output of pre-processing stage for specific analytical tasks· Aggregated data significantly improves analysis speed

Analytical processing· Mines aggregated (and raw) data for insights, hypothesis testing· At present basic processing (Excel), machine learning next

Analysis output· Results of analytical processing: averages, distributions, correlations· Presented as models, tables, graphs, charts, etc.

Page 14: Six Easy Pieces of Quantitatively Analyzing Open Source

Efficiently Estimating Commit Sizes

Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th International

Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Forthcoming.

http://www.riehle.org/2009/02/11/estimating-commit-sizes-efficiently/

Page 15: Six Easy Pieces of Quantitatively Analyzing Open Source

Definition of Commit Size

• Commit consists of Diffs which consist of Sections– One commit may affect multiple files, each file diff can have multiple

sections

• Commit Size = sum(Diff Sizes) where Diff Size = sum(Section Sizes)

Page 16: Six Easy Pieces of Quantitatively Analyzing Open Source

What Diff Does

4,5c4,6< d< f---> e> e> e7a9> j9d10< n

abceeeghjm

abcdfghmn

01:02:03:04:05:06:07:08:09:10:11:

diff a.txt b.txtb.txta.txt

Page 17: Six Easy Pieces of Quantitatively Analyzing Open Source

The Trouble with Diff

• From the GNU diff manual:

The way that GNU “diff” determines which lines have changed always comes up with a near-minimal set of differences. Usually it is good enough for practical purposes.

• Also, on the option “-d”:

Try hard to find a smaller set of changes.

• There is no reliable way of determining SLoC changed!

Page 18: Six Easy Pieces of Quantitatively Analyzing Open Source

Some Diff Section Size Examples

Table 1: Interpretation of the entry 1 SLoC added, 1 SLoC removed

Table 2: Interpretation of the entry 4 SLoC added, 3 SLoC removed

2011Event 2

1100Event 1

Number of Modifications

Number of SLoC changed

Number of SLoC removed

Number of SLoC added

(1, 1)

7034Event 4

6123Event 3

5212Event 2

4301Event 1

Number of Modifications

Number of SLoC changed

Number of SLoC removed

Number of SLoC added

(4, 3)

Page 19: Six Easy Pieces of Quantitatively Analyzing Open Source

Garden Variety of Heuristics

5.440Linear Estimation7

40.35-5.95Ldiff6

30.87-3.06GNU diff –d5

19.55-1.96GNU diff4

7.68-0.27Bounds Mean3

6.39-4.41Upper Bound2

16.643.86Lower Bound1

Error Standard Deviation

ErrorMean

Approach

Page 20: Six Easy Pieces of Quantitatively Analyzing Open Source

Visual Comparison of Heuristics

Page 21: Six Easy Pieces of Quantitatively Analyzing Open Source

Definition of Commit Size

• General form: diff_size ← (a, r) – Diff size is a function of SLoC added and removed

• Linear form: diff_size(a, r) = ca × a + cr × r + b – Straightforward linear approximation

• Open source based estimation provides linear estimates– Linear regression run over open source sample data

function real diff_size(int a, int r)

if (0.01269 × a + 0.01540 × r > 2.9965)return 0.9497 × a + 0.9744 × r – 2.9965

elsereturn 0.9370 × a + 0.9590 × r

endend

Page 22: Six Easy Pieces of Quantitatively Analyzing Open Source

The Commit Size Distributionof Open Source

Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of the 42nd Hawaiian International Conference on System Science

(HICSS-42). IEEE Press: 2009. Page 1-8.

http://www.riehle.org/2008/09/23/the-commit-size-distribution-of-open-source-software/

Page 23: Six Easy Pieces of Quantitatively Analyzing Open Source

The Overall Commit Size Distribution

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

7

2621

4

5242

8

1E+

06

2E+

06

4E+

06

8E+

06

2E+

07

Commit Size [SLoC]

Nu

mb

er

of

Oc

cu

rre

nc

es

Page 24: Six Easy Pieces of Quantitatively Analyzing Open Source

The Dominance of Small Commits

12.13%

8.96%

5.45%4.96%

3.52% 3.35%2.55% 2.57%

2.05% 1.94%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

1 2 3 4 5 6 7 8 9 10

Commit Size [SLoC]

Pe

rce

nta

ge

of

Nu

mb

er

of

Oc

cu

rre

nc

es

Page 25: Six Easy Pieces of Quantitatively Analyzing Open Source

The Overall Commit Size Distribution

Open source is incremental development

Small commits dominate: the smaller the commit, the more likely

Hypothesis: Contributors and committersfollow same behavioral programming patterns

Given that our development tools were designed with 30-50 or

more lines of code in mind, they may be suboptimal for open source

Further research into comparing open with closed source development is necessary

Page 26: Six Easy Pieces of Quantitatively Analyzing Open Source

Developer Activityin Open Source Software Projects

Dirk Riehle, Oliver Arafat, Amit Deshpande. “Developer Activity in Open Source Software Projects.” In

preparation.

Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In Proceedings of the Fourth Conference on Open Source Systems (OSS

2008). Springer Verlag, 2008. Page 273-280.

http://www.riehle.org/2008/03/08/continuous-integration-in-open-source-software-development/

Page 27: Six Easy Pieces of Quantitatively Analyzing Open Source

Average Commit Size

0

5

10

15

20

25

30

35

40

45

50

1990-01-01 1992-01-01 1993-12-31 1996-01-01 1997-12-31 2000-01-01 2001-12-31 2004-01-01 2005-12-31 2007-12-31

Time

Ave

rag

e C

om

mit

Siz

e p

er W

eek

[SL

oC

]

Page 28: Six Easy Pieces of Quantitatively Analyzing Open Source

Average Commit Frequency

0

10

20

30

40

50

60

70

80

90

100

1990-01-01 1992-01-01 1993-12-31 1996-01-01 1997-12-31 2000-01-01 2001-12-31 2004-01-01 2005-12-31

Time

Co

mm

its

per

Dev

elo

per

per

Wee

k

Page 29: Six Easy Pieces of Quantitatively Analyzing Open Source

Changes in Developer Behavior

No significant changes apparent

Hypothesis: Foundations are not having an overall impact (yet)?

Agile methods, in particular continuous integration did not change behavior or were always present

Further research into separating contributors from committers is necessary

Page 30: Six Easy Pieces of Quantitatively Analyzing Open Source

The Commenting Practice of Open Source

Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to

Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009:

Forthcoming.

http://www.riehle.org/2009/02/04/the-comment-density-of-open-source-software-code/

Page 31: Six Easy Pieces of Quantitatively Analyzing Open Source

Average Comment Density

comment density = #comment lines / #code lines

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

Project Size in Lines of Code (LoC = CL + SLoC)

Co

mm

en

t D

en

sit

y

mean = 0.1867median = 0.1674stdev = 0.1088correl = -0.00787

Page 32: Six Easy Pieces of Quantitatively Analyzing Open Source

Comment Density by Programming Language

2737%10%Perl6.

5348%11%Python5.

2769%16%Javascript4.

16218%18%C/C++3.

55912%22%php2.

108511%26%Java1.

Population SizeStddev [%]Average [%]Language#

Page 33: Six Easy Pieces of Quantitatively Analyzing Open Source

Comment Density by Commit Size

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60 70 80 90 100

Source Lines of Code in Commit [SLoC]

Co

mm

en

t D

en

sit

y

Comment Density by SLoC Size

Total Average Comment Density

sloc size = 1-100mean = 0.2513median = 0.2340stdev = 0.0626

sloc size = 50-100mean = 0.2234median = 0.2190stdev = 0.0169

sloc size = 80-100mean = 0.2224median = 0.2171stdev = 0.0209

Page 34: Six Easy Pieces of Quantitatively Analyzing Open Source

Comment Density by Team Size

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0 10 20 30 40 50 60 70 80 90 100

Team Size [Number of Committers]

Co

mm

ent

Den

sity

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Comment Density by Team Size

Total Average Comment Density

Percent Projects with Team Size

team size = 1-20mean = 0.1914median = 0.1878stdev = 0.0255

team size = 1-50mean = 0.1922median = 0.1906stdev = 0.0425

team size = 1-100mean = 0.1856median = 0.1857stdev = 0.0641correl = -0.0550

Page 35: Six Easy Pieces of Quantitatively Analyzing Open Source

Comment Density by Project Age

17.50%

17.75%

18.00%

18.25%

18.50%

18.75%

19.00%

19.25%

19.50%

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48

Project Age [months]

Av

era

ge

Co

mm

en

t D

en

sit

y

correl = -0.9054

Page 36: Six Easy Pieces of Quantitatively Analyzing Open Source

Commenting in Open Source

A continuous on-going practice

Surprisingly high comment density,

clearly no belief in self-documenting code

Hypothesis: Comment density of open sourcerepresents the sweet spot of commenting code

For any specific conclusions, watch the parameters

Further research into project types is necessary

Page 37: Six Easy Pieces of Quantitatively Analyzing Open Source

Team Size Evolutionin Open Source Projects

Philipp Hofmann, Dirk Riehle. “Team Size Evolution in Open Source Software Projects.” In preparation.

Page 38: Six Easy Pieces of Quantitatively Analyzing Open Source

Teams Size Evolution Figure

Team Size

# P

roje

cts

2007

2006

2005200420032002

200120001999

1998

1997

19961995199419931992

199119901989198819871986

1985

Evolution of Team Sizes

1 2 3 5 7 11 18 29 46 73 123 221 397 713

13

71

74

71

50

55

42

36

81

35

75

Page 39: Six Easy Pieces of Quantitatively Analyzing Open Source

Is Open Source Scale-Free?

• Definition of scale-free distribution– Follows power-law (straight line on log-log graph)

– Very common in nature, technology, and society

– Implies same principles govern every level of scale

• Examples of scale-free graphs in open source– Commit size distribution

– Team size distribution

• If open source was scale-free then– We would know how to scale to ever larger project sizes

– By basically applying the same principles across the board

Page 40: Six Easy Pieces of Quantitatively Analyzing Open Source

Conclusions

• Lots of interesting insights to be gained from analyzing open source

• We have the chance now to fix misconceptions by looking at data

• Lots of that insight may apply to closed source development as well

• Open source may show the most resource-efficient sweet spots

• New research agenda: Is open source scale-free?

Page 41: Six Easy Pieces of Quantitatively Analyzing Open Source

Thank you!

[email protected], www.riehle.org, twitter.com/driehle

Comments are welcome!