Six Easy Pieces of Quantitatively Analyzing Open Source

Six Easy Pieces(of Quantitatively Analyzing Open Source Software)

Dirk Riehle

SAP Research, SAP Labs LLC

[email protected], www.riehle.org, twitter.com/driehle

Open Source Software

• Definition of open source software– Software that is provided under an approved OSI license

• General properties of open source software– Software that is available in source code form

– Software that you can modify and redistribute

• Further important (not always present) properties– The project has gathered a thriving community

– Feedback and ideas about the software are public and abundant

– Projects are egalitarian, meritocratic, and self-organizing

Talk Overview (Agenda)

The Growth of Open Source Software

Data Mining for Fun and Profit

Efficiently Estimating Commit Sizes

Developer Activity in Open Source Software Projects

1.

2.

3.

5.

The Commit Size Distribution of Open Source4.

The Commenting Practice of Open Source6.

Team Size Evolution in Open Source Projects7.

Conclusions8.

The Growth of Open Source Software

Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag,

2008. Page 197-209.

http://www.riehle.org/2008/03/14/the-total-growth-of-open-source/

Source Code Growth in Open Source

0

200

400

600

800

1,000

1,200

Jan-

93

Jan-

94

Jan-

95

Jan-

96

Jan-

97

Jan-

98

Jan-

99

Jan-

00

Jan-

01

Jan-

02

Jan-

03

Jan-

04

Jan-

05

Jan-

06

Time

To

tal

SL

oC

[m

illi

on

s]

SLoC = source lines of code

Model of Source Code Growth

where,y: Total open source lines of codex: Time from Jan 1995 to Dec 2006 in months

0.964y = 2E+06*e0.0464xLower bound

0.961y = 784098*e0.0555xUpper bound

R-square valueModelApproach

Project Growth in Open Source

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

Jan-

95

Jan-

96

Jan-

97

Jan-

98

Jan-

99

Jan-

00

Jan-

01

Jan-

02

Jan-

03

Jan-

04

Jan-

05

Jan-

06

Time

To

tal

Nu

mb

er o

f P

roje

cts

Model of Project Growth

where,y: Total number of open source projectsx: Time from Jan 1995 to Dec 2006 in months

0.956y = 7.1511e0.0499x

R-square valueModel

Where Open Source is Growing

It is not the size of individual projects,

but the total number of active working open source projects that is growing exponentially

Open source is growing in every domain,

including business applications

Data Mining for Fun and Profit

Oliver Arafat, Amit Deshpande, Philipp Hofmann, Dirk Riehle.

http://www.riehle.org/publications/

Motivation and Approach

• Gain insight into how open source software development works– Review processes unique to open source (or transferable)

– Possibly improve corporate software development

– Possibly build novel tools

• Post-facto detailed quantitative analysis of actual behavior– Analyzing what people do rather than what they say they do

– Open source is publicly developed software so there is lots of data

Data Source, Data Quality

• Using the Ohloh.net repository of open source project information– Detailed information on a large number of projects (>9000 in 2008

snapshot)

– Includes project structure and members but also detailed code information

– Ohloh captures every commit down to diff sections (>8M commits)

– See http://www.ohloh.net for more information

• To be useful, data needs to be cleaned and filtered– Depends on the question at hand

– Developed a variety of easily applicable filters

Open Source Analytics Tool Chain

Raw Data(ohloh.net)

Aggr. Data(RDBMS)

12

34

5

SQL, Java

Excel / Calc,R, spec. tools

Raw data source· Local database (ohloh.net snapshot, crawled sources)· Web services access (ohloh.net, sourceforge.net, others)

Pre-processing· Database querying using SQL and scripts· Java library for computationally heavyweight filters, aggregation

Aggregated data source· Output of pre-processing stage for specific analytical tasks· Aggregated data significantly improves analysis speed

Analytical processing· Mines aggregated (and raw) data for insights, hypothesis testing· At present basic processing (Excel), machine learning next

Analysis output· Results of analytical processing: averages, distributions, correlations· Presented as models, tables, graphs, charts, etc.

Efficiently Estimating Commit Sizes

Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th International

Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Forthcoming.

http://www.riehle.org/2009/02/11/estimating-commit-sizes-efficiently/

Definition of Commit Size

• Commit consists of Diffs which consist of Sections– One commit may affect multiple files, each file diff can have multiple

sections

• Commit Size = sum(Diff Sizes) where Diff Size = sum(Section Sizes)

What Diff Does

4,5c4,6< d< f---> e> e> e7a9> j9d10< n

abceeeghjm

abcdfghmn

01:02:03:04:05:06:07:08:09:10:11:

diff a.txt b.txtb.txta.txt

The Trouble with Diff

• From the GNU diff manual:

The way that GNU “diff” determines which lines have changed always comes up with a near-minimal set of differences. Usually it is good enough for practical purposes.

• Also, on the option “-d”:

Try hard to find a smaller set of changes.

• There is no reliable way of determining SLoC changed!

Some Diff Section Size Examples

Table 1: Interpretation of the entry 1 SLoC added, 1 SLoC removed

Table 2: Interpretation of the entry 4 SLoC added, 3 SLoC removed

2011Event 2

1100Event 1

Number of Modifications

Number of SLoC changed

Number of SLoC removed

Number of SLoC added

(1, 1)

7034Event 4

6123Event 3

5212Event 2

4301Event 1

Number of Modifications

Number of SLoC changed

Number of SLoC removed

Number of SLoC added

(4, 3)

Garden Variety of Heuristics

5.440Linear Estimation7

40.35-5.95Ldiff6

30.87-3.06GNU diff –d5

19.55-1.96GNU diff4

7.68-0.27Bounds Mean3

6.39-4.41Upper Bound2

16.643.86Lower Bound1

Error Standard Deviation

ErrorMean

Approach

Visual Comparison of Heuristics

Definition of Commit Size

• General form: diff_size ← (a, r) – Diff size is a function of SLoC added and removed

• Linear form: diff_size(a, r) = ca × a + cr × r + b – Straightforward linear approximation

• Open source based estimation provides linear estimates– Linear regression run over open source sample data

function real diff_size(int a, int r)

if (0.01269 × a + 0.01540 × r > 2.9965)return 0.9497 × a + 0.9744 × r – 2.9965

elsereturn 0.9370 × a + 0.9590 × r

endend

The Commit Size Distributionof Open Source

Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of the 42nd Hawaiian International Conference on System Science

(HICSS-42). IEEE Press: 2009. Page 1-8.

http://www.riehle.org/2008/09/23/the-commit-size-distribution-of-open-source-software/

The Overall Commit Size Distribution

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

7

2621

4

5242

8

1E+

06

2E+

06

4E+

06

8E+

06

2E+

07

Commit Size [SLoC]

Nu

mb

er

of

Oc

cu

rre

nc

es

The Dominance of Small Commits

12.13%

8.96%

5.45%4.96%

3.52% 3.35%2.55% 2.57%

2.05% 1.94%

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

1 2 3 4 5 6 7 8 9 10

Commit Size [SLoC]

Pe

rce

nta

ge

of

Nu

mb

er

of

Oc

cu

rre

nc

es

The Overall Commit Size Distribution

Open source is incremental development

Small commits dominate: the smaller the commit, the more likely

Hypothesis: Contributors and committersfollow same behavioral programming patterns

Given that our development tools were designed with 30-50 or

more lines of code in mind, they may be suboptimal for open source

Further research into comparing open with closed source development is necessary

Developer Activityin Open Source Software Projects

Dirk Riehle, Oliver Arafat, Amit Deshpande. “Developer Activity in Open Source Software Projects.” In

preparation.

Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In Proceedings of the Fourth Conference on Open Source Systems (OSS

2008). Springer Verlag, 2008. Page 273-280.

http://www.riehle.org/2008/03/08/continuous-integration-in-open-source-software-development/

Average Commit Size

0

5

10

15

20

25

30

35

40

45

50

1990-01-01 1992-01-01 1993-12-31 1996-01-01 1997-12-31 2000-01-01 2001-12-31 2004-01-01 2005-12-31 2007-12-31

Time

Ave

rag

e C

om

mit

Siz

e p

er W

eek

[SL

oC

]

Average Commit Frequency

0

10

20

30

40

50

60

70

80

90

100

1990-01-01 1992-01-01 1993-12-31 1996-01-01 1997-12-31 2000-01-01 2001-12-31 2004-01-01 2005-12-31

Time

Co

mm

its

per

Dev

elo

per

per

Wee

k

Changes in Developer Behavior

No significant changes apparent

Hypothesis: Foundations are not having an overall impact (yet)?

Agile methods, in particular continuous integration did not change behavior or were always present

Further research into separating contributors from committers is necessary

The Commenting Practice of Open Source

Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to

Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009:

Forthcoming.

http://www.riehle.org/2009/02/04/the-comment-density-of-open-source-software-code/

Average Comment Density

comment density = #comment lines / #code lines

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

Project Size in Lines of Code (LoC = CL + SLoC)

Co

mm

en

t D

en

sit

y

mean = 0.1867median = 0.1674stdev = 0.1088correl = -0.00787

Comment Density by Programming Language

2737%10%Perl6.

5348%11%Python5.

2769%16%Javascript4.

16218%18%C/C++3.

55912%22%php2.

108511%26%Java1.

Population SizeStddev [%]Average [%]Language#

Comment Density by Commit Size

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60 70 80 90 100

Source Lines of Code in Commit [SLoC]

Co

mm

en

t D

en

sit

y

Comment Density by SLoC Size

Total Average Comment Density

sloc size = 1-100mean = 0.2513median = 0.2340stdev = 0.0626



Comment Density by Team Size

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0 10 20 30 40 50 60 70 80 90 100

Team Size [Number of Committers]

Co

mm

ent

Den

sity

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Comment Density by Team Size

Total Average Comment Density

Percent Projects with Team Size

team size = 1-20mean = 0.1914median = 0.1878stdev = 0.0255

team size = 1-50mean = 0.1922median = 0.1906stdev = 0.0425

team size = 1-100mean = 0.1856median = 0.1857stdev = 0.0641correl = -0.0550

Comment Density by Project Age

17.50%

17.75%

18.00%

18.25%

18.50%

18.75%

19.00%

19.25%

19.50%

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48

Project Age [months]

Av

era

ge

Co

mm

en

t D

en

sit

y

correl = -0.9054

Commenting in Open Source

A continuous on-going practice

Surprisingly high comment density,

clearly no belief in self-documenting code

Hypothesis: Comment density of open sourcerepresents the sweet spot of commenting code

For any specific conclusions, watch the parameters

Further research into project types is necessary

Team Size Evolutionin Open Source Projects

Philipp Hofmann, Dirk Riehle. “Team Size Evolution in Open Source Software Projects.” In preparation.

Teams Size Evolution Figure

Team Size

# P

roje

cts

2007

2006

2005200420032002

200120001999

1998

1997

19961995199419931992

199119901989198819871986

1985

Evolution of Team Sizes

1 2 3 5 7 11 18 29 46 73 123 221 397 713

13

71

74

71

50

55

42

36

81

35

75

Is Open Source Scale-Free?

• Definition of scale-free distribution– Follows power-law (straight line on log-log graph)

– Very common in nature, technology, and society

– Implies same principles govern every level of scale

• Examples of scale-free graphs in open source– Commit size distribution

– Team size distribution

• If open source was scale-free then– We would know how to scale to ever larger project sizes

– By basically applying the same principles across the board

Conclusions

• Lots of interesting insights to be gained from analyzing open source

• We have the chance now to fix misconceptions by looking at data

• Lots of that insight may apply to closed source development as well

• Open source may show the most resource-efficient sweet spots

• New research agenda: Is open source scale-free?

Thank you!

[email protected], www.riehle.org, twitter.com/driehle

Comments are welcome!

Six Easy Pieces of Quantitatively Analyzing Open Source

Technology

Transcript of Six Easy Pieces of Quantitatively Analyzing Open Source