Six Easy Pieces of Quantitatively Analyzing Open Source
-
Upload
driehle -
Category
Technology
-
view
1.003 -
download
1
description
Transcript of Six Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces(of Quantitatively Analyzing Open Source Software)
Dirk Riehle
SAP Research, SAP Labs LLC
[email protected], www.riehle.org, twitter.com/driehle
Open Source Software
• Definition of open source software– Software that is provided under an approved OSI license
• General properties of open source software– Software that is available in source code form
– Software that you can modify and redistribute
• Further important (not always present) properties– The project has gathered a thriving community
– Feedback and ideas about the software are public and abundant
– Projects are egalitarian, meritocratic, and self-organizing
Talk Overview (Agenda)
The Growth of Open Source Software
Data Mining for Fun and Profit
Efficiently Estimating Commit Sizes
Developer Activity in Open Source Software Projects
1.
2.
3.
5.
The Commit Size Distribution of Open Source4.
The Commenting Practice of Open Source6.
Team Size Evolution in Open Source Projects7.
Conclusions8.
The Growth of Open Source Software
Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag,
2008. Page 197-209.
http://www.riehle.org/2008/03/14/the-total-growth-of-open-source/
Source Code Growth in Open Source
0
200
400
600
800
1,000
1,200
Jan-
93
Jan-
94
Jan-
95
Jan-
96
Jan-
97
Jan-
98
Jan-
99
Jan-
00
Jan-
01
Jan-
02
Jan-
03
Jan-
04
Jan-
05
Jan-
06
Time
To
tal
SL
oC
[m
illi
on
s]
SLoC = source lines of code
Model of Source Code Growth
where,y: Total open source lines of codex: Time from Jan 1995 to Dec 2006 in months
0.964y = 2E+06*e0.0464xLower bound
0.961y = 784098*e0.0555xUpper bound
R-square valueModelApproach
Project Growth in Open Source
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
Jan-
95
Jan-
96
Jan-
97
Jan-
98
Jan-
99
Jan-
00
Jan-
01
Jan-
02
Jan-
03
Jan-
04
Jan-
05
Jan-
06
Time
To
tal
Nu
mb
er o
f P
roje
cts
Model of Project Growth
where,y: Total number of open source projectsx: Time from Jan 1995 to Dec 2006 in months
0.956y = 7.1511e0.0499x
R-square valueModel
Where Open Source is Growing
It is not the size of individual projects,
but the total number of active working open source projects that is growing exponentially
Open source is growing in every domain,
including business applications
Data Mining for Fun and Profit
Oliver Arafat, Amit Deshpande, Philipp Hofmann, Dirk Riehle.
http://www.riehle.org/publications/
Motivation and Approach
• Gain insight into how open source software development works– Review processes unique to open source (or transferable)
– Possibly improve corporate software development
– Possibly build novel tools
• Post-facto detailed quantitative analysis of actual behavior– Analyzing what people do rather than what they say they do
– Open source is publicly developed software so there is lots of data
Data Source, Data Quality
• Using the Ohloh.net repository of open source project information– Detailed information on a large number of projects (>9000 in 2008
snapshot)
– Includes project structure and members but also detailed code information
– Ohloh captures every commit down to diff sections (>8M commits)
– See http://www.ohloh.net for more information
• To be useful, data needs to be cleaned and filtered– Depends on the question at hand
– Developed a variety of easily applicable filters
Open Source Analytics Tool Chain
Raw Data(ohloh.net)
Aggr. Data(RDBMS)
12
34
5
SQL, Java
Excel / Calc,R, spec. tools
Raw data source· Local database (ohloh.net snapshot, crawled sources)· Web services access (ohloh.net, sourceforge.net, others)
Pre-processing· Database querying using SQL and scripts· Java library for computationally heavyweight filters, aggregation
Aggregated data source· Output of pre-processing stage for specific analytical tasks· Aggregated data significantly improves analysis speed
Analytical processing· Mines aggregated (and raw) data for insights, hypothesis testing· At present basic processing (Excel), machine learning next
Analysis output· Results of analytical processing: averages, distributions, correlations· Presented as models, tables, graphs, charts, etc.
Efficiently Estimating Commit Sizes
Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th International
Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Forthcoming.
http://www.riehle.org/2009/02/11/estimating-commit-sizes-efficiently/
Definition of Commit Size
• Commit consists of Diffs which consist of Sections– One commit may affect multiple files, each file diff can have multiple
sections
• Commit Size = sum(Diff Sizes) where Diff Size = sum(Section Sizes)
What Diff Does
4,5c4,6< d< f---> e> e> e7a9> j9d10< n
abceeeghjm
abcdfghmn
01:02:03:04:05:06:07:08:09:10:11:
diff a.txt b.txtb.txta.txt
The Trouble with Diff
• From the GNU diff manual:
The way that GNU “diff” determines which lines have changed always comes up with a near-minimal set of differences. Usually it is good enough for practical purposes.
• Also, on the option “-d”:
Try hard to find a smaller set of changes.
• There is no reliable way of determining SLoC changed!
Some Diff Section Size Examples
Table 1: Interpretation of the entry 1 SLoC added, 1 SLoC removed
Table 2: Interpretation of the entry 4 SLoC added, 3 SLoC removed
2011Event 2
1100Event 1
Number of Modifications
Number of SLoC changed
Number of SLoC removed
Number of SLoC added
(1, 1)
7034Event 4
6123Event 3
5212Event 2
4301Event 1
Number of Modifications
Number of SLoC changed
Number of SLoC removed
Number of SLoC added
(4, 3)
Garden Variety of Heuristics
5.440Linear Estimation7
40.35-5.95Ldiff6
30.87-3.06GNU diff –d5
19.55-1.96GNU diff4
7.68-0.27Bounds Mean3
6.39-4.41Upper Bound2
16.643.86Lower Bound1
Error Standard Deviation
ErrorMean
Approach
Visual Comparison of Heuristics
Definition of Commit Size
• General form: diff_size ← (a, r) – Diff size is a function of SLoC added and removed
• Linear form: diff_size(a, r) = ca × a + cr × r + b – Straightforward linear approximation
• Open source based estimation provides linear estimates– Linear regression run over open source sample data
function real diff_size(int a, int r)
if (0.01269 × a + 0.01540 × r > 2.9965)return 0.9497 × a + 0.9744 × r – 2.9965
elsereturn 0.9370 × a + 0.9590 × r
endend
The Commit Size Distributionof Open Source
Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of the 42nd Hawaiian International Conference on System Science
(HICSS-42). IEEE Press: 2009. Page 1-8.
http://www.riehle.org/2008/09/23/the-commit-size-distribution-of-open-source-software/
The Overall Commit Size Distribution
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
1310
7
2621
4
5242
8
1E+
06
2E+
06
4E+
06
8E+
06
2E+
07
Commit Size [SLoC]
Nu
mb
er
of
Oc
cu
rre
nc
es
The Dominance of Small Commits
12.13%
8.96%
5.45%4.96%
3.52% 3.35%2.55% 2.57%
2.05% 1.94%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
1 2 3 4 5 6 7 8 9 10
Commit Size [SLoC]
Pe
rce
nta
ge
of
Nu
mb
er
of
Oc
cu
rre
nc
es
The Overall Commit Size Distribution
Open source is incremental development
Small commits dominate: the smaller the commit, the more likely
Hypothesis: Contributors and committersfollow same behavioral programming patterns
Given that our development tools were designed with 30-50 or
more lines of code in mind, they may be suboptimal for open source
Further research into comparing open with closed source development is necessary
Developer Activityin Open Source Software Projects
Dirk Riehle, Oliver Arafat, Amit Deshpande. “Developer Activity in Open Source Software Projects.” In
preparation.
Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In Proceedings of the Fourth Conference on Open Source Systems (OSS
2008). Springer Verlag, 2008. Page 273-280.
http://www.riehle.org/2008/03/08/continuous-integration-in-open-source-software-development/
Average Commit Size
0
5
10
15
20
25
30
35
40
45
50
1990-01-01 1992-01-01 1993-12-31 1996-01-01 1997-12-31 2000-01-01 2001-12-31 2004-01-01 2005-12-31 2007-12-31
Time
Ave
rag
e C
om
mit
Siz
e p
er W
eek
[SL
oC
]
Average Commit Frequency
0
10
20
30
40
50
60
70
80
90
100
1990-01-01 1992-01-01 1993-12-31 1996-01-01 1997-12-31 2000-01-01 2001-12-31 2004-01-01 2005-12-31
Time
Co
mm
its
per
Dev
elo
per
per
Wee
k
Changes in Developer Behavior
No significant changes apparent
Hypothesis: Foundations are not having an overall impact (yet)?
Agile methods, in particular continuous integration did not change behavior or were always present
Further research into separating contributors from committers is necessary
The Commenting Practice of Open Source
Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to
Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009:
Forthcoming.
http://www.riehle.org/2009/02/04/the-comment-density-of-open-source-software-code/
Average Comment Density
comment density = #comment lines / #code lines
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09
Project Size in Lines of Code (LoC = CL + SLoC)
Co
mm
en
t D
en
sit
y
mean = 0.1867median = 0.1674stdev = 0.1088correl = -0.00787
Comment Density by Programming Language
2737%10%Perl6.
5348%11%Python5.
2769%16%Javascript4.
16218%18%C/C++3.
55912%22%php2.
108511%26%Java1.
Population SizeStddev [%]Average [%]Language#
Comment Density by Commit Size
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 10 20 30 40 50 60 70 80 90 100
Source Lines of Code in Commit [SLoC]
Co
mm
en
t D
en
sit
y
Comment Density by SLoC Size
Total Average Comment Density
sloc size = 1-100mean = 0.2513median = 0.2340stdev = 0.0626
sloc size = 50-100mean = 0.2234median = 0.2190stdev = 0.0169
sloc size = 80-100mean = 0.2224median = 0.2171stdev = 0.0209
Comment Density by Team Size
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0 10 20 30 40 50 60 70 80 90 100
Team Size [Number of Committers]
Co
mm
ent
Den
sity
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Comment Density by Team Size
Total Average Comment Density
Percent Projects with Team Size
team size = 1-20mean = 0.1914median = 0.1878stdev = 0.0255
team size = 1-50mean = 0.1922median = 0.1906stdev = 0.0425
team size = 1-100mean = 0.1856median = 0.1857stdev = 0.0641correl = -0.0550
Comment Density by Project Age
17.50%
17.75%
18.00%
18.25%
18.50%
18.75%
19.00%
19.25%
19.50%
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48
Project Age [months]
Av
era
ge
Co
mm
en
t D
en
sit
y
correl = -0.9054
Commenting in Open Source
A continuous on-going practice
Surprisingly high comment density,
clearly no belief in self-documenting code
Hypothesis: Comment density of open sourcerepresents the sweet spot of commenting code
For any specific conclusions, watch the parameters
Further research into project types is necessary
Team Size Evolutionin Open Source Projects
Philipp Hofmann, Dirk Riehle. “Team Size Evolution in Open Source Software Projects.” In preparation.
Teams Size Evolution Figure
Team Size
# P
roje
cts
2007
2006
2005200420032002
200120001999
1998
1997
19961995199419931992
199119901989198819871986
1985
Evolution of Team Sizes
1 2 3 5 7 11 18 29 46 73 123 221 397 713
13
71
74
71
50
55
42
36
81
35
75
Is Open Source Scale-Free?
• Definition of scale-free distribution– Follows power-law (straight line on log-log graph)
– Very common in nature, technology, and society
– Implies same principles govern every level of scale
• Examples of scale-free graphs in open source– Commit size distribution
– Team size distribution
• If open source was scale-free then– We would know how to scale to ever larger project sizes
– By basically applying the same principles across the board
Conclusions
• Lots of interesting insights to be gained from analyzing open source
• We have the chance now to fix misconceptions by looking at data
• Lots of that insight may apply to closed source development as well
• Open source may show the most resource-efficient sweet spots
• New research agenda: Is open source scale-free?