The adoption of FOSS workfows in commercial software development: the case of git and github
-
Upload
dmgerman -
Category
Engineering
-
view
86 -
download
0
Transcript of The adoption of FOSS workfows in commercial software development: the case of git and github
The adoption of FOSS workfows in commercial software development: the
case of git and github
Daniel M GermanUniversity of Victoria
Canada
Open Source is everywhere
On SSL and Heartbleed
“[Heartbleed] is a software faw that has left up to two-thirds of the world’s websites vulnerable to attack by hackers.”
– The Economist
“There is no such thing as bad publicity except your own obituary.”
– Brendan Behan
● “Most open-source software – and Open SSL is no exception – is produced voluntarily by people who are not paid for creating it. They do it for love, professional pride or as a way of demonstrating technical virtuosity. And mostly they do it in their spare time.”
– John Naughton The Observer/The Guardian
'Heartbleed' bug can't be simply blamed on coders, April 13, 2014
“Responsible corporate use of open-source software should therefore involve some measure of reciprocity: a corporation that benefts hugely from such software ought to put something back, either in the form of fnancial support for a particular open-source project, or – better still – by encouraging its own software people to contribute to the project.”
“Much of the invisible backbone of websites from Google to Amazon to the Federal Bureau of Investigation was built by volunteer programmers in what is known as the open-source community.”
“... volunteers, connected over the Internet, work together to build free software, to maintain and improve it and to look for bugs. Ideally, they check one another’s work in a peer review system similar to that found in science.”
Linus Law:
“Given Enough Eyeballs, all Bugs are Shallow”
Eric Raymond, The Cathedral and the Bazaar
In the case of Heartbleed
“There weren't enough eyeballs”
- Eric Raymond,
● Code was created by a grad student
● Reviewed by S. Henson, core developer of OpenSSL
● Included in OpenSSL in the Spring 2011
● Not discovered for 3 years!
Budget of openSSL:
– US$2,000 for 2013
the OpenSSL problem
● important infrastructure projects that are run by small teams of volunteers
● on April 24, the Linux Foundation announces the “Core Infrastructure Initiative” to address it
Core Infrastructure Initiative
● Funded by:
– Amazon, Cisco, Dell, Facebook, Fujitsu, Google, IBM, Intel, Microsoft, NetApp, Rackspace,Qualcomm, VMware and The Linux Foundation
● Funding to core projects:
– Fellowships to core developers
– as well as other resources to assist the project in improving its security, enabling outside reviews, and improving responsiveness to patch requests.
What is FOSS development?
● Most important feature of FOSS
– its free or open source license
● License
– Guarantees code is available to others to reuse
– Becomes a social contractamong participants
What is OSS development?
● Most frequently defned as:
– Self organized teams developing software without a central authority
● Code is open for review
– and reuse!!!
● Anybody can participate
What makes OSS development possible?
● Teams of self-organized developers and contributors
● The Internet
● A common toolkit
● Version control systems
Teams
● Come from all sectors:
– Professionals and hobbyists
– Paid and volunteers
– Novices and Experienced
– High-school students to PhDs
– All over the world!!!
● Highly motivated!
Common Toolkit
● To be able to collaborate you need a common set of tools
– Programming languages● gcc, perl, python, java, ruby, lua, php...
– Editors and IDEs● Emacs, vim, Eclipse, Netbeans...
– Libraries● boost, maven, cpan, Pypi...
– Infrastructure● Make, ant, cmake, bugzilla, etc.
– Hosting infrastructure● Sourceforge, Google Code, github, bitbucket
● They must be available at zero cost to anybody
FOSS Toolkit
● I posit that one of the biggest infuences of FOSS on the practice of Software Development is the wide use of FOSS tools for the development of software
– Most implementations of popular programming languages today are open source
– FOSS Editors and IDEs arewidely used too
Free Software Foundation
● The FSF had to boostrap the development of the OSS toolkit
– To build an Operating System you need a compiler
– Before you build a compiler you need an editor, but you need an editor to build a compiler
– gcc, emacs, bintools (ls, echo, cat, etc.), etc
Richard Stallman
Created the legal and technical infrastructure for Free and Open Source software
on Code Reviews
Need for Code Reviews
● Many FOSS teams discovered that to ship good quality software they needed to review the source code
Fagan Code Inspections
● Code reviews performed at specifc stages of development
Effective, but not widely used
Open Source style Code Reviews
● Fagan inspections were unfeasible
– Required participants to be in the same room
● Instead, code reviews started to be incremental
– Rather than reviewing the whole, review the delta (the patch)
Code Reviews in FOSS
the spectrum of Code Reviews
code reviews in FOSS
(1) early, frequent reviews(2) of small, independent, complete contributions
(3) that are broadcast to a large group of stakeholders, but only reviewed by a small set of self-selected experts
(4) resulting in an effcient and effective peer review technique.
- Peter Rigby
Lessons from FOSS
on Version Control systems
Version Control Systems
● At the beginning, FOSS used tar fles in USENET
– the FSF would ship physical tapes!
● Today, version control systems are the norm
– Centralized or Distributed
● FOSS has a continuous and proven track of innovation in version control systems
– FOSS democratized VC
On Version Control
● The VC is the circulatory system of a software development
● It brings the code to all stakeholders
● A contribution is a patch
– one or more commits
the patch
● the patch should be reviewed
● most VCs don't support reviewing of patches
the patch and its review
● Two models:
– Commit then Review● Review the code after it has been integrated
or
– Review Then Commit (RTC)● Review the patch before it is integrated
Linux
● Linux incorporated RTC early in its process
● Linus needed integration of Review process with VC
● No FOSS VC did it
– he turned to bitkeeper
Bitkeeper and Linux
● Symbiotic relationship
– Free (as in beer) licenses to linux developers with one big condition
● User should not develop competing tools
– Bitkeeper rapidly improved Linux integration process● simplifed integration of reviewed code
– Bitkeeper was probably infuenced by Linus workfow
– in 2005 bitkeeper revokes its license to Linux developers
Git
● Many other distributed version control systems before it
● What makes it special?
– Many features, but specially:● Pull-requests● git incorporates code review process with a
distributed version control system– Even via email patches
How is distributed version control software being used?
Git
● Software engineers are moving towards git
– And other DVCs
● Github a major reason
The Promise of Git
From: http://thkoch2001.github.io/whygitisbetter/
Challenge 1
● Personal repos are beyond reach
● Local commits might never be observable
“History is written by the victors”
Challenge 2: History
Rebasing changes history
Save history before it is lost!
Super-repository
● Collection of repositories cloned (recursively) from the same repo
– At least one per developer● In their personal computer
– At least one public repository● The blessed
– In git, no way to trace them
Moving commits across the superRepo
Method
Push Done at source, needs write access to destination
Pull Done at destination, needs read access to source
Email Source creates patch mails it; recipient applies it
Ecosystem of Repos
Can we learn from Linux?
Life of a Patch in Linux
ContinuousMining of Linux
● Linux has no centralized logging
– Nobody really knows what the superRepo is
– Commits fow without any event broadcasting mechanism
● Who do we fnd the activity?
– Repos
– Commits
Semiautomatic Process
● Every 3 hrs, ask every repo
– What new commits do you have?
– What commits did you delete?
– Automatically resolve propagations● Commits might propagate before we scan
● Daily:
– Are commits in repo by unknown committers?● Answer:
– is there a new repo? or is committer new to repo?
Implementation
● Running since Nov. 2011
– Currently scans 650 repos every 3 hrs
– Retrieved ● 2.3 million commits (compared to 400k in Linus
repo)● 109 million records in propagation table
<commit-id, added|deleted, repo, when>
Snapshot (Linus) Continuous
No Repos 1 479
Commits 64k 533k
Non-merge Commits 59k 485k
Unique Non-merges 58k 135k
%unique non-merges 98.9% 27.9%
Non-merges that reached Blessed 43.1%
Different authors emails 3434 5646
Different authors 2883 4575
Different committers emails 283 1185
Different committers 245 1058
Commit vs Patches
● Commit ids are insuffcient to tracks patches
● Large amount of work not reaching blessed
Arrival of Commits at Blessed
Arrival of Commits at Blessed...
● We can classify patches as a new feature or bug-fx
The Latency
Time of Authorship Time of Commit
The Repos
Path to Linus
● Large ecosystem of repositories
– Producers
– Consumers
Contributors vs Consumers
Linux Dashboard
● We asked two linux maintainers:
– Can this info be useful?
● Answer:
– “Yes”
… but not for what we expected...
Tracking commits in Linux
● Need to track patches, not commits
– Particularly important in consumer repositories
– Need to cross-reference commits● What commits contain the same patch?
– Some repos track commits from blessed via cherry-picking
● Commit ids are useless● So they annotate log with the origin commit id
Linux Commits Dashboard● Where is my commit?
– My original commit, has it reached Linus?
● What was merged?
– What commits were merged at once by Linus?
● What commits are related to this one?
– Same patch● Rebasing● Cherry picking
– Mentioned in a commit● This commit fxes bug introduced in X● This commit reverts commit X
● http://o.cs.uvic.ca:20810/perl/cid.pl?cid=70cb8bb0d365f0bc8b20fa67347caf9598a4674e
●
Researcher states:
“40% of pull requests are not merged”
● Based on simply querying ghtorrent data● But it ignores what really happens● Many pull requests are merged without being marked as merged in github
● Ghtorrent data has many potential threats to validity
What is github used for?
"I store my presentations in github. I don't need a USB stick anymore!"
Are there potential threats to validity for studies that assume github is about software engineering
only?
Methodology
● Data sources:
– Surveys
– Sampling of repositories
● Mixed methods:
– Quantitative, and
– Qualitative
I. A repository is not necessarily a project
II. Most projects have few commits
III. Most projects are innactive
IV. A large proportion of repositories are not for software engineering
V. More than two thirds of projects are personal
VI. Only a fraction of repos use pull requests
VII. If the commits in a pull-request are reworked, github only records the resulting patch
VIII. Most pull-requests appear as non-merged, even though they were merged
IX. Many active projects do not conduct all their sotfware development activity in github
Uses:
Most projects are inactive
Social?
67% of projects are personal repos
95% have 3 or less committers
Self contained?
“Any serious project would have to have someseparate infrastructure - mailing lists, forums, ircchannels and their archives, build farms, etc. [...]Thus while GitHub and all other project hosts areused for collaboration, they are not and can not
be a complete solution.”
Others are already using github's information to reach conclusions!
the open source report card
http://osrc.dfm.io/dmgerman/
how are github users collaborating?
How does github suppot collaboration?
● Methodology:
– Survey● 240 responses (24% response rate)
– Interviews● 35 interviews from survey respondents
– 71% professional developers– 11% managers– 9% students– 9% interns
● Approximately 1hr each
Survey: why do you use github?
Code centric collaboration
Themes: focus
● Simple tools
– git branching/merging
– github features seem to be enough for most● Pull requests and issue tracking
● Focused interaction
– code-centric, focused communication
– asynchronous and unobtrusive●
Focus: independence
● Decentralized work:
– git allows them to work independently
– yet they have visibility of what others do
● Low need for management:
– Need for a clear process (the workfow)
– They shy away from rigid management and team structure
– Team managers recognize this
– Managers should be educated on using git/github
Focus: Exposure
● Easy contribution process
– Fork and potentially contribute without pre-authorization
● Peer pressure
– Developers are conscious that their code is readily visible to others
– Adoption of small, frequent contributions
OSS mentality
● At the operational level
– the nature of the work allows independence and self-organization.
– developers are familiar with the idea of working this way and share the mentality behind it.
● developers are self-driven
● share the mentality of
– self- organizing,
– minimizing communication and coordination needs,
– having ownership of code, and
– operating on a meritocratic, expertise-based model
The github ecosystem
The Github Ecosystem
● github is creating an ecosystem of proprietary, cloud enabled applications for software development teams
– Service integration
– JSON API
● Asana, Campfre, Lighthouse, Jira, Travis, Trello, etc, etc.
Conclusions
● git and github are promoting the use of the pull-request workfow
– small, independent contributions
– that can be reviewed before integration
● Effectively, adopting open source code practices into their development
– Independent work
– Code reviews of contributions before they are integrated