Overview - PLG

4
University of Waterloo Evolution, Growth, and Cloning in Linux: A Case Study Michael W. Godfrey Davor Svetinovic Qiang Tu University of Waterloo Overview Ongoing CSER project: Investigating growth and evolution of open source software Linux, vim, gcc, … Lehman’s laws of evolution and Linux Why is Linux still growing so fast? Hyp: cloning is common Case study of Linux SCSI drivers (in progress) How/why does cloning really occur? Parallel evolution? How well do clone detection tools work in spotting “real-world” cloning? What is software evolution? Evolution is what happens while you’re busy making other plans.” Usually, we consider evolution to begin once the first version has been delivered: Maintenance is the planned set of tasks to effect changes. e.g., corrective, perfective, adaptive, preventive Evolution is what actually happens to the software. Lehman’s Laws of software evolution in a nutshell • Observations: (Most) useful software must evolve or die. As a software system gets bigger, its resulting complexity tends to limit its ability to grow. Development progress/effort is (more or less) constant. • Advice: Need to manage complexity. Do periodic redesigns. Treat software and its development process as a feedback system (and not as a passive theorem). Lehman’s examples Growth of Linux Growth of compressed tarfile 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 20,000,000 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001 Size in bytes Development releases (1.1, 1.3, 2.1, 2.3) Stable releases (1.0, 1.2, 2.0, 2.2) Growth in number of source files (*.[ch]) 0 1000 2000 3000 4000 5000 6000 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001 # of source code files (*.[ch] ) Development releases (1.1, 1.3, 2.1, 2.3) Stable releases (1.0, 1.2, 2.0, 2.2) Growth in number of functions, variables, macros 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001 # of global fcns, variables, and macros Development releases (1.1, 1.3, 2.1, 2.3) Stable releases (1.0, 1.2, 2.0, 2.2) Growth in LOC (w. and w/o comments) 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001 Total LOC Total LOC ("wc -l") -- development releases Total LOC ("wc -l") -- stable releases Total LOC uncommented -- development releases Total LOC uncommented -- stable releases

Transcript of Overview - PLG

University of Waterloo

Evolution, Growth, and Cloning in Linux: A Case Study

Michael W. Godfrey

Davor Svetinovic

Qiang Tu

University of Waterloo

Overview

• Ongoing CSER project: – Investigating growth and evolution of open source

software• Linux, vim, gcc, …

• Lehman’s laws of evolution and Linux– Why is Linux still growing so fast?

• Hyp: cloning is common

• Case study of Linux SCSI drivers (in progress)– How/why does cloning really occur?– Parallel evolution?– How well do clone detection tools work in spotting

“real-world” cloning?

What is software evolution?

“Evolution is what happens while you’re busy

making other plans.”

• Usually, we consider evolution to begin once the first version has been delivered:

– Maintenance is the planned set of tasks to effect changes.• e.g., corrective, perfective, adaptive, preventive

– Evolution is what actually happens to the software.

Lehman’s Laws of software evolution in a nutshell• Observations:

– (Most) useful software must evolve or die.

– As a software system gets bigger, its resulting complexity tends to limit its ability to grow.

– Development progress/effort is (more or less) constant.

• Advice: – Need to manage complexity.

– Do periodic redesigns.

– Treat software and its development process as a feedback system (and not as a passive theorem).

Lehman’s examples Growth of Linux

Growth of compressed tarfile

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

20,000,000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

Siz

e in

byt

es

Development releases (1.1, 1.3, 2.1, 2.3)

Stable releases (1.0, 1.2, 2.0, 2.2)

Growth in number of source files (*.[ch])

0

1000

2000

3000

4000

5000

6000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

# o

f so

urc

e co

de

file

s (*

.[ch

] )

Development releases (1.1, 1.3, 2.1, 2.3)

Stable releases (1.0, 1.2, 2.0, 2.2)

Growth in number of functions, variables, macros

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

# o

f g

lob

al f

cns,

var

iab

les,

an

d m

acro

s

Development releases (1.1, 1.3, 2.1, 2.3)

Stable releases (1.0, 1.2, 2.0, 2.2)

Growth in LOC (w. and w/o comments)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

To

tal L

OC

Total LOC ("wc -l") -- development releases

Total LOC ("wc -l") -- stable releases

Total LOC uncommented -- development releases

Total LOC uncommented -- stable releases

Observations and hypotheses

• Growth along devel. path is super-linear

y = .21*x^2 + 252*x + 90,055 r2=.997y = size in LOC x = days since v1.0 r2 is “coefficient of determination” using least squares

[Lehman/Turski’s model: y’ = y + E/y^2 (3Ex)^(1/3)]

– Linux’s strong growth is continuing.– This is stronger growth at MLOC level than observed by

others (Lehman, Gall), even for other OSs.

Linux growth phenomena

Average and median .c file size

0

100

200

300

400

500

600

700

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

Un

com

men

ted

LO

C

Average .c file size -- dev. releasesAverage .c file size -- stable releasesMedian .c file size -- dev. releasesMedian .c file size -- stable releases

Average and median .h file size

0

20

40

60

80

100

120

140

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

Un

com

men

ted

LO

C

Average .h file size -- dev. releasesAverage .h file size -- stable releasesMedian .h file size -- dev. releasesMedian .h file size -- stable releases

Growth of major subsystems

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

To

tal u

nco

mm

ente

d L

OC

drivers

arch

include

net

fs

kernel

mm

ipc

lib

init

SS LOC as percentage of whole system

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

Per

cen

tag

e o

f to

tal s

yste

m u

nco

mm

ente

d L

OC

driversarchincludenetfskernelmmipclibinit

Linux growth phenomena

SS growth as percentage of whole system (ignoring drivers SS)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

Per

cen

tag

e o

f to

tal

syst

em u

nco

mm

ente

d L

OC arch

includenetfskernelmmipclibinit

Growth of small core SSs

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

To

tal u

nco

mm

ente

d L

OC

kernelmmipclibinit

Growth of drivers SS

0

50,000

100,000

150,000

200,000

250,000

300,000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

To

tal

un

com

men

ted

LO

C

drivers/netdrivers/scsidrivers/chardrivers/videodrivers/isdndrivers/sounddrivers/acorndrivers/blockdrivers/cdromdrivers/usbdrivers/"others"

Growth of arch SS

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

To

tal u

nco

mm

ente

d L

OC

arch/ppc/arch/sparc/arch/sparc64/arch/m68k/arch/mips/arch/i386/arch/alpha/arch/arm/arch/sh/arch/s390/

Why has Linux been able to continue its geometric growth?• Core code quality is carefully maintained

• Architecture/problem domain– It’s largely drivers

– Much of the code is “parallel”

– It’s not as big as you might think• Vanilla configuration used only 15% of files

• Development model (OSD) and its sociology– Popularity and visibility has encouraged outsiders (both

hackers and industry) to contribute

– “Clone and hack” is an acceptable development style

Case study: Linux SCSI drivers

• Nice, controlled experiment:– Large body of code, multiple versions, well used system,

open source

– SCSI drivers all do similar tasks

– Source comments shows cloning has occurred!

• Approx. 500 releases of Linux since 1994.

• Kernel v2.3.39: (released Jan 2000)

– 5000 source files, 2.2 MLOC, 10 hardware architectures– drivers/scsi has 212 source files, 166 KLOC,

Goals of case study

• Examine “real world” cloning:– How common is it?– Why is it done?– What do the “cloning patterns” look like?

• Examine parallel evolution:– What kinds of changes are common?– Do developers (need to) change clone relatives too?

• Is there a better design structure lurking?• Compare against clone detection tools

– Are detections tools looking for the right indications of cloning?

SCSI Subsystem - Size (rel. 2.2.16)

• Number of source files: 211

• Number of functions: 2512

• Number of lines: 254,953

• % of comments: 38

• Number of low-level drivers: 80

• File size: – on average ~3000 lines

– large multi-card drivers ~15,000 lines

SCSI Subsystem - Architecture

• Upper Layer – Uniform way of handling devices

– Hard Disk, CD-ROM Disk, Tape, Generic

• Middle Layer– “bridge” between Upper Layer and Low-Level Devices

• Low-Level Device Drivers– low-level driver functionality and management

Clones Expected?

• Why did we expect to find clones:– Every driver must implement uniform interface

– Design of subsystem does not support other forms of reuse

– Driver logic is relatively simple (!)

– Devices from same family ⇒ more cloning

– Completely different hardware ⇒ less or no cloning

– Open source ⇒ anyone can reuse code

– Easier and more efficient to reuse existing code

– Reused code already tested, so probably better quality than if we build it from scratch

Clones - Manual Inspection

• From source code comments, we have found:

esp.[ch]

jazz_esp.[ch]

dec_esp.[ch]

cyberstorm.[ch]

cyberstormII.[ch] mca_53c9x.[ch] blz2060.[ch] fastlane.[ch]

qlogicisp.[ch]

qlogicpti.[ch]

fdomain.[ch]

fd_mcs.[ch]

sd.[ch]

sr.[ch]

t128.[ch]

pas16.[ch]

Types of Changes Detected

• Names of variables

• Initialization parameters and constants

• Driver specific initialization logic removed/added

• Small change in supporting functions

• Small changes in driver management code

• Comments are updated

• Code changed is highly embedded into other code, which makes extraction of that code hard

Automatic Clone Detection

• We have looked for commercial and research clone detection software

• Clone Finder - www. studio501.com– free trial edition (C, C++)

– easy to use

– groups clones and highlights them in the source code

• Clone DR [Baxter] www.semdesigns.com (future)– Cobol trial edition (supports also C, C++, Java)

• Merlo et al. tool (future)

Clone Finder Results

• Number of files scanned: 8• Number of source lines: 4081• Elapsed time in seconds: 0.44• Number of Groupings: 14• Number of Blocks within those groupings: 30• Total number of duplicated lines: 373• Percent of source lines which are duplicated: 9.14

Something missed?

cyberstorm.c

….

static void dma_dump_state(struct NCR_ESP *esp)

{ESPLOG(("esp%d: dma -- cond_reg<%02x>\n",

esp->esp_id, ((struct cyber_dma_registers *)

(esp->dregs))->cond_reg));

ESPLOG(("intreq:<%04x>, intena:<%04x>\n",

custom.intreqr, custom.intenar));}

static void dma_init_read(struct NCR_ESP *esp, __u32 addr, int length)

{

struct cyber_dma_registers *dregs = (struct cyber_dma_registers *) esp->dregs;

cache_clear(addr, length);

addr &= ~(1);

dregs->dma_addr0 = (addr >> 24) & 0xff;

dregs->dma_addr1 = (addr >> 16) & 0xff;

dregs->dma_addr2 = (addr >> 8) & 0xff;

dregs->dma_addr3 = (addr ) & 0xff;ctrl_data &= ~(CYBER_DMA_WRITE);

…….

cyberstormII.c

….

static void dma_dump_state(struct NCR_ESP *esp)

{ESPLOG(("esp%d: dma -- cond_reg<%02x>\n",

esp->esp_id, ((struct cyberII_dma_registers *)

(esp->dregs))->cond_reg));

ESPLOG(("intreq:<%04x>, intena:<%04x>\n",

custom.intreqr, custom.intenar));}

static void dma_init_read(struct NCR_ESP *esp, __u32 addr, int length)

{

struct cyberII_dma_registers *dregs = (struct cyberII_dma_registers *) esp->dregs;

cache_clear(addr, length);

addr &= ~(1);

dregs->dma_addr0 = (addr >> 24) & 0xff;

dregs->dma_addr1 = (addr >> 16) & 0xff;

dregs->dma_addr2 = (addr >> 8) & 0xff;

dregs->dma_addr3 = (addr ) & 0xff;}

……...

How to Solve Cloning “Problem”

• Clone management through development process?– Unlikely in this case, since it’s hard to incorporate into

open source development

• Automatic clone detection and removal?– Not clear that tools are adequate for “real world”

cloning problems

– Software developed and maintained by different parties

– Architecture of the subsystem would be “broken”

Proposed Clone Solution

• Combination of clone control and removal:– Make driver “template” that separates generic code

from driver specific one

– Clearly indicate which parts of driver are to be changed and which not

– “Alarm” other developers when bug discovered in common code

• This allows independent development, preserves architecture, and simplifies design

• Applicable to all “plug-in” based software

Conclusion

• It’s not clear that current clone detection tools “do the right thing”

• Theory developed on clone management, detection, and removal is not universally applicable to all types of applications, languages, and designs– Need more qualitative analysis of “cloning in the real

world”

• Combination of different approaches should give the best results

Ongoing & Future Work

• More detailed qualitative analysis of “cloning in the real world”

• More investigation of relative effectiveness of clone detection tools

• Investigation of “parallel evolution” by maintenance type– bug fixes– new features– restructuring

• Investigate another driver family, see if results are similar e.g., Linux network card drivers