The half-life of code & the ship of Theseus - Sauf · The half-life of code & the ship of Theseus...

ErikBernhardsson

Thehalf-lifeofcode&theshipofTheseus2016-12-05

Asaprojectevolves,doesthenewcodejustaddontopoftheoldcode?Ordoes

itreplacetheoldcodeslowlyovertime?Inordertounderstandthis,Ibuilta

littlethingtoanalyzeGitprojects,withhelpfromtheformidableGitPython

project.Theideaistogobackinhistoryhistoricalandruna gitblame(makingthissomewhatfastwasabitnontrivial,asitturnsout,butI’llspare

youthedetails,whichinvolvesomeopportunisticcachingoffiles,pick

historicalpointsspreadoutintime,use gitdiff toinvalidatechangedfiles,

etc).

Inmomentofclarity,Inamed“GitofTheseus”asaterriblepunonshipof

Theseus.I’madadnow,soIcanmaketerriblepuns.Itreferstoaphilosophical

paradox,wherethepiecesofashiparereplacedforhundredsofyears.Ifall

piecesarereplaced,isitstillthesameship?

TheshipwhereinTheseusandtheyouthofAthensreturnedfromCretehad

thirtyoars,andwaspreservedbytheAtheniansdowneventothetimeof

DemetriusPhalereus,fortheytookawaytheoldplanksastheydecayed,

puttinginnewandstrongertimberintheirplaces,insomuchthatthisship

becameastandingexampleamongthephilosophers,forthelogicalquestion

ofthingsthatgrow;onesideholdingthattheshipremainedthesame,andthe

othercontendingthatitwasnotthesame.

Itturnsoutthatcodedoesn’texactlyevolvethewayIexpected.Thereisa“ship

ofTheseus”effect,butthere’salsoacompoundingeffectwherecodebaseskeep

growingovertime(maybeIshouldcallit“SecondAvenueSubway”effect,after

theconstructionprojectinNYCthat’sbeengoingonsince1919).

Let’sstartbyanalyzingGititself.Gitbecameself-hostingearlyon,andit’sone

ofthemostpopularandoldestGitprojects:

Thisplotstheaggregatenumberoflinesofcodeovertime,brokendowninto

cohortsbytheyearadded.Iwouldhaveexpectedmoreofadecayhere,andI’m

surprisedtoseethatsomuchcodewrittenbackin2006isstillaliveinthecode

base—interesting!

Wecancomputethedecayforindividualcommitstoo.Ifwealignallcommits

atx=0,wecanlookattheaggregatedecayforcodeinacertainrepo.This

analysisissomewhathardertoimplementthanitsoundslikebecauseof

variousstuff(mostlybecausenewercommitshavehadlesstime,sotheright

endofthecurverepresentsanaggregateoffewercommits).

ForGit,thisplotlookslikethis:

Evenafter10years,40%oflinesofcodeisstillpresent!Let’slookatabroader

rangeof(somewhatrandomlyselected)opensourceprojects:

ItlookslikeGitissomewhatofanoutlierhere.Fittinganexponentialdecayto

Gitandsolvingforthehalf-lifegivesapprox~6years.

Hmm…notconvincedthisisnecessarilyaperfectfit,butasthefamousquote

goes:Allmodelsarewrong,somemodelsareuseful.Iliketheexplanatory

powerofanexponentialdecay—codehasanexpectedlifetimeandaconstant

riskofbeingreplaced.

Isuspectaslightlybettermodelwouldbetofitasumofexponentials.This

wouldworkforarepowithsomecodethatchangesfastandsomecodethat

changesslowly.Butbeforegoingdownarabbitholeofcurvefitting,Ireminded

myselfofvonNeumann’squote:WithfourparametersIcanfitanelephant,

andwithfiveIcanmakehimwigglehistrunk.There’sprobablysomewayto

makeitwork,butI’llrevisitsomeothertime.

Let’slookatalotofprojectsinaggregate(alsosampledsomewhatarbitrarily):

Inaggregate,thehalf-lifeisroughly~3.33years.Ilikethat,it’saneasynumber

toremember.Butthespreadisbigbetweendifferentprojects.Theaggregate

modeldoesn’tnecessarilyhavesuperstrongpredictivepower—it’shardto

pointtoaarbitraryopensourceprojectandexpecthalfofittobegone3.33

yearslater.

MoarreposApache(akaHTTPD)isanotherrepothatgoeswayback:

Rails:

Beautifulexponentialfit!

Node

Wannarunitforyourownrepo?Again,codeisavailablehere.

ThemonsterrepoofthemallNotethatmostoftheserepostookatmostafewminutestoanalyze,usingmy

script.AsafinaltestIdecidedtorunitovertheLinuxkernelwhichisHUGE—

635,229commitsasoftoday.Thisis16timeslargerthanthesecondbiggest

repoIlookedat(rails)andtookmultipledaystoanalyzeonmyshitty

computer.TomakeitfasterIendedupcomputingthefull gitblame onlyfor

commitsspreadoutatleast3weeksandalsolimiteditto .c files:

Thesquigglylinesareprobablyfromthesamplingmechanism.Butlookatthis

beauty—awhopping16Mlines!Thecodecontributionfromeachyear’scohort

isextremelysmoothatthisscale.Individualcommitshaveabsolutelyno

meaningatthisscale—theycumulativesumofthemisverypredictible.It’s

likegoingfromNewton’slawstothermodynamics.

Linuxalsoclearlyexhibitsmoreofalineargrowthpattern.I’mspeculatingthat

thishastodowithitshighmodularity.The drivers directoryhasbyfarthe

mostnumberoffiles(22,091)followedby arch (17,967)whichcontains

supportforvariousarchitectures.Thisisexactlythekindofthingsyouwould

expecttoscaleverywellwithcomplexity,sincetheyhaveawelldefined

interface.

Somewhatofftopic,butIlikethenotionofhowwellaprojectsscaleswith

complexity.Alinearscalabilityistheultimategoal,whereeachonemarginal

featuretakesroughlythesameamountofcode.Badprojectsscale

superlinearly,andeverymarginalfeaturetakesmoreandmorecode.

It’sinterestingtogobackandcontrastLinuxtosomethinglikeAngular,which

basicallyexhibitstheoppositebehavior:

Thehalf-lifeofarandomlyselectedlineinAngularisabout0.32years.Does

thisreflectonAngular?Isthearchitecturebasicallynotas“linear”and

consistent?Youmightsaythecomparisonisunfair,becauseAngularisnew.

That’safairpoint.ButIwouldn’tbesurprisedifitdoesreflectonsome

questionabledesign.Don’tmeantobeshittingonAngularhere,butit’san

interestingcontrast.

Half-lifebyrepositoryAsomewhatarbitrarysampleofprojectsandtheirhalf-lifes:

project half-life(years) firstcommit

angular 0.32 2014

bluebird 0.56 2013

kubernetes 0.59 2014

keras 0.69 2015

tensorflow 1.08 2015

express 1.23 2009

scikit-learn 1.29 2011

luigi 1.30 2012

backbone 1.48 2010

ansible 1.52 2012

react 1.66 2013

node 1.76 2009

underscore 1.97 2009

requests 2.10 2011

rails 2.43 2004

django 3.38 2005

theano 3.71 2008

numpy 4.15 2006

moment 4.54 2015

scipy 4.62 2007

tornado 4.80 2009

redis 5.20 2010

flask 5.22 2010

httpd 5.38 1999

git 6.04 2005

chef 6.18 2008

linux 6.60 2005

It’sinterestingthatmomenthassuchhighhalf-life,butthereasonisthatso

muchofthecodeislocale-specific.Thiscreatesamorelinearscalabilitywitha

stablecoreofcodeandlinearadditionsovertime.expressisanoutlierinthe

otherdirection.It’s7yearsoldbutcodechangesextremelyquickly.I’m

guessingthisispartlybecause(a)lackoflinearscalabilityincode(b)it’s

probablyoneofthefirstmajorJavascriptopensourceprojectstohit

mainstream/popularity,surfingontheNode.jswave.Possiblythecodebase

alsosucks,butIhavenoidea

Hascodingchanged?Icanthinkofthreereasonswhythere’ssuchastrongrelationshipbetweenthe

yeartheprojectwasinitiated,andthehalf-life

1. Codechurnsmoreearlyoninprojects,andbecomesmorestableawhilein

2. Codinghaschangedfrom2006to2016,andmodernprojectsevolvefaster

3. There’ssomekindofselectionbiaswheretheonlyprojectsthatsurviveare

thescalablestablesones

Interestingly,Idon’tfindanyclearevidenceof#1inthedata.Thehalf-lifefor

codewrittenearlierinoldprojectsareashighaslatecode.I’mskepticalabout

#3aswellbecauseIdon’tseewhytherewouldbearelationbetweensurvival

andcodestructure(butmaybethereis).Myconclusionisthatwritingcode

hasfundamentallychangedinthelast10years.Codereallyseemsto

changeatamuchfasterrateinmodernprojects.

Bytheway,seediscussiononHackerNewsandonReddit!

Relatedposts

NYCsubwaymath2016-04-04

Nearestneighbormethodsandvectormodels–part12015-09-23

RecurrentNeuralNetworksforCollaborativeFiltering2014-06-28

Howtobuildupadatateam(everythingIeverlearnedaboutrecruiting)2014-

06-08

InterviewwithaDataScientist:ErikBernhardsson2015-10-27

Paretoefficency2016-10-25

Analyzing50kfontsusingdeepneuralnetworks2016-01-20

©2016.Allrightsreserved.

Loading[Contrib]/a11y/accessibility-menu.js

https://erikbern.com/

https://github.com/erikbern/git-of-theseus

https://gitpython.readthedocs.io/en/stable/

https://en.wikipedia.org/wiki/Second_Avenue_Subway#Initial_attempts

https://en.wikipedia.org/wiki/Self-hosting

https://github.com/apache/httpd

https://github.com/rails/rails

https://github.com/nodejs/node

https://github.com/erikbern/git-of-theseus

https://github.com/torvalds/linux

https://github.com/rails/rails

https://en.wikipedia.org/wiki/Kinetic_theory_of_gases

https://github.com/moment/moment

https://github.com/expressjs/express

https://news.ycombinator.com/item?id=13112449

https://www.reddit.com/r/programming/comments/5gqurc/the_halflife_of_code_the_ship_of_theseus/

https://erikbern.com/2016/04/04/nyc-subway-math.html

https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1/

https://erikbern.com/2014/06/28/recurrent-neural-networks-for-collaborative-filtering/

https://erikbern.com/2014/06/08/how-to-build-up-a-data-team-everything-i-ever-learned-about-recruiting/

https://erikbern.com/2015/10/28/interview-with-a-data-scientist-erik-bernhardsson/

https://erikbern.com/2016/10/25/pareto-efficiency.html

https://erikbern.com/2016/01/21/analyzing-50k-fonts-using-deep-neural-networks/

The half-life of code & the ship of Theseus - Sauf · The half-life of code & the ship of Theseus...

Documents

Transcript of The half-life of code & the ship of Theseus - Sauf · The half-life of code & the ship of Theseus...