IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 7, JULY 2014 633 An Empirical...

An Empirical Study of RefactoringChallenges and Benefits at Microsoft

Miryung Kim,Member, IEEE, Thomas Zimmermann,Member, IEEE, and

Nachiappan Nagappan,Member, IEEE

Abstract—It is widely believed that refactoring improves software quality and developer productivity. However, few empirical studies

quantitatively assess refactoring benefits or investigate developers’ perception towards these benefits. This paper presents a field

study of refactoring benefits and challenges at Microsoft through three complementary study methods: a survey, semi-structured

interviews with professional software engineers, and quantitative analysis of version history data. Our survey finds that the refactoring

definition in practice is not confined to a rigorous definition of semantics-preserving code transformations and that developers perceive

that refactoring involves substantial cost and risks. We also report on interviews with a designated refactoring team that has led a multi-

year, centralized effort on refactoring Windows. The quantitative analysis of Windows 7 version history finds the top 5 percent of

preferentially refactored modules experience higher reduction in the number of inter-module dependencies and several complexity

measures but increase size more than the bottom 95 percent. This indicates that measuring the impact of refactoring requires multi-

dimensional assessment.

Index Terms—Refactoring, empirical study, software evolution, component dependencies, defects, churn

Ç

1 INTRODUCTION

IT is widely believed that refactoring improves softwarequality and developer productivity by making it easier to

maintain and understand software systems [1].Many believethat a lack of refactoring incurs technical debt to be repaid inthe form of increased maintenance cost [2]. For example,eXtreme programming claims that refactoring saves devel-opment cost and advocates the rule of refactor mercilesslythroughout the entire project life cycles [3]. On the otherhand, there exists a conventional wisdom that software engi-neers often avoid refactoring, when they are constrained bya lack of resources (e.g., right beforemajor software releases).Some also believe that refactoring does not provide immedi-ate benefit unlike new features or bug fixes [4].

Recent empirical studies show contradicting evidence onthe benefit of refactoring as well. Ratzinger et al. [5] foundthat, if the number of refactorings increases in the precedingtime period, the number of defects decreases. On the otherhand, Weißgerber and Diehl found that a high ratio of refac-toring is often followed by an increasing ratio of bug reports[6], [7] and that incomplete or incorrect refactorings causebugs [8]. We also found similar evidence that there exists astrong correlation between the location and timing of API-level refactorings and bug fixes [9].

These contradicting findings motivated us to conduct afield study of refactoring definition, benefits, and challenges

in a large software development organization and investigatewhether there is a visible benefit of refactoring a large system.In this paper, we address the following research questions:(1) What is the definition of refactoring from developers’ per-spectives? By refactoring, do developers indeedmean behav-ior-preserving code transformations that modify a programstructure [1], [10]? (2) What is the developers’ perceptionabout refactoring benefits and risks, and inwhich contexts dodevelopers refactor code? (3) Are there visible refactoringbenefits such as reduction in the number of bugs, reductionin the average size of code changes after refactoring, andreduction in the number of component dependencies?

To answer these questions, we conducted a survey with328 professional software engineers whose check-in com-ments included a keyword “refactor�”. From our survey par-ticipants, we also came to know about a multi-yearrefactoring effort on Windows. Because Windows is one ofthe largest, long-surviving software systems within Micro-soft and a designated team led an intentional effort of sys-tem-wide refactoring, we interviewed the refactoring teamof Windows. Using the version history, we then assessedthe impact of refactoring on various software metrics suchas defects, inter-module dependencies, size and locality ofcode changes, complexity, test coverage, and people andorganization related metrics.

To distinguish the impact of refactoring versus regularchanges, we define the degree of preferential refactoring—applying refactorings more frequently to a module, relativeto the frequency of regular changes. For example, if a moduleis ranked at the fifth in terms of regular commits but rankedthe third in terms of refactoring commits, the rank differenceis 2. This positive number indicates that, refactoring is prefer-entially applied to the module relative to regular commits.We use the rank difference measure specified in Section 4.4instead of the proportion of refactoring commits out of all

� M. Kim is with the Department of Electrical and Computer Engineering atthe University of Texas at Austin.

� T. Zimmermann and N. Nagappan are with Microsoft Research atRedmond.

Manuscript received 25 Mar. 2013; revised 3 Jan. 2014; accepted 16 Mar.2014. Date of publication 17 Apr. 2014; date of current version 18 July 2014.Recommended for acceptance by W.F. Tichy.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TSE.2014.2318734

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 7, JULY 2014 633

0098-5589� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

commits per module, because the preferential refactoringmeasure is less sensitive to the total number of commitsmade in each module. We then investigate the relationshipbetween preferential refactoring and software metrics. Interms of a threshold, we contrast the top 5 percent preferen-tially refactored modules against the bottom 95 percent,because the top 5 percent modules cover most refactoringcommits—over 90 percent. We use both bivariate correlationanalysis and multivariate regression analysis to investigatehow much different development factors may impact thedecision to apply refactoring andhow these factors contributeto reduction of inter-module dependencies and defects [11].

Our field study makes the following contributions:

� The refactoring definition in practice seems to differfrom a rigorous academic definition of behavior-preserving program transformations. Our survey partic-ipants perceived that refactoring involves substantialcost and risks, and they needed various types of toolsupport beyond automated refactoring within IDEs.

� The interviews with a designated Windows refactor-ing team provide insights into how system-widerefactoring was carried out in a large organization.The team led a centralized refactoring effort by con-ducting an analysis of a de-facto dependency struc-ture and by developing custom refactoring supporttools and processes.

� To collect refactoring data from version histories, weexplore two separate methods. First, we isolate refac-toring commits based on branches relevant to refac-toring tasks. Second, we analyze check-in commentsfrom version histories based on the refactoringrelated keywords identified from our survey.

� The top 5 percent of preferentially refactored mod-ules decrease the number of dependencies by a fac-tor of 0.85, while the rest increases it by a factor of1.10 compared to the average number of dependencychanges per modules.

� The top 5 percent of preferentially refactored mod-ules decrease post-release defects by 7 percent lessthan the rest, indicating that defect reduction cannotbe contributed to the refactoring changes alone. It ismore likely that the defect reduction in Windows 7 isenabled by both refactoring and non-refactoringchanges in the modified modules.

� We also collect data and conduct statistical analysisto measure the impact of refactoring on the size andlocality of code changes, test coverage metrics, andpeople and organization related metrics, etc. Thestudy results indicate that the refactoring effort waspreferentially focused on the modules that have rela-tively low churn measures, have higher test blockcoverage and relatively few developers worked onin Windows Vista. Top 5 percent of preferentiallyrefactored modules experience a greater rate ofreduction in certain complexity measures, butincreases LOC and crosscutting changes more thanother modules.1

While there are many anecdotes about the benefit ofrefactoring, few empirical studies quantitatively assessrefactoring benefit. To the best of our knowledge, ourstudy is the first to quantitatively assess the impact ofmulti-year, system-wide refactoring on various softwaremetrics in a large organization. Consistent with our inter-view study, refactoring was preferentially applied to mod-ules with a large number of dependencies and preferentialrefactoring is correlated with reduction in the number ofdependencies. Preferentially refactored modules havehigher test adequacy and they experience defect reduction;however, this defect reduction cannot be attributed to therole of refactoring changes alone according to our regres-sion analysis. Preferentially refactored modules experiencehigher reduction in several complexity measures butincrease size more than the bottom 95 percent. This indi-cates that measuring the impact of refactoring requiresmulti-dimensional assessment.

Based on our study, we propose future research direc-tions on refactoring—we need to provide various types oftool support beyond automated refactorings in IDEs, suchas refactoring-aware code reviews, refactoring cost and ben-efit estimation, and automated validation of program cor-rectness after refactoring edits. As the benefit of refactoringis multi-dimensional and not consistent across various met-rics, we believe that managers and developers can benefitfrom automated tool support for assessing the impact ofrefactoring on various software metrics.

2 A SURVEY OF REFACTORING PRACTICES

In order to understand refactoring practices at Microsoft, wesent a survey to 1,290 engineers whose change commentsincluded the keyword “refactor�” in the last two years in fiveMicrosoft products: Windows Phone, Exchange, Windows,office communication and services (OCS), and Office. Wepurposely targeted the engineers who are already familiarwith the terms, refactor, refactoring, refactored, etc., becauseour goal is to understand their own refactoring definitionand their perception about the value of refactoring. The sur-vey consisted of 22 multiple choice and free-form questions,which were designed to understand the participant’s ownrefactoring definition, when and how they refactor code,including refactoring tool usage, developers’ perceptiontoward the benefits, risks, and challenges of refactoring.

Table 1 shows a summary of the survey questions. Thefull list is available as a technical report [12]. We analyzedthe survey responses by identifying the topics and key-words and by tagging individual responses with the identi-fied topics. The first author did a two pass analysis by firstidentifying emerging categories and next by tagging indi-vidual answers using the categories. Suppose that a devel-oper answered the following to the question, “Based on yourown experience, what are the risks involved in refactoring?” Wethen tagged the answer with categories, regression bugs andbuild breaks, merge conflicts, and time taken from other tasks,which emerged from the participant’s answer.

“Depending on the scope of the refactoring, it can be easy tounintentionally introduce subtle bugs if you aren’t careful,especially if you are deliberately changing the behavior of thecode at the same time. Other risks include making it difficult to

1. A module unit in Windows is a .dll or .exe component and itsdefinition appears in Section 4.

634 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 7, JULY 2014

merge changes from others (especially troublesome becauselarger refactoring typically takes a significant amount of timeduring which others are likely to make changes to the samecode), and making it difficult for others to merge with you (effec-tively spreading out the cost of the merge to everyone else whomade changes to the same code).”

In total, 328 engineers participated in the survey. Eightythree percent of them were developers, 16 percent of themwere test engineers, 0.9 percent of them were build engi-neers, and 0.3 percent of them were program managers. Theparticipants had 6.35 years of experience at Microsoft and9.74 years of experience in software industry on averagewith a familiarity with C++, C, and C#.

2.1 What Is a Refactoring Definition in Practice?

When we asked, “how do you define refactoring?”, wefound that developers do not necessarily consider thatrefactoring is confined to behavior preserving transforma-tions [10]. Seventy-eight percent define refactoring as codetransformation that improves some aspects of programbehavior such as readability, maintainability, or perfor-mance. Forty-six percent of developers did not mentionpreservation of behavior, semantics, or functionality in their

refactoring definition at all. This observation is consistentwith Johnson’s argument [13] that, while refactoring pre-serves some behavior, it does not preserve behavior in allaspects. The following shows a few examples of refactoringdefinitions by developers.2

“Rewriting code to make it better in some way.”

“Changing code to make it easier to maintain. Strictlyspeaking, refactoring means that behavior does not change,but realistically speaking, it usually is done while addingfeatures or fixing bugs.”

When we asked, “how does the abstraction level of MartinFowler’s refactorings or refactoring types supported byVisual Studio match the kinds of refactoring that you per-form?”, 71 percent said these basic refactorings are often apart of larger, higher-level effort to improve existing software.Forty-six percent of developers agree that refactorings sup-ported by automated tools differ from the kind of refactor-ings they perform manually. In particular, one developersaid, the refactorings listed in Table 1 form the minimum

TABLE 1Summary of Survey Questions (The Full List Is Available as a Technical Report [12])

2. In the following, each italicized, indented paragraph corre-sponds to a quote from answers to our survey (Section 2) or inter-views (Section 3).

KIM ET AL.: AN EMPIRICAL STUDY OF REFACTORING CHALLENGES AND BENEFITS AT MICROSOFT 635

granular unit of any refactoring effort, but none are worthyof being called refactoring in and of themselves. The refac-torings she performs are larger efforts aimed at interfacesand contracts to reduce software complexity, which mayutilize any of the listed low-level refactoring types, but havea larger idea behind them. As another example, a partici-pant said,

“These (Fowler’s refactoring types or refactoring types sup-ported by Visual Studio) are the small code transformation tasksoften performed, but they are unlikely to be performed alone.There’s usually a bigger architectural change behind them.”

These remarks indicate that the scope and types of codetransformations supported by refactoring engines are oftentoo low-level and do not directly match the kinds of refac-toring that developers want to make.

2.2 What Are the Challenges Associatedwith Refactoring?

When we asked developers, “what are the challenges asso-ciated with doing refactorings at Microsoft?”, 28 percent ofdevelopers pointed out inherent challenges such as workingon large code bases, a large amount of inter-componentdependencies, the needs for coordination with other devel-opers and teams, and the difficulty of ensuring programcorrectness after refactoring. Twenty-nine percent of devel-opers also mentioned a lack of tool support for refactoringchange integration, code review tools targeting refactoringedits, and sophisticated refactoring engines in which a usercan easily define new refactoring types. The difficulty ofmerging and integration after refactoring often discouragespeople from doing refactoring [14]. Version control systemsthat they use are sensitive to rename and move refactoring,and it makes it hard for developers to understand codechange history after refactorings. The following quotesdescribe the challenges of refactoring change integrationand code reviews after refactoring:

“Cross-branch integration was the biggest problem [15]. Wehave this sort of problem every time we fix any bug or refactoranything, although in this case it was particularly painfulbecause refactoring moved files, which prevented cross-branchintegration patches from being applicable.”

“It (refactoring) typically increases the number of lines/filesinvolved in a check-in. That burdens code reviewers andincreases the odds that your change will collide with some-one else’s change.”

Many participants also mentioned that, when a regres-sion test suite is inadequate, there is no safety net for check-ing the correctness of refactoring. Thus, it often preventsfrom developers to initiate refactoring effort.

“If there are extensive unit tests, then (it’s) great, (one) wouldneed to refactor the unit tests and run them, and do some san-ity testing on scenarios as well. If there are no tests, then (one)need to go from known scenarios and make sure they all work.If there is insufficient documentation for scenarios, refactoringshould not be done.”

In addition to these inherent and technical challenges ofrefactoring reported by the participants, maintaining back-ward compatibility often discourages them from initiatingrefactoring effort.

According to self-reported data, developers do mostrefactoring manually and they do not use refactoring toolsdespite their awareness of refactoring types supported bythe tools. When we asked, “what percentage of your refac-toring is done manually as opposed to using automatedrefactoring tools?”, developers said they do 86 percent ofrefactoring manually on average. Surprisingly 51 percent ofdevelopers do all 100 percent of their refactoring manually.Fig. 1 shows the percentages of developers who usuallyapply individual refactoring types manually despite theawareness and availability of automated refactoring toolsupport. Considering that 55 percent of these developersreported that they have automated refactoring enginesavailable in their development environments, this lack ofusage of automated refactoring engines is very surprising.With an exception of rename refactoring, more than a halfof the participants said that they apply those refactoringsmanually, despite their awareness of the refactoring typesand availability of automated tool support. This result isaligned with Vakilian et al. [16]. Our survey responses indi-cate that the investment in tool support for refactoring mustgo beyond automated code transformation, for example,tool support for change integration, code reviews afterrefactoring, validation of program correctness, estimation ofrefactoring cost and benefit, etc.

“I’d love a tool that could estimate the benefits of refactoring.Also, it’d be awesome to have better tools to help figure outwho knows a lot about the existing code to have somebody totalk to and how it has evolved to understand why the code waswritten the way it was, which helps avoid the same mistakes.”

“I hope this research leads to improved code understandingtools. I don’t feel a great need for automated refactoring tools,but I would like code understanding and visualization tools tohelp me make sure that my manual refactorings are valid.”

“What we need is a better validation tool that checks correct-ness of refactoring, not a better refactoring tool.”

2.3 What Are the Risks and Benefits of Refactoring?

When we asked developers, “based on your experience,what are the risks involved in refactorings?”, they reportedregression bugs, code churns, merge conflicts, time takenfrom other tasks, the difficulty of doing code reviews afterrefactoring, and the risk of over-engineering. Fig. 2 summa-rizes the percentage of developers who mentioned each par-ticular risk factor. Note that the total sum is over 100 percentas one developer could mention more than one risk factor.

Fig. 1. The percentage of survey participants who know individual refac-toring types but do those refactorings manually.


Seventy-six percent of the participants consider that refac-toring comes with a risk of introducing subtle bugs andfunctionality regression; 11 percent say that code merging ishard after refactoring; and 24 percent mention increasedtesting cost.

“The primary risk is regression, mostly from misunderstand-ing subtle corner cases in the original code and not accountingfor them in the refactored code.”

“Over-engineering—you may create an unnecessary architec-ture that is not needed by any feature but all code chunks haveto adapt to it.”

“The value of refactoring is difficult to measure. How do youmeasure the value of a bug that never existed, or the time savedon a later undetermined feature? How does this value bubbleup to management? Because there’s no way to place immediatevalue on the practice of refactoring, it makes it difficult to jus-tify to management.”

When we asked, “what benefits have you observed fromrefactoring?”, developers reported improved maintainabil-ity, improved readability, fewer bugs, improved perfor-mance, reduction of code size, reduction of duplicate code,improved testability, improved extensibility & easier to addnew feature, improved modularity, reduced time to market,etc, as shown in Fig. 3.

When we asked, “in which situations do you performrefactorings?” developers reported the symptoms of codethat help them decide on refactoring (see Fig. 4). Twenty-two percent mentioned poor readability; 11 percent men-tioned poor maintainability; 11 percent mentioned the diffi-culty of repurposing existing code for different scenariosand anticipated features; 9 percent mentioned the difficultyof testing code without refactoring; 13 percent mentionedcode duplication; 8 percent mentioned slow performance;

5 percent mentioned dependencies to other teams’ modules;and 9 percent mentioned old legacy code that they need towork on. Forty-six percent of developers said they do refac-toring in the context of bug fixes and feature additions, and57 percent of the responses indicate that refactoring isdriven by immediate concrete, visible needs of changes thatthey must implement in a short term, rather than potentiallyuncertain benefits of long-term maintainability. In addition,more than 95 percent of developers do refactoring across allmilestones and not only in MQ milestones—a period desig-nated to fix bugs and clean up code without the responsibil-ity to add new features. This indicates the pervasiveness ofrefactoring effort. According to self-reported data, develop-ers spend about 13 hours per month working on refactoring,which is close to 10 percent of their work, assuming devel-opers work about 160 hours per month.

3 INTERVIEWS WITH THE WINDOWS REFACTORING

TEAM

In order to examine how the survey respondents’ percep-tion matches reality in terms of refactoring and to investi-gate whether there are visible benefits of refactoring, wedecided to conduct follow-up interviews with a subset ofthe survey participants and to analyze the version historydata. In terms of a subject program, we decided to focus onWindows, because it is the largest, long-surviving softwaresystem within Microsoft and because we learned from oursurvey that a designated refactoring team has led an inten-tional, system-wide refactoring effort for many years.

We conducted one-on-one interviews with six key mem-bers of this team. The following describes the role of inter-view participants. The interviews with the participantswere audio-recorded and transcribed later for analysis. Thefirst author of this paper led all interviews. The first authorspent two weeks to categorize the interview transcripts interms of refactoring motivation, intended benefits, process,and tool support. She then discussed the findings with thesixth subject, (a researcher who is familiar with the Win-dows refactoring project and collaborated with the refactor-ing team) to check her interpretation.

� Architect (90 minutes).

� Architect / Development Manager (30 minutes).

� Development Team Lead (75 minutes).

� Development Team Lead (85 minutes).

� Developer (75 minutes).

� Researcher (60 minutes).Fig. 3. Various types of refactoring benefits that developers experienced.

Fig. 2. The risk factors associated with refactoring.

Fig. 4. The symptoms of code that help developers initiate refactoring.


The interview study results are organized by the ques-tions raised during the interviews.

“What motivated your team to lead this refactoring effort?”The refactoring effort was initiated by a few architects whorecognized that a large number of dependencies at the mod-ule level could be reduced and optimized to make modularreasoning of the system more efficient, to maximize paralleldevelopment efficiency, to avoid unwanted parallel changeinterference, and to selectively rebuild and retest subsys-tems effectively.

“If X percent of the modules are at a strongly connected compo-nent and you touch one of those things and you have to retest Xpercent of the modules again. . .”

“How did you carry out system-wide refactor-

ings on a very large system?” The refactoring teamanalyzed the de-facto module level dependency structurebefore making refactoring decisions. After the initialanalysis of module level dependencies, the team cameup with a layered architecture, where individual mod-ules were assigned with layer numbers, so that the par-tial ordering dependency relationships among modulescould be documented and enforced. To help with theanalysis of de-facto dependency structure, the team useda new tool called MaX [17]. MaX not only computesmodule level dependencies but also can distinguishbenign dependency cycles within a layer from undesir-able dependencies that go from low-level layers to thelayers above. The refactoring team consulted other teamsabout how to decompose existing functionality into a setof logical sub-groupings (layers).

“Our goal was actually (A) to understand the system, and todevelop a layered model of the system; and (B) to protect themodel programmatically and automatically. So by developinga mathematical model of the entire system that is based on layernumbers and associating modules with a layer number, wecould enforce a partial ordering—that’s what we call it, thelayer map.”

The team introduced quality gate checks, which preventeddevelopers from committing code changes that violate thelayer architecture constraints to the version control system.The refactoring team then refactored the existing system bysplitting existing modules into sub-component modules orby replacing existing modules with new modules.

They created two custom tools to ease migration of exist-ingmodules to newmodules. Similar to how Java allows cre-ation of abstract classes which later can be bound to concretesubclasses, the team created a technology that allows otherteams to import an empty header module for each logicalgroup of API family, which can be later bound to a concretemodule implementation depending on the system configura-tion. Then a customized loader loads an appropriate targetmodule implementation instead of the empty headermoduleduring the module loading time. This separates API con-tracts from API implementations, thus avoiding inclusion ofunnecessary modules in a different execution environment,where only aminimal functionality instead of a full function-ality is desired. The above technology takes care of switchingbetween two different API implementations during loadtime, but does not take care of cases where the execution of

two different API implementations must be weaved care-fully during runtime. To handle such cases, the team system-atically inserted program changes to existing code. Suchcode changes followed a special coding style guideline forbetter readability and were partially automated by stub codegeneration functionality.

In summary, we found that the refactoring effort had thefollowing distinctive characteristics:

� The team’s refactoring decisionsweremade after sub-stantial analysis of a de-facto dependency structure.

� The refactoring effort was centralized and topdown—the designated team made softwarechanges systematically, integrated the changes toa main source tree, and educated others on howto use new APIs, while preventing architecturaldegradation by others.

� The refactoring was enabled and facilitated by devel-opment of custom refactoring support tools and pro-cesses such as MaX and quality gate check.

4 QUANTITATIVE ANALYSIS OF WINDOWS 7VERSION HISTORY

To examine whether the refactoring done by this team had avisible benefit, we analyze Windows 7 version history data.We first generate hypotheses based on our survey and inter-view study findings. We then conduct statistical analysisusing the software metrics data collected from versionhistories.

4.1 Study Hypotheses

As software functionality varies for different projects and theexpertise level of developers whowork on the projects variesacross different organizations, our study goal is to contrastthe impact and characteristics of refactoring changes againstthat of non-refactorings in the same organization and thesame project. In other words, we compare the characteristicsof refactoring vs. non-refactoring versus all changes, notedas refactoring churn, non-refactoring churn, and regular churn (i.e., the union of refactorings and non-refactorings).

We generate study hypotheses based on our qualitativestudy findings. These hypotheses are motivated by our sur-vey and interviews, as well as the refactoring literature. Thehypotheses are described in Table 2 and the following sub-sections discuss our data collection and analysis methodand corresponding findings.

� H1 (Dependency). We investigate the relationshipbetween refactoring and dependency because ourinterview study indicates that the primary goal ofWindows refactoring is to reduce undesirable inter-module dependencies (i.e., the number of neighbormodules connected via dependencies).

� H2 (Defect). We investigate the relationship betweenrefactoring and defect because many of our surveyparticipants perceive that refactoring comes with arisk of introducing defects and regression bugs.

� H3 (Complexity). The hypotheses on complexity aremotivated by prior studies on technical debt [18],[19], [20], [21], [22], [23].


� H4 (Size, Churn and Locality). The hypotheses on size,churn, and locality are motivated by the fact thatdevelopers often initiate refactoring to improvechangeability and maintainability [24] and thatcrosscutting concerns pose challenges in evolvingsoftware systems [25], [26], [27], [28].

� H5 (Developer and Organization). The hypotheses onorganizational characteristics are motivated by thefact that the more people who touch the code, thehigher the chance of code decay and the higher needof coordination among the engineers, calling forrefactoring of the relevant modules [29].

� H6 (Test Coverage). We investigate the hypothesis ontest adequacy because our survey respondents said,“If there are extensive unit tests, then (it’s) great. If thereare no tests or there is insufficient documentation for testscenarios, refactoring should not be done.”

� H7 (Layer). We investigate the hypothesis on a lay-ered architecture to confirm our interview findingsthat the refactoring team split core modules andmoved reusable functionality from upper layers tolower layers to repurpose Windows for different exe-cution environments.

4.2 Data Collection: Identifying RefactoringCommits

To identify refactoring events, we use two separate methodsrespectively. First, we identify refactoring-related branchesand isolate changes from those branches. Second, we iden-tify refactoring related keywords and mine refactoring com-mits from Windows 7 version history by searching for thesekeywords in the commit logs [30].

Refactoring branch identification. In many developmentorganizations, changes are made to specific branches andlater merged to the main trunk. For example, the Windowsrefactoring team created certain branches to apply refactor-ing exclusively. So we asked the team to classify eachbranch as a refactoring versus non-refactoring branch. Webelieve that our method of identifying refactorings is reli-able because the team confirmed all refactoring branchesmanually and reached a consensus about the role of thoserefactoring branches within the team.

During Windows 7 development, 1.27 percent ofchanges are changes made to the refactoring branches

owned by the refactoring team; 98.73 percent of changesare made to non-refactoring branches. The number ofcommitters who worked on the refactoring branches is2.04 percent, while the number of committers on non-refactoring branches is 99.84 percent. Please note that thesum of the two is greater than 100 percent because somecommitters work both on refactoring branches and non-refactoring branches. 94.64 percent of modules areaffected by at least one change from the refactoringbranches, and 99.05 percent of modules are affected byat least one change from non-refactoring branches. Inour study, refactored modules are modules where atleast one change from the refactoring branches is com-piled into. For example, if the refactoring team madeedits on the refactoring branches to split a single Vistamodule into three modules in Windows 7, we call thethree modules as refactored modules in Windows 7.

Mining commit logs. To identify refactoring from ver-sion histories, we look for commit messages with certainkeywords. In our survey, we asked the survey partici-pant, “which keywords do you use or have you seen beingused to mark refactoring activities in change commit mes-sages? Based on the responses, we identify a list of topten keywords that developers use or have seen beingused to mark refactoring events: refactor (254), clean-up(42), rewrite (22), restructure (15), redesign (15), move (15),extract (11), improve (9), split (7), reorganize (7), rename (7)out of 328 responses. By matching the keywords againstthe commit messages, we detected refactoring commitsfrom version histories.

According to this method, 5.76 percent of commits arerefactoring, while 94.29 percent of commits are non-refac-toring. The number of committers for the refactoringchanges is 50.74 percent, while the number is 98.56 percentfor non-refactoring. 95.07 percent of modules are affectedby at least one refactoring commit, and 99.92 percent ofmodules are affected by at least one non-refactoring.

4.3 Data Collection: Software Metrics

This section discusses our data collection method for defect,dependency, and developers metrics. For other categories,Table 3 clarifies how the data are collected.

Dependencies. For our study, we analyzed dependen-cies at a module level. Here, a module refers to an

TABLE 2A Summary of Study Hypotheses and Results


executable file (COM, EXE, etc.) or a dynamic-linklibrary file (DLL) shipped with Windows. Modules areassembled from several source files and typically form alogical unit, e.g., user32.dll may provide programswith functionality to implement graphical user interfaces.This module unit is typically used for program analysiswithin Microsoft and the smallest units to which defectsare accurately mapped. A software dependency is adirected relation between two pieces of code such asexpressions or methods. There exist different kinds ofdependencies: data dependencies between the definitionand use of values and call dependencies between thedeclaration of functions and the sites where they arecalled. Microsoft has an automated tool called MaX [17]that tracks dependency information at the function level,including calls, imports, exports, RPC, COM, and Regis-try access. MaX generates a system-wide dependencygraph from both native x86 and .NET managed modules.MaX is used for change impact analysis and for integra-tion testing [17]. For our analysis, we generated a sys-tem-wide dependency graph with MaX at the functionlevel. Since modules are the lowest level of granularityto which defects can be accurately mapped back to, welifted this graph up to the module level in a separatepost-processing step.

Defects. Microsoft records all problems that are reportedfor Windows in a database. In this study, we measured thechanges in the number of post-release defects—defects lead-ing to failures that occurred in the field within six monthsafter the initial releases of Windows Vista or Windows 7.We collected all problem reports classified as non-trivial (incontrast to enhancement requests [33]) and for which theproblem was fixed in a later product update. The location ofthe fix is used as the location of the post-release defect. Tounderstand the impact of Windows 7 refactoring, we

compared the number of dependencies and the number ofpost-release defects at the module level between WindowsVista and Windows 7.

Developers and organization. We use committer informa-tion extracted from version control systems. Based on theWindows product organization’s structure and committerinformation, we measure seven metrics that represent thenumber of contributors, the cohesiveness of ownership, andthe diffusion of contribution used by Nagappan et al. Thesemetrics are summarized in Table 3 and the detailed descrip-tion and data collection method of these measures are avail-able elsewhere [32].

4.4 Analysis Method: Preferential Refactoring

To distinguish the role of refactoring versus regularchanges, we define the degree of preferential refactoring—applying refactorings more frequently to a module, relativeto the frequency of regular changes.

This notion of preferential refactoring is used through-out the later sections to distinguish the impact of

TABLE 3Software Metrics to Be Used for Data Collection and Statistical Analysis


refactoring versus regular changes. For example, if amodule is ranked at the fifth in terms of regular commitsbut ranked the third in terms of refactoring commits, therank difference is two. This positive number indicatesthat, refactoring is preferentially applied to the modulerelative to regular commits. We use the rank differencemeasure Eq. (1), instead of the proportion of refactoringcommits out of all commits per module, defined asrefactoring commit countðmÞ

all commit countðmÞ , because this proportion measure isvery sensitive to the total number of commits made toeach module and because the modules with very fewchanges pose a significant noise.

We then sort the modules based on the rank differ-ence in descending order and contrast the characteristicsof the top 5 percent group against the characteristics ofthe bottom 95 percent group. The reason why we choosea particular threshold of top 5 percent instead of top

n percent is that the top 5 percent modules with mostrefactoring commits account for 90 percent of all refac-toring commits—in other words, the top 5 percent grouprepresents concentrated refactoring effort.

4.5 Hypothesis H1. Dependencies

To investigate H1.A, for each module in Vista, we mea-sure Neighbors, the number of neighbor modules con-nected via dependencies. We then contrast the averageinter-module dependencies of the top 5 percent preferen-tially refactored modules versus that of bottom 95 percentof preferentially refactored modules in Vista. Theseresults are summarized in Table 4. The second columnand the third column report the relative ratio of a soft-ware metric with respect to the average metric value inVista for the top 5 percent and bottom 95 percent groupsrespectively. The fourth column reports the p-value

TABLE 4The Relationship between Windows 7 Refactoring and Various Software Metrics

Statistically significant test results (p-value � 0.05) are marked in yellow background.


of the Wilcoxon test. Statistically significant test results(p-value � 0.05) are marked in yellow background. Ourresults indicate that the top 5 percent most preferentiallyrefactored modules have 7 percent more inter-moduledependencies on average than the average modifiedmodule in Vista, while the bottom 95 percent havealmost the same number of inter-module dependenciesas the average modified module. In other words, devel-opers did preferentially applied refactorings to the moduleswith a higher number of inter-module dependencies inVista. A two-sided, unpaired Wilcoxon (Mann Whitney)test indicates that the top 5 percent group is statisticallydifferent from the bottom 95 percent in terms of itsinter-module dependency coverage (p-value: 0.0026).

To investigate H1.B, we contrast the changes in themetric value for the top 5 percent group versus the bot-tom 95 percent group. The changes in a metric are thennormalized with respect to the absolute average delta ofthe software metric among all modified modules. Sup-pose that the top 5 percent of most preferentially refac-tored modules decreased the value of a software metricby 5 on average, while the bottom 95 percent increasedthe metric value by 10 on average. On average, a modi-fied module has increased the metric value by 9.25. Wethen normalize the decrease in the top 5 percent group(�5) and the increase in the bottom 95 percent group(þ10) with respect to the absolute value of the averagechange (9.25) average of all modified modules, resultingin �0.54 and þ1.08 respectively.

Using this method, we measure the impact of preferen-tial refactoring on the changes in inter-module dependen-cies. The result indicates that the top 5 percent ofpreferentially refactored modules decreased the number ofinter-module dependencies by a factor of 0.85, while thebottom 95 percent of preferentially refactored modulesincreased the number of inter-module dependencies by afactor of 1.10 with respect to the average change of all modi-fied modules. The Wilcoxon test indicates that the trend isstatistically significant (p-value: 0.000022).

We conclude that, preferential refactoring is correlatedwith the decrease of inter-module dependencies. This isconsistent with our interviews with the Windows refac-toring team that their goal is to reduce undesirable inter-module dependencies.

4.6 Hypothesis H2. Defect

Similar to the analysis of contrasting the top 5 percent groupvs. the bottom 95 percent group in Section 4.5, we investi-gate the relationship between preferential refactoring treat-ment and defects. The results in Table 4 indicate that the top5 percent group has 9 percent fewer post-release defectsthan the rest in Vista, confirming that refactoring was notnecessarily preferentially applied to the modules with ahigh number of defects.

Further, the top 5 percent group decreased post-releasedefects by a factor of 0.93, while the bottom 95 percentgroup decreased by a factor of 1.00 with respect to theaverage change in defect counts. In other words, the top5 percent group is correlated with the reduction of post-release defects, but less so, compared to the rest. This

result indicates that the cause of defect reduction cannotbe attributed to refactoring changes alone. It is possiblethat the defect reduction is enabled by other events inthose refactored modules as well.

4.7 Hypothesis H3. Complexity

We use the same method of contrasting the top 5 percentversus bottom 95 percent. Table 4 summarizes the results.Top 5 percent of most preferentially refactored modulestend to have lower complexity measures in Vista comparedto the bottom 95 percent. However, this distinction is notstatistically significant, according to the Wilcoxon test,rejecting the hypothesis H3.A.

Top 5 percent group reduces fan-in, fan-out, andcyclomatic complexity measures more than the rest. Forexample, if we assume that an average modified modulehas 100 fan-ins in Windows Vista, the top 5 percentgroup covers 67, while the bottom 95 percent covers 102fan-ins. The member read based coupling (C5.1)increases much less for the top 5 percent, compared tothe rest. The results are statistically significant (p-value� 0.05). However, for other complexity metrics, the sametrend does not hold or is not statistically significant. Insummary, we found that, the Windows refactoring effortdid not reduce various complexity measures consistently.

4.8 Hypothesis H4. Size, Churn, and Locality

We investigate the relationship between refactoring andsize, churn, and locality respectively. The hypotheses H4.A,H4.B and H4.C are motivated by the fact that developersoften initiate refactoring to improve changeability andmaintainability [24] and that crosscutting concerns posechallenges in evolving software systems [25], [26], [27], [28].We use the same study method of contrasting the top 5 per-cent versus the bottom 95 percent described in Section 4.5.

Regarding H4.A, the top 5 percent has smaller sizemetrics than the rest in Vista. This indicates that refac-toring changes were not preferentially applied to themodules with large size. Furthermore, the top 5 percentincreased size more in terms of LOC, while the restdecreased their module size. In other words, preferentialrefactoring did not play any role in decreasing varioussize measures, rejecting H4.A.

Regarding H4.B, in contrast to our original hypothesis,the top 5 percent group consistently has lower churn meas-ures in Vista, compared to the rest. In other words, the lessfrequent changes are in Vista, the more likely they are toreceive preferential refactoring treatment during Windows7 development. One possible explanation is that developersmight have determined that the modules to be refactoredare problematic and that making too many changes to themcould impede the stability of the system. Thus, to preservestability, they may have applied fewer changes inten-tionally. The churn measures in the top 5 percent groupdecrease less, compared to the rest. The less frequent refac-torings are, the greater the decrease in churn measurements,rejecting H4.B.

Regarding H4.C, to measure the degree of crosscuttingchanges, we measure the average number of modified filesper check-in. Our results indicate that preferential


refactoring is applied to the modules with lower degree ofcrosscutting changes in Vista, and the modules whichreceived preferential refactoring treatment tend to increasecrosscutting changes more than other modules in Windows7 (173 versus 96 percent), rejecting H4.C. We believe thatthe increase in the crosscutting changes is caused by refac-torings themselves, which tend to be more scattered thanregular changes (on average refactorings touched 20 percentmore files than regular changes).

4.9 Hypothesis H5. Developer and Organization

The more people who touch the code, the higher thechance of code decay, and there could be a higher needfor coordination among the engineers [29]. Such coordi-nation needs may call for refactoring of the relevantmodules. We hypothesize that refactoring could reducethe number of engineers who worked on the modules byaligning the modular structure of the system with theteam structure [34], [35], [36]. To check these hypotheses,we study the cohesiveness and diffusion measures ofownership and contribution. The less cohesive and localthe contributions are, it may call for more refactoringeffort, while diffusive ownership could impede the effortof initiating refactoring.

Table 4 summarizes the results. Regarding the hypothe-ses H5.A, the NOE and NOEE measures are lower in thetop 5 percent than the rest, indicating that the refactoringeffort was preferentially made to the modules where afewer number of developers worked on. This is contrary toour initial hypothesis. One possible explanation is that therefactored modules are crucial, important modules and forthese modules typically only a tight small group of develop-ers were allowed to check-in code changes.

The cohesiveness of contribution measures (DMO, PO,and OCO) are lower in the top 5 percent than the rest inVista, validating H5.B. The refactoring effort was preferen-tially made to the modules with lower cohesiveness in termsof contribution. However, we did not find any distinctivetrends in terms of changes in those measures before andafter refactoring.

Regarding the hypothesis H5.C, the diffusion measure(OOW) is lower in the top 5 percent than the rest inVista. However, we did not find any distinctive trendsin terms of changes in those measures before and afterrefactoring.

4.10 Hypothesis H6. Test Coverage

We investigate the following hypothesis about refactoringand test adequacy, because our survey indicates that, whena regression test suite is inadequate, it could prevent devel-opers from initiating refactoring effort. Developers perceivethat there is no safety net for checking the correctness ofrefactoring edits when the test adequacy is low. Table 4summarizes the results. The block and arc test coverages forthe top 5 percent of most preferentially refactored modulesare indeed higher than the rest.

4.11 Hypothesis H7. Layers

The layer data in Windows is a means of quantifying thearchitectural constraints of the system. Simplistically

speaking, the lowest level is the Windows kernel and thehighest level is the UI components. The goal of assigninglayer numbers to modules is to prevent reverse depen-dencies. In other words, modules in level 0 are moreimportant than say level 25 for the reliability of the sys-tem as modules at level n can depend only on the mod-ules of level n� 1 and below only. We investigate H7 toconfirm our interview findings that the refactoring teamsplit core modules and moved reusable functionalityfrom an upper layer to a lower layer to repurpose Win-dows for different execution environments.

The top 5 percent and the bottom 95 percent of most pref-erentially refactored modules have similar average layernumbers (1.01 and 1.00 respectively in Table 4). The differ-ence is not statistically significant. In contrast to the inter-viewees’ understanding, the refactoring was notpreferentially applied to modules with a high layer number.Further, the layer number increase for the top 5 percentgroup is less than the increase of the bottom 95 percent.

4.12 Multivariate Regression Analysis

To identify how much and which factors may impact arefactoring investment decision (or selection of the modulesto apply refactorings preferentially), we use multivariateregression analysis.

To select independent variables for multivariate regres-sion, we first measure the Spearman rank correlationbetween the rank difference and individual software met-rics before refactoring. We use the metrics from Vista, asthey represent the characteristics before the refactoring effortwas made

corðall commit rankðmÞ � ref commit rankðmÞ;metric vistaðmÞÞ: (2)

If the correlation is a positive number close to 1, it impliesthat preferentially refactored modules tend to have a highermetric value before refactoring. If the correlation is nearzero, it implies that there exists no correlation between pref-erential refactoring and the metric.

We then select all metrics with significant correlations atp < 0:0001, even if the correlations values do not appear tobe very high. We use the lower p-value to account for multi-ple hypothesis testing (Bonferroni Correction) [37]. We con-struct an initial model where a dependent variable is a rankdifference per module, and independent variables are {# offiles per check-in, # of sources, # of incoming dependencies,OCO, OOW, relative churn, NOE, # of changed files, # ofdefects, total churn, NOEE, NOE, and # of check-ins}. Thenwe use a forward and backward stepwise regression toreduce the number of variables in the model.

The final model is described in Table 5. The rows repre-sent the selected variables and the cells represent coefficient,standard errors, and t-values. The results indicate that thelocality of change, the number of dependent modules (sour-ces), the number of defects, and the number of developerswho worked on the modules are significant factors for pref-erential refactoring. As discussed in previous sections, fac-tors such as complexity, size, and test adequacy do not playmuch roles in the decision of refactoring investment.


To investigate how other factors may contribute to reduc-tion of defects and dependencies, we use multivariateregression analysis. Using the same method described, weselect all metrics that have significant correlations with pref-erential refactoring at p < 0:0001 to build an initial model.Then we use a forward and backward stepwise regressionto reduce the number of variables in the model. In additionto these selected variables, we use several independent vari-ables: (1) preferential refactoring, as measured by the rankdifferences, (2) refactoring churn and churn, as measured bythe number of refactoring commits and all commits, and (3)refactoring ratio as measured by # of refactoring commits

# of all commits .In Table 6, column Ddefects describes a regression

model for the changes in the number of defects, i.e.,defects windows7ðmÞ � defects vistaðmÞ. Column D depen-dencies describes a regression model for the changes in thenumber of inter-module dependencies. Each cell describes acoefficient for a selected independent variable. The resultingmodel for Ddefects indicates that one of the most importantfactors for defect reduction is the number of previousdefects in Vista. The higher the number of defects in Vista,the higher the decrease of defects in Windows 7 (coefficient,�0.965). Another important factor is the amount of refactor-ing churn, indicated by # of refactoring commits (coeffi-cient, �0.072). This result indicates that, among manymetrics, refactoring churn is likely to play a significant rolein reducing the number of defects.

Similarly, the resulting model for Ddependencies indi-cates that important factors include the number of previousdefects in Vista and the amount of refactoring churn(coefficients �0.182 and �0.130 respectively).

4.13 Results of Keyword Based Identificationof Refactoring Commits

In addition to the designated refactoring team, other devel-opers in the Windows also apply refactoring to the system.To understand the impact of such refactoring, we identifyrefactoring changes by matching refactoring keywordsagainst commit messages described in Section 4.2.

Refactoring commits found using a branch-basedmethod overlap with refactoring commits found using akeyword-based method. When normalizing the absolutenumber of commits, suppose that the total number ofcommits is X. The number of refactoring commits identi-fied based on keywords is Y ¼ 0:058X. The number ofrefactoring commits identified based on branches isZ ¼ 0:013X. The overlap between refactoring commitsidentified using both methods is 0:004X, which is 0:006Yor 0:279Z. The absolute number of commits is normal-ized for presentation purposes.

Table 7 presents the same information as Table 4, and theonly difference between the two is that Table 7 uses abranch-based isolation method, while Table 4 uses a key-word-based method. According to the results of Table 4, therefactoring team’s effort was preferentially applied to themoduleswith a relative high number of inter-module depen-dencies, low post-release defects, low churn measures, andhigh organization cohesiveness metrics. The team effort iscorrelated with a relatively higher degree of reduction ininter-module dependencies and certain complexity meas-ures. On the other hand, the refactoring effort indicated bythe keyword method focused on the modules with a relativelow number of inter-module dependencies, low post-releasedefects, low complexity measures, low churn measures,small sizes, and high organization cohesivenessmetrics.

Since refactorings identified by a keyword methodand a branch isolation method are both correlated withreduction in inter-module dependencies, complexity met-rics, and churn measures, these reduction trends areunlikely to be enabled by the refactoring team’s effortalone. Instead, both the refactoring made by individualdevelopers in Windows 7 and the designated refactoringteam are likely to have contributed to the reduction ofthese metrics.

4.14 Discussion

Table 2 summarizes the study results of refactoring commitsfound by the branch method.

� Refactoring was preferentially applied to the mod-ules with a large number of dependencies. Preferen-tial refactoring is correlated with reduction in thenumber of dependencies. These findings areexpected and not surprising because interview par-ticipants stated the goal of system-wide refactoringis to reduce undesirable inter-module dependencies.However, we believe that there is a value in validat-ing intended benefits using version history analysis.

� Preferential refactoring is correlated with the reduc-tion of post-release defects, but less so, compared tothe rest. This indicates that the cause of defectreduction cannot be attributed to refactoringchanges alone.

� Refactoring was not preferentially applied to themodules with high complexity. Preferentially refac-tored modules experience a higher rate of reduction

TABLE 6Multivariate Regression Analysis Results for the Changes in

Defects and Dependencies

TABLE 5Multivariate Regression Analysis for Preferential Refactoring

Software metrics are measured for Vista before refactoring effort.


in certain complexity measures, but increase LOCand crosscutting changes more than the rest of mod-ules. This implies that managers may need auto-mated tool support for monitoring the impact ofrefactoring in a multi-dimensional way.

� Refactoring was preferentially applied to the mod-ules with high test coverage. This is consistent withthe survey participants’ view that test adequacyaffects developers’ likelihood to initiate refactoring.

5 THREATS TO VALIDITY

Internal validity. Our findings in Section 3 indicate only cor-relation between the refactoring effort and reduction of the

number of inter-module dependencies and post-releasedefects, not causation—there are other confounding factorssuch as the expertise level of developers that we did notexamine. It is possible that the changes to the number ofmodule dependencies and post-release defects in Windows7 are caused by factors other than refactoring such as thetypes of features added in Windows 7.

Construct validity. Construct validity issues arise whenthere are errors in measurement. This is negated to an extentby the fact that the entire data collection process of failuresand VCS is automated. When selecting target participantsfor refactoring, we searched all check-ins with the keyword“refactor�” based on the assumption that people who usedthe word know at least approximately what it means.

TABLE 7The Relationship between Refactoring Identified by a Keyword Based Method and Various Software Metrics

Statistically significant test results (p-value � 0.05) are marked in yellow background.


The definition of refactoring from developers’ perspec-tives is broader than behavior-preserving transformations,and the granularity of refactorings also varies among theparticipants. For example, some survey participants refer toFowler’s refactorings, while a large number of the partici-pants (71 percent) consider that refactorings are often a partof larger, higher-level effort to improve existing software.We do not have data on how many of the survey partici-pants participated in large refactorings vs. small refactor-ings as such question was not a part of the survey. In ourWindows case study, we focused on system-wide refactoring,because such refactoring granularity seems to be alignedwith the refactoring granularity mentioned by a large num-ber of the survey participants.

To protect confidential information, we used standardnormalizations. All analyses were performed on the actualvalues and the normalization was done for the presentationpurposes only to protect confidential information.

External validity. In our case, we came to know about amulti-year refactoring effort in Windows from several sur-vey participants and to leverage this best possible scenariowhere intentional refactoring was performed, we focusedon the case study of Windows. As opposed to formal experi-ments that often have a narrow focus and an emphasis oncontrolling context variables, case studies test theories andcollect data through observation in an unmodified setting.

Our study about Windows 7 refactoring may not general-ize to other refactoring practices, because the Windowsrefactoring team had a very specific goal of reducing unde-sirable inter-module dependencies and therefore may notbe applicable where refactoring efforts have a different orless precise goal. The hypothesis H7 assumes that the soft-ware system has a layered architecture. Thus it may not beapplicable to systems without a layered architecture.Because the Windows refactoring is a large scale, system-wide refactoring, its benefit may differ from other smallscale refactorings that appear in Fowler’s catalog.

While we acknowledge that our case study on Windowsmay not generalize to other systems, most developmentpractices are similar to those outside of Microsoft. Further-more, developers at Microsoft are highly representative ofsoftware developers all over the world, as they come fromdiverse educational and cultural backgrounds.3 We believethat lifting the veil on the Windows refactoring processand quantifying the correlation between refactoring andvarious metrics could be valuable to other developmentorganizations. To facilitate replication our study outsideMicrosoft, we published the full survey questions as a tech-nical report [12].

6 RELATED WORK

Refactoring definition. While refactoring is defined as abehavior-preserving code transformation in the academicliterature [10], the de-facto definition of refactoring in prac-tice seems to be very different from such rigorous defini-tion. Fowler catalogs 72 types of structural changes inobject oriented programs but these transformations do not

necessarily guarantee behavior preservation [1]. In fact,Fowler recommends developers to write test code firstbefore, since these refactorings may change a program’sbehavior. Murphy-Hill et al. analyzed refactoring logs andfound that developers often interleave refactorings withother behavior-modifying transformations [38], indicatingthat pure refactoring revisions are rare. Our survey in Sec-tion 2 also finds that refactoring is not confined to low-level, semantics-preserving transformations from devel-opers’ perspectives.

Quantitative assessment of refactoring benefits. While sev-eral prior research efforts have conceptually advanced ourunderstanding of the benefit of refactoring through meta-phors, few empirical studies assess refactoring benefitsquantitatively. Sullivan et al. first linked software modular-ity with option theories [39]. A module provides an optionto substitute it with a better one without symmetric obliga-tions, and investing in refactoring activities can be seen aspurchasing options for future adaptability, which will pro-duce benefits when changes happen and the module canbe replaced easily. Baldwin and Clark [40] argued that themodularization of a system can generate tremendous valuein an industry, given that this strategy creates valuableoptions for module improvement. Ward Cunninghamdrew the comparison between debt and a lack of refactor-ing: a quick and dirty implementation leaves technical debtthat incur penalties in terms of increased maintenance costs[21]. While these projects advanced conceptual under-standing of refactoring impact, they do not quantify thebenefits of refactoring.

Xing and Stroulia found that 70 percent of structuralchanges in Eclipse’s evolution history are due to refac-torings and existing IDEs lack support for complex refac-torings [41]. Dig and Johnson studied the role ofrefactorings in API evolution, and found that 80 percentof the changes that break client applications are API-level refactorings [42]. While these studies focused onthe frequency and types of refactorings, they did notfocus on how refactoring impacts inter-module depen-dencies and defects. MacCormack et al. [43] definedmodularity metrics and used these metrics to study evo-lution of Mozilla and Linux. They found that the rede-sign of Mozilla resulted in an architecture that wassignificantly more modular than that of its predecessor.However, unlike our study on Windows, their studymerely monitored design structure changes in terms ofmodularity metrics without identifying the moduleswhere refactoring changes are made.

Several research projects automatically detect thesymptoms of poor software design—coined as code smellsby Fowler [1]. Gu�eh�eneuc and Albin-Amiot detect inter-class design defects [44] and Marinescu identifies designflaws using software metrics [45]. Izurieta and Biemandetect accumulation of non design-pattern related code[19]. Guo et al. define domain-specific code smells [46]and investigate the consequence of technical debt [20].Wong et al. [47] identify modularity violations—recurringdiscrepancies between which modules should changetogether and which modules actually change togetheraccording to version histories. While these studies corre-late the symptoms of poor design with quality

3. Global diversity and inclusion http://www.microsoft.come/about/diversity/en/us/default.aspx.


measurements such as the number of bugs, these studiesdo not directly investigate the consequence of refactor-ing—purposeful actions to reverse or mitigate the symp-toms of poor design.

Conflicting evidence on refactoring benefit. Kataoka et al.[18] proposed a refactoring evaluation method that com-pares software before and after refactoring in terms ofcoupling metrics. Kolb et al. [23] performed a case studyon the design and implementation of existing softwareand found that refactoring improves software withrespect to maintainability and reusability. Moser et al.[48] conducted a case study in an industrial, agile envi-ronment and found that refactoring enhances qualityand reusability related metrics. Carriere et al.’s casestudy found the average time taken to resolve ticketsdecreases after re-architecting the system [49]. Ratzingeret al. developed defect prediction models based on soft-ware evolution attributes and found that refactoringrelated features and defects have an inverse correlation[5]—if the number of refactoring edits increases in thepreceding time period, the number of defects decreases.These studies indicated that refactoring positively affectsproductivity or quality measurements.

On the other hand, several research efforts found con-tradicting evidence that refactoring may affect softwarequality negatively. Weißgerber and Diehl found thatrefactoring edits often occur together with other types ofchanges and that refactoring edits are followed by anincreasing number of bugs [6]. Kim et al. found that thenumber of bug fixes increases after API refactorings [9].Nagappan and Ball found that code churn—the numberof added, deleted, and modified lines of code—is corre-lated with defect density [50]—since refactoring oftenintroduces a large amount of structural changes to thesystem, some question the benefit of refactoring. G€organd Weißgerber detected errors caused by incompleterefactorings by relating API-level refactorings to the cor-responding class hierarchy [6].

Because manual refactoring is often tedious and error-prone, modern IDEs provide features that automate theapplication of refactorings [51], [52]. However, recentresearch found several limitations of tool-assisted refactor-ings as well. Daniel et al. found dozens of bugs in the refac-toring tools in popular IDEs [53]. Murphy-Hill et al. foundthat refactoring tools do a poor job of communicating errorsand programmers do not leverage them as effectively asthey could [38]. Vakilian et al. [16] and Murphy et al. [54]found that programmers do not use some automated refac-torings despite their awareness of the availability of auto-mated refactorings.

These contradicting findings on refactoring benefitsmotivate our survey on the value perception about refactor-ing. They also motivate our analysis on the relationshipbetween refactoring and inter-module dependencies anddefects.

Refactoring change identification. A number of existingtechniques address the problem of automatically inferringrefactorings from two program versions. These techniquescompare code elements in terms of their name [41] andstructure similarity to identify move and rename refactor-ings [55]. Prete et al. encode Fowler’s refactoring types in

template logic rules and use a logic query approach to auto-matically find complex refactorings from two program ver-sions [56]. This work also describes a survey of existingrefactoring reconstruction techniques. Kim et al. use theresults of API-level refactoring reconstruction to study thecorrelation between API-level refactorings and bug fixes [9].While it is certainly possible to identify refactorings usingrefactoring reconstruction techniques, in our Windows 7analysis, we identify the branches that a designated refac-toring team created to apply and maintain refactoringsexclusively and isolate changes from those branches. Webelieve that our method of identifying refactorings is reli-able as a designated team confirmed all refactoringbranches manually and reached a consensus about the roleof those refactoring branches within the team.

Empirical studies on windows. Prior studies on Windowsfocused on primarily defect prediction. Nagappan andBall investigated the impact of code churn on defect den-sity and found that relative code churn measures wereindicators of code quality [50]. Zimmermann and Nagap-pan built a system wide dependency graph of WindowsServer 2003. By computing network centrality measures,they observed that network measures based on depen-dency structure were 10 percent more effective in defectprediction, compared to complexity metrics [57]. Morerecently, Bird et al. observed that socio-technical networkmeasures combined with dependency measures werestronger indicators of failures than dependency measuresalone [58]. Our current study is significantly differentfrom these prior studies by distinguishing refactoringchanges from non-refactoring changes and by focusing onthe impact of refactoring on inter-module dependenciesand defects. In addition, this paper uses a multivariateregression analysis to investigate the factors beyondrefactoring churn that could have affected the changes inthe defects and inter-module dependencies.

7 CONCLUSIONS AND FUTURE WORK

This paper presents a three-pronged view of refactoring in alarge software development organization through a survey,interviews, and version history data analysis. To investigatea de-facto definition and the value perception about refac-toring in practice, we conducted a survey with 328 profes-sional software engineers. Then to examine whether thesurvey respondents’ perception matches reality andwhether there are visible benefits of refactoring, we inter-viewed six engineers who led the Windows refactoringeffort and analyzed Windows 7 version history data.

Our study finds the definition of refactoring in prac-tice is broader than behavior-preserving program transforma-tions. Developers perceive that refactoring involvessubstantial cost and risks and they need various types ofrefactoring support beyond automated refactoring withinIDEs. Our interview study shows how system-widerefactoring was carried out in Windows. The quantitativeanalysis of Windows 7 version history shows that refac-toring effort was focused on the modules with a highnumber of inter-module dependencies and high test ade-quacy. Consistent with the refactoring goal stated by theinterview participants, preferentially refactored modules


indeed experienced higher reduction in the number ofinter-module dependencies than other changed modules.While preferentially refactored modules experience ahigher rate of reduction in certain complexity measures,they increase LOC and crosscutting changes more thanthe rest. As the benefit of refactoring is multi-dimen-sional and not consistent across various metrics, webelieve that managers and developers can benefit fromautomated tool support for monitoring the impact ofrefactoring on various software metrics.

ACKNOWLEDGMENTS

The authors thank Galen Hunt, Tom Ball, Chris Bird, MikeBarnett, Rob DeLine, and Andy Begel for their insightfulcomments. Thanks to the Microsoft Windows refactoringteam for their help in understanding the data. Thanks tomany managers and developers who volunteered their timeto participate in our research. Miryung Kim performed apart of this work, while working at Microsoft Research. Thiswork was in part supported by National Science Foundationunder the Grants CCF-1149391, CCF-1117902, and CCF-1043810, and by Microsoft SEIF award.

REFERENCES

[1] M. Fowler, Refactoring: Improving the Design of Existing Code. Read-ing, MA, USA: Addison-Wesley, 2000.

[2] L. A. Belady and M. Lehman, “A model of large program devel-opment,” IBM Syst. J., vol. 15, no. 3, pp. 225–252, 1976.

[3] K. Beck, Extreme Programming Explained, Embrace Change. Reading,MA, USA: Addison-Wesley, 2000.

[4] W. Opdyke, Refactoring, Reuse & Reality, Lucent Technologies,Murray Hill, NJ, USA, 1999.

[5] J. Ratzinger, T. Sigmund, and H. C. Gall, “On the relation of refac-torings and software defect prediction,” in Proc. ACM Int. WorkingConf. Mining Softw. Repositories, 2008, pp. 35–38.

[6] P. Weißgerber and S. Diehl, “Are refactorings less error-pronethan other changes?” in Proc. ACM Int. Workshop Mining Softw.Repositories, 2006, pp. 112–118.

[7] P. Weißgerber and S. Diehl, “Identifying refactorings from source-code changes,” in Proc. 21st IEEE/ACM Int. Conf. Autom. Softw.Eng., 2006, pp. 231–240.

[8] C. G€org and P. Weißgerber, “Error detection by refactoringreconstruction,” in Proc. Int. Workshop Mining Softw. Repositories,2005, pp. 1–5.

[9] M. Kim, D. Cai, and S. Kim, “An empirical investigation into therole of refactorings during software evolution,” in Proc. ACMIEEE 33rd Int. Conf. Softw. Eng., 2011, pp. 151–160.

[10] T. Mens and T. Tourwe, “A survey of software refactoring,” IEEETrans. Softw. Eng., vol. 30, no. 2, pp. 126–139, Feb. 2004.

[11] T. Hastie, R. Tibshirani, and J. H. Friedman, The Elements of Statis-tical Learning: Data Mining, Inference, and Prediction. New York,NY, USA: Springer-Verlag, 2001.

[12] M. Kim, T. Zimmermann, and N. Nagappan, “Appendix to a fieldstudy of refactoring rationale, benefits, and challenges at micro-soft,” Microsoft Res., Tech. Rep. MSR-TR-2012-4, Redmond, WA,USA, 2012.

[13] R. Johnson, “Beyond behavior preservation,” Microsoft FacultySummit 2011, Invited Talk, Jul. 2011.

[14] Y. Brun, R. Holmes, M. D. Ernst, and D. Notkin, “Proactive detec-tion of collaboration conflicts,” in Proc. 19th ACM SIGSOFT Symp.13th Eur. Conf. Found. Softw. Eng., 2011, pp. 168–178. [Online].Available: http://doi.acm.org/10.1145/2025113.2025139

[15] C. Bird and T. Zimmermann, “Assessing the value of brancheswith what-if analysis,” in Proc. ACM SIGSOFT 20th Int. Symp.Found. Softw. Eng., 2012, pp. 45:1–45:11. [Online]. Available:http://doi.acm.org/10.1145/2393596.2393648

[16] M. Vakilian, N. Chen, S. Negara, B. A. Rajkumar, B. P. Bailey, andR.E. Johnson, “Use, disuse, andmisuse of automated refactorings,” inProc. 34th Int. Conf. Softw. Eng., Jun. 2012, pp. 233–243.

[17] A. Srivastava, J. Thiagarajan, and C. Schertz, “Efficient Integrationtesting using dependency analysis,” Microsoft Res., Tech. Rep.MSR-TR-2005-94, Redmond, WA, USA, 2005.

[18] Y. Kataoka, T. Imai, H. Andou, and T. Fukaya, “A quantitativeevaluation of maintainability enhancement by refactoring,” inProc. Int. Conf. Softw. Maintenance, 2002, pp. 576–585.

[19] C. Izurieta and J. M. Bieman, “How software designs decay: Apilot study of pattern evolution,” in Proc. 1st Int. Symp. EmpiricalSoftw. Eng. Meas., 2007, pp. 449–451.

[20] Y. Guo, C. Seaman, R. Gomes, A. Cavalcanti, G. Tonin, F. Q. B. DaSilva, A. L. M. Santos, and C. Siebra, “Tracking technical debt—anexploratory case study,” in Proc. 27th IEEE Int. Conf. Softw. Mainte-nance, Sep. 2011, pp. 528–531.

[21] W. Cunningham, “The wycash portfolio management system,” inAddendum Proc. Object-Oriented Programm. Syst., languages, Appl.,1992, pp. 29–30.

[22] N. Brown, Y. Cai, Y. Guo, R. Kazman, M. Kim, P. Kruchten, E.Lim, A. MacCormack, R. Nord, I. Ozkaya, R. Sangwan, C. Seaman,K. Sullivan, and N. Zazworka, “Managing technical debt in soft-ware-reliant systems,” in Proc. ACM FSE/SDP Workshop FutureSoftw. Eng. Res., 2010, pp. 47–52. [Online]. Available: http://doi.acm.org/10.1145/1882362.1882373

[23] R. Kolb, D. Muthig, T. Patzke, and K. Yamauchi, (Mar. 2006).Refactoring a legacy component for reuse in a software productline: A case study: Practice articles. J. Softw. Maintenance Evol.[Online]. 18, pp. 109–132. Available: http://dl.acm.org/citation.cfm?id=1133105.1133108

[24] M. Kim, T. Zimmermann, and N. Nagappan, “A field study ofrefactoring challenges and benefits,” in Proc. ACM SIGSOFT20th Int. Symp. Found. Softw. Eng., 2012, pp. 50:1–50:11. [Online].Available: http://doi.acm.org/10.1145/2393596.2393655

[25] P. Tarr, H. Ossher, W. Harrison, and J. Stanley M. Sutton, “Ndegrees of separation: Multi-dimensional separation of concerns,”in Proc. 21st Int. Conf. Softw. Eng., 1999, pp. 107–119.

[26] G. Kiczales, J. Lamping, A. Menhdhekar, C. Maeda, C. Lopes, J.-M.Loingtier, and J. Irwin, “Aspect-oriented programming,” in Proc.Eur. Conf. Object-Oriented Programm., 1997, vol. 1241, pp. 220–242.[Online]. Available: citeseer.ist.psu.edu/kiczales97aspectoriented.html

[27] M. Eaddy, T. Zimmermann, K. D. Sherwood, V. Garg, G. C. Mur-phy, N. Nagappan, and A. V. Aho, “Do crosscutting concernscause defects?” IEEE Trans. Softw. Eng., vol. 34, no. 4, pp. 497–515,Jul. 2008.

[28] R. J. Walker, S. Rawal, and J. Sillito, “Do crosscutting concernscause modularity problems?” in Proc. ACM SIGSOFT 20th Int.Symp. Found. Softw. Eng., 2012, pp. 49:1–49:11. [Online]. Available:http://doi.acm.org/10.1145/2393596.2393654

[29] F. P. Brooks, The Mythical Man-Month: Essays on Software Engineer-ing. Reading, MA, USA: Addison-Wesley, Aug. 1975. [Online].Available: http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0201835959

[30] E. Murphy-Hill, A. P. Black, D. Dig, and C. Parnin, “Gatheringrefactoring data: A comparison of four methods,” in Proc. 2ndACMWorkshop Refactoring Tools, 2008, pp. 1–5.

[31] T. J. McCabe, “A complexity measure,” IEEE Trans. Softw. Eng.,vol. 2, no. 4, pp. 308–320, Jul. 1976.

[32] N. Nagappan, B. Murphy, and V. Basili, “The influence of organi-zational structure on software quality: An empirical case study,” inProc. 30th ACM Int. Conf. Softw. Eng., 2008, pp. 521–530. [Online].Available: http://doi.acm.org/10.1145/1368088.1368160

[33] G. Antoniol, K. Ayari, M.Di Penta, F. Khomh, and Y.-G.Gu�eh�eneuc, “Is it a bug or an enhancement?: A text-basedapproach to classify change requests,” in Proc. ACM Conf. CenterAdv. Stud. Collaborative Res.: Meeting Minds, 2008, pp. 304–318.

[34] M. E. Conway, “How do committees invent?” Datamation, vol. 14,no. 4, pp. 28–31, 1968.

[35] J. Herbsleb and R. Grinter, “Architectures, coordination, and dis-tance: Conway’s law and beyond,” IEEE Softw., vol. 16, no. 5,pp. 63–70, Sep./Oct. 1999.

[36] S. Bailey, S. Godbole, C. Knutson, and J. Krein, “A decade of con-way’s law: A literature review from 2003–2012,” in Proc. 3rd Int.Workshop Replication Empir. Softw. Eng. Res., Oct. 2013, pp. 1–14.

[37] P. Dalgaard, Introductory Statistics with R, 2nd ed. New York, NY,USA: Springer-Verlag, 2008.

[38] E. Murphy-Hill, C. Parnin, and A. P. Black, “How we refactor,and how we know it,” in Proc. 31st Int. Conf. Softw. Eng., 2009,pp. 287–297.


[39] K. Sullivan, P. Chalasani, S. Jha, and V. Sazawal, Software Design asan Investment Activity: A Real Options Perspective in Real Options andBusiness Strategy: Applications to Decision Making. London, U.K.:Risk Books, Nov. 1999.

[40] C. Y. Baldwin and K. B. Clark, Design Rules: The Power of Modular-ity. Cambridge, MA, USA: MIT Press, 1999.

[41] Z. Xing and E. Stroulia, “UMLDiff: An algorithm for object-ori-ented design differencing,” in Proc. 20th IEEE/ACM Int. Conf.Autom. Softw. Eng., 2005, pp. 54–65.

[42] D. Dig and R. Johnson, “The role of refactorings in API evolution,”in Proc. 21st IEEE Int. Conf. Softw. Maintenance, 2005, pp. 389–398.

[43] A. MacCormack, J. Rusnak, and C. Y. Baldwin, “Exploring thestructure of complex software designs: An empirical study ofopen source and proprietary code,” Manage. Sci., vol. 52, no. 7,pp. 1015–1030, 2006.

[44] Y. -G. Gu�eh�eneuc and H. Albin-Amiot, “Using design patternsand constraints to automate the detection and correction of inter-class design defects,” in Proc. 39th Int. Conf. Exhib. Technol. Object-Oriented Languages Syst., 2001, pp. 296–305. [Online]. Available:http://dl.acm.org/citation.cfm?id=882501.884740

[45] R. Marinescu, “Detection strategies: Metrics-based rules fordetecting design flaws,” in Proc. 20th IEEE Int. Conf. Softw. Mainte-nance, 2004, pp. 350–359. [Online]. Available: http://dl.acm.org/citation.cfm?id=1018431.1021443

[46] Y. Guo, C. Seaman, N. Zazworka, and F. Shull, “Domain-specifictailoring of code smells: An empirical study,” in Proc. 32nd ACM/IEEE Int. Conf. Softw. Eng.—Vol. 2, 2010, pp. 167–170. [Online].Available: http://doi.acm.org/10.1145/1810295.1810321

[47] S. Wong, Y. Cai, M. Kim, and M. Dalton, “Detecting softwaremodularity violations,” in Proc. ACM IEEE 33rd Int. Conf. Softw.Eng., 2011, pp. 411–420.

[48] R. Moser, A. Sillitti, P. Abrahamsson, and G. Succi, “Does refactor-ing improve reusability?” in Proc. 9th Int. Conf Reuse Off-the-ShelfCompon., 2006, pp. 287–297.

[49] J. Carriere, R. Kazman, and I. Ozkaya, “A cost-benefit frameworkfor making architectural decisions in a business context,” in Proc.32nd ACM/IEEE Int. Conf. Softw. Eng., 2010, pp. 149–157.

[50] N. Nagappan and T. Ball, “Use of relative code churn measures topredict system defect density,” in Proc. 27th ACM Int. Conf. Softw.Eng., 2005, pp. 284–292.

[51] W. G. Griswold, “Program restructuring as an aid to softwaremaintenance,” Ph.D. dissertation, Dept. Comput. Sci. Eng., Univ.Washington, Seattle, WA, USA, 1991.

[52] D. Roberts, J. Brant, and R. Johnson, “A refactoring tool for small-talk,” Theory Practice Object Syst., vol. 3, no. 4, pp. 253–263, 1997.

[53] B. Daniel, D. Dig, K. Garcia, and D. Marinov, “Automated testingof refactoring engines,” in Proc. 6th Joint Meet. Eur. Softw. Eng.Conf. ACM SIGSOFT Symp. Found. Softw. Eng., 2007, pp. 185–194.

[54] G. C. Murphy, M. Kersten, and L. Findlater. (2006, Jul.). How areJava software developers using the eclipse IDE? IEEE Softw.[Online]. 23(4), pp. 76–83. Available: http://dx.doi.org/10.1109/MS.2006.105

[55] D. Dig and R. Johnson, “Automated detection of refactorings inevolving components,” in Proc. Eur. Conf. Object-Oriented Pro-gramm., 2006, pp. 404–428.

[56] K. Prete, N. Rachatasumrit, N. Sudan, and M. Kim, “Template-based reconstruction of complex refactorings,” in Proc. IEEE Int.Conf. Softw. Maintenance, Sep. 2010, pp. 1–10.

[57] T. Zimmermann and N. Nagappan, “Predicting defects using net-work analysis on dependency graphs,” in Proc. 30th Int. Conf.Softw. Eng., 2008, pp. 531–540. [Online]. Available: http://doi.acm.org/10.1145/1368088.1368161

[58] C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu,“Putting it all together: Using socio-technical networks to pre-dict failures,” in Proc. 20th Int. Symp. Softw. Rel. Eng., 2009,pp. 109–119. [Online]. Available: http://dx.doi.org/10.1109/ISSRE.2009.17

Miryung Kim received the BS degree in com-puter science from the Korea Advanced Instituteof Science and Technology in 2001. Shereceived the PhD in computer science & engi-neering from the University of Washington in2008 and then joined the faculty at the Universityof Texas at Austin, where she is currently anassistant professor. Her research interestsinclude the area of software engineering, with afocus on improving developer productivity inevolving software systems. She received a Goo-

gle Faculty Research Award in 2014, US National Science FoundationCAREER Award in 2011, a Microsoft Software Engineering InnovationFoundation Award in 2011, an IBM Jazz Innovation Award in 2009, andACM SIGSOFT Distinguished Paper Award in 2010. She is a member ofthe IEEE.

Thomas Zimmermann received the PhD degreefrom Saarland University in 2008. He is aresearcher in the Research in Software Engi-neering (RiSE) group at Microsoft Research Red-mond, an adjunct assistant professor at theUniversity of Calgary, and an affiliate faculty atthe University of Washington. His research inter-ests include empirical software engineering, min-ing software repositories, computer games,recommender systems, development tools, andsocial networking. He is best known for his

research on systematic mining of version archives and bug databases toconduct empirical studies and to build tools to support developers andmanagers. He received ACM SIGSOFT Distinguished Paper Awards in2007, 2008, and 2012. He is a member of the IEEE.

Nachiappan Nagappan received the PhDdegree from North Carolina State University in2005. He is a principal researcher in theResearch in Software Engineering (RiSE)group at Microsoft Research Redmond. Hisresearch interests include software reliability,software metrics, software testing, and empiri-cal software processes. His interdisciplinaryresearch projects span the spectrum of soft-ware analytics ranging from intelligent soft-ware design for games, identifying data center

reliability to software engineering optimization for energy in mobiledevices. He is a member of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 7, JULY 2014 633 An Empirical...

Documents

Transcript of IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 7, JULY 2014 633 An Empirical...