CloudBuild: Microsoft’s Distributed and Caching Build Service · PDF fileCloudBuild:...
Embed Size (px)
Transcript of CloudBuild: Microsoft’s Distributed and Caching Build Service · PDF fileCloudBuild:...
CloudBuild: Microsofts Distributed and CachingBuild Service
Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan,Erik Mavrinac, Wolfram Schulte, Newton Sanches, Srikanth Kandula
ABSTRACT_ousands of Microso engineers build and test hundreds of so-ware products several times a day. It is essential that this con-tinuous integration scales, guarantees short feedback cycles, andfunctions reliably with minimal human intervention. _is paperdescribes CloudBuild, the build service infrastructure developedwithin Microso over the last few years. CloudBuild is responsi-ble for all aspects of a continuous integration workow, includingbuilds, test and code analysis, as well as drops, package and sym-bol creation and storage. CloudBuild supports multiple build lan-guages as long as they fulll a coarse grained, le IO based contract.CloudBuild uses content based caching to run build-related tasksonly when needed. Lastly, it builds on many machines in parallel.CloudBuild oers a reliable build service in the presence of un-reliable components. It aims to rapidly onboard teams and hencehas to support non-deterministic build tools and specication lan-guages that under-declare dependencies. Wewill outlinehowwe ad-dressed these challenges and characterize the operations of Cloud-Build. CloudBuild has on-boarded hundreds of codebases withonly man-months of eort each. Some of these codebases are usedby thousands of developers. _e speed ups of build and test rangefrom 1.3 to 10, and service availability is 99%.
KeywordsBuild Systems, CloudBuild, Distributed Systems, Caching, Speci-cation Languages, Operations
1. INTRODUCTIONMicroso is rapidly adopting an agile methodology to enable a
much faster delivery cadence of days to weeks. For instance, Oce365,Visual StudioOnline and SQLAzurehave release cycles rangingfrom 3 months to 3 weeks. On the extreme end, Bings website shipsseveral times a day.Delivering rapidly requires that builds, which are at the core of
the inner loop of any engineering system, are fast and reliable. _ischallenge has given us an opportunity to design a new in-house in-frastructure for continuous integration known as CloudBuild. Itsdesign was motivated by these needs:Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]
ICSE 16 Companion, May 14 - 22, 2016, Austin, TX,USA 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-4205-6/16/05. . . $15.00
execute builds, tests, and related tasks as fast as possible, on-board as many product groups as eortlessly as possible, integrate into existing workows, ensure high reliability of builds and their necessary infras-
tructure, reduce costs by avoiding separate build labs per organization, leverageMicrosos resources in the cloud for scale and elas-
ticity, and lastly consolidate disparate engineering eorts into one common
We found it dicult to simultaneously satisfy these requirements.Most build labs, for instance, have only a few dependencies on othersystems; they connect to one version control system or build dropstorage solution. To avoid separate build labs, CloudBuild sup-ports theunion of all lab dependencies andworkows. However, theconsequent interoperability issues can lead to lower reliability. Asanother instance of conicting requirements, product groups haveoptimized their builds (and build labs) over many years, oen ex-ploiting their particular product architecture and workow to opti-mize for incremental and parallel builds. Since CloudBuild oersa generic cloud-based solution across the various product groups,our current speedup improvements are less dramatic (not 100 forexample) relative to the highly optimized custom systems.When architecting CloudBuild, we had to make a crucial de-
cision. We found that existing build tools and specication lan-guageswere unsuitable for cached and parallel execution. In partic-ular, the languageswould under-specify dependencies and the toolsproduced non-deterministic outputs. Hence, while single machine(or single threaded) execution would run correctly, distributed ex-ecution and caching of task outputs could lead to unpredictable be-havior or limited speedup. A clean solution would be to design anew build specication language, rewrite all build specications inthis language and force tools to respect additional constraints. Sucheorts are underway [5, 27]. CloudBuild, however, is our quicksolution. _at is, here we show how to onboard existing specica-tion languages (and tools) on to a cloud-based parallel and cachedbuild system. _e key advantage with this choice is the speed of on-boarding. Builds and tests can be sped up and disparate labs can beconsolidated into the cloud quickly. We also found CloudBuild tobe able to cover a wide range of tools and specications. _e cost,however, is the rather heuristic nature of CloudBuild: our goal isnot to function correctly in every possible setting (of specicationand tool behavior). We mitigate this with customer education andhardening of CloudBuild to cover frequently recurring issues.We believe that our design decision has been successful. Cloud-
Build is currently being used by more than 4000 developers inBing, Exchange, SQL,OneDrive, Azure andOce. It executesmorethan 20K builds per day on up to 10K machines spread over several
data centers. Build and test speeds typically improved by 1.3 times to10 times compared to earlier lab based systems, sustaining an avail-ability of 99.9%. Reliability, dened as the number of requests thatare servedwithout a CloudBuild internal error, is better than 99%.CloudBuild connectswith three dierent version control and fourbinary storage systems. REST APIs and Azures Service Bus  areused for integration into multiple workow systems.
While CloudBuilds service design follows existing designs forlarge scale data-parallel systems [3, 8, 41], CloudBuilds build en-gine is unique in that it allows the execution of arbitrary build lan-guages and tools, even in the presence of under-specied dependen-cies and non-deterministic tools.
Given a build request, CloudBuild evaluates build specica-tions, typically expressed in Make or MsBuild les, to infer project-to-project level dependencies. _is means that CloudBuilds unitof scheduling is a coarse grained project. While not as ne grainedas other systems like Googles Bazel , the approach allows greaterreuse of pre-existing build specications,without having to compre-hend all the runtime semantics of the underlying build tools.CloudBuild can determine whether to execute a given project
or to reuse outputs captured during an earlier execution, and storedin a system-wide cache. _e determination is based on the evalua-tion of the pertinent source les and parent projects which is sum-marized into a cache lookup key for the project. _e approach tocompute this key ensures that the build output is deterministic andconsistent with other projects in the same build.Finally CloudBuild schedules work, i.e. the tasks for each
project that needs to be built, in a location-awaremanner over mul-tiple worker machines. Locality here implies two things: fresh-ness of the cache at the machine as well as the freshness of theirenlistment from source control. Without this locality, the build-preparation-time to congure the machine with the right set ofSDKs and sources before it can participate in the build will be sub-stantial. CloudBuild uses standard DAG scheduling logic. _etasks themselves have heterogeneous resource demands. _e de-pendencies between tasks are more arbitrary than in data parallelclusters (no shues for example). Furthermore, unlike the caseof data-parallel jobs where outputs are much smaller than the in-put (e.g., queries in TPC-DS  or TPC-H ), the outputs ofthe median build job are quite large since many libraries and exe-cutables are generated and packaged. _is requires CloudBuild tomore carefully pipeline the fetching of inputs with the execution oftasks.For a given task, CloudBuild creates a dedicated le system di-
rectory structure, on which build tools are to execute. We call thisa sandbox, and it addresses reliability issues due to under-specieddependencies and allows for multi-tenant builds. _is is similar inintent to Unixs chroot jail , though it is implemented at the ap-plication level. As we will explain later, cache and sandbox play to-gether to guarantee the absence of Frankenbuilds, i.e. buildswhereoutputs from dierent build jobs can combine in inconsistent waysdue to cache re-use.Despite allowing so much exibility for specications, the result-
ing CloudBuild system is fast and reliable. At Microso, we haveobserved that 70 to 90% of build artifacts can be shared; this num-ber varies across code bases. And CloudBuilds overhead for dis-tributed builds is smallwith enough builders, 95% of builds takeno more than 1.05 the total time cost of the most expensive pathin the directed task graph described in the build specications.CloudBuildsmain contribution is thus to allow for reliable, fast
builds in the presence of under-specied dependencies and non-deterministic tools. CloudBuilds engineering contribution is thatit is easy to use, scales to varying deman