Brooks Davis, Isolating Cluster Users (and Their Jobs) for Performance and Predictability

Isolating Cluster Jobs for Performance and Predictability

Brooks Davis, Michael AuYeungThe Aerospace Corporation

El Segundo, CA{brooks,mauyeung}@aero.org

Abstract

At The Aerospace Corporation, we run a largeFreeBSD based computing cluster to support engi-neering applications. These applications come in allshapes, sizes, and qualities of implementation. Tosupport them and our diverse userbase we have beensearching for ways to isolate jobs from one anotherin ways that are more effective than Unix time shar-ing and more fine grained than allocating whole nodesto jobs. In this paper we discuss the problem spaceand our efforts so far. These efforts include implemen-tation of partial file systems virtualization and CPUisolation using CPU sets.

1 Introduction

The Aerospace Corporation operates a federallyfunded research and development center in supportof national-security, civil and commercial space pro-grams. Many of our 2400+ engineers use a variety ofcomputing technologies to support their work. Engi-neering applications range from small models whichare easily handled by desktops to parameter stud-ies involving thousands of cpu hours and traditional,large scale parallel codes such as computational fluiddynamics and molecular modeling applications. Ourprimary resources used to support these large appli-cations are computing clusters. Our current primarycluster, the Fellowship[Davis] cluster consists of 352dual-processor nodes with a total of 1392 cores run-ning FreeBSD. Two additional clusters, beginning at150 dual-processor nodes each are being constructedto augment Fellowship.

As in any multiuser computing environment with lim-ited resources, user competition for resources is a sig-nificant burden. Users want everything they need todo their job, right now. Unfortunately, other usersmay need those resources at the same time. Thus,

c©2008 The Aerospace Corporation.All trademarks, service marks, and trade names are theproperty of their respective owners.

Figure 1: Fellowship Circa February 2007

systems to arbitrate this resource contention are nec-essary. On Fellowship we have deployed the SunGrid Engine[SGE] scheduler which schedules batchjobs across the nodes.

In the next section we discuss the performance prob-lems that can occur when sharing resources in ahigh performance computing cluster. We then discussrange of possibilities to address these problems. Wethen explain the solutions we are investigating and de-scribe our experiments with them. We then concludewith a discussion of potential future work.

2 The Trouble With Sharing

In the early days of computing, time sharing schemeswere devised so that users could share computers.Early systems were batch oriented and allowed onlya single program at once. Later systems allowed mul-tiple users to use the system at the same time. To-day, virtually all operating systems (except for a fewsmall real-time or embedded operating systems) sup-port time sharing to allow users to run multiple pro-grams at once. Depending on the nature of the loadimposed by those programs, time sharing may helpor hinder performance. For instance, if processes are

often blocked waiting for information from storage orthe network, time sharing allows other processes torun when they can not. On the other hand, if pro-cesses are performing pure computation, the overheadof performing context switches between the processeswill impose a performance penalty by consuming cy-cles the could otherwise have been used for useful com-putation. A common worst case scenario is runningthe system out of main memory and thus being forcedto swap processes out. All systems resources can eas-ily be consumed moving program state too and fromdisk.

In the area of high performance computing (HPC),time sharing is employed on individual systems byvirtue of the fact that standard operating systems aregenerally used. Some some scheduling systems pro-vide course-grained time sharing (typically on the or-der of hours or minutes), but fine-grained time sharingis generally considered too expensive due to the cost ofmaintaining synchronization between machines. Thetypical mode of operation is that users submit a jobrequesting to run one or more processes. The sched-uler allocates space on one or more systems and typ-ically provides some assistance in starting up thoseprocesses. The individual processes then run in astandard time sharing environment on each host untilcompletion. If each process is assigned its own system,then application performance will be highly predica-ble. On the other hand, if resources on these systemsare heavily contended, applications may have unpre-dictable run time. This is particularly problematic inparallel applications where work is divided betweenprocesses in a static manner. In that case, the pro-cess which ends up on the most overloaded systemwill hold up the whole computation. While such de-signs are generally considered a bad idea today, theyremain common in practice.

This sort of unfairness creates difficulties for users, ad-ministrators, and scheduling systems. Users can notreliably predict when their job will finish which forcesthem to provide inaccurate estimates of job comple-tion time. This in turn leads to reduced schedulerefficiency because schedulers can only make optimaldecisions when they know when current processes willfinish and how long queued jobs will run.

3 The Range of Possibilities

A number of possible techniques exist to isolate userjobs from each other. These options include classictime sharing based approaches, single application sub-clusters, and full or partial virtualization.

3.1 Time Sharing and Gang Scheduling

The historical solution to resource sharing–significantly predating[GE] Unix–is some form oftime sharing. On a cluster this can either be doneper-host or the cluster can be gang scheduled withthe whole cluster dedicated to a particular job at agiven time. Either method has the advantage thatmore jobs than can be simultaneously run can worktoward partial results instead of one job blocking thenext entirely. On the down side, switching contextsbetween jobs has significant costs.

Time sharing on individual hosts is a standard featureof all modern operating systems so it can be imple-mented with no administrator effort. However com-pletion times in the face of unknown resource con-tention are highly unpredictable. With parallel jobsthings are worse because in many algorithms, a sin-gle node can slow down the whole computation. If alack of predictability of run times is acceptable, thisapproach is very effective at maximizing resource uti-lization.

Gang scheduling suffers less from contention and un-predictability since resources are typically wholly ded-icated to a job during the (usually long) time slice.Context switching costs are much higher because com-plex state such as network connections may need to becreated and destroyed. Also, memory resources usedby gang scheduled processes are likely to be pushedout when they are not scheduled resulting in largeswap in penalties when rescheduled. The largest drawback of gang scheduling is the general lack of availableimplementations. This lack seems to indicate that theHPC community is not sufficiently interested in thecapability to deal with the complex issues involved inmaking it work for arbitrary applications.

3.2 Single Application (Sub-)Clusters

An approach at the opposite end of the spectrumfrom time sharing is single application clusters. Inthis approach, users are either given a private clus-ter for a period of time or are allocated clusters ondemand using systems like EmuLab or Sun’s HedebyProject[Emulab, Hedeby]. Some labs with annual al-locations for resources use the first variant.

This allows users complete control over how they runtheir jobs. They can let the systems run relativelyidle for maximum predictability or heavily oversub-scribe the system to get early results from many ex-periments. Since users are given their own systems,this allows more complete data security than a con-

ventional multi-user system. On the down side, usersmay oversubscribe node to the point of reducing over-all throughput due to the cost of swapping. There isalso no easy way to harvest under utilized resourcesunless users have a mix of job types. Additionally,this method only works well for relatively large jobsor projects.

3.3 Virtualization

One method of implementing per-application subclusters is to use a virtualization system suchas VMWare, Xen, or VirtualBox[VMWare, Xen,Virtualbox]. These systems can be deployed in waysthat allow very rapid deployment of system imageswhich means that virtual clusters can be deployedrapidly, potentially based on scheduler demand. Be-cause multiple virtual images can be run on one ma-chine at the same time, jobs can be given isolated en-vironments with dependable performance while stillallowing other uses such as cycle scavenging applica-tions. Other benefits of virtualized systems includethe ability to hide hardware details from applicationsand for applications to run in specially tailored op-erating environments. For example, running with anuncommon operating system version or configuration.

Virtualization does have a number of downsides whichoffset some of these benefits. Because hardware is hid-den from the virtualized operating system, the OS cannot take full advantage of it. A common example ofthis is high performance network devices. New net-work cards such as those from Neterion support allo-cating hardware queues to particular virtual imageswhich can help in this area, but most current devicesdo not have such features. Additionally, there is someCPU, memory, and storage overhead from virtualiza-tion, CPU overhead is typically in the 5-20% rangewhere as memory and storage overhead is often up to100%. There may also be problems with licensing soft-ware in a virtualized environment, but this situationis improving over time.

3.4 Virtual Private Servers

The Internet hosting industry has developed an alter-native to full virtualization which they use to providevirtual private servers. Their solution is to provideoperating system enhancements which allows parti-tioning sets of processes off from the main systemand other partitions. Implementations typically startwith the file system virtualization provided by thechroot() system call and add new features to furtherisolate processes and prevent escape from the virtual

root. The FreeBSD jail system is one such implemen-tation. It restricts processes to a single IP address,reduces their ability to see processes and other ob-jects outside a jail, and adds a set of restrictions onthe root user within the jail. Another widely availableimplementation of this type of functionality is SolarisZones which also includes CPU and memory use re-strictions. Many hosting companies have developedtheir own versions in house.

The main advantage of these techniques is that thatoverhead is low compared to full virtualization. SomeISPs run hundreds or even thousands of virtual pri-vate servers on a single machine that could only han-dle tens of Xen or VMWare sessions. Additionally, theadministrator or developer can choose how much theywant to virtualize. For instance, multiple FreeBSDjails can share the same view of the file system. Thisallows processes to be separated and to run on differ-ent networks, but lets them share files.

The primary down side of this approach that sincevirtualization is incomplete, users have ability to affecteach others processes to a greater extent than with fullvirtualization. An additional downside compared tofull virtualization is that while virtual environmentscan differ, they are much more constrained in howthey do so than with full virtualization. For exampleand FreeBSD jail can include a system image of anolder version of FreeBSD (with some limitations) orfrom some Linux versions, but there is no support forSolaris or Windows images.

3.5 Resource Limits or Partitions

All modern Unix-like systems implement some sort ofper-process resource limits. These generally includecurrent memory use, total CPU time, number of openfiles, etc. The goal of these limits it to keep poorlydesigned programs or badly behaved users from ex-hausting system resources. Most cluster schedulersincluding Sun Grid Engine use these limits to partiallyenforce constraints on jobs. SGE can not rely on thelimits because processes are allowed to create childprocess which are subject to the limits independently.As a result SGE tracks process resource use and keepsit’s own list of total use for each job to enforce limitsas needed.

Some operating systems provide the additional abil-ity to limit resources for sets of processes. Irix hasthe concept of jobs which are subject to a collectiveresource limit[SGI-job]. Similarly, Solaris 9 has a con-cept of tasks[Sun]. A scheduler could take advantageof these limits to implement limits on job components.In general these limits are limits on total use over

time or on resource growth and as such do not pro-tect against temporary exhaustion of resources suchas CPU cycles or bandwidth to memory, disk, or net-work accessible resources. This means that they donot generally protect the predictability of applicationperformance. Some exceptions such as per-processbandwidth limits have been demonstrated, but theyare not widely deployed.

In addition to placing limits on resource use, some re-sources can be partitioned. For instance, most modernoperating systems allow processes and their childrento be tied to one or more CPUs. If this functional-ity is used to tie each job to a unique set of CPUs,the ability processes to interfere with each other willbe significantly reduced. SGE once implemented suchan allocation on some platforms, but the support wasprimarily implemented on Irix and no longer works.The biggest downside of this approach is that not allresources may be easily partitioned and often each re-source much be partitioned though a different inter-face.

4 Our Experiments

With out diverse set of user applications, we need anapproach to job isolation that works well for jobs us-ing between one and several hundred processes withrun times ranging from minutes to days and occasion-ally weeks. After examining of the pros and cons ofthe available techniques, we decided to investigate ap-proaches based on resource partitioning and virtualprivate servers. We felt an approach which deployedapplication specific sub-clusters was an attractive ap-proach from many perspectives, but it is too coursegrained without cheap, efficient virtualization. Ourexperience has shown that simply letting the OS tryto handle the problem is fairly effective most of thetime and thus we believe that appropriate partitionswill solve many of our problems. On common issueis poorly designed jobs that consume more than theirallotment of CPU cycles. Another problem is applica-tions accidentally consuming all of a key resource suchas local disk space. For example, an application wasonce filling /tmp on cluster nodes leading to an arrayof difficult to diagnose problems such as programs fail-ing to perform necessary interprocess communicationsdue to inability to create Unix domain sockets.

We have been experimenting in a few areas of resourcepartitioning. These include per-job memory basedtemporary files, binding of jobs to sets of CPUs, andpartial file system virtualization using variant sym-bolic links. We have implemented these on our clus-ter using a wrapper script around the SGE shepherd

program.

4.1 SGE Shepherd Wrapper

The SGE shepherd program (sge shepherd) is exe-cuted by the SGE execution daemon (sge execd) tomanage and monitor the execution of the portions ofan SGE job residing on a particular host. It handlesboth direct starting of single jobs and remote startingof components of parallel jobs. It has the task of set-ting resource limits on processes and tracking, mon-itoring, and reporting on their resource use throughout the life of the job. We initially planned to mod-ify the shepherd’s job startup code to implement fur-ther resource restrictions, but fortunately discoveredthat SGE provides the ability to specify an alternatecommand that can act as a wrapper around the ac-tually sge shepherd program. We have implementeda modular wrapper system in ruby. Using a scriptinglanguage makes development simpler and allows us toeasily leverage both library calls and command linetools.

The wrapper system allows wrapper modules to beregistered and then have callbacks to set themselvesup, to adjust the shepherd command line, and afterthe shepherd exits. This allows changes to be madesuch as mounting temporary file systems before exe-cution and allows programs like env, chroot, or jailto wrap the sge shepherd program and change it’senvironment directly.

4.2 Memory Backed Temporary Directo-ries

The first wrapper we implemented extends the SGEfeature of automatically created per-job temporary di-rectories. By default, the SGE execution daemon cre-ates a per-job sub-directory in an administrator de-fined directory and passes the path to the shepherdas part of its configuration. The shepherd then setthe TMPDIR environmental variable so well designedapplications place any temporary files in this direc-tory. The directory is removed at the end of the job’sexecution so stray files left by crashes or inadequatecleanup algorithms are automatically removed. Thisfeature is very helpful, but we found it does not go farenough in some cases. The problem is that all thesetemporary directories live on the same file system andthus a single job can run all jobs out of temporaryspace.

We have solved this problem with the mdtmpdir wrap-per that mounts a memory backed file system (specifi-

cally a swap backed md device with UFS) over the topof the temporary directory. We have made the size auser requestable resource. We current set a hard limiton the total allocation size, but intend to replace thatwith an SGE load sensor based on swap levels in thefuture. The module registers a precmd hook whichcreates, formats, and mounts the file system and apostcmd hook which removes it. As a side benefit,many applications will see vastly improved temporaryfile performance due to the use of memory disks. Weplan to deploy this hook on our cluster soon.

4.3 Variant Symbolic Links

While the memory TMPDIR wrapper solves a numberof problems for us, it is not a complete solution tothe problem of shared temporary space. It is not un-common for applications, particularly locally writtenscripts or older code to hard code /tmp in paths. Thiscan lead to running out of space in /tmp despite theexistence of TMPDIR. Worse, some applications hardcode full paths including /tmp which can result is cor-rupt results or bizarre failures if multiple applicationinstances attempt to use the same paths. To workaround this, we would like to virtualize /tmp for SGEjobs using variant symbolic links.

Variant symbolic links are a concept dating to at leastApollo Domain/OS[Wikipedia]. The idea is to makespecial symbolic links which contain variables that areexpanded at run time and thus can point to differentplaces for different processes. We have ported the vari-ant implementation from DragonFlyBSD[DragonFly]to FreeBSD with significant modifications. Thesemodifications include the ability to specify default val-ues for links with the %{VARIABLE:default-value}syntax and a more secure precedence order betweenthe variable scopes In our implementation, systemwide variables override per-process one rather thanthe other way around. This means there is no risk ofsetuid binaries loading libraries from user controlledlocations due to symlinks.

We intend to replace /tmp with a symlink with like%{TMPDIR:/var/tmp}. We will then set the per-process TMPDIR variable in the wrapper script so thatit points to the memory based TMPDIR. That will re-sult in all processes other than SGE jobs seeing /tmpand a symlink to /var/tmp, but SGE processes seeingit as a link to their own private directory. This con-figuration should fully resolve issues with contentionover /tmp space (I/O bandwidth is another story).

4.4 CPU Set Allocator

With the introduction of CPU sets[FreeBSD-cpuset]and CPU affinity[FreeBSD-cpuaff] support inFreeBSD 7.1, it is possible to tie groups of processesto groups of CPUs. We have leveraged this supportto allocate independent sets of CPUs to each job.

Our CPU set allocation algorithm is a relatively naiverecursive one which does not take cache affinity intoaccount. The algorithm first attempts to find smallestavailable range that the request will fit into. If thatfails, it allocates the largest range and recursively re-quests an allocation for the remaining request size. Wehope this results in minimal fragmentation over time,but we have not performed a detailed analysis.

We store CPU set allocations in/var/run/sge cpuset and lock the file when al-locating or deallocating ranges. The module registersa precmd hook which handles range allocation, apostcmd which deallocates ranges, and a cmdwrapperwhich invokes the cpuset(1) utility to restrict thesge shepherd process and thus it’s children to theallocated CPU range.

Since the point of the CPU set allocator is to isola-tor performance impacts, we conducted a experimentto verify this effect. Our test platform consisted of asystem with dual Intel Xeon E5430 CPUs running at2.66GHz for a total of 8 cores. The system has 16 GBof ram and 4 1TB SATA disks neither of which shouldnot have a measurable impact on benchmark results.The system is running FreeBSD 7.1-PRERELEASEamd64 corresponding to subversion revision r182969.Sun Grid Engine 6.2 is installed. For our benchmark,we chose an implementation of the nqueens prob-lem installed from the FreeBSD Ports Collection asnqueens-1.0. The executable is named qn24b base.The program finds all possible layouts of N queensplaced on an N ×N chess board where no queen cancapture another. Program performance is almost en-tirely dependent on integer CPU performance.

Our benchmark runs consisted of running one instanceat a time of “qn24b base 18” via SGE job submis-sions and using the reported elapsed time as our tim-ing result. To provide competing load, we ran 0, 7, or8 instances of “qn24b base 20” in an SGE job whichrequested 7 of the 8 slots on the system. Each loadprocess ran for more than an hour..

A summary of our results is presented in Table 1. The7 load process case without CPU sets show the ex-pected results that fully loading the system yields insmall performance degradation and increase in vari-ability. Likewise the 8 load process case yields a sig-

Table 1: Results of CPU set benchmark runs

Without CPU sets With CPU setsBaseline 7 Load

Procs8 LoadProcs

7 LoadProcs

8 LoadProcs

Runs 8 8 17 11 12Average Run Time (sec)? 345.73 347.32 393.35 346.63 346.74Standard Deviation 0.21 0.64 14.6 0.05 0.04Difference from Baseline 0.59 46.63 † †

Margin of Error 0.51 10.81 † †

Percent Difference from Baseline 0.17% 13.45% † †

? Smaller numbers are better. † No difference at 95% confidence.

nificant performance degradation and correspondingincrease in variability. With CPU sets enabled, nostatistically valid difference can be demonstrated be-tween the baseline and the 7 and 8 load process testcases and the variability of run times drops apprecia-bly. These results demonstrate basic validity of ourapproach and confirm our hypothesis that CPU setswill benefit programs even when a node is not over-loaded.

5 Future Work

A number areas of possible future work exist. Extend-ing the current SGE shepherd wrapper framework toadd more partitioning would be generally helpful. Onewe are particularly interested in is adding a wrapperto place jobs in individual chroots or jails, possiblywith different user requested OS versions. This couldenable a number of useful things including upgradingthe kernel to take advantage to new performance fea-tures without exposing the users to disruptive userland changes. Users could also potentially requestLinux jails for their jobs which would allow easierdeployment of commercial applications. Potentially,tools such as DTrace could be used to analyze thesejobs, something not possible on Linux due to licensingcompatibility issues.

Other areas worth investigating are the creation ofper-job VLANs for parallel jobs so jobs would be iso-lated from each other on the network and using themandatory access control framework to better isolatejobs without the use of jails. Using features suchas dummynet to allocate network bandwidth to jobsmight also be interesting. For MPI traffic this wouldbe fairly easy, but NFS would be more difficult dueto the need to account for the actual process makingthe request. Similarly, it would be useful if disk band-width could be reserved similar to the guaranteed-rateI/O mechanisms in Irix[SGE-grio].

Also on the FreeBSD side, adding Irix-like job resourcelimits or Solaris like tasks would simplify tracking cer-tain limits in SGE and could provide a useful place tohang per-job rate limits.

All in all, the use of resource partitioning and tech-niques from the virtual private server space is prof-itable space for further exploration in the quest tomake clusters more useful and more manageable. Weintend to continue exploring this area.

References

[Davis] Brooks Davis, Michael AuYeung, J. MattClark, Craig Lee, James Palko, Mark Thomas,Reflections on Building a High-performanceComputing Cluster Using FreeBSD.Proceedings, AsiaBSDCon 2007.

[DragonFly] DragonFly BSD http://www.dragonflybsd.org/about/history.shtml

[FreeBSD-cpuaff] The FreeBSD Project,cpuset getaffinity(2), FreeBSD System CallsManual, http://www.freebsd.org/cgi/man.cgi?query=cpuset getaffinity&manpath=FreeBSD+8-current March 29, 2008.

[FreeBSD-cpuset] The FreeBSD Project, cpuset(2),FreeBSD System Calls Manual,http://www.freebsd.org/cgi/man.cgi?query=cpuset&manpath=FreeBSD+8-currentMarch 29, 2008.

[Emulab] White, Lepreau, Stoller, Ricci,Guruprasad, Newbold, Hibler, Barb, andJoglekar, An Integrated ExperimentalEnvironment for Distributed Systems andNetworks, appeared at OSDI 2002, December2002.

[GE] General Electric Computer Dept. Laboratory,The Dartmouth Time-Sharing System – A Brief

Description,http://www.dtss.org/ge dtss.php SunnyvaleCalifornia, 26 March 1965.

[Hedeby] The Hedeby Projecthttp://hedeby.sunsource.net/

[SGE] Sun Grid Engine Projecthttp://gridengine.sunsource.net/

[SGI-job] SGI, Inc. job limits(5) IRIX Admin:Resource Administrationhttp://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?cmd=getdoc&coll=0650&db=man&fname=5%20job limits

[SGE-grio] SGI, Inc. grio(5) IRIX Admin: ManPages http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?cmd=getdoc&coll=0650&db=man&fname=5%20grio

[Sun] Sun Microsystems. System AdministrationGuide: Solaris Containers-ResourceManagement and Solaris Zones. 2008.

[Wikipedia] Wikipedia contributors, “Symboliclink,” Wikipedia, The Free Encyclopedia,http://en.wikipedia.org/w/index.php?title=Symbolic link&oldid=234623190(accessed September 19, 2008).

[Virtualbox] Sun xVM VirtualBox http://www.sun.com/software/products/virtualbox/

[VMWare] VMWare http://vmware.com/

[Xen] Xen Hypervisor http://xen.org/

Brooks Davis, Isolating Cluster Users (and Their Jobs) for Performance and Predictability

Documents

Transcript of Brooks Davis, Isolating Cluster Users (and Their Jobs) for Performance and Predictability