A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for...

6
A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen J. Fink Rodric Rabbah Sunil Shukla IBM Thomas J. Watson Research Center {josh, dfb, ioana, perry, sjfink, rabbah, skshukla}@us.ibm.com ABSTRACT Heterogeneous systems show a lot of promise for extracting high- performance by combining the benefits of conventional architec- tures with specialized accelerators in the form of graphics proces- sors (GPUs) and reconfigurable hardware (FPGAs). Extracting this performance often entails programming in disparate languages and models, making it hard for a programmer to work equally well on all aspects of an application. Further, relatively little attention is paid to co-execution—the problem of orchestrating program ex- ecution using multiple distinct computational elements that work seamlessly together. We present Liquid Metal, a comprehensive compiler and run- time system for a new programming language called Lime. Our work enables the use of a single language for programming het- erogeneous computing platforms, and the seamless co-execution of the resultant programs on CPUs and accelerators that include GPUs and FPGAs. We have developed a number of Lime applications, and success- fully compiled some of these for co-execution on various GPU and FPGA enabled architectures. Our experience so far leads us to be- lieve the Liquid Metal approach is promising and can make the computational power of heterogeneous architectures more easily accessible to mainstream programmers. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Fea- tures; B.6.3 [Logic design]: Design aids General Terms Design, Languages Keywords Heterogeneous, GPU, FPGA, Streaming, Java 1. INTRODUCTION The mixture of computational elements that make up a proces- sor is increasingly heterogeneous. There are already processors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2012, June 3-7, San Francisco, California, USA. Copyright 2012 ACM 978-1-4503-1199-1/12/06 ...$10.00. that couple a conventional general purpose CPU with a graphics processor (GPU), and there is increasing use of the GPU for more than graphics processing because of its exceptional computational abilities at particular problems. In this way, a computer with a CPU and a GPU is heterogeneous because it offers a specialized compu- tational element (the GPU) for computational tasks that suit its ar- chitecture, and a truly general purpose computational element (the CPU) for all other tasks (e.g., including if needed the computational tasks that are well suited for the GPU). The GPU is an example of a hardware accelerator. Other forms of hardware accelerators are gaining wider consideration, including field programmable gate ar- rays (FPGAs) and fixed-function accelerators. Programming technologies exist for CPUs, GPUs, FPGAs, and various accelerators in isolation. For example, programming lan- guages for a GPU include OpenMP, OpenCL [5], and CUDA, all of which can be viewed as extensions of the C programming lan- guage. A GPU-specific compiler inputs a program written in one of these languages, and preprocesses the program to separate the GPU-specific code (hereinafter referred to as device code) from the remaining program code (hereinafter referred to as the host code). The device code is typically recognized by the presence of explicit device-specific language extensions or compiler direc- tives (e.g., pragma), or syntax (e.g., kernel launch with <<<...>>> in CUDA). The device code is further translated and compiled into device-specific machine code (hereinafter referred to as an arti- fact). The host code is modified as part of the compilation process to invoke the device artifact when the program executes. The de- vice artifact may either be embedded into the host machine code, or it may exist in a repository and identified via a unique identi- fier that is part of the invocation process. Programming solutions for heterogeneous computers that include FPGAs are comparable to GPU programming solutions. However, FPGAs do not enjoy the benefits of a widely accepted C dialect yet. Regardless of the heterogeneous mix of processing elements in a computer, the programming methodology to date is generally sim- ilar and shares the following characteristics. First, the disparate languages or dialects in which different architectures must be pro- grammed make it hard for a single programmer or programming team to work equally well on all aspects of a project. Second, relatively little attention is paid to co-execution—the problem of orchestrating a program execution using multiple distinct compu- tational elements that work seamlessly together. Co-execution re- quires (1) partitioning a program into tasks that can be mapped to the computational elements, (2) scheduling the tasks onto the com- putational elements, and (3) handling the communication between computational elements, which in itself involves serializing data and preparing it for transmission, routing data between processors, and receiving and deserializing data. Given the complexities asso- 271

Transcript of A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for...

Page 1: A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen

A Compiler and Runtime for Heterogeneous Computing

Joshua Auerbach David F. Bacon Ioana Burcea Perry ChengStephen J. Fink Rodric Rabbah Sunil Shukla

IBM Thomas J. Watson Research Center{josh, dfb, ioana, perry, sjfink, rabbah, skshukla}@us.ibm.com

ABSTRACTHeterogeneous systems show a lot of promise for extracting high-performance by combining the benefits of conventional architec-tures with specialized accelerators in the form of graphics proces-sors (GPUs) and reconfigurable hardware (FPGAs). Extracting thisperformance often entails programming in disparate languages andmodels, making it hard for a programmer to work equally well onall aspects of an application. Further, relatively little attention ispaid to co-execution—the problem of orchestrating program ex-ecution using multiple distinct computational elements that workseamlessly together.

We present Liquid Metal, a comprehensive compiler and run-time system for a new programming language called Lime. Ourwork enables the use of a single language for programming het-erogeneous computing platforms, and the seamless co-execution ofthe resultant programs on CPUs and accelerators that include GPUsand FPGAs.

We have developed a number of Lime applications, and success-fully compiled some of these for co-execution on various GPU andFPGA enabled architectures. Our experience so far leads us to be-lieve the Liquid Metal approach is promising and can make thecomputational power of heterogeneous architectures more easilyaccessible to mainstream programmers.

Categories and Subject DescriptorsD.3.3 [Programming Languages]: Language Constructs and Fea-tures; B.6.3 [Logic design]: Design aids

General TermsDesign, Languages

KeywordsHeterogeneous, GPU, FPGA, Streaming, Java

1. INTRODUCTIONThe mixture of computational elements that make up a proces-

sor is increasingly heterogeneous. There are already processors

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAC 2012, June 3-7, San Francisco, California, USA.Copyright 2012 ACM 978-1-4503-1199-1/12/06 ...$10.00.

that couple a conventional general purpose CPU with a graphicsprocessor (GPU), and there is increasing use of the GPU for morethan graphics processing because of its exceptional computationalabilities at particular problems. In this way, a computer with a CPUand a GPU is heterogeneous because it offers a specialized compu-tational element (the GPU) for computational tasks that suit its ar-chitecture, and a truly general purpose computational element (theCPU) for all other tasks (e.g., including if needed the computationaltasks that are well suited for the GPU). The GPU is an example ofa hardware accelerator. Other forms of hardware accelerators aregaining wider consideration, including field programmable gate ar-rays (FPGAs) and fixed-function accelerators.

Programming technologies exist for CPUs, GPUs, FPGAs, andvarious accelerators in isolation. For example, programming lan-guages for a GPU include OpenMP, OpenCL [5], and CUDA, allof which can be viewed as extensions of the C programming lan-guage. A GPU-specific compiler inputs a program written in oneof these languages, and preprocesses the program to separate theGPU-specific code (hereinafter referred to as device code) fromthe remaining program code (hereinafter referred to as the hostcode). The device code is typically recognized by the presenceof explicit device-specific language extensions or compiler direc-tives (e.g., pragma), or syntax (e.g., kernel launch with <<<...>>>

in CUDA). The device code is further translated and compiled intodevice-specific machine code (hereinafter referred to as an arti-fact). The host code is modified as part of the compilation processto invoke the device artifact when the program executes. The de-vice artifact may either be embedded into the host machine code,or it may exist in a repository and identified via a unique identi-fier that is part of the invocation process. Programming solutionsfor heterogeneous computers that include FPGAs are comparableto GPU programming solutions. However, FPGAs do not enjoy thebenefits of a widely accepted C dialect yet.

Regardless of the heterogeneous mix of processing elements in acomputer, the programming methodology to date is generally sim-ilar and shares the following characteristics. First, the disparatelanguages or dialects in which different architectures must be pro-grammed make it hard for a single programmer or programmingteam to work equally well on all aspects of a project. Second,relatively little attention is paid to co-execution—the problem oforchestrating a program execution using multiple distinct compu-tational elements that work seamlessly together. Co-execution re-quires (1) partitioning a program into tasks that can be mapped tothe computational elements, (2) scheduling the tasks onto the com-putational elements, and (3) handling the communication betweencomputational elements, which in itself involves serializing dataand preparing it for transmission, routing data between processors,and receiving and deserializing data. Given the complexities asso-

271

Page 2: A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen

ciated with orchestrating the execution of a program on a heteroge-neous computer, a very early static decision must be made on whatwill execute where, a decision that is complex and also costly to re-visit as a project evolves. This is exacerbated by the fact that someof the accelerators, for example, FPGAs, are difficult to programwell and place a heavy engineering burden on programmers.

This paper presents Liquid Metal, a comprehensive compiler in-frastructure and runtime system that together enable the use of asingle language for programming computing systems with hetero-geneous accelerators, and the co-execution of the resultant pro-grams on such architectures. Using Liquid Metal, a program is low-ered into an intermediate representation that describes the compu-tation as independent but interconnected computational nodes. Liq-uid Metal then gives a series of quasi-independent backend compil-ers a chance to compile one or more groups of computational nodesfor different target architectures. Most backend compilers are underno obligation to compile everything. However, the CPU compileralways compiles the entire program, guaranteeing that every nodehas at least one implementation.

The result of a compilation with Liquid Metal is a collection ofartifacts for different architectures, each labeled with the particu-lar computational node that it implements. As a consequence, theruntime can choose from a large number of functionally-equivalentconfigurations depending on which nodes are activated during exe-cution. In this way, Liquid Metal solves the problem of a prematurestatic partitioning of an application by offering a dynamic partition-ing of the program across available processing elements, targetingCPUs, GPUs, and FPGAs. The advantage of runtime-partitioningis that it is not permanent, and it may adapt to changes in the pro-gram workloads, program phase changes, availability of resources,and other fine-grained dynamic features.

Liquid Metal source files are programmed using a new languagecalled Lime [1] which is a Java-compatible object-oriented lan-guage, with new features that make it feasible to (1) define a unitof migration between CPUs and accelerators, (2) statically compileefficient code for the various accelerators in a given heterogeneoussystem, and (3) orchestrate the dataflow across the system. Wehave developed a number of benchmarks in Lime and successfullycompiled and run these on CPU+GPU and CPU+FPGA systems.We present an overview of Lime in the following section, with anemphasis on those particular language features that facilitate ap-plication development for heterogeneous architectures. Then wedescribe the Liquid Metal compiler and runtime system, and high-light a development flow using the Lime. We conclude with relatedwork and future directions.

2. LIME LANGUAGE OVERVIEWThe foundation for the Liquid Metal compiler and runtime is a

new language called Lime, a general purpose object-oriented lan-guage with functional and stream-oriented programming features.Lime is Java-compatible and, as a consequence, it enables an evolu-tionary migration of existing Java code to Lime, exploiting the newlanguage features where most profitable first. Lime is designed toease the programming burden for heterogeneous architectures via(1) strong isolation and (2) abstract parallelism with explicit com-munication and computation operators.

2.1 Strong IsolationStrong isolation allows the Liquid Metal runtime to safely mi-

grate computation between heterogeneous processors. Isolation isrealized using a combination of orthogonal language features calledvalue and local. The former is a type modifier which declares thatany instance of the type is recursively immutable. That is, a value

01 public value enum bit {02 zero, one;03 public bit ~ this {04 return this == zero ? one : zero;05 }06 }

07 public class Bitflip {08 local static bit flip(bit b) {09 return ~b;10 }11 local static bit[[]] mapFlip(bit[[]] input) {12 var flipped = Bitflip @ flip(input);13 return flipped;14 }15 static bit[[]] taskFlip(bit[[]] input) {16 bit[] result = new bit[input.length];17 var flipit = input.source(1)18 => ([ task flip ])19 => result.<bit>sink();20 flipit.finish();21 return new bit[[]](result);22 }23 }

Figure 1: Lime examples.

cannot mutate once it is created. All primitive types in Lime (e.g.,int, float) are also values. The local modifier is attributedto method declarations, and requires that local methods only callother local methods and access fields that are either reachablefrom the method arguments, fields of the object instance itself (ifthe method is not static), or compile-time constants.

An example of a value type is shown in Figure 1 lines (1-6).The type bit is an enumeration of two values zero and one.Unlike Java enumerations which are mutable, this type is declaredimmutable using the value keyword. The methods of a valuetype are local by default. In contrast, the method flip in Fig-ure 1 line 8 is explicitly declared local because Bitflip is nota value class and its methods are global by default. A globalmethod may perform side-effecting operations, including I/O.

The value and local modifiers provide practical invariantsfor the Liquid Metal compiler and runtime to rely on and exploit.Namely, a local method whose arguments are values is pure (i.e.,stateless) if it is either a static method (e.g., flip) or an instancemethod of a value type (e.g., bit ~). Pure methods are candi-dates for migration between processors in a heterogeneous archi-tecture because they provide implementation freedom. On a GPUor CPU, the compiler and runtime can invoke threads as neededto apply pure methods and scale-up throughput. Similarly on anFPGA where methods may be embodied as modules, the compilermay instantiate modules as needed to scale-up throughput.

Stateful instance methods are also candidates for co-executionif they are local and the object instance is constructed usingan isolating constructor: a local constructor with value ar-guments. Unlike pure methods which provide data-parallelism,stateful methods require the exploitation of pipeline-parallelism toachieve higher throughput. Lime provides additional language fea-tures to facilitate the expression of data and pipeline parallelism.

2.2 Abstract ParallelismThere are two primary ways of expressing parallelism in Lime.

These are illustrated in the methods mapFlip and taskFlip inFigure 1. The first is an example using the Lime map operator“@” (line 12) which, in this case, applies the method flip (lines8-10) to every element of the input array to produce a new bitarray of the same size. The argument to the method mapFlip is avalue array (denoted using [[]]) and hence its elements are read-only and cannot be assigned. One may call the method using a bit

272

Page 3: A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen

array or bit literal. Lime provides bit literals as syntactic sugar tocompactly define bit arrays because of their prevalence in FPGAdesigns. For example the bit literal 100b is a 3-bit array wherebit[0]=0 and bit[2]=1. The result of mapFlip(100b) isa bit array equal to the bit literal 001b.

When the map operator is a local static method applied tovalue arguments, the compiler can infer data-parallelism and op-timize the implementation accordingly. In Liquid Metal, the mapand reduce (not shown) operators are exploited heavily for opti-mizing code for co-execution on a GPU. We achieved end-to-endspeedups of 12×−431× for a number of benchmarks co-executingbetween CPU and GPU using an NVidia GTX580 (Fermi) [3].

Pipeline-parallelism in Lime is expressed using explicit opera-tors to create tasks (i.e., pipeline stages) and connect them so dataflows between them. A 3-stage pipeline is shown on lines 17-19in Figure 1 as an example. The first stage is a “source” task thatproduces a stream of bits, one bit a time. The second stage appliesthe flip method which flips one bit a time. The third and final“sink” stage accumulates the bits into a new bit array, one bit at atime. The source and sink tasks use utility methods provided byLime array types. The flip task (line 18) explicitly uses the Limetask operator. When applied to a static method as in this case, theresult is a dataflow actor that repeatedly applies the named method.The actor consumes data from its input port and produces data to itsoutput stream, applying the named method when the port containssufficient data to satisfy the argument requirements of the method.The connect operator “=>” connects tasks so values flow betweenthem. Hence, every bit produced by the source (line 17) flows tothe flip task (line 18) and the result of the flip task flows to the sinktask (line 19). Connected tasks are called task graphs in Lime.

Only values may flow between tasks. This restriction, which isenforced by the Lime type system, guarantees that data that crossphysical boundaries in a heterogeneous system cannot mutate inflight. Hence, the data may be marshaled on one end and unmar-shaled on the other without concern for data-races. Furthermore,values are cycle-free and may be marshaled using custom strate-gies that are tailored to the physical wire-format of the system.

Lime tasks which are either source or sink nodes in the task graph(i.e., have no input connections or output connections, respectively)are allowed to perform I/O and may have side-effects. In contrast,inner tasks, called filters, must be strongly isolated in that the taskoperator can only be applied to a local method with value ar-guments. It is the inner tasks that are usually migrated from theCPU and co-executed on accelerators.

Task graph construction is separated from task graph execution.The task graph construction does not cause any of the actors in thegraph to execute. Instead, Lime requires an explicit operation tocause the actors to execute. This is accomplished using a start()or finish() method on tasks (line 20). The latter causes the ex-ecution to start and blocks the caller until computation has termi-nated. In the example, the graph execution terminates when the lastbit produced by the source is consumed by the sink.

2.3 Task Relocation and Co-ExecutionLime requires the use of relocation brackets ([ ]) around task

expressions (Figure 1, line 18) in order to inform the compiler andruntime of the programmer’s desire to co-execute tasks. Whenomitted, the compiler and runtime provide no guarantees that thetask graph contains any co-executable regions.

The relocation brackets provide a lightweight and convenientmechanism for a programmer to experiment with many differentpartitions between device code and host code without perturbingthe rest of their code. In addition, since the methods that a task

Frontend Backend

OpenCL Verilog

JavaBytecode

FPGAGPUCPU

JVM Lime Runtime

Artifacts

Toolchain

Generatedcode

Runtime

Hardware

Usercode

Task

Task

Task Task

Task

Task

Task Task

Task

public static int doWork(int[[]] values){ int acc = 0; for (int i=0; i<values.length; i) acc += values[i]; return acc;}

public static int doWork(int[[]] values){ int acc = 0; for (int i=0; i<values.length; i) acc += values[i]; return acc;}

Task

Task Task

Task

public static int doWork(int[[]] values){ int acc = 0; for (int i=0; i<values.length; i) acc += values[i]; return acc;}

Figure 2: Liquid Metal compiler & runtime.

operator acts on may be used seamlessly in a number of contexts(e.g., map/reduce, task graph, or imperative code), the programmercan develop and debug their code in a single semantic domain andreuse much of their existing code. We believe that programmerswill favor using the Lime value and local modifiers becausethey are non-intrusive, provide general soundness guarantees that aprogrammer may wish to assert, and most notably, when combinedwith task and map/reduce operators, provide a path for exploitingheterogeneous architectures.

3. COMPILING FOR HETEROGENEITYFigure 2 presents an overview of our compilation and runtime

toolchain. Liquid Metal accepts a set of source files and producesartifacts for execution. An artifact is an executable entity that maycorrespond to either the entire program (as is the case with the byte-code generation) or its subsets (as is the case with the OpenCL/GPUand Verilog/FPGA backends). An artifact is packaged in such away that it can be replaced at runtime with another artifact that isits semantic equivalent.

The compiler frontend performs shallow optimizations and gen-erates Java bytecode for executing the entire program in a Javavirtual machine (JVM). The compiler backend generates code forGPUs and FPGAs. The backend operates on a subset of the inputprogram, focusing solely on compiling Lime task graphs.

The backend consists of architecture-specific device compilers;currently, a GPU compiler and an FPGA compiler. The former gen-erates OpenCL for the GPU, while the latter generates Verilog forthe FPGA. These are subsequently compiled using device-specifictoolflows to complete the artifact generation for each accelerator.

The compiler relies on the presence of relocation brackets aroundtask graphs to learn of the tasks it must compile. In general, taskgraphs can be constructed using rich control flow (e.g., iterative,recursive). The compiler discovers the shape and other propertiesof these task graphs statically. As expected, compile-time analysismay not discover all possible task graphs that the program mightbuild. If the relocation brackets are present and the compiler failsto determine the shape of the task graph, the programmer is in-formed at compile time with an appropriate error message. Thebenchmarks we have developed so far use task construction idiomsthat our compiler can recognize. We believe these benchmarks arewritten in a style that is natural to programmers.

Each of the device compilers operates autonomously and inde-pendent of the other compilers. It examines the tasks that make up

273

Page 4: A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen

each task graph and decides whether the code that comprises thetasks is suitable for the device. A task containing language con-structs that are not suitable for the device is excluded from furthercompilation by that backend. The device compiler produces arti-facts for the task (sub)graphs that are not subject to exclusion.

The frontend and backend compilers cooperate to produce a man-ifest describing each generated artifact and labeling it with a uniquetask identifier. The set of manifests and artifacts are used by theruntime system in determining which task implementations are se-mantically equivalent to each other. The same task identifiers areincorporated into the final phase of bytecode compilation such thatthe task graph construction in the Lime program can pass alongthese identifiers to the runtime when the program is executed.

4. LIQUID METAL RUNTIMEThe Lime language runtime is implemented primarily in Java

and runs in any modern JVM. It constructs task graphs at run time,elects which implementations of a task to use from the availableartifacts, performs task scheduling, and if necessary, orchestratesmarshaling and communication with native artifacts. Its structureis best understood by focusing on its phases of execution.

4.1 Task Graph ConstructionThe runtime contains a class for every distinct kind of task that

can arise in the Lime language (e.g., sources, sinks, filters). Whenthe bytecode compiler produces the code for a task, it generates anobject allocation for one of these runtime classes. A connect opera-tion “=>” creates a FIFO queue between tasks. When the programexecutes, the task creation and connection operators are reflectedin an actual graph of runtime objects. The generated bytecode thatinstantiates a task also provides the runtime with the unique identi-fiers that were generated by the backend compilers to advertise theset of available task artifacts.

The start() method, when called on any task graph, activatesthe scheduling portion of the runtime. Assuming for the momentthat the tasks in the graph only have bytecode artifacts, the runtimecreates a thread for each task. These threads will block on the in-coming connections until enough data is available for the task toproduce data on its outgoing connection.

4.2 Task SubstitutionThe runtime learns about the tasks in the graph that have non-

bytecode artifacts while examining the task graph during construc-tion, since the unique identifiers of tasks, which are stored in thetask runtime objects, can be looked up efficiently in the artifactstore populated by the backends. For each task (sub)graph that hasan alternative implementation, the runtime is in a position to per-form a substitution. At present, the runtime algorithm for doing thissubstitution is primitive: it prefers a larger substitution to a smallerone. It also favors GPU and FPGA artifacts to bytecode althoughthat choice can be manually directed as well. A more sophisti-cated algorithm that accounts for communication costs, performsdynamic migration, or runtime adaptation is left to future work.Performing the actual substitution involves the native connectionand marshaling issues described in the next section. Substitutionprecedes the creation and starting of threads as described in the pre-vious section but is otherwise transparent to most of the runtime.

4.3 Native Connections and MarshalingAlthough the data and code isolation of Lime tasks makes com-

putation offloading possible, the runtime system must still efficientlytransfer data between the main system memory and the device. Be-cause the runtime supports a number of disparate accelerators that

Figure 3: Data transfer between JVM and native device usinga float array as input and int array as output.

include GPUs and FPGAs, the runtime implementation adopts auniversal “wire” format that relies only on sending a byte stream asshown in Figure 3.

The communication steps between the host JVM and the nativedevice entail (1) serializing a Lime value to a byte array, (2) cross-ing the JNI boundary, and (3) converting this byte array into a C-style value. The return path is a mirror image in which we convertthe native data structure to a byte array, return from the JNI call,and then deserialize from the byte array back into a heap-residentvalue.

During the task substitution process, the runtime will find a cus-tom serializer based on the task I/O data type. Marshaling on theC side is similar but more specialized because the data is generallydensely packed.

Our current design affords a common format as a starting pointfor a communication layer that supports heterogeneous devices.One might further optimize the protocol by creating specific com-munication channels so that the sender and receiver are aware of thedata format the other party desires. Going even further, one mightbe able to avoid a low-level memory copy by pinning memorypages and managing memory explicitly. However, these changescome at the cost of OS and JVM portability.

5. DESIGN FLOW USING LIMEIn addition to the Liquid Metal compiler and runtime, we de-

veloped an interactive development environment (IDE) to aid Limedevelopers design and implement their applications. The IDE isan Eclipse plugin and provides a number of features that a typicalJava developer is accustomed to. Among these is a package ex-plorer, seen in the leftmost panel in Figure 4, and a class outlinein the rightmost panel. The Bitflip Lime class is open in theeditor (this is the same class used in an earlier part of the paper, al-beit here it is more complete). Syntax highlighting is visible in thefigure as well. In addition, the reader may notice a green underlineat line 18 and a round marker to the left of the line number. Theseare Lime-specific markers indicating that the compiler generated adevice artifact for the corresponding task in the relocation brackets.The console panel toward the bottom of the screen shot shows theresult of executing the program, which has emitted two rows of bitvalues.

The panel labeled “Run Configuration”, right of center in thescreen shot, shows how a Lime developer interacts with the com-piler and runtime to generate artifacts for co-execution. It is cur-rently possible to compile Lime tasks to generate artifacts for na-tive binaries, GPUs and FPGAs. In the case of native binaries, thecompiler generates C code and builds shared libraries that are dy-namically loaded by the Liquid Metal runtime to co-execute withthe remaining Lime bytecodes. For GPUs, the compiler mostly fol-lows the same flow as with native binaries, but the tasks are com-

274

Page 5: A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen

Figure 4: Liquid Metal IDE (top) and co-execution CPU+FPGA simulator (bottom).

piled to OpenCL kernels. Lastly for FPGAs, the compiler gener-ates Verilog and then drives FPGA-specific logic synthesis flows togenerate bitfiles that may be loaded into the FPGA at runtime forco-execution. However, it is also possible to simulate the Verilogalongside the Liquid Metal runtime using a Verilog/VHDL simula-tor.

A typical design flow starts off with the developer writing Limecode that describes the computation, then introducing map/reduceoperators and task-graphs, verifying correctness (in bytecodes) us-ing normal debugging methodologies (e.g., using the Eclipse inter-active symbolic debugger), and lastly introducing relocation brack-ets to invoke the backend device compilers. For FPGA develop-ment where the Lime source code may affect the behavioral syn-thesis in a number of ways, it is useful for the developer to rapidlyiterate over their designs, and access low-level timing results from asimulator so they may improve their code and achieve better perfor-mance. This is especially important at this point in time because ourFPGA backend is a work in progress. The bottom half of Figure 4shows the result of a co-execution between the Liquid Metal run-time and a Verilog simulator, visualized using a waveform viewer.

The waveform captures the entire execution of the taskFliptask graph. The example is driven with 9 input bits (line 9 in theeditor). In the waveform, these are represented by the 9 transitionson the inReady signal. The generated logic uses a FIFO whichproduces a value on the next rising edge of the clock. This is mostclearly visible starting at the clock cursor at 92ns (input[1]=1).The inReady signal is asserted and inData[0] is high one cy-cle later. Another three cycles later, the output of the module isready as indicated by the outReady signal. In this case, the mod-ule I/O is not fully pipelined and the computation essentially con-sists of one cycle to read, one cycle to compute, and one cycle topublish the result.

6. RELATED WORKThe work described in this paper is an entirely new compilation

infrastructure for Lime that includes bytecode, OpenCL and Ver-ilog code generation. This is in contrast to our previous work [4]which compiled a much smaller language lacking explicit task-graph features or map/reduce operations. The current work alsoincludes a runtime that enables co-execution for the generated ar-tifacts on GPUs and FPGAs. The previous paper does not discussruntime replacement or the kind of compilation methodology de-scribed in this paper.

Several research projects address the programming burdens ofheterogeneous systems. For example, SoC-C [8] is an extension tothe C language that allows the programmer to manage distributedmemory, express pipeline parallelism and map different tasks to re-sources. EXOCHI [12], a research effort from Intel, describes aruntime for integrating accelerators with general purpose proces-sors with shared memory and dynamic allocation of tasks to accel-erators. The Quilin system [7] provides a C API that can be used towrite parallel programs and an adaptable runtime that dynamicallymaps computations to processing elements in a CPU+GPU system.Accelerator [10] provides a C# API library that uses a data-parallelmodel based on parallel arrays to program GPUs. Elastic com-puting [13] is another approach to support heterogeneous systemswhere a library of functions is built with several implementations(for different devices or algorithms). During an installation phase,the system profiles each implementation and at runtime, the profileis used to determine the best implementation to use based on theinput parameters.

Liquid Metal differs from these approaches in a number of ways:the compiler and runtime are founded upon a new language calledLime. Unlike its C counterparts, Lime is much richer, unifies sev-eral forms of parallelism, and makes the computation and commu-

275

Page 6: A compiler and runtime for heterogeneous computing · 2012. 6. 18. · A Compiler and Runtime for Heterogeneous Computing Joshua Auerbach David F. Bacon Ioana Burcea Perry Cheng Stephen

nication explicit. All of the programming features in support ofheterogeneous computing are first-class language constructs, andas such, they are checkable, provide strong static guarantees, andempower the compiler to perform simple yet highly effective op-timizations [3]. Furthermore, the relocation brackets in Lime pro-vide a convenient and lightweight feature for a programmer to ex-periment with many different program partitions without special-izing the bulk of their code for a particular style of computation.While the current Liquid Metal runtime does not employ intelli-gent heuristics for partitioning and mapping, we view much of theexisting work on this particular topic to be complimentary to ours.Our work has focused on identifying a robust unit of migrationand co-execution (e.g., task-graphs), and exposing the computa-tion and communication in a way that facilitates compilation andco-execution to multiple targets.

OpenCL [5] provides an interesting comparison point to LiquidMetal. While OpenCL is largely used for programming GPUs to-day, there is significant interest in extending its reach to FPGAs aswell. The main advantage of our work is that Lime was specifi-cally designed with synthesis to FPGAs in mind. Hence, bit-datatypes are first class, there are no predefined data types that imply aparticular vector-width, and complicated synchronization and ex-plicit communication orchestration is not required. The LiquidMetal compiler and runtime automatically generate the synchro-nization and communication code for GPUs and FPGAs as needed,thus freeing the programmer from tedious and error prone coding.

Compilation of restricted subsets of C or C++ to FPGA codeis common (see [2] for a recent survey). There is arguably somedegree of co-execution supported by these tools in that they oftengenerate a “testbench” in addition to the FPGA artifact. However,these tools are not oriented toward building complete applicationsin a co-execution style, and often the extensions for FPGA pro-gramming are not well suited for GPU programming.

Virgil [11, 6] is an object-oriented language originally designedfor micro-controllers, but unlike Java or Lime, it restricts objectcreation to an initialization phase. The Kiwi [9] project synthesizesFPGA code for selected library components in C#. These effortsdo not emphasize co-execution, rather, they generate entire designsto run wholly in the FPGA.

7. CURRENT AND FUTURE DIRECTIONSIn this paper we highlighted features of the Lime language that

we expect to lighten the burden on developers who wish to exploitheterogeneous architectures. The Liquid Metal compiler and run-time leverage the properties of the language to free the programmerfrom the tedious partitioning of computation between host and de-vice code, and the orchestration of communication for co-execution.Instead, a Lime developer writes their application for GPU andFPGA accelerators in a single semantic domain, uses map/reduceand task graphs to expose parallelism, and then experiments withpartitioning their code for co-execution using a lightweight lan-guage mechanism, namely relocation brackets.

We have successfully developed a number of applications usingLime and run on CPU+GPU and CPU+FPGA systems. Our cur-rent system supports virtually all GPUs, and we have demonstratedsignificant performance gains on AMD and NVidia GPUs [3]. Incontrast, because FPGAs require device-specific synthesis flows,

we currently support the following devices: PCIe attached Nallat-ech 280 boards with Xilinx FPGA parts, and Xilinx XUP V5 andSpartan LX9 development boards connected to the host via UART.In addition we support Verilog/VHDL simulators including NCSimand Modelsim.

Our current and future emphasis is on improving the quality ofour FPGA device compiler and broadening the set of language fea-tures that are compiled to Verilog. In addition, we are exploringa number of research topics related to runtime introspection andadaptation of the task-graph partitioning so that tasks run wherethey are best suited. Lastly, we are exploring applications that canbenefit simultaneously from CPU+GPU+FPGA co-execution.

8. REFERENCES[1] J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: a

Java-compatible and synthesizable language forheterogeneous architectures. In OOPSLA, pp. 89–108, Oct.2010.

[2] J. M. Cardoso and P. C. Diniz. Compilation Techniques forReconfigurable Architectures. Springer, 2008.

[3] C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink.Compiling a high-level language for GPUs. In PLDI, June2012.

[4] S. S. Huang, A. Hormati, D. F. Bacon, and R. Rabbah.Liquid Metal: Object-oriented programming across thehardware/software boundary. In ECOOP, pp. 76–103.Springer-Verlag, 2008.

[5] Khronos OpenCL Working Group. The OpenCLSpecification.

[6] S. Kou and J. Palsberg. From OO to FPGA: fitting roundobjects into square hardware? In OOPSLA, pp. 109–124.ACM, 2010.

[7] C.-K. Luk, S. Hong, and H. Kim. Qilin: exploitingparallelism on heterogeneous multiprocessors with adaptivemapping. In MICRO, pp. 45–55. ACM, 2009.

[8] A. D. Reid, K. Flautner, E. Grimley-Evans, and Y. Lin.SoC-C: efficient programming abstractions forheterogeneous multicore systems on chip. In CASES, pp.95–104. ACM, 2008.

[9] S. Singh and D. J. Greaves. Kiwi: Synthesis of FPGAcircuits from parallel programs. In FCCM, pp. 3–12. IEEEComputer Society, 2008.

[10] D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using dataparallelism to program GPUs for general-purpose uses. InASPLOS, pp. 325–335. ACM, 2006.

[11] B. L. Titzer. Virgil: objects on the head of a pin. In OOPSLA,pp. 191–208. ACM, 2006.

[12] P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian,M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. Exochi:architecture and programming environment for aheterogeneous multi-core multithreaded system. In PLDI,pp. 156–166. ACM, 2007.

[13] J. R. Wernsing and G. Stitt. Elastic computing: a frameworkfor transparent, portable, and adaptive multi-coreheterogeneous computing. In LCTES, pp. 115–124. ACM,2010.

276