MARC: A Many-Core Approach to Reconfigurable Computing
Transcript of MARC: A Many-Core Approach to Reconfigurable Computing
MARC: A Many-Core Approach to Reconfigurable Computing
Ilia Lebedev, Shaoyi Cheng, Austin Doupnik, James Martin, Christopher Fletcher,
Daniel Burke, Mingjie Lin, and John Wawrzynek
Department of EECS, University of California at Berkeley, CA 94704
Abstract—We present a Many-core Approach to Re-configurable Computing (MARC), enabling efficient high-performance computing for applications expressed using paral-lel programming models such as OpenCL. The MARC systemexploits abundant special FPGA resources such as distributedblock memories and DSP blocks to implement completesingle-chip high efficiency many-core microarchitectures. Thekey benefits of MARC are that it (i) allows programmersto easily express parallelism through the API defined in ahigh-level programming language, (ii) supports coarse-grainmultithreading and dataflow-style fine-grain threading whilepermitting bit-level resource control, and (iii) greatly reducesthe effort required to re-purpose the hardware system for dif-ferent algorithms or different applications. A MARC prototypemachine with 48 processing nodes was implemented using aVirtex-5 (XCV5LX155T-2) FPGA for a well known Bayesiannetwork inference problem. We compare the runtime of theMARC machine against a manually optimized implementation.With fully synthesized application-specific processing cores, ourMARC machine comes within a factor of 3 of the performanceof a fully optimized FPGA solution but with a considerablereduction in development effort and a significant increase inretargetability.
Keywords-Reconfigurable Computing; Many-Core; FPGA;Compiler; Performance; Throughput.
I. INTRODUCTION
Reconfigurable devices such as FPGAs exhibit huge po-
tential for exploiting application-specific parallelism and
performing power-efficient computation. As a result, the
overall performance of FPGA-based solutions is often sig-
nificantly higher than that of CPU-based ones [1], [2].
Unfortunately, truly unleashing an FPGA’s performance
potential usually requires cumbersome HDL programming
and laborious manual optimization. Specifically, program-
ming FPGAs demands skills and techniques well outside
the application-oriented expertise of many developers, thus
forcing them to step beyond their traditional programming
abstractions and embrace hardware design concepts such as
clock management, state machines, pipelining, and device-
specific memory management. Finally, wide acceptance in
the marketplace requires binary compatibility across a range
of implementations. However, the current crop of FPGAs re-
quires lengthy reimplementation for each new chip version,
even within the same FPGA family.
These observations therefore raise a natural question:
for a given class of computation-intensive applications, is
it possible to build a reconfigurable computing machine
constrained to resemble a many-core computer, program it
using a high-level imperative language such as C/C++, and
yet still achieve orders of magnitude in performance gain
relative to conventional computing means? If so, such a
methodology would 1) allow programmers to easily express
parallelism through the API defined in a high-level program-
ming language, and 2) support coarse-grain multithreading
and dataflow-style fine-grain threading while permitting bit-
level resource control, and 3) greatly reduce the effort
required to repurpose the implemented hardware platform
for different algorithms or different applications.
Intuitively, constraining reconfigurable architectures
would likely result in performance degradation compared
to fully-customized solutions. As depicted in Fig. 1,
from the lowest performing platform to the highest, the
amount of effort required for application mapping increases
significantly. We contend that there is a sizable design
space, the gray area in Fig. 1, between hand optimized
FPGA solutions and general-purpose processors, that
warrants a systematic exploration. To that end, this work
attempts to show that, although the MARC approach trails
behind hand optimized FPGA platforms in performance
and efficiency, a disciplined approach with architectural
constraints, and without resorting to cumbersome HDL
programming and time-consuming manual optimization,
may win overwhelmingly in terms of hardware portability,
and reduced design time and effort. We believe MARC,
our proposed Many-core Approach to Reconfigurable
Computing, to be a first step toward finding the right
computational abstractions to characterize a wide range
of reconfigurable devices, expose a uniform view to the
programmer, capture computation in a manner that diverse
hardware implementations can exploit efficiently, and
leverage ongoing work in parallel programming.
GPPGPU
FPGA
(HDL) ASIC
MARC
Eas
e-of-
des
ign
PerformanceLow
Low
High
Hig
h
Figure 1. Landscape of modern computing platforms: Ease of applicationdesign and implementation vs. performance.
We believe that the key to retaining the performance
efficiency of FPGAs is to allow the many-core architecture
2010 International Conference on Reconfigurable Computing
978-0-7695-4314-7/10 $26.00 © 2010 IEEE
DOI 10.1109/ReConFig.2010.49
7
to be customized on a per application basis. We think of the
architecture model as a template with a set of parameters to
be chosen based on characteristics of the target application.
Our understanding of which aspects of the architecture to
parameterize continues to evolve as we investigate differ-
ent application mappings. However, the obviously impor-
tant parameters are: the number of processing cores, core
arithmetic-width, core pipeline depth, richness and topology
of an interconnection network, customization of cores—
from addition of specialized instructions to fixed function
datapaths, and details of the cache and local store hierar-
chies. In this study we explore a part of this parameterization
space and compare the performance of a customized many-
core FPGA implementation to a hand optimized version.
The rest of the paper is organized as follows: introduction
of the target application—a Bayesian network inference
problem—in Section II, Section III describes the high-level
architecture of the MARC system. Finally, in Section IV, we
illustrate a prototype of the MARC machine using a Virtex-5
FPGA on a BEE3 system and compare the performance on
several different MARC variants with that of a manually
optimized FPGA solution as well as a general purpose
processor (GPP) implementation.
II. APPLICATION: BAYESIAN NETWORK INFERENCE
We evaluated MARC against a Bayesian network (BN)
inference implementation called ParaLearn [3]. BNs are
statistical models that show conditional independences be-
tween nodes, via the local Markov property: a node is
conditionally independent of its non-descendants, given its
parents. Bayesian inference is the process of finding a BN’s
structure from quantitative data (“evidence”) taken for the
BN. Once a BN’s structure (made up of nodes V1, . . . , Vn)
is known, and each node Vi has been conditioned on its
parent set Πi, the joint distribution over the nodes becomes
the product:
P (V1, . . . , Vn) =
n∏
i=1
P (Vi|Πi)
We chose to compare MARC and ParaLearn for two
primary reasons. First, ParaLearn is a computationally in-
tensive application believed to be particularly well suited for
FPGA acceleration as illustrated by [3]. Second, our group,
in a collaboration with Stanford University, has expended
significant effort over the previous two years developing
several generations of a hand optimized FPGA implemen-
tation tailored for ParaLearn [3]. Therefore, we have not
only a concrete reference design but also well-corroborated
performance results for fair comparisons with a manually-
optimized FPGA implementation.
This paper compares ParaLearn’s kernel, the “order sam-
pler”, against various MARC configurations. The order sam-
pler takes as input prior evidence (D) of the BN’s structure
and produces a set of “high-scoring” BN orders. The other
steps in the BN inference workflow, known as the pre-
processing, graph sampling and post-processing [3] steps,
are outside the scope of this work. We chose to implement
the order sampler as it has the highest asymptotic complexity
of any step in the workflow.
The order sampler uses Markov chain Monte Carlo
(MCMC) sampling to perform an iterative random walk in
the space of BN orders. Per iteration, a “proposed” order
≺ is (1) formed through exchanging two nodes within the
current order, (2) scored, and (3) accepted or rejected based
on the Metropolis-Hastings rule. The score of a proposed
order is given by [4]:
Score(≺ |D) =n∏
i=1
∑
Πi∈Π≺
LocalScore(Vi,Πi;D,G)
for a network of n nodes, where the inner-most loop is
an accumulation over each node’s local scores (a statistical
representation for the evidence), whose corresponding parent
sets are compatible with the given order. Generally, steps 1–
3 repeat until a domain expert is confident that the high
scoring orders will yield networks that represent D.
III. THE MARC ARCHITECTURE
A. Many-Core Template
The overall architecture of a MARC system, as illus-
trated in Fig. 2, resembles a scalable, many-core-style pro-
cessor architecture, comprising one Control Processor (C-
core) and multiple Algorithmic Processing Cores (A-cores).
Both the C-cores and the A-core can be implemented as
conventional pipelined RISC processors. However, unlike
embedded processors commonly found in modern FPGAs,
the processing cores in MARC are completely parameterized
with variable bit-width, reconfigurable multi-threading, and
even aggregate/fused instructions. Furthermore, A-cores can
alternatively be synthesized as fully customized datapaths.
For example, in order to hide global memory access la-
tency, improve processing node utilization, and increase the
overall system throughput, a MARC system can perform
fine-grained multithreading through shift register insertion
and automatic retiming. Finally, while each processing core
possesses a dedicated local memory accessible only to itself,
a MARC system has a global memory space implemented as
distributed block memories accessible to all processing cores
through the interconnect network. Communication between
a MARC system and its host can be realized by reading and
writing global memory.
P/L P/LP/LP/L
MemMemMemMemMemMemMemMem
Sch
eduler
A-coreA-coreA-coreC-core
Interconnect Network
Figure 2. Diagram of key components in a MARC machine.
8
B. Execution Model and Software Infrastructure
Our MARC system builds upon both LLVM, a
production-grade open-source compiler infrastructure [5],
and OpenCL (Open Computing Language) [6], a framework
for writing programs that execute across heterogeneous
platforms consisting of GPPs, GPUs, and other accelerators.
Fig. 3 presents a high-level schematic of a typical MARC
machine. A user application runs on a host according to
the models native to the host platform—a high-performance
PC in our study. Execution of a MARC program occurs
in two parts: kernels that run on one or more A-cores of
the MARC devices and a control program that runs on the
C-core. The control program defines the context for the
kernels and manages their execution. During the execution,
the MARC application spawns kernel threads to run on the
A-cores, each of which runs a single stream of instructions
as SIMD units (that execute in lockstep with a single stream
of instructions) or as SPMD units (each processing core
maintains its own program counter).
ThreadCounter
LocalScheduler
MIPS CoreIMEM DMEM
MemoryMemoryMapMap
BootMemory
Private Memory
Local Memory
Host-MARC Interface
Kernel Scheduler
Kernel Queue
Results Queue
Global Memory
Figure 3. Schematic of a MARC machine’s implementation.
C. Application-Specific Processing Core
One strength of MARC is its capability to integrate fully
customized application-specific processing cores/datapaths
so that the kernels in an application can be more efficiently
executed. To this end, a high-level synthesis flow depicted
by Fig. 4 was developed to generate customized datapaths
for a target application.
Kernel Written in C
llvm-gcc
SSA IR
Predication
Data Flow Graph
Datapath Generation
Instruction Mapping
Pipelining Datapath
Scheduling/Ctr Gen.
Loop Scheduling
Multithreading
HDL Code of Customized Datapath
Figure 4. CAD flow of synthesizing application-specific processing cores.
The original kernel source code in C/C++ is first compiled
through llvm-gcc to generate the intermediate represen-
tation (IR) in the form of a single static assignment graph
(SSA), which forms a control flow graph where instructions
are grouped into basic blocks. Within each basic block, the
instruction parallelism can be extracted easily as all false
dependencies have been removed in the SSA representa-
tion. Between basic blocks, the control dependencies can
then be transformed to data dependencies through branch
predication. In our implementation, only memory operations
are predicated since they are the only instructions that can
generate stalls in the pipeline. By converting the control
dependencies to data dependencies, the boundaries between
basic blocks can be eliminated. This results in a single
data flow graph with each node corresponding to a single
instruction in the IR. Creating hardware from this graph
involves a one-to-one mapping between each instruction
and various pre-determined hardware primitives. Finally,
the customized cores have the original function arguments
converted into inputs. In addition, a simple set of control
signals is created for cores to be started and to signal the
completion of the execution. For memory accesses within
the original code, each non-aliasing memory pointer used
by the C function is mapped to a memory interface capable
of accommodating variable memory access latency. The
integration of the customized cores into a MARC machine
involves mapping the input of the cores to memory addresses
accessible by the control core, as well as the addition of
a memory handshake mechanism allowing cores to access
global and local memories. For the results reported in this
paper, loop pipelining and predication are done manually, but
a fully automated flow from C to HDL is currently under
development in our group.
D. Host-MARC Interface
Gigabit Ethernet is used to implement the communication
link between the host and the MARC device. We leveraged
the GateLib [7] project from Berkeley to implement the
host interface, allowing the physical transport to be easily
replaced by a faster medium in the future.
E. Memory Organization
In a MARC machine, threads executing a kernel have
access to three distinct memory regions: private, local, and
global. Global memory permits read and write access to
all threads within any executing kernels on any processing
core (ideally, reads and writes to global memory may be
cached depending on the capabilities of the device, however
in our current MARC machine implementation, caching is
not supported). Local memory is a section of the address
space shared by the threads within a computing core. This
memory region can be used to allocate variables that are
shared by all threads spawned from the same computing
kernel. Finally, private memory is a memory region that is
dedicated to a single thread. Variables defined in one thread’s
private memory are not visible to another thread, even when
they belong to the same executing kernel.
Physically, the private memory regions in a MARC system
are implemented within distributed LUT RAMs, while local
memory and part of global memory reside in the block RAM
(BRAM). To permit a larger memory space, we also allow
9
external memory to be used as part of the global memory
region. To increase the number of global memory ports, we
use both ports of each BRAM block separately, exposing
each BRAM as two smaller single-port memories. Obvi-
ously, the achievable aggregate memory access bandwidth
inside an FPGA is often far below its peak value, and the
available amount of memory is small in comparison with
other platforms, such as a modern GPU. Nevertheless, the
flexibility of the FPGA enables the MARC approach to use
application-specific access patterns in order to achieve high
memory bandwidth.
F. Kernel Scheduler
To achieve high throughput, kernels must be scheduled
to avoid memory access conflicts. The MARC system
allows for a globally aware kernel scheduler, which can
orchestrate the execution of kernels and control access to
shared resources. The global scheduler is controlled via a
set of memory-mapped registers, which are implementation-
specific. This approach allows a range of schedulers,
from simple round-robin or priority schedules to complex
problem-specific scheduling algorithms.
The MARC machine optimized for ParaLearn uses the
global scheduler to dispatch threads at a coarse grain (gang-
ing up thread starts). The use of the global scheduler is
therefore rather limited as the problem does not greatly
benefit from a globally-aware approach to scheduling.
G. System Interconnect
One of the key advantages of reconfigurable computing
is the ability to exploit application-specific communication
patterns in the hardware system. MARC allows the network
to be selected from a library of various topologies, such
as mesh, H-tree, crossbar, or torus. Application-specific
communication patterns can thus be exploited by providing
low-latency links along common routes.
The MARC machine optimized for ParaLearn explores
two topologies: a pipelined crossbar and a ring, as shown
in Fig. 5. The pipelined crossbar contains no assumptions
about the communication pattern of the target application—
it is a non-blocking network that provides uniform latency
to all locations in the global memory address space. Due to
the large number of endpoints on the network, the crossbar
is limited to 120 MHz with 8 cycles of latency.
The ring interconnect only implements nearest-neighbor
links, thereby providing very low latency access to some
locations in global memory, while requiring multiple hops
for other accesses. Nearest neighbor communication is im-
portant in the accumulation phase of ParaLearn, and helps
reduce overall latency. Moreover, this network topology is
significantly more compact, and can be clocked at a much
higher frequency—as high as 300 MHz in our implementa-
tions.
C-core C-core
A-core
A-core
A-core
A-core
A-core
A-core
Memory
Memory
Memory
Memory
Memory
Memory
Memory
MemoryMapping MappingScheduler Scheduler
Node
to Hostto Host
Block
Block
Block
Block
Block
Block
Figure 5. System diagram of a MARC system. (a) Ring network. (b)Pipelined crossbar.
IV. MARC IMPLEMENTATION AND PERFORMANCE
A. Hardware Prototyping
The prototype MARC implementation for this study
comprises one C-core and 48 A-cores implemented on
a single Virtex-5 (XCV5LX155T-2) of a BEEcube BEE3
module. While the C-core is implemented as a conven-
tional 4-stage RISC processor, all A-cores are application-
specific/customized with multithreading support. Each A-
core normally executes multiple concurrent threads to sat-
urate the long cycles in the application dataflow graph
and to maintain high throughput. In this study, we imple-
mented single threaded, two-way multithreaded and four-
way multithreaded A-cores. When individually instantiated,
the cores are clocked at 119 MHz, 180 MHz, and 260 MHz
respectively. However they achieve 105 MHz, 160 MHz, and
148 MHz respectively in the completely assembled MARC
due to high FPGA resource utilization.
The placed-and-routed MARC prototype with the RISC
A-cores is shown in Fig. 6 (left panel), and the system with
application specific A-cores is shown in Fig. 6 (right panel),
with various main components highlighted. Constrained by
long CAD tool run-time (about 20 hours), the hardware
usage of our MARC implementations is 84% and 71%
respectively, while the full custom FPGA implementation
of ParaLearn utilizes about 92% of total chip resource.
Figure 6. FPGA layouts after placement and routing.
10
As in other computing platforms, memory accesses sig-
nificantly impact the overall performance of a MARC
system. In the current MARC implementation, private or
local memory accesses take exactly one cycle, while global
memory accesses typically involve longer latencies that are
network dependent. Such discrepancies between local and
global memory access latencies, we believe, provide ample
opportunities for memory optimization and performance im-
provements, especially considering the hardware flexibility
of the MARC system manifested by application-specific pro-
cessing cores and customized interconnect networks. This
benefit becomes even more pronounced when local memory
accesses constitute the majority of all memory accesses, as
in ParaLearn.
B. Mapping ParaLearn onto the MARC Machine
The ParaLearn order sampler comprises a C-core to
control the main loop and A-cores to implement the scoring
step. Per iteration, the C-core performs the node swap
operation, broadcasts the proposed order, and applies the
Metropolis-Hastings check. These operations take up a neg-
ligible amount of time relative to the scoring process.
Scoring is composed of 1) the parent set compatibility
check, and 2) an accumulation across all compatible par-
ent sets. Step 1 must be made over every parent set; its
performance is limited by how many parent sets can be
simultaneously accessed. We store parent sets in BRAMs
that serve as A-core private memory, and are therefore
limited by the number of A-cores and attainable A-core
throughput. Step 2 must be first carried out independently
by each A-core thread, then across A-core threads, and
finally across the A-cores themselves. We serialize cross-
thread and cross-core accumulations. Each accumulation is
implemented with a global memory access.
The benchmark we chose consists of 16 nodes, each of
which has 2517 parent sets. We divide these into a total
of 12 chunks and allocate 12 threads per node. In the
implementations surveyed, 48 cores are used to run 192
threads. Thus, 3 cores are used per node and 4 threads are
used per core.
Because step 2 in the scoring process depends on the
global memory latency, network customization is key to
improving performance. Clustering each node’s 3 A-cores
is beneficial when local memory is used for storing scores
across all 3 A-cores. With this setup, it takes only 12 cycles
to initially write the 12 threads’ scores. Furthermore, since
one core can be given exclusive access to local memory, all
further memory accesses are single-cycle.
Clustering also reduces the size of the global network and
the number of FIFOs needed to decouple cores and the net-
work. This optimization greatly reduces area consumption
and increases clock frequency, especially in the case of the
largest 4-way multithreaded cores.
Table INAMING CONVENTION FOR MARC MACHINES IN THIS STUDY.
Alias Description
MARC-Rgen RISC A-core with Generic Network
MARC-Ropt RISC A-core with Optimized Network
MARC-C1 Customized A-core
MARC-C2 Customized A-core (2-way MT)
MARC-C4 Customized A-core (4-way MT)
MARC-C1c Clustered Customized A-core
MARC-C2c Clustered Customized A-core (2-way MT)
MARC-C4c Clustered Customized A-core (4-way MT)
C. Performance Comparison and Analysis
We compare the performance of MARC machines,
with and without application-specific customized processing
cores. The C-core variations and associated names are listed
in Table I.
We benchmark the order scoring algorithm against the
manually-optimized FPGA solution, as well as a conven-
tional microprocessor (GPP) reference implementation.
The runtime of each hardware platform—absolute and
relative to the FPGA reference implementation—is shown in
Table II. We also list the LUT utilization (listed as “Device
Utilization” in the table) along with performance normalized
by the LUT utilization (listed as “Relative Area Eff.” in the
table). The performance relative to the full-custom FPGA
implementation is also shown graphically in Fig. 7.
It is clear that using RISC processor A-cores only achieves
about 5% of the performance shown by the FPGA reference
implementation, even with optimization of the interconnect
topology (a ring network versus a pipelined crossbar). Cus-
tomizing the A-cores, however, yields a significant leap in
performance, moving MARC to within an order of magni-
tude of the performance of the reference FPGA implementa-
tion. Further optimizing the A-cores through clustering and
multi-threading significantly accelerates the accumulation
phase of the order sampling algorithm allowing MARC to
perform within a factor of 3 of the reference. Furthermore,
when we normalize for LUT utilization (an approximation
for chip area), MARC performs within a factor of 2 of the
hand optimized FPGA reference design.
Although the main objective of this work is to compare
various implementations using the many-core approach to
reconfigurable computing, we also benchmark a implemen-
tation using a general purpose processor. We do not claim
that the implementation of the GPP reference algorithm
is fully optimized, instead it is included to provide a
rough idea of the performance of MARC relative to a
non-reconfigurable platform with this algorithm. The GPP
implementation used in this study was written in OpenCL
and run on a 3.33 GHz Intel Core i7 Nehalem 975 with 4
cores, using 1 HW thread per core (with a 32KB L1 and
256KB L2 cache per core).
11
Figure 7. Performance Comparison to Full Custom FPGA Implementation
Table IIPERFORMANCE COMPARISON BETWEEN MARC, GPU, AND GPP.
Configuration Device Execution Relative Relative
Utilization Time (µs) Perf. Area Eff.
GPP Reference n/a 350 0.0055 n/a
MARC-Rgen 0.9 58.48 0.0327 0.0334
MARC-Ropt 0.84 38 0.0503 0.0551
MARC-C1 0.55 10.89 0.1754 0.2935
MARC-C2 0.63 7.76 0.2462 0.3595
MARC-C4 0.71 9.93 0.1924 0.2493
MARC-C1c 0.46 9.4 0.2033 0.4066
MARC-C2c 0.53 6.77 0.2819 0.4894
MARC-C4c 0.57 5.47 0.3492 0.5636
FPGA Reference 0.92 1.91 1.0000 1.0000
V. CONCLUSION
MARC offers a new methodology to design FPGA-based
computing systems by combining a many-core architectural
template, a high-level imperative programming model [6],
and modern compiler technology [5] to efficiently target
FPGAs for general-purpose, compute-intensive applications.
The primary objective of our work is to evaluate a many-
core architecture as an abstraction layer (or execution model)
for FPGA-based computation. We are motivated by recent
renewed interest and efforts in parallel programming for
emerging many-core platforms, and feel that finding an
efficient many-core abstraction for FPGAs would apply the
advances in parallel programming to reconfigurable com-
puting. Of course, constraining an FPGA to an execution
template reduces the flexibility of implementation, and there-
fore the potential for performance. However, we hypothesize
that much of the potential loss in efficiency is recoverable
through per application customization of the many-core
system. This paper outlines our initial efforts to quantify this
tradeoff for one real-world application (Bayesian inference).
We have demonstrated that performance competitive with
a highly optimized FPGA solution is attainable via a produc-
tive abstraction (days versus months of development time).
Despite these results, the effectiveness of MARC remains
to be investigated. We are limited by our ability produce
many high-quality custom FPGA solutions in a variety
of domains. Nonetheless, we plan to expand this study,
surveying more applications and improving the many-core
template in a systematic way. We are optimistic that a
MARC-like approach will open new frontiers for high-
performance reconfigurable computing.
VI. ACKNOWLEDGMENTS
This project was supported by the NIH, grant no. 130826-
02, the DoE1, award no. DE-SC0003624, and the Berkeley
Wireless Research Center. We would also like to thank the
members of the Berkeley Reconfigurable Computing group
for contributing ideas and discussions surrounding this work,
and the Stanford Nolan Lab for the benchmark data set.
REFERENCES
[1] M. Lin, I. Lebedev, and J. Wawrzynek, “High-
throughput bayesian computing machine with reconfig-
urable hardware,” in FPGA ’10: Proceedings of the 18th
annual ACM/SIGDA international symposium on Field
programmable gate arrays. New York, NY, USA:
ACM, 2010, pp. 73–82.
[2] ——, “OpenRCL: From Sea-of-Gates to Sea-of-Cores,”
in Proceedings of IEEE 20th International Conference
on Field Programmable Logic and Applications, 2010.
[3] N. Bani Asadi, C. W. Fletcher, G. Gibeling, E. N. Glass,
K. Sachs, D. Burke, Z. Zhou, J. Wawrzynek, W. H.
Wong, and G. P. Nolan, “Paralearn: a massively par-
allel, scalable system for learning interaction networks
on fpgas,” in ICS ’10: Proceedings of the 24th ACM
International Conference on Supercomputing. New
York, NY, USA: ACM, 2010, pp. 83–94.
[4] N. Friedman and D. Koller, “Being bayesian about
network structure,” in UAI ’00: Proceedings of the 16th
Conference on Uncertainty in Artificial Intelligence.
San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., 2000, pp. 201–210.
[5] C. Lattner and V. Adve, “LLVM: A Compilation Frame-
work for Lifelong Program Analysis & Transformation,”
in Proceedings of the 2004 International Symposium
on Code Generation and Optimization (CGO’04), Palo
Alto, California, Mar 2004.
[6] Khronos OpenCL Working Group, The OpenCL
Specification, version 1.0.29, 8 December 2008.
[Online]. Available: http://khronos.org/registry/cl/specs/
opencl-1.0.29.pdf
[7] G. Gibeling and et al., “Gatelib: A library for hardware
and software research,” Tech. Rep., 2010.
1Support from the DoE does not constitute the DoE’s endorsement ofthe views expressed in this paper.
12