Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...
-
Upload
nguyenxuyen -
Category
Documents
-
view
216 -
download
0
Transcript of Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle...
h1 h2 h3 h4
v1 v2 v3 v4
hidden neurons
visible neurons
Sang Kyun Kim, Lawrence McAfee, Peter McMahon, Kunle Olukotun
Avalon Slave
AVALON MM
Main Controller
32 : CPU↔RBM register
MUX
Stream Logic
…
Buffers Buffers
Local FSMLocal FSM
Memory
Sigmoid / RNG / Compare
Memory
…
Sigmoid / RNG / Compare
…
Buffers Buffers
Local FSMLocal FSM
Memory
Sigmoid / RNG / Compare
Memory
…
GRP0 GRP1 GRP2 GRP3
Avalon Master
Avalon Slave
Sigmoid / RNG / Compare
256 : CPU↔RBM mem
… … … …Visible Neuron Broadcast
256
256 : DDR2↔RBM
Main↔Local
256
TreeAdd Value Broadcast
Tree Add / Accum Tree Add / Accum Tree Add / Accum Tree Add / Accum
256 256 256
16
16
16
16
16
16
16
16
Restricted Boltzmann Machine
0 jihv 1 jihv
i
j
i
j
t = 0 t = 1 reconstructiondata
)( 10 jijiij hvhvw
Figure 1. RBM Structure (above)An RBM is a two layer neural network with all-to-all connections between the layers
Figure 2. Deep Belief Nets (right)A Deep Belief Network is a multi-layergenerative model. The network is firstlearned with all the weights tied, which is equivalent to an RBM. Then it freezes the first layer and learns the remaining weights, which is also equivalent to another RBM.
RBM
Figure 3. RBM Training(from Hinton’s tutorial at NIPS’07)The training begins with the dataat the visible layer computing the probabilities of the hidden layer.
The hidden layer is updated and stochastically fires to reconstruct the visible layer. The hidden layer is recomputed from the reconstructed visible layer. The weights are updated based on the difference between the visible-hidden product of the original training example and the visible-hidden product of the reconstructed data.
IntroductionRestricted Boltzmann Machines (RBMs) - the building block fornewly popular Deep Belief networks (DBNs) - are a promising newtool for machine learning practitioners. However, future research inapplications of DBNs is hampered by the considerable computationthat training requires. We have designed a novel architecture andFPGA implementation that accelerates the training of general RBMsin a scalable manner, with the goal of producing a system thatmachine learning researchers can use to investigate ever largernetworks. Our current (single FPGA) design uses a highly efficient,fully-pipelined architecture based on 16-bit arithmetic forperforming RBM training on an FPGA. Single-board results show aspeedup of 25-30X achieved over an optimized softwareimplementation on a high-end CPU. Current design efforts are for amulti-board implementation.
RBM Training ProcedureContrast-divergence learning to simplify infinite alternating Gibbs sampling
Experimental PlatformStackable Altera Stratix III FPGA board with DDR2 SDRAM interface.
Figure 4. Terasic DE3 Fast Prototyping BoardThe left image shows the DE3 board. It has an Altera Stratix IIIFPGA with high speed I/O interface for communications withmultiple other boards. It also includes a DDR2 SO-DIMM interface,USB JTAG interface, and USB 2.0 interface. The right imageillustrates multiple DE3 boards connected in a stacked manner.
Implementation DetailsSingle FPGA implementation of RBM is developed. Multi-FPGA version is being designed at the moment.
AVALON MM
DDR2 Controller
NiosII
32256
32256
Main Controller
Weight Array
Multiply Array
Adder Array
Sigmoid Array
256
Memory Stream
RNG /Compare
Array
Neuron ArrayUpdate Logic
RBM Module
ISP1761
32
USBDDR2 JTAG
UART
Figure 5. Overall SystemSingle FPGA consists of a CPU, DDR2 controller, and an RBM module. These components are connect-ed via Altera’s Avalon interface.
Figure 6. RBM ModuleThe RBM module is the most important module. The module wasdesigned to exploit all the available multipliers, maximizing perform-ance. The module was also segmented into groups to avoid long wiresand enforce localization. The weight matrix was assumed to fit on-chip.
Figure 7. Core Multiply ArrayTo eliminate transpose operation in learning algorithm, matrix multiply operation is performed as (a) linear combination of weight vectors in hidden phase, and as (b) vector inner product in reconstruct-ion phase.
Performance ResultsThe RBM module runs at 200MHz. Comparison was made against one core of Intel Core2 2.4GHz processor running on MATLAB.
Figure 8. SpeedupThe speedup against single precision float-ing point MATLAB run is around 25X. The speedup depends on the network structure and size.
Extending to Multiple FPGA SystemTwo issues being tackled1. The weight matrix increases as O(n2) with the number of nodes.
Thus, the weight matrix no longer fits on-chip and will have to bestreamed from DRAM. Using a batch size of greater than 16, wecan exploit the data parallelism to reduce the requiredbandwidth that is feasible for DDR2 SDRAM.
2. The single FPGA scheme issues a broadcast of visible data everycycle. If we extend this to multiple FPGAs, the height of theboard stack may limit the maximum clock rate driving the RBMmodule.
0
5
10
15
20
25
30
35
256x256 256x1024 512x512
Speedup for 50 epochs
Base:single precision Base:double precision
ConclusionDeep Belief Nets are an emerging area where Restricted BoltzmannMachine is at the heart of it. FPGAs can effectively exploit theinherent fine-grain parallelism in RBMs to reduce the computationalbottleneck for large scale DBN research. As a prototype of building afast DBN research machine, we implemented a high-speedconfigurable RBM on a single FPGA. The RBM has shownapproximately 25X speedup compared to a single precision softwareimplementation of the RBM running on a Intel Core2 processor.Interacting with the Stanford’s AI research group, our futureimplementation of multiple FPGA boards is expected to provideenough speedup to attack large problems that remained unsolvedfor decades.