1
Abstract—General purpose computing systems are used for a
large variety of applications. Extensive supports for flexibility in
these systems limit their energy efficiencies. Given that big data
applications are among the main emerging workloads for
computing systems, specialized architectures for big data
processing are needed to enable low power and high throughput
execution. Several big data applications are particularly focused
on classification and clustering tasks. In this paper we propose a
multicore heterogeneous architecture for big data processing.
This system has the capability to process key machine learning
algorithms such as deep neural network, autoencoder, and k-
means clustering. Memristor crossbars are utilized to provide
low power high throughput execution of neural networks. The
system has both training and recognition (evaluation of new
input) capabilities. The proposed system could be used for
classification, unsupervised clustering, dimensionality
reduction, feature extraction, and anomaly detection
applications. The system level area and power benefits of the
specialized architecture is compared with the NVIDIA Telsa
K20 GPGPU. Our experimental evaluations show that the
proposed architecture can provide four to six orders of
magnitude more energy efficiency over GPGPUs for big data
processing.
Keywords–Low power architecture; memristor crossbars;
autoencoder; unsupervised training; big data.
I. INTRODUCTION
ith the increasingly large volumes of data being
generated in almost all fields including commerce,
science, medicine, security, and social networks, one of the
key application drivers of future computing systems is the
processing of these data to decipher patterns within it to help
make better decisions. This process is typically described as
big data analytics and involves machine learning to analyze
and categorize the data.
Two broad categories of machine learning algorithms are
supervised and unsupervised learning. In supervised learning,
we know exactly what classes of data exist and a sample of
data for each class is available (this is described as labeled
data). Supervised machine learning algorithms are trained on
this labeled datasets, and once trained, the algorithms can then
sort out new unlabeled data into the existing set of classes.
A more challenging, and perhaps more common, form of
big data analytics involves unlabeled data. In this case the
machine learning algorithms need to examine large volumes
of data to uncover hidden patterns, unknown correlations, and
other useful information. Some of the most commonly used
unsupervised processing includes, autoencoder, restricted
Boltzmann machines, self organizing maps, and k-means
clustering.
Big data processing using machine learning algorithms
have traditionally been run on clusters of GPUs. One of the
major problems with these systems is their high power
consumption and the need for higher throughput. Due to the
limits of Dennard scaling and Moore’s Law, alternative
architectures are gaining importance. Several studies have
examined low power, high throughput architectures for
machine learning applications. Examples of this include
DaDianNao [1] and PuDianNao [2], both of which accelerate
machine learning algorithms through digital circuits. In
PuDianNao neuron synaptic weights are stored in the off-chip
memory, and hence requires back and forth data movement
during training, between off-chip memory and the processing
chip. In DaDianNao neuron synaptic weights are stored in
eDRAM and are later brought into Neural Functional Units
for execution. Input data are initially stored in the central
eDRAM. This system utilized dynamic wormhole routing for
data transfers.
This paper presents a memristor crossbar based multicore
architecture to implement unsupervised training of streaming
data. It is a heterogeneous architecture consisting of
memristor neural cores, a digital unit for clustering, a RISC
core for system configuration and initialization, and an on-
chip routing network for data communication. The key
benefits of this approach are high throughput at very low
power and area consumption. In this paper we utilize
memristor crossbars for neural network based system design
and examine supervised and unsupervised learning of the
system. The proposed system could be used for classification,
unsupervised clustering, dimensionality reduction, feature
extraction and anomaly detection applications.
Memristors [3,4] have received significant attention as a
potential building block for neuromorphic systems [5,6]. In
these systems memristors are used in a crossbar structure to
efficiently evaluate multiply-add operations in parallel in the
analog domain (these are the dominant operations in neural
A Reconfigurable Low Power High Throughput Streaming
Architecture for Big Data Processing
Raqibul Hasan, Tarek Taha, and Zahangir Alom
Email: hasanm1, tarek.taha, [email protected]
Department of Electrical and Computer Engineering
University of Dayton, Ohio, USA
W
2
networks). This enables highly dense neuromorphic systems
with great computational efficiency [7]. We are using
memristor crossbars which provide high synaptic weight
density and parallel analog processing consuming very low
energy. Memristor based synapses are 146 times denser than
traditional digital SRAM synapse (assuming 8 bits precision
in both cases). In this system processing happens at physical
location of the data. Data transfer energy and functional unit
energy consumptions are saved significantly.
The most recent results for memristor based neuromorphic
systems can be seen in [12-15]. These works do not examine
multicore design and system level evaluations are not studied.
[12,13] examined only linearly separable problems. [16]
examined memristor based neural networks utilizing four
memristors per synapse where the proposed systems utilize
two memristors per synapse. They propose deployment of the
memristor based system as neural accelerator with RISC
system while proposed systems are standalone embedded
processing architecture. Furthermore, the proposed system
has on-chip unsupervised learning capability.
Detailed circuit level simulations (in SPICE) of memristor
crossbars were carried out to verify the neural operations.
Both the programming and the recognition phase of the
neural networks were examined. These simulations utilized
an accurate memristor device model, and take low level
circuit details such as wire resistance, capacitance, and
driver circuits into consideration. We have examined the
power, area, and performance of the proposed multicore
system and compared them with a GPU based system. Our
results indicate that the memristor based architecture can
provide up to six orders of magnitude more energy efficiency
over GPU for the selected benchmarks.
The rest of the paper is organized as follows: section II
demonstrates overall heterogeneous multicore architecture.
Section III describes memristor device, neuron circuit design,
autoencoder and memristor based autoencoder design.
Section IV describes designs of the heterogeneous cores.
Sections V and VI describe experimental setup and results
respectively. Section VII describes related works in the area
and finally in section VIII we summarize our work.
II. SYSTEM OVERVIEW
In this paper we propose an energy efficient multicore
heterogeneous architecture to accelerate both supervised and
unsupervised machine learning applications. The architecture
has implementation of the backpropagation training
algorithm, autoencoder, deep neural network, and k-means
clustering algorithm. These are some of the fundamental
algorithms in deep learning and are used for big data
processing.
A common approach for unsupervised learning is
clustering which uses dimensionality reduction and feature
extraction as preprocessing. Dimensionality reduction allows
the data of large dimension to be represented using a smaller
set of features, and thus allow a simpler clustering algorithm
to process an unlabeled dataset. In this paper, we implement
the commonly used autoencoder algorithm for dimensionality
reduction and the k-means algorithm for clustering. The
autoencoder is based on multilayered neural networks and
thus is implemented using memristor based neural processing
cores. As the input data can be of high dimensionality (such
as a high resolution image), multiple neural cores are needed
to implement the autoencoder.
We utilize the back-propagation algorithm for training the
autoencoder. Due to the use of the autoencoder, a reduced
dimension k-means implementation is sufficient, and hence
we utilize a single digital core to implement k-means
clustering. Deep neural network applications for supervised
training could also be implemented in the proposed system.
For deep network, autoencoder is used for the unsupervised
layer wise pre training and supervised fine tuning is
performed on the pre trained weights utilizing the back-
propagation algorithm. The system has both training and
inference capabilities. It can be used for classification,
unsupervised clustering, dimensionality reduction, feature
extraction, and anomaly detection applications.
Fig. 1. Heterogeneous architecture including memristor based multicore
system as one of the processing components. Proposed multicore system
with several neural cores (NC), one clustering core connected through a 2-D
mesh routing network. (R: router).
Fig. 1 shows the overall system architecture which includes
memristor crossbar based neural cores (NC) and a digital unit
for clustering. Memristor crossbar neural cores are utilized to
provide low power, high throughput execution of neural
networks. The cores in the system are connected through an
on-chip routing network (R). The neural cores receive their
inputs from main memory holding the training data through
the on-chip routers and a buffer between the routing and main
memory as shown in Fig. 1.
Since the proposed architecture is geared towards machine
learning applications (including big data processing), the
training data will be used multiple times during the training
phase. As a result, access to a high bandwidth memory system
is needed, and thus we propose the use of 3D stacked DRAM.
Using through Silicon vias reduces memory access time and
energy, thereby reducing overall system power. To allow
efficient access to the main memory utilizing a DMA
controller that is initialized by a RISC processing core. A
single issue pipelined RISC core is used to reduce power
consumption.
R R R
R R R
R R R
Bu
ffer
Bu
ffer
NC NC NC
NC NC NC
NC NC
Clustering core
. . .
. . .
RISC
Main memory
DMATSV
3
An on-chip routing network is needed to transfer neuron
outputs among cores in a multicore system. In feed-forward
neural networks, the outputs of a neuron layer are sent to the
following layer after every iteration (as opposed to a spiking
network, where outputs are sent only if a neuron fires). This
means that the communication between neurons is
deterministic and hence a static routing network can be used
for the core-to-core communications. In this study, we
assumed a static routing network as this would be lower
power consuming than a dynamic routing network. The
network is statically time multiplexed between cores for
exchanging multiple neuron outputs.
SRAM based static routing is utilized to facilitate re-
programmability in the switches. Fig. 2 shows the routing
switch design. Note carefully that the switch allows outputs
from a core to be routed back into the core to implement
recurrent networks or multi-layer networks where all the
neuron layers are within the same core.
Fig. 2. SRAM based static routing switch. Each blue circle in the left part of
the figure represents the 8x8 SRAM based switch shown in the middle
(assuming a 8-bit network bus).
III. AUTOENCODER AND MEMRISTOR NEURON CIRCUIT
A. Memristor Devices
The memristor device was first theorized in 1971 by Dr.
Leon Chua [3]. Memristors are essentially nanoscale variable
resistors, where the resistance is modified if a voltage greater
than a write threshold voltage (Vth) is applied [17]. Hence
their resistance can be read by applying a voltage below Vth
across them without changing the device state. Several
research groups have demonstrated memristive behavior
using different materials. One such device, composed of
layers of HfOx and AlOx [18], has a high on state resistance
(RON≈10kΩ), a very high resistance ratio (ROFF/RON≈1000)
and a write threshold voltage of about 1.3 V. Physical
memristors can be laid out in a high density grid known as a
crossbar. The schematic and layout of a memristor crossbar
can each be seen in Fig. 3. A memristor in the crossbar
structure occupies 4F2 area (where F is the device feature
size). This is 36 times smaller than a SRAM memory cell. A
memristor crossbar can perform many multiply-add
operations in parallel and the conductance of multiple
memristors in a crossbar can be updated in parallel. Multiply-
add operations are the dominant operations in neural networks
and training of neural networks require update of synaptic
weights iteratively. As a consequence, memristors have a
great potential as a synaptic element in a neural network based
system design.
(a) (b)
Fig. 3. (a) Memristor crossbar schematic and (b) memristor crossbar
layout.
B. Neuron circuit
Fig. 4 shows the block diagram of a neuron. The neuron
performs two types of operations, (1) a dot product of the
inputs x1,…,xn and the weights w1,…,wn, and (2) and the
evaluation of an activation function. The dot product
operation can be seen in Eq. (1) where i=1,…,n. The
activation function of the neuron is shown in in Eq. (2). In a
multi-layer neural network, a non-linear activation function is
desired (e.g. f(x) = tan-1(x)).
Fig. 4. Neuron block diagram.
n
i ijij WxDP1
(1)
jj DPfy (2)
In this paper we have utilized memristors as neuron
synaptic weights. The circuit in Fig. 5 shows the memristor
based neuron circuit design utilized in this paper. In this
circuit, each data input is connected to two virtually grounded
op-amps (operational amplifiers) through a pair of
memristors. For a given row, if the conductance of a
memristor connected to the first column (σA+) is higher than
the conductance of the memristor connected to the second
column (σA-), then the pair of memristors represents a positive
synaptic weight. In the inverse situation, when σA+
< σA-, the
memristor pair represents a negative synaptic weight. As the
op-amps in Fig. 5 virtually ground the two neuron columns,
the currents through the first and second columns can be
calculated as (𝐴𝜎𝐴+ + ⋯ + 𝛽𝜎𝛽+) and (𝐴𝜎𝐴− + ⋯ + 𝛽𝜎𝛽−)
respectively.
BAGND
Vdd
Vp
Input portOutput port
Routing bus
Programming transistor
A
B
8 bit8 bit
. . .
. . .
8 bit wide
8 bit wide
SRAM
A B
8 bit wide
Memristor device
Area of 4 F2 (F: feature size)
outputinputs
w1
w2
wn
.
.
.
x1
x2
xn
fΣ
4
Fig. 5. Circuit diagram for a single memristor based neuron.
Assume that
𝐷𝑃𝑗 = 4𝑅𝑓[𝜎𝐴+ + ⋯ + 𝛽𝜎𝛽+ − 𝜎𝐴− + ⋯ + 𝛽𝜎𝛽−]
= 4𝑅𝑓[𝐴(𝜎𝐴+ − 𝜎𝐴−) + ⋯ + 𝛽(𝜎𝛽+ − 𝜎𝛽−)]
The output, yj of the op-amp connected directly with the
second column represents the neuron output. When the power
rails of the op-amps, VDD and VSS are set to 0.5V and -0.5V
respectively, the neuron circuit implements the activation
function h(x) as in Eq. (3) where
𝑥 = 4𝑅𝑓[𝐴(𝜎𝐴+ − 𝜎𝐴−) + ⋯ + 𝛽(𝜎𝛽+ − 𝜎𝛽−)]. This implies,
the neuron output can be expressed as h(DPj). Fig. 6 shows
that h(x) closely approximates the activation function, 𝑓(𝑥) =1
1+𝑒−𝑥 − 0.5. The values of VDD and VSS are chosen such that
no memristor gets a voltage greater than Vth across it during
evaluation.
ℎ(𝑥) = 𝑥
4 𝑖𝑓 |𝑥| < 2
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3)
Fig. 6. Plot of functions f(x) and h(x).
C. Autoencoder
Several big data applications are particularly focused on
classification and clustering tasks. The robustness of such
systems depends on how well they can extract important
features from the raw data. For big data processing we are
interested for a generic feature extraction mechanism which
will work for a variety of applications. The autoencoder [19]
is a popular method for dimensionality reduction and feature
extraction approach which automatically learns features from
unlabeled data. This means, it is an unsupervised approach
and no supervision is needed.
The architecture of the autoencoder is similar to a multi-
layer neural network as shown in Fig. 7. An autoencoder tries
to learn the function hW,b(x) ≈ x. That is, it tries to learn an
approximation to the identity function, such that output x’ is
similar to input x. By placing constraints on the network, such
as by limiting the number of hidden units, we can discover
useful patterns within the data.
Fig. 7. Two layer network having four inputs, three hidden neurons and four
output neurons.
As a concrete example, suppose the inputs x are the pixel
intensity values from a 10×10 image (100 pixels) so n = 100.
An autoencoder determining compressed representation of
these data has 100 inputs, 50 neurons in the hidden layer and
100 output neurons. The network is trained such that output
layer neurons generate same values as the applied inputs.
Since there are only 50 hidden units, the network is forced to
learn a compressed representation of the input. This paper
utilizes memristor devices to develop novel extreme low
power implementations of on-chip training circuitry for
unsupervised training of the autoencoder.
D. Memristor Based Neural Network Implementation
The structures of a multi-layer neural network and an
autoencoder are similar. Both can be viewed as a feed forward
neural network. Training of the two systems are different. The
autoencoder is trained layer by layer. The training of each
layer is similar to a two layer neural network training where a
temporarily added second layer tries to learn the inputs
applied to first layer (detail in previous subsection). Fig. 7
shows a simple two layer feed forward neural network with
four inputs, four outputs, and three hidden layer neurons. Fig.
8 shows a memristor crossbar based circuit that can be used
to evaluate the neural network in Fig. 7. There are two mem-
ristor crossbars in this circuit, each representing a layer of
neurons. Each crossbar utilizes the neuron circuit shown in
Fig. 5.
In Fig. 8, the first layer of the neurons is implemented using
an 5×6 crossbar. The second layer of two neurons is
implemented using a 4×8 memristor crossbar, where 3 of the
inputs are coming from the 3 neuron outputs of the first
crossbar. One additional input is used for bias. Applying the
inputs to a crossbar, an entire layer of neurons can be
processed in parallel within one cycle.
AB
β
yj
+ -
+ -
AA
Memristor
C
Synapse
RRf
R
-5 0 5-0.5
0
0.5
x
f(x)
h(x)
Hidden layer Output
layer
Input layer
5
Fig. 8. Schematic of the neural network shown in Fig. 7 for forward
pass.
E. The Training Algorithm
In order to provide proper functionality, a multi-layer
neural network needs to be trained using a training algorithm.
Back-propagation (BP) [20] and the variants of the BP
algorithm are widely used for training such networks. The
stochastic BP algorithm is used to train the memristor based
multi-layer neural network and is described below:
1) Initialize the memristors with high random resistances.
2) For each input pattern x:
i) Apply the input pattern x to the crossbar circuit
and evaluate the DPj values and outputs (yj) of all
neurons (hidden neurons and output neurons).
ii) For each output layer neuron j, calculate the error,
δj, between the neuron output (yj) and the target
output (tj).
𝛿𝑗 = (𝑡𝑗 − 𝑦𝑗) (4)
iii) Back propagate the error for each hidden layer
neuron j.
𝛿𝑗 = ∑ 𝛿𝑘𝑤𝑘,𝑗𝑘 (5)
where neuron k is connected to the previous layer
neuron j.
iv) Determine the amount, Δw, that each neuron’s
synapses should be changed (2η is the learning
rate and f is the activation function):
Δ𝑤𝑗 = 2𝜂 × 𝛿𝑗 × 𝑓′(𝐷𝑃𝑗) × 𝑥 (6)
If the error in the output layer has not converged to a
sufficiently small value, goto step 2.
F. Circuit Implementation of the Training Algorithm
The circuit implementation of the training algorithm for the
neural network in Fig. 8 can be broken down into following
major steps:
1. Apply inputs to layer 1 and record the layer 2 neuron
output errors.
2. Back-propagate layer 2 errors through the second
layer weights and record the layer 1 errors.
3. Update the synaptic weights.
The circuit implementations of these steps are detailed below:
Step 1: A set of inputs is applied to the layer 1 neurons, and
the layer 2 neuron outputs are measured. This process is
shown in Fig. 8. Errors are discretized into 8 bit
representations (one sign bit and 7 bits for magnitude).
Step 2: The layer 2 errors (δL2,1 and δL2,2) are applied to the
layer 2 weights after conversion from digital to analog form
as shown in Fig. 9 to generate the layer 1 errors (δL1,1 to δL1,6).
The memristor crossbar in Fig. 9 is the same as the layer 2
crossbar in Fig. 8. Assume that the synaptic weight associated
with input i, neuron j (second layer neuron) is wij=σij+ - σij
- for
i=1,2,3 and j=1,2,..,4. In the backward phase we want to
evaluate
δL1,i = Σjwij δL2,j for i=1,2,3 and j=1,2,..,4.
= Σj(σij+ - σij
-)δL2,j
= Σjσij+δL2,j - Σjσij
-δL2,j (7)
The circuit in Fig. 9 is essentially evaluating the same
operations as Eq. (7), applying both δL2,j and -δL2,j to the
crossbar columns for j=1,2,3. The back propagated errors are
discretized using ADCs and are stored in buffers. To reduce
the training circuit overhead, we can multiplex the back
propagated error generation circuit as shown in Fig. 10. In this
circuit, enabling the appropriate pass transistors, back
propagated errors are sequentially generated and stored in the
buffer. Access to the pass transistors will be controlled by a
shift register.
Fig. 9. Schematic of the neural
network shown in Fig. 7 for back
propagating errors to layer 1.
Fig. 10. Implementing back
propagation phase multiplexing
error generation circuit.
Step 3: The weight update procedures for both layers are
similar. They take neuron inputs, errors, and intermediate
neuron outputs (DPj) to generate a set of training pulses. The
training unit essentially implements Eq. (6). To update a
synaptic weight by an amount Δw, the conductance of the
memristor connected to the first column of the corresponding
neuron will be updated by amount Δw/2 (where Δw/2=𝜂 ×𝛿𝑗 × 𝑓′(𝐷𝑃𝑗) × 𝑥𝑖) and the memristor connected to the second
column of the corresponding neuron by amount -Δw/2. We
will describe the weight update procedure for the first
columns of the neurons (odd crossbar columns in Fig. 8). The
weight update procedure for the second columns of the
neurons (even columns) is similar to the procedure for the first
columns, except that we need to multiply the neuron errors
(𝛿𝑗) by -1.
In Eq. (6) we need to evaluate the derivative of the
activation function for the dot product of the neuron inputs
and weights (DPj). The DPj value of neuron j is essentially the
scaled difference of the currents through the two columns
implementing the neuron (see section III(B)). Applying the
inputs to the neuron again, the difference of column currents
are stored in the buffer after converting into digital form. The
yj
R
Rf
R+ -
+ -
C
β
A
B
inp
uts
Layer 2 crossbar
D
β
outputs
Layer 1 crossbar
-+
-+
. . .
Error inputs from layer 2
-
DAC
-+
. . .
Error inputs from layer 2
-
DAC
6
derivative of the activation function (f’) is evaluated using a
lookup table. A digital multiplier is utilized to multiply the
neuron error (δj) and f’(DPj). The number δj×f’(DPj) is
converted into analog form and is applied to the training pulse
generation unit (see Fig. 11(b)). For training, pulses of
variable amplitude and variable duration are produced. The
amplitude of the training signal is modulated by the neuron
input (xi) and is applied to the row wire connecting the desired
memristor (see Fig. 11(a)). The duration of the training pulse
is modulated by η×δj×f’(DPj) and is applied to the column
wire connecting the desired memristor (see Fig. 11(b)). The
combined effect of the two voltages applied across the
memristor will update the conductance by an amount
proportional to 𝜂 × 𝛿𝑗 × 𝑓′(𝐷𝑃𝑗) × 𝑥𝑖.
Fig. 11. Training pulse generation module. Here value of Vb is 1.2 V and
V∆1 is a triangular wave whose duration determines the learning rate.
IV. HETEROGENEOUS CORES
The proposed system has a RISC core, a digital core for
clustering and memristor neural cores. These cores are
connected through an on-chip routing network. The
memristor neural cores could be used to implement
autoencoders and feed forward neural networks of different
configurations.
A. Memristor Neural Core
Fig. 12 shows the memristor based single neural core
architecture. It consists of a memristor crossbar of size
400×200, input and output buffers, a training unit, and a
control unit. Our objective was to take as big crossbar as
possible, because this enables more computations to be done
in parallel. Doing experiment on different crossbar sizes, we
observed that 400×200 crossbar has very little impact of sneak
paths for the memristor device considered (high resistance
values). The control unit will manage the input and output
buffers and will interact with the specific routing switch
associated with the core. The control unit will be implemented
as a finite state machine and thus will be of low overhead.
Processing in the core is analog and entire core process in one
cycle for an input. The neural cores communicate between
each other in digital form as it is expensive to exchange and
latch analog signals. Neuron outputs are discretized using a
three bit ADC converter for each neuron and are stored in the
output buffer for routing. Inputs come to the neural core
through the routing network in digital form and are stored in
the input buffer. Inputs are applied to the memristor crossbar,
converting them into analog form.
Fig. 12. Single memristor neural core architecture.
B. Digital Clustering Core
We have designed a digital core for performing k-means
clustering on the feature representation output of the
autoencoder. The core could be configured to generate up to
32 clusters and maximum input dimension could be 32 after
dimensionality reduction using the autoencoder. This system
performs clustering based on Manhattan distance
calculations.
Assume that our data dimension is n and the number of
clusters is m. For an element of the data sample, the
corresponding Manhattan distances for the current m cluster
centers are evaluated in parallel and are accumulated in the
dist. registers (Fg. 13). If a subtractor (in the first row of
subtractors) output is negative, it is subtracted from the
corresponding dist. register otherwise it is added. After n
iterations (one for each element of the input sample), the
Manhattan distance between the input sample and the current
cluster centers will be in dist. registers. In the next m cycles,
the minimum distance value and the corresponding cluster
center index will be evaluated using the circuit shown in the
figure at right.
The system has another set of registers for each cluster
center (Fig. 13) to store the accumulated values of the data
samples belonging to that cluster. One counter is taken for
each cluster to keep track of how many data samples belong
to that cluster. The data sample is accumulated in the center
accumulator register corresponding to the minimum distance
cluster. To minimize hardware costs and processing time, this
operation is overlapped with the distance calculation step for
the next data pattern in Fig. 13. When all the training data
samples are assigned to the clusters based on the minimum
distances, new cluster center will be evaluated dividing the
center accumulator registers by the corresponding sample
counter values. After this, new cluster centers will be updated
and operations for the next epoch will be performed based on
the new centers.
+
-
Va
ip4=V∆1
Vwd
DPj
-
+
R
R
Rip1=-xi
f’
ip3=δj
× DAC
ip2=-(Vth-Vb)
.
.
.
. . .
Vwai
Memristor
VDD=Vb
VSS=-Vb
(a)
(b)
Inp
ut
Bu
ffer
Output buffer
DAC
DAC
DAC
DAC
DAC
ADC ADC ADC ADC
Control Unit
Routingswitch
Training unit
7
Fig. 13. Digital clustering core design implementing k-means clustering algorithm.
V. EXPERIMENTAL SETUP
A. Applications and Datasets
We have examined the MNIST [21] and ISOLET [22]
datasets for classification and clustering tasks. The
classification applications utilized deep networks where the
autoencoder was utilized for unsupervised layer-wise pre-
training of the networks. For clustering applications,
dimensionality reduction and feature extraction tasks are
performed using the autoencoder. After that the k-means
clustering algorithm is applied on the data generated by the
autoencoder. An autoencoder based anomaly detection was
examined on the KDD dataset [23]. Table I shows the neural
network configurations for different datasets and applications.
Table I: Neural network configurations. Dataset Configuration
Anomaly
detection
KDD 41→15→41
Classification MNIST 784→300→200→100→10
ISOLET 617→2000→1000→500→250→
26
Dimensionali
ty reduction
MNIST 784→300→200→100→20
ISOLET 617→2000→1000→500→250→
20
B. Mapping Neural Networks to Cores
The neural hardware are not able to time multiplex neurons
as their synaptic weights are stored directly within the neural
circuits. Hence a neural network’s structure may need to be
modified to fit into a neural core. In cases where the networks
were significantly smaller than the neural core memory,
multiple neural layers were mapped to a core. In this case, the
layers executed in a pipelined manner, where the outputs of
layer 1 were fed back into layer 2 on the same core through
the core’s routing switch.
When a software network layer was too large to fit into a
core (either because it needed too many inputs or it had too
many neurons), the network layer was split amongst multiple
cores. Splitting a layer across multiple cores due to a large
number of output neurons is trivial. When there were too
many inputs per neuron for a core, each neuron was split into
multiple smaller neurons as shown in Fig. 14. When splitting
a neuron, the network needs to be trained based on the new
network topology. As the network topology is determined
prior to training (based on the neural hardware architecture),
the split neuron weights are trained correctly.
Fig. 14. Splitting a neuron into multiple smaller neurons.
C. Area Power Calculations
The area, power, and timing of the SRAM array, used in
the clustering core, were calculated using CACTI [24] with
the low operating power transistor option utilized. We
assumed a 45 nm process in our area power calculations.
Components of a typical cache that would not be needed in
the neural core (such as the tag array and tag comparator)
were not included in the calculations. The clustering core also
requires adders, registers, multiplexers, and a control unit.
The power of these basic components were determined
through SPICE simulations. A frequency of 200 MHz was
assumed for the digital system to keep power consumption
low.
For the memristor cores, detailed SPICE simulations were
used for power and timing calculations of the analog circuits
(drivers, crossbar, and activation function circuits). These
simulations considered the wire resistance and capacitance
within the crossbar as well. The results show that the crossbar
required 20 ns to be evaluated. As the memristor crossbars
evaluate all neurons in one step, the majority of time in these
systems is spent in transferring neuron outputs between cores
through the routing network. We assumed that routing would
run at 200 MHz clock resulting in 4 cycles needed for crossbar
processing. The routing link power was calculated using
Orion [25] (assuming 8 bits per link). Off-chip I/O energy was
also considered as described in section II. Data transfer
energy via TSV was assumed to be 0.05 pJ/bit [26].
VI. RESULTS
A. Supervised Training Result
We have performed detailed simulations of the memristor
crossbar based neural network systems. MATLAB (R2014a)
and SPICE (LTspice IV) were used to develop a simulation
framework for applying the training algorithm to a multi-layer
neural network circuit. SPICE was mainly used for detailed
analog simulations of the memristor crossbar array and
MATLAB was used to simulate the rest of the system. The
circuits were trained by applying input patterns one by one
until the errors were below the desired levels.
Simulation of the memristor device used an accurate model
of the device published in [27]. The memristor device
simulated for this circuit was published in [18] and the
m cluster centers
dist.:
SRAM array
Input
+
+
Reg.
+
+
Reg .
n
mux
+
Reg.
. . .
-
-
hold/load
Reg.:
mux
Routing switch
index
mux mux
. . .
. . .
Div. Div.
Reg. Reg.
mux
+
data . . .
Center accumulator:
i1
i2
i3
i4
i5
i6
i7
i8
i1
i2
i3
i4
i5
i6
i7
i8
8
switching characteristics for the model are displayed in Fig.
15. This device was chosen for its high minimum resistance
value and large resistance ratio. According to the data
presented in [18] this device has a minimum resistance of 10
kΩ, a resistance ratio of 103, and the full resistance range of
the device can be switched in 20 μs by applying 2.5 V across
the device.
Fig. 15. Simulation results displaying the input voltage and current waveforms for the memristor model [27] that was based on the device in [18].
The following parameter values were used in the model to obtain this result:
Vp=1.3V, Vn=1.3V, Ap=5800, An=5800, xp=0.9995, xn=0.9995, αp=3, αn=3,
a1=0.002, a2=0.002, b=0.05, x0=0.001.
SPICE simulations of large memristor crossbars are very
time consuming. Therefore, we examined the on-chip
supervised training approach with the Iris dataset [28] which
utilizes crossbars of manageable sizes. A two layer network
with four inputs, ten hidden neurons and one output neuron
was utilized. Fig. 16 shows the Matlab-SPICE training
simulation on the Iris dataset. It shows that the neural network
was able to learn the desired classifiers.
Fig. 16. Learning curve for the classification based on Iris dataset.
B. Unsupervised Training Result
Unsupervised training of the autoencoder was performed
on the Iris dataset. The network configuration utilized was
4→2→4. After training, the two hidden neuron outputs give
the feature representation of the corresponding input data in a
reduced dimension of two. Fig. 17 shows the distribution of
the data of three different classes (setosa, versicolor,
virginica) in the feature space. We can observe that data
belonging to the same class appears closely in the feature
space and could potentially be linearly separated.
Fig. 17. Distribution of the data of different classes in the feature space.
C. Anomaly Detection
We have examined autoencoder based anomaly detection on
the KDD dataset. The network configuration for this
application was 39->15->39. As the SPICE simulation of this
network size takes a very long time, we did this experiment in
MATLAB. During training the autoencoder was trained using
only normal data packets. It is expected that during
evaluation, for normal data packets the differences between
input and reconstruction will be smaller compared to the
differences for the attack packets. The network was trained
only with 5292 normal packet data (no anomalous packets
were used during training).
Figures 18 and 19 show the distribution of the distances
between inputs and the corresponding reconstructions for the
normal packets and the attack packets respectively. From Fig.
20 it is seen that the system can detect about 96.6% of the
anomalous packets with a 4% false detection rate. Similar
approaches could be used for other big data applications
where the objective is to detect anomalous patterns from large
volumes of continuously streaming data in real time at low
power.
Fig. 18. Distance between original
data and reconstructed data for
normal packets.
Fig. 19. Distance between
original data and reconstructed
data for attack packets.
Fig. 20. Anomaly detection rate for the test dataset for different decision
parameters.
D. Impact of System Constraints on Application Accuracy
The hardware neural network training circuit differs from
the software implementations in the following aspects:
- Limited precision of the discretized neuron errors
and DPj values.
- Limited precision of the discretized neuron output
values.
- Each neuron can have a maximum of 400 synapses.
Fig. 21 compares the accuracies obtained from Matlab
implementations of the applications considering the hardware
system constraints and implementations without considering
those constraints. It is seen that enforcing the system
constraints the applications still give competitive
performances.
0 10 20 30 40 50
-2
-1
0
1
2
time (us)
Voltage (
V)
On State
Off State
0 10 20 30 40 50
-200
-100
0
100
200
time (us)
Curr
ent
( A
)
On State
Off State
0 20 400
0.2
0.4
0.6
0.8
Epoch
MS
E
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4-1
-0.8
-0.6
-0.4
-0.2
0
setosa
versicolor
virginica
0 500 1000 1500 2000 2500 30000
0.2
0.4
0.6
0.8
1
1.2
Normal packet
Dis
tance b
etn
input
& r
econstr
uction
0 500 1000 1500 2000 25000
0.2
0.4
0.6
0.8
1
1.2
Anomalous packet
Dis
tance b
etn
input
& r
econstr
uction
0 20 40 60 80 1000
20
40
60
80
100
False detection (%)
Dete
ction r
ate
(%
)
9
Fig. 21. Impact of memristor system constraints (3 bits neuron output and 8
bits neuron error precision) on application accuracy.
E. Single Core Area and Power
The area of the digital clustering core is 0.039 mm2 and its
power consumption is 1.36 mW. The training of 1000 samples
for one epoch in this core takes 0.32 μs time.
The memristor neural core configuration is 400×100, i.e. it
can take a maximum of 400 inputs and can process a
maximum of 100 neurons. The area of a single memristor
neural core is 0.0163 mm2. Table II shows a single memristor
core power and timing at different steps of execution. The
RISC core is considered to be used only for configuring the
cores, routers, and DMA engine. As a result, we assume that
the RISC core is turned off afterwards during the actual
training or evaluation phases.
Table II: Memristor core timing and power for different
execution steps. Time (us) Power (mW)
Forward pass (recognition) 0.27 0.794
Backward pass 0.80 0.706
Weight update 1.00 6.513
Control unit 0.0004
F. System Level Evaluations
Total system area: The whole multicore system includes
144 memristor neural cores, one digital clustering core, one
RISC core, one DMA controller, 4 kB of input buffer and 1
kB of output buffer. The RISC core area was evaluated using
McPat [36] and came to 0.52 mm2. The total system area was
2.94 mm2.
Energy efficiency: We have compared the throughput
energy benefits of the proposed architecture over an Nvidia
Tesla K20 GPU. The system consumes 225 W power. The
area of the GPU is 561 mm2 using a 28 nm process. Figures
22 and 23 show the throughput and energy efficiencies
respectively of the proposed system over GPU for different
applications during training. For training the proposed
architecture provides up to 30x speedup and four to six orders
of magnitude more energy efficiency over GPU.
Fig. 22. Application speedup over
GPU for training. Fig.23. Energy efficiency of the proposed system over GPU for
training.
Figures 24 and 25 show the throughput and energy
efficiencies of the proposed system over GPU for different
applications respectively during evaluation of new inputs. For
recognition, the proposed architecture provides up to 50×
speedup and five to six orders of magnitude more energy
efficiency over the GPU.
Fig. 24. Application speedup
over GPU for recognition.
Fig. 25. Energy efficiency of the
proposed system over GPU for evaluation of a new input.
Table III shows the number of cores used, the time and the
energy for a single training data item in one iteration. Table
IV shows the evaluation time and energy for a single test data.
In both training and recognition phases the computation
energy dominated the total system energy consumption.
Table III: For training number of cores used, the time and
the energy for a single input in the proposed architecture. Memristor # of
core Time (us)
Compute energy (J)
IO energy (J)
Total energy (J)
Mnist_class 57 7.29 4.18E-07 8.48E-09 4.26E-07
Mnist_AE 57 17.99 8.37E-07 8.57E-09 8.45E-07
Mnist_kmeans 1 0.42 9.67E-10 4.47E-12 9.71E-10
Isolate_AE 132 24.41 1.97E-06 2.68E-08 1.99E-06
Isolate_kmeans 1 0.42 9.67E-10 4.47E-12 9.71E-10
Isolet_class 132 8.86 9.67E-07 2.67E-08 9.94E-07
KDD_anomaly 1 4.15 7.33E-09 4.51E-09 1.18E-08
Table IV: Recognition time and energy for one input in the
proposed architecture. Memristor Time (us) Compute
energy (J)
IO energy
(J)
Total
energy (J)
Mnist_class 0.77 1.42E-08 8.43E-09 2.26E-08
Mnist_AE 0.77 1.42E-08 8.43E-09 2.26E-08
Mnist_kmeans 0.32 8.89E-10 3.69E-12 8.93E-10
Isolate_AE 0.77 3.28E-08 2.67E-08 5.94E-08
Isolate_kmeans 0.32 8.89E-10 3.69E-12 8.93E-10
Isolet_class 0.77 3.28E-08 2.67E-08 5.94E-08
KDD_anomaly 0.77 2.48E-10 4.48E-09 4.73E-09
0102030405060708090
100
Acc
ura
cy (
%)
s/w
constrained
0
5
10
15
20
25
30
Spee
du
p o
ver
GP
U
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Ener
gy e
ffic
ien
cy o
ver
GP
U
0
10
20
30
40
50
Spee
du
p o
ver
GP
U
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Ener
gy e
ffic
ien
cy o
ver
GP
U
10
VII. RELATED WORK
A. Digital Systems
High performance computing platforms can simulate a
large number of neurons, but are very expensive from a power
consumption standpoint. Specialized architectures [29-32]
can significantly reduce the power consumption for neural
network applications and yet provide high performance
IBM’s TrueNorth chip consists of 5.4 billion transistors
[33]. It has 4,096 neurosynaptic cores interconnected via an
intra-chip network that integrates one million programmable
spiking neurons and 256 million configurable synapses. The
basic building block is a core, a self-contained neural network
with 256 input lines (axons), and 256 outputs (neurons)
connected via 256×256 directed, programmable synaptic
connections. With a 400×240-pixel video input at 30 frames-
per-second, the chip consumes 63mW. Key differences
between this system and our system are that TrueNorth uses
spiking neurons, has asynchronous communications, and
allows the SRAM arrays to enter ultra low leakage modes. We
examine the impact of using leakage-less SRAM arrays in our
study.
DaDianNao and PuDianNao [1,2] are accelerator for deep
neural networks (DNN) and convolutional neural networks
(CNN). These systems are based on a fully digital design. In
PuDianNao [2] neuron synaptic weights are stored in off-chip
memory which requires data movement during training back
and forth between the off-chip memory and the processing
chip. In DaDianNao [1], neuron synaptic weights are stored
in eDRAM and are later brought into a neural functional unit
for execution. Input data are initially stored in a central
eDRAM. This system utilized dynamic wormhole routing for
data transfers. It was able to achieve 150× energy reduction
over GPUs. ShiDianNao is an accelerator only for CNNs. As
CNNs utilize shared weights, a small on-chip SRAM cache
was able to store the whole network’s weights.
St. Amant et al. [34] presented a compilation approach
where a user would annotate a code segment for conversion
to a neural network. This study examined transformation of
several kernels including FFT, JPEG, K-means, and the Sobel
edge detector into equivalent neural networks. In this work
they utilized analog neural cores to accelerate a RISC
processor for general purpose codes. Our approach varies
from this in that we are examining standalone neural cores for
embedded applications, and using memristors for the
processing.
Belhadj et al. [35] proposed multicore architectures for
spiking neurons and evaluated them for embedded signal
processing applications. They stored synaptic weights in
digital form and used digital to analog converters to process
the neurons. There were significant differences in the neuron
circuit designs and neural algorithm compared to our system.
B. Memristor Based Systems
The most recent results for memristor based neuromorphic
systems can be seen in [12-15]. Alibart [12] and Preioso [13]
examined only linearly separable problems. Boxun [15]
demonstrates in-situ training of memristor crossbar based
neural networks for nonlinear classifier designs. They
proposed having two copies of the same synaptic weights, one
for the forward pass and another transposed version for the
backward pass. However it is practically not easy to have an
exact copy of a memristor crossbar because of memristor
device stocasticity. Soudry et al. [14] proposed
implementation of gradient descent based training on
memristor crossbar neural networks. They utilized two
transistors and one memristor per synapse. Their synaptic
weight precision is less than that of the two memristors per
synapse neuron design utilized in this paper.
Liu [16] examined memristor based neural networks
utilizing four memristors per synapse while the proposed
systems utilized two memristors per synapse. They proposed
the deployment of the system as a neural accelerator with a
RISC processor. The proposed system executes data directly
coming from a 3D stacked sensor chip.
VIII. CONCLUSION
In this paper we have proposed a multicore heterogeneous
architecture for big data processing. This system has the
capability to process key machine learning algorithms such as
deep neural network, autoencoder, and k-means clustering.
The system has both training and recognition (evaluation of
new input) capabilities. We utilized memristor crossbars to
implement energy efficient neural cores. Our experimental
evaluations show that the proposed architecture can provide
four to six orders of magnitude more energy efficiency over
GPUs for big data processing.
REFERENCES
[1] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang,
Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam.
2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO-47). IEEE Computer Society,
Washington, DC, USA, 609-622. [2] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou,
Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015.
PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS
'15). ACM, New York, NY, USA, 369-381. [3] L. O. Chua, "Memristor—The Missing Circuit Element," IEEE
Transactions on Circuit Theory, vol. 18, no. 5, pp. 507–519 (1971).
[4] D. Chabi, W. Zhao, D. Querlioz, J.-O. Klein, “Robust Neural Logic Block (NLB) Based on Memristor Crossbar Array” IEEE/ACM
International Symposium on Nanoscale Architectures, pp.137-143,
2011. [5] C. Zamarreño-Ramos, L. A. Camuñas-Mesa, J. A. Pérez-Carrasco, T.
Masquelier, T. Serrano-Gotarredona, and B. Linares-Barranco, “On
spike-timing-dependent-plasticity, memristive devices, and building a self-learning visual cortex,” Frontiers in Neuroscience, Neuromorphic
Engineering, vol. 5, Article 26, pp. 1-22, Mar. 2011. [6] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The
missing Memristor found,” Nature, vol. 453, pp. 80–83 (2008).
[7] T. M. Taha, R. Hasan, C. Yakopcic, and M. R. McLean, "Exploring the Design Space of Specialized Multicore Neural Processors," IEEE
International Joint Conference on Neural Networks (IJCNN), 2013.
[8] M. Hu, H. Li, Y. Chen, Q. Wu, G. S. Rose, R. W. Linderman, "Memristor Crossbar-Based Neuromorphic Computing System: A Case
Study." IEEE Trans. Neural Netw. Learning Syst. 25(10): 1864-1878
(2014)
11
[9] A. M. Sheri, A. Rafique, W. Pedrycz, M. Jeon, "Contrastive divergence
for memristor-based restricted Boltzmann machine" Engineering Applications of Artificial Intelligence 37(2015) 336-342
[10] D. Soudry, D. D. Castro, A. Gal, A. Kolodny, and S. Kvatinsky,
"Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training,” IEEE Trans. on Neural Networks and Learning
Systems, accepted.
[11] Liu, Xiaoxiao, et al. "A Heterogeneous Computing System with Memristor-Based Neuromorphic Accelerators." HPEC, 2014.
[12] F. Alibart, E. Zamanidoost, and D.B. Strukov, "Pattern classification by
memristive crossbar circuits with ex-situ and in-situ training", Nature Communications, 2013.
[13] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K.
Likharev, D. B. Strukov, “Training and operation of an integrated neuromorphic network based on metal-oxide memristors,” Nature,
521(7550), 61-64, 2015.
[14] D. Soudry, D. D. Castro, A. Gal, A. Kolodny, and S. Kvatinsky, "Memristor-Based Multilayer Neural Networks With Online Gradient
Descent Training,” IEEE Trans. on Neural Networks and Learning
Systems, issue 99, 2015. [15] Boxun Li; Yuzhi Wang; Yu Wang; Chen, Y.; Huazhong Yang,
"Training itself: Mixed-signal training acceleration for memristor-
based neural network," Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific , vol., no., 20-23 Jan. 2014.
[16] Xiaoxiao Liu; Mengjie Mao; Beiye Liu; Hai Li; Yiran Chen; Boxun Li;
Yu Wang; Hao Jiang; Barnell, M.; Qing Wu; Jianhua Yang, "RENO: A high-efficient reconfigurable neuromorphic computing accelerator
design," in Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE , vol., no., pp.1-6, 8-12 June 2015
[17] X. Dong, C. Xu, S. Member, Y. Xie, and N. P. Jouppi, “NVSim: A
Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” IEEE Trans. on Computer Aided Design of
Integrated Circuits and Systems, vol. 31, no. 7, pp. 994-1007, July,
2012. [18] S. Yu, Y. Wu, and H.-S. P. Wong, "Investigating the switching
dynamics and multilevel capability of bipolar metal oxide resistive
switching memory," Applied Physics Letters 98, 103514 (2011). [19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion,” The Journal of Machine
Learning Research, vo. 11, pp. 3371-3408, 2010.
[20] Russell, S. & Norvig, P. (2002). Artificial Intelligence: A Modern
Approach (2nd Edition). Prentice Hall, ISBN-13: 978-01379039555. [21] http://yann.lecun.com/exdb/mnist/
[22] https://archive.ics.uci.edu/ml/datasets/ISOLET
[23] https://kdd.ics.uci.edu/ [24] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
NUCA Organizations and Wiring Alternatives for Large Caches with
CACTI 6.0” In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40),
Washington, DC, USA, 3-14, 2007.
[25] A. B. Kahng, B. Li, L. S. Peh, and K. Samadi, “ORION 2.0: A fast and accurate NoC power and area model for early-stage design space
exploration,” Design, Automation & Test in Europe Conference &
Exhibition, pp.423-428, 20-24 April 2009. [26] Gorner, J.; Hoppner, S.; Walter, D.; Haas, M.; Plettemeier, D.;
Schuffny, R., "An energy efficient multi-bit TSV transmitter using
capacitive coupling," Electronics, Circuits and Systems (ICECS), 2014 21st IEEE International Conference on , vol., no., pp.650,653, 7-10
Dec. 2014.
[27] C. Yakopcic, T. M. Taha, G. Subramanyam, and R. E. Pino, "Memristor SPICE Model and Crossbar Simulation Based on Devices with
Nanosecond Switching Time," IEEE International Joint Conference on
Neural Networks (IJCNN), August 2013. [28] https://archive.ics.uci.edu/ml/datasets/Iris
[29] S. B. Furber, S. Temple and A. D. Brown, “High-Performance
Computing for Systems of Spiking Neurons,” Proceedings of AISB'06 workshop on GC5: Architecture of Brain and Mind, vol.2, pp 29-36,
Bristol, April, 2006.
[30] J. Schemmel, J. Fieres, K. Meier, “Wafer-Scale Integration of Analog Neural Networks”, IEEE International Joint Conference on Neural
Networks (IJCNN), 2008.
[31] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, D. S. Modha, “A digital neurosynaptic core using embedded crossbar memory with
45pJ per spike in 45nm,” IEEE Custom Integrated Circuits Conference
(CICC), vol., no., pp.1-4, 19-21 Sept. 2011. [32] J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy, S.
Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar,
D. S. Modha, “Building block of a programmable neuromorphic substrate: A digital neurosynaptic core,” The International Joint
Conference on Neural Networks (IJCNN), pp.1-8, June 2012.
[33] P. A Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, Jun Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura,
B. Brezzo, Ivan Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M.
D. Flickne, W. P. Risk, R. Manohar, D. S. Modha, "A million spiking-neuron integrated circuit with a scalable communication network and
interface." Science 345, no. 6197, pp 668-673, 2014.
[34] R. S. Amant, A. Yazdanbakhsh, J. Park, B. Thwaites, H. Esmaeilzadeh, A. Hassibi, L. Ceze, and D. Burger, “General-purpose code acceleration
with limited-precision analog computation,”. In Proceeding of the 41st
annual international symposium on Computer architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, pp. 505-516, 2014.
[35] B. Belhadj, A. J. L. Zheng, R. Héliot, and O. Temam. “Continuous real-
world inputs can open up alternative accelerator designs,” SIGARCH Comput. Archit. News 41, 3 (June 2013).
[36] Sheng Li; Ahn, Jung Ho; Strong, R.D.; Brockman, J.B.; Tullsen, D.M.;
Jouppi, N.P., "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in
Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM
International Symposium on , vol., no., pp.469-480, 12-16 Dec. 2009
Top Related