ISSCC 2012 / SESSION 28 / ADAPTIVE & LOW-POWER CIRCUITS / 28C][2012][ISSCC][Chung... · Figure...
Transcript of ISSCC 2012 / SESSION 28 / ADAPTIVE & LOW-POWER CIRCUITS / 28C][2012][ISSCC][Chung... · Figure...
480 • 2012 IEEE International Solid-State Circuits Conference
ISSCC 2012 / SESSION 28 / ADAPTIVE & LOW-POWER CIRCUITS / 28.2
28.2 A 1.0TOPS/W 36-Core Neocortical Computing Processor with 2.3Tb/s Kautz NoC for Universal Visual Recognition
Chuan-Yung Tsai, Yu-Ju Lee, Chun-Ting Chen, Liang-Gee Chen
National Taiwan University, Taipei, Taiwan
Unlike human brains, where various kinds of visual recognition tasks are carriedout with homogeneous neocortical circuits and a unified working mechanism,existing visual recognition processors [1-4] rely on multiple algorithms and het-erogeneous multicore architectures to accomplish their narrowly predefinedrecognition tasks. In contrast, new brain-mimicking recognition algorithms havebeen proposed recently [5,6]; we label such algorithms as belonging to theNeocortical Computing (NC) model. Based on neuroscience findings, NC algo-rithms model both the human brain’s static and dynamic visual recognitionstreams (as shown in Fig. 28.2.1) through a series of unified matching/poolingoperations. The algorithms exhibit high visual recognition capability and performaccurately on wide range of image/video recognition tasks. However, using theNC model for real-time recognition involves several hundred GOPS of bothdense and sparse matrix calculations and requires over 1.5Tb/s inter-stage databandwidth – requirements that cannot be met efficiently on existing parallelarchitectures.
In this paper, an NC processor for power-efficient real-time universal visualrecognition is proposed with following features: 1) A grey matter-like homoge-neous many-core architecture with event-driven hybrid MIMD execution pro-vides 1.0TOPS/W efficient acceleration for NC operations; 2) A white matter-likeKautz NoC provides 2.3Tb/s throughput, fault/congestion avoidance and redun-dancy-free multicast with 151Tb/s/W NoC power efficiency, which is 2.7-3.9×higher than previous NoC-based visual recognition processors [1,2].
Figure 28.2.2 shows the system architecture of the Kautz NoC-based 36-core NCprocessor, where the Kautz graph is implemented as an NoC. A Kautz graph canbe defined by its degree d and diameter k (both d and k are 3 in this work) andhas N = (d+1)dk-1 nodes (N = 36 cores in this work). The Kautz NoC provideswhite matter-like routing capability – the maximum hop count between any 2cores is less than logdN. Fault/congestion tolerance is also possible, as there ared disjoint routing paths between any 2 cores. The grey matter-like homogeneouscores support unified NC operations and provide programmability for variousrecognition workloads. As the examples show in Fig. 28.2.2, such a combinationprovides better scalability in terms of both application and architecture com-pared to heterogeneous designs [1-4]. In this processor, each core is addressedby a string of 3 base-4 numbers (i.e. 0/1/2/3; core addresses are 6b in total) andadjacent numbers must be different (e.g. 122 is not a valid address). Each coreunidirectionally links to 3 other cores with left-shifted addresses. For example,core 121 links to 210, 212 and 213, and receives links from 012, 212 and 312,accordingly. Through this string-shifting procedure, a core can reach any othercore within 3 hops. All cores share a 32b address space, where each core isaddressed by the most significant 6b and has a 26b private address space. Amemory map, which defines the NC model, is loaded through two 64b systembuses as memory pages.
Figure 28.2.3 shows a block diagram of the NC core and its processing element(PE). All computations are event-driven. Specifically, a core is only turned onwhen a packet is received from another core through the NoC, or from the sys-tem bus. Inactive cores and their components are clock gated. Arriving packetsare decoded into instructions and queued in the instruction FIFO. The instructiondispatcher can issue up to 2 instructions to the PEs and paging memory units,which cover the instruction’s target addresses. Page misses are handled by thememory management unit (MMU) and DMA. In the proposed hybrid MIMDmode, each of the 2 issued instructions can be either SIMD or SISD. Thus, thecore can flexibly switch between maximal acceleration for dense NC operations(by SIMD) and minimal power consumption for sparse NC operations (by SISD).With this scheme, a 10.3× recognition speed increase and a 4.43× power reduc-tion can be achieved. Inside each PE, the 16b arithmetic unit can execute up to4 operations per cycle, i.e. 1GOPS @ 250MHz. Successful instruction execution
will cause the resultant datum to be sent to the packet encoder, and fired to itssubsequent target addresses through NoC.
Figure 28.2.4 shows a block diagram of the Kautz NoC router and defines itsrouting rules. The router performs distributed low-radix routing and all inputports share one routing FIFO, which minimizes area and improves power effi-ciency. All packets can be routed/multicast with the distributed routing string-based procedure, as illustrated in Fig. 28.2.4. Information regarding faulty/con-gested cores or NoC links can be used to correct routing strings and redirectpackets for fault/congestion avoidance with minimal hop count overhead. Basedon properties of a Kautz graph, the NC processor can sustain at least 2 faultycores/links. The proposed redundancy-free multicast scheme exploits the man-ner in which cores are named by using disallowed name strings as group multi-cast addresses without header redundancies (e.g. 122 (a disallowed name) canbe used to represent the cores 120, 121 and 123 as a subgroup, and 11X repre-sents the entire core group 1). With this scheme, a distributed minimum span-ning tree-based multicast is realized and provides a 1.75× recognition speedincrease. Compared to a non-fault-tolerant star-/tree-based NoC [1-3], the KautzNoC provides better efficiency (since centralized high-radix routers arearea/power hungry). Compared to a fault-tolerant mesh-based NoC, the Kautzhas a smaller diameter and lower routing delay (3 hops vs. 10 hops and 16%less average packet delay when running NC applications vs. a 6×6 mesh).
Figure 28.2.5 shows the single-step communication/execution scheme for thischip, which is called push-based processing. Unlike conventional parallelprocessors, where data are first placed on, then pulled from, caches multipletimes by multiple cores/threads, push-based processing passes data andinstructions as multicast packets through the energy-efficient Kautz NoC, whichmimics a neuron’s firing process, as shown in Fig. 28.2.1. According to eachdatum’s associated NC operation, a SIMD or SISD instruction is executed. Zero-valued datum and datum associated with zero-valued coefficients will not resultin any computation being performed, providing further power reductions.Recognition speed is improved by 2.09×, and power is reduced by 1.38×, due topush-based processing. The benefits of the features incorporated are summa-rized in Fig. 28.2.5. Overall, 37.8× acceleration, and a 6.25× total power reduc-tion are achieved.
Figure 28.2.6 offers a summary of the chip’s features. The NC processor isimplemented on a 4.5×4.5mm2 die using the TSMC 65nm CMOS process. A widerange of visual recognition applications, including image (object/face/scene) andvideo (action/sport), is supported and the chip works in real-time with high accu-racy. Compared with other programmable visual recognition processors [1-4],high power efficiencies are achieved. The die photo is shown in Fig. 28.2.7.
Acknowledgments:We thank the TSMC University Shuttle Program for chip fabrication and theNational Chip Implementation Center for chip testing. This work is funded by theNational Science Council. We also thank Tung-Chien Chen, Yu-Han Chen, Chih-Chi Cheng, Shao-Yi Chien, Tzu-Der Chuang, David Cox and Pai-Heng Hsiao fortheir valuable suggestions.
References:[1] K. Kim, et al., “A 125GOPS 583mW Network-on-Chip Based ParallelProcessor with Bio-Inspired Visual Attention Engine,” ISSCC Dig. Tech. Papers,pp. 308-309, 2008.[2] J.-Y. Kim, et al., “A 201.4GOPS 496mW Real-Time Multi-Object RecognitionProcessor with Bio-Inspired Neural Perception Engine,” ISSCC Dig. Tech.Papers, pp. 150-151, 2009.[3] S. Lee, et al., “A 345mW Heterogeneous Many-Core Processor with anIntelligent Inference Engine for Robust Object Recognition,” ISSCC Dig. Tech.Papers, pp. 332-333, 2010.[4] T. Kurafuji, et al., “A Scalable Massively Parallel Processor for Real-TimeImage Processing,” ISSCC Dig. Tech. Papers, pp. 334-335, 2010.[5] H. Jhuang, et al., “A biologically inspired system for action recognition,” IEEEInternational Conf. on Computer Vision, pp. 1-8, 2007.[6] J. Mutch, et al., “Object class recognition and localization using sparse fea-tures with limited receptive fields,” International Journal of Computer Vision, vol.80, no. 1, pp. 45-57, 2008.
978-1-4673-0377-4/12/$31.00 ©2012 IEEE
481DIGEST OF TECHNICAL PAPERS •
ISSCC 2012 / February 22, 2012 / 2:00 PM
Figure 28.2.1: NC model and supported applications of NC processor. Figure 28.2.2: NC processor architecture and scalability.
Figure 28.2.3: Event-driven NC core/PE architectures and execution modes.
Figure 28.2.5: Push-based processing mechanism and speed/power summaryof proposed features. Figure 28.2.6: Chip features and comparison.
Figure 28.2.4: Kautz NoC router architecture and routing/multicast flows.
t=1t=2
t=3
PoolingOperation
2D / PyramidalAverage orMaximum
MatchingOperationMulti-scale2D / 3D
Convolution
PoolingOperation
Dense / Sparse2D Average orMaximum
MatchingOperation
Dense / Sparse2D Linear orLaplacian or
Gaussian Kernel
PoolingOperation
Dense / Sparse1D Maximum(Video or KNN)
MatchingOperation
Dense / Sparse1D Linear Kernel(SVM) or L1 / L2Distance (KNN)
...
FeaturePyramid
FeatureVector
RecognitionAnswerImage / Video Input
Ventral Stream:Static Visual Recog.
Dorsal Stream:Dynamic Visual Recog.
Neocortical Computing Processor
ImageRecognition:Indoor/
OutdoorObject,Face,Scene
Identification,etc.
VideoRecognition:Human
ActionSurveillance,Sport
VideoClassification,etc.
Ref. [1-3]
Range of Supported Applications
[4](Partially)
Feature Extraction Stages Feature Classification Stage
Unified Neocortical Computing Operations
NeocorticalComputingModel [5,6]
White Matter(Connection)
Grey Matter(Computing)
Kautz NoC
CG 0
CG 1
CG 2
CG 3
Core030
NoCRouter
Core031
Core032
Core023
Core020
Core021
Core012
Core010
Core013
64-BitSystem
Bus I/F 1
Core101
Core102
Core103
Core130
Core131
Core132
Core123
Core121
Core120
Core212
Core213
Core210
Core201
Core202
Core203
Core230
Core232
Core231
64-BitSystem
Bus I/F 2
Core
323
Core
320
Core
321
Core
312
Core
313
Core
310
Core
301
Core
303
Core
302
DDRSDRAM
RISC
Still/VideoCamera
NeocorticalComputingProcessor NoCRouter NoCRouter NoCRouter NoCRouter NoCRouter
NoCRouterNoCRouterNoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouterNoCRouterNoCRouterNoCRouterNoCRouter
NoCRouter NoCRouter NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
NoCRouter
Application A
Application B
Feature ClassificationCores
Feature Extraction Cores(Overload)
Feature ClassificationCores (Overload)
Feature Extraction Cores
Heterogeneous Proc. [1-4]
Homogeneous NC Proc.
Heterogeneous Proc. [1-4]
Homogeneous NC Proc.
BalancedCores &NoC withUnified
Workloads
BalancedCores &NoC withUnified
Workloads
NC Processor Scalability
Congested
Congested
Internal Bus
16b Arithmetic Unit
Inst.Dispatch
LocalMemoryUnit
81BReg. File(72b x 9)
Kautz NoC Router
Inst.FIFO(4 deep)
PacketEncode
DMA / MMU / TLB
PagingMemoryUnit 1
2KBSP SRAM(16b x 1K)
PagingMemoryUnit 9
2KBSP SRAM(16b x 1K)
...
...
PacketDecode
Control
Adder-Subtractor
Adder-Subtractor
Multiplier
Comparator
Data Parser
Data Packer
RegistersP
EControl
Inst. Imm. Field Local Mem. Broadcast
To/FromNeighborPE
To/FromNeighborPE
PE 0 PE 1 PE 9
To Packet EncodeTo Memory (Write)
From Memory (Read)To Memory (Addr.)
SISD Mode SIMD Mode
From Other CoresTo Other Cores
IssuedInstruction(s)
HMIMD Mode
...
...Mem
PE
Dense 1D &Sparse 2D/3D Inst.(e.g. Maximum)
...
...
Dense 2D/3D Inst.(e.g. 9x9 Conv.)
...
...
2 SISD/SIMD Inst.(e.g. 5x5 & 3x3 Conv.)
In Buf
MU
RU 1
RU 2
RU 3RoutingControl
Fault/CongestionInformation
Strings(FCIS1/FCIS2)
In Buf
In Buf
From Other CoresTo Other Cores
Arriving / Loopback Packets
Port-shared Routing FIFO(4-in-3-out / 84b-wide / 8-deep)
In Ctrl.ToPacketDecode From
PacketE
ncode
...
Routing Unit (RU) FlowS1 S2 S3
D1 D2 D3
D1 D2 D3D2 D3S1 S2 S3
D3S1 S2 S3
if (S2==D1) & (S3==D2)
else if (S3==D1)
else
S1 S2 S3 D1 D2 D3
S1 S2 S3 I1 I2
S1 S2 S3 D1 D2 D3I1
If RS contains substring FCIS1/FCIS2
try inserting 1 number
else try inserting 2 numbers
1. Routing String (RS)Generation
2. RS Check (Flt./Cong. Avoiding)
3. Routing Port Decision
S1 S2 S3 ?
Multicast Unit (MU) Flow1. Routing String (RS) Generation2. Multicast Detection, Packet Cloning,and Target Address Modification
if (length(RS)==4) & (D2==D3), send to
else if (length(RS)==5) & (D1==D2), send to
D1 D2 P1 D1 D2 P2 D1 D2 P3
D1 P1 P1 D1 P2 P2 D1 P3 P3
Local Core
Next-hop CoresS2 S3 P1
P1 P2 P3
(Target Addr. Modifications)
e.g.1 2 1
3 2 3
2 1 0S2 S3 P22 1 2S2 S3 P32 1 3 1 2 1 3 2 3 3/0
Route: 213 132 323 (if no fault);210 103 032 323 (if 132 faulty)
e.g.132
1 2 1 3 2 30
Target Core
2 1 e.g. multicast to 211 detected at 121 triggerspacket to be cloned and send to 210, 212 and 213(the subgroup defined as 211) via all output ports
directly, excluding zerosor not-to-be-used values
...
...
...
...
...
y3,3 = i=1.. ..
...
...
...
...
...
...Multicast
Target Array y on Core B
Target Array y on Core ASource Array x on Core ZSource Array x on $ Target Array y on Core A
Target Array y on Core B
SIMD Inst.
Addr. (Req. Mem.)
y1,1 = y1,1 +
y2,1 = y2,1 +
y3,1 = y3,1 +
y4,1 = y4,1 +
y5,1 = y5,1 +
Pull-based Processing(1) Data placed on cache,including zeros
(2) Data pulled by targets
Push-based Processing(1) Data pushed to targetsdirectly, excluding zeros
y1,1 = y1,1 +
205mW 209mW 927mW 1284mW63fps 36fps 3.5fps 1.7fps
AveragePower*(mW)
All FeaturesOn
MulticastOff
HybridMIMD Off
Push-basedProc. Off
4.43x
1.38x
1.02x
6.25x
All FeaturesOn
MulticastOff
HybridMIMD Off
Push-basedProc. Off
RecognitionSpeed*(fps)
10.3x
2.09x
1.75x
37.8x
*Benchm
ark:128x1281000-class
IndoorObjectIm
ageRecognition
Technology
Die Size
Power Supply
Operating Frequency
Gate / SRAM
TotalPower Consumption
Peak Performance
Power Efficiency
TSMC 65nm 1P9M CMOS
20.25mm2 (4.5mm x 4.5mm)
Core 1.0V, I/O 2.5V
250MHz
2.2M / 648KB
205mW / 351mW(Average / Peak)
360GOPS
1026GOPS/W
NoC Throughput
NoCPower Consumption
NoCPower Efficiency
Fault / CongestionTolerance
128x128 Image Input(Enhanced NC Model)
128x128 Image Input
256x256 Image Input
128x128 Video Input2.3Tb/s
(Aggregated Bandwidth)
15mW
151Tb/s/W
2 Cores / 2 NoC Links /1 Core + 1 NoC Link
33-68fps
63-130fps
15-32fps
14-26fps
Online Instance Learning Yes (KNN Mode)
Recognition
Speed/Feature
ISSCC08[1]
ISSCC09[2]
ISSCC10[3]**
ISSCC10[4]
ThisWork
616
835933
310
1026
PowerEfficiency*
(GOPS/W)
ISSCC08[1]**
ISSCC09[2]**
ThisWork
38.556.8
151
NoC
PowerEfficiency*
(Tb/s/W)
Measured
Under
Applications
Show
nin
Fig.28.2.1(11
Image/V
ideoDatabases)
*TechnologyScaling
ofPower(130nm
to65nm
):P65 =
P130 x
(C65 /C
130 )x(V
65 /V130 ) 2=
P130 /2.88
**Numberfrom
Corresponding
JSSCPaper
28
• 2012 IEEE International Solid-State Circuits Conference 978-1-4673-0377-4/12/$31.00 ©2012 IEEE
ISSCC 2012 PAPER CONTINUATIONS
Figure 28.2.7: Chip micrograph.
Core Group 0Paging Memory Array
Core Group 2Paging Memory Array
KautzNoC
CG0PMA
CG2PMA
CoreGroup 3PagingMemoryArray
CG0PMA
CG2PMA
CoreGroup 1PagingMemoryArray
Core Group 0(9 NC Cores)
Core Group 2(9 NC Cores)
CoreGroup 1(9 NC Cores)
CoreGroup 3(9 NC Cores)