The brain on low power scalable architectures: efficient ...Human Brain Project WaveScales: 2016...
Transcript of The brain on low power scalable architectures: efficient ...Human Brain Project WaveScales: 2016...
ParCo2017 – International Conference on Parallel ComputingBologna, Italy 12-15 September 2017
The brain on low power scalablearchitectures: efficient simulation of
cortical slow waves and asynchronous states
Andrea BiagioniINFN – Sezione di Roma
for the APE Lab ExaNeSt and WaveScalES team
Human Brain Project
� WaveScales: 2016 – 2023� Measures of brain Slow Waves during deep-sleep and
anesthesia and transition to awareness� Large-scale spiking simulations (hundreds of billions
synapses) distributed over (tens of) thousands of processes.
Distributed and Plastic Spiking Neural Networks (DPSNN)� Neural networks heavily interconnected at multiple distances,
local activity rapidly produces effects at all distances ÆPrototype of non-trivial parallelization problem
� Each neural spike originates a cascade of synaptic events atmultiple times: t + Δts Æ Complex data structures andsynchronization. Mixed time-driven (delivery of spiking message)and event-driven (neural dynamic and synaptic activity)
� Multiple time-scales (neural, synaptic, long and short termplasticity models) Æ Non-trivial synchronization at all scales
� Gigantic synaptic data-base. A key issue for large scalesimulations Æ Clever parallel resource management required.
13/09/2017 2Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
P. S. P
aolucci et al, Distributed sim
ulationof
polychronousand plastic spiking neural netw
orks: strong and w
eak scaling of a representative mini-
application benchmark executed on a sm
all-scale com
modity cluster,arXiv:1310.8478, O
ct. 2013.
Neuron Model
13/09/2017 3
� The unit of the system:� Semplifications are needed� balancement between computing (flops)
and biological plausibility� Point-like Leaky Integrate and Fire with
Spike Frequency Adaptation
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Gigante et al. 2007, Diverse population-burstingmodes of adapting spiking neurons. Phys Rev Lett. 98:148101. DOI: 10.1103/PhysRevLett.98.148101.
E. M. Izhikevich. 2004. Which model to use for cortical spiking neurons?. Trans. Neur. Netw. 15, 5 (September 2004), 1063-1070. DOI=http://dx.doi.org/10.1109/TNN.2004.832719
Neural Columns
13/09/2017 4
� Grey matter� Different families of neurons are in
the column (excitatory, inhibitory)� Configurable number of family� Configurable number of neurons� Parametric
V. Braitenberg. 2007. Grey substance and white substance. Scholarpedia, 2(11):2918.
White MatterLong Range Inter-areal Communication
Grey MatterNeurons + Intra-areal connectionsShort range communication
� 2 Excitatory, 1 Inhibitory� Family Ratio 1:3:1� Neurons ~1250
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
� Cortical Area: A segment of the cerebralcortex that carries out a given function
� Cortical Column: a group of neurons inthe cortex that can be successivelypenetrated by a probe insertedperpendicularly to the cortical surface.
Testbed
� QUonG (32 nodes; 256 core):� Intel Ivy Bridge CPU E5-2630 v2 @
2.60GHz (dual processor; exa-core)� 128 GB per node (10 GB per core)� NIC: IB gen2� Limited to 96 cores� INFN Roma
13/09/2017 5
� Galileo (516 node; 8256 core):� Intel Haswell 2.40 GHz per node
(dual processor; octo-core)� 128 GB per node (8 GB per core)� NIC: IB gen3 (4x QDR switch)� 281 on TOP500 (July 2017)� Limited to 1024 cores� CINECA (Bologna)
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
R. Ammendola et al., QUonG: A GPU-based HPC system dedicated to LQCD computing, in Application Accelerators in High- Performance Computing (SAAHPC), 2011 Symposium on, pp. 113–122, July 2011
DPSNN: Strong and Weak Scaling measures
Strong scaling. From 1 to 1024 cores @ 2.4 GHz simulate various total network sizes.Exec time normalized to synapse count.
13/09/2017 6
Weak scaling for various local network sizes.
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Distribution of Cortical Modules among Software Processes
A sample grid of 64=8x8 neural columns.
Excitatory neurons projects 76% of their synapses toward neurons located in the same column, 3% to first neighbouring columns, 2% to second neighbours and 1% to third neighbour.
a) Grid of 64 processes: 1 column per process
b) Grid of 4 processes: 16 columns per process
c) Grid of 256 processes: ¼ of column per process
One computational core host one software processes.
13/09/2017 7
Strong scaling measures:A,b,c) Examples of distribution of a grid composed of 64 neural columns over a varying number of software processes (computational cores)
Node connectivity matrix is not equal to the Columns connectivity matrix.
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Overview of DPSNN tasks
13/09/2017 8
12x12 24x24 48x48NEURONS 0.18M 0.71M 2.86MSYNAPSES 0.20G 0.80G 3.20GCOLUMNS 144 576 2304PROCESSES 144 192 192COLUMN/PROCESS 1 3 12SIMULATED SECONDS 30 12 18WALL CLOCK SEC. 1484 2148 15182COMMUNICATION 35.2% 10.7% 0.9%SYNCHRONIZATION 22.9% 36.3% 36.2%COMPUTATION 21.3% 34.2% 45.1%LIST MANAGEMENT 17.1% 16.7% 16.9%
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
4x4 GRID: Real-time domain
13/09/2017 9
1,002,004,008,00
16,0032,0064,00
1 2 4 8 16 32 64
Spee
d Up
Processes/Cores
Strong Scaling; Grid 4x4
SPEED UP IDEAL
� Simulated time: 10s (QUonG)� Wall clock time
� 32 processes: 12.03 sec� 64 processes: 16.46 sec
� Similar to expected behavior (λ=350um)� Communication doesn’t scale!!!� Traditional distributed computing system:
� Throughput: OK� Latency: NO
0,130,250,501,002,004,008,00
16,0032,0064,00
128,00256,00
1 2 4 8 16 32 64
Seco
nds
Processes/Cores
Strong Scaling; Grid 4x4
COMMUNICATION COMPUTATION BARRIER TOTAL
D. S. Modha et al. The Cat is Out of the Bag: Cortical Simulations with 109 Neurons, 1013 Synapses, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, Oregon, pages 1-12, 2009, ACM
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Exanest Objectives� H2020 - FETHPC-2014 (December 2015 – November 2018)� System architecture for datacentric Exascale-class HPC
� Low-latency unified Interconnect (compute & storage traffic)• RDMA + PGAS to reduce communication overhead
� Fast, distributed in-node non-volatile-memory� Extreme compute-power density
� Advanced totally-liquid cooling technology� Scalable packaging for ARM-based (v8, 64-bit) microserver
• Low Energy Compute (256 ARM cores + 1TB DDR4 Memory in a 1U blade)• Heterogeneous: FPGA accelerator (~4 TFlops per node)
� Real scientific and data-center applications� Applications used to identify system requirements� Tuned versions will evaluate our solutions
INFN activities are strongly synergic with project objectives:� APE supercomputer: VLSI, system design, high density packing� APEnet: FPGA-based NIC for clusters (low-latency, high-throughput)
13/09/2017 10
INFN
Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
R. Ammendola et al., APEnet+: a 3D Torus network optimized for GPU-based HPC systems, Journal of Physics: Conference Series,vol. 396, no. 4, p. 042059, 2012
M. Katevenis et al. The next Generation of Exascale-class Systems: the ExaNeStProject in 2017 Euromicro Conference on Digital System Design (DSD), Aug 2017
DPSNN on low-powercomputing architectures
� Evaluate the performaces of low-power processors inscalable simulations of spiking neural network models.
� Compare performances against traditional server-platform processors.
� Try to identify the critical architectural features enablingbetter time-to-solution and energy-to-solution figures onthis application.
� Intel Xeon vs. ARM Cortex cores (two generations toevaluate trend):1. Westmere Xeon E5620 @2.4 GHz vs. ARMv7-A Cortex A-
15 @2.3 GHz2. Haswell E5-2620 v3 @2.4 GHz vs. ARMv8-A Cortex A-57
@1.9 GHz
13/09/2017 Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states 11
� Dimensions: 1U standard rackmountable
� Motherboard: X8DTG-DF� CPU: Dual Intel Westmere quad-
core Xeon E5620� DRAM: 48 GB DDR3 1333 MHz� NIC: Mellanox ConnectX VPI IB
QDR� OS: CentOS release 6.7, kernel
2.6.32-573.7.1.el6.x86_64
� Tegra K1 SOC� CPU: NVIDIA "4-Plus-1" 2.32GHz ARM quad-core
Cortex-A15 CPU with Cortex-A15 battery-savingshadow-core
� GPU: NVIDIA Kepler "GK20a" GPU with 192SM3.2 CUDA cores (up to 326 GFLOPS in SP)
� DRAM: 2GB DDR3L 933MHz EMC x16 using 64-bit data width
� Storage: 16GB fast eMMC 4.51� Ethernet: RTL8111GS Realtek 10/100/1000Base-
T Gigabit LAN
1° Gen
13/09/2017 12Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Comparison of 1° Gen server and low-power architectures
� Same # of cores, ~Same clock frequency.� Intel Xeon E5620 supports Hypertheading (ARM Cortex A-15 does not).� SIMD Floating Point Theoretical Peak Performance ( 2x in DP)
� ARM Cortex-A15 (NEON):• 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add• 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add
� Intel Westmere (SSE4.2):• 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication• 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
� Memory Bandwidth: 14.9 GB/s (ARM Cortex-A15) vs 25.6 GB/s (IntelXeon E5620)� DPSNN makes an intensive use of memory (e.g. for delivering spikes
to post-synaptic neuron queues).
13/09/2017 Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states 13
Benchmark Configuration(1° Gen)
� DPSNN:� Simulated time: 3 s� 10K LIFCA neurons� 18M synapses
� Low-power platform:� 2 quad-core ARM A15 Jetson TK1 + Gigabit switch� 8 MPI processes
� Server platform:� 1 Supermicro SuperServer 6016GT-TF (2 Intel E5620 quad-core
processors)� 8 MPI processes (hyperthreading turned off)
13/09/2017 Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states 14
1° Gen Results
� TIME: Server platform 3.3x better than low-power platform
� POWER: Server platform 14.4x worse than low-power platform
� ENERGY: Server platform is 4.4x worse than low-power platform
� We did not subtract any base-line power consumption.
13/09/2017 15Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
� Tegra X1 SOC (20 nm)� CPU: ARMv8 ARM Cortex-A57 quad-core (2MB
L2 cache) + ARM Cortex-A53 quad-core (64-bit)in Big.LITTLE configuration, 102 MHz / 1.9 GHz
� GPU: NVIDIA Maxwell ”GM20B” with 256 CUDAcore: 512 GFLOPS (FP32), 1TFLOPS (FP16).
� DRAM: 4GB LPDDR4 (25.6GBs BW)� Storage: 16GB eMMC� Ethernet: 10/100/1000Base-T� OS: Ubuntu 14.04.1 LTS (GNU/Linux 3.10.67-
g458d45c aarch64)� SW stack: gcc 4.8.4 (Ubuntu/Linaro 4.8.4-
2ubuntu1~14.04.3), Open MPI 1.6.5
� Dimensions: 4U standard� Motherboard: X10DRG-Q� CPU:Dual hexa core Intel E5-2620
v3 @2.4 GHz (15MB L2 cache), 1.2up to 3.2 GHz frequency scaling, 22nm , mem BW up to 59 GB/s
� DRAM: 64GB DDR4 2133 MHz� NIC: Mellanox ConnectX VPI IB
QDR� OS: CentOS 7.2, kernel 4.5.3-
1.el7.elrepo.x86_64� SW stack: gcc 4.8.5, Open MPI
1.10.0
2° Gen
13/09/2017 16Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Benchmark Configuration(2° Gen)
� DPSNN:� Simulation time: 3 s� 10K LIFCA neurons� 18M synapses
� Low-power platform:� 1 Jetson TX1 (quad core ARM
Cortex A57)� 4 MPI processes, interactive
freq scaling governor� Server platform:
� 1 Supermicro SuperServer7048GR-TR (2 hexa core IntelE5-2620 v3 @ 2.40GHz)
� 4 MPI processes, powersavefreq. scaling governor
13/09/2017 Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states 17
2° Gen Results
� TIME: Server platform is 5x fasterthan low-power platform
� POWER: Server platform is 14.5x worse than low-power platform
� ENERGY: Server platform is 2.9x worse than low-power platform
� We did not subtract any base-line power consumption.
13/09/2017 18Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states
Haswell vs. Cortex A57Comments on Results
� Effective Cortex A57 usable max freq. is 1734 MHz.� Taking into account the full baseline power consumption is
unfair for the Haswell platform (used 4 cores out of 12). If werenormalize the baseline to 1/3 for the Haswell, resultswould be:� Power consumption ratio: 10.9 (instead of 14.5)� Energy to solution ratio: 2.2 (instead of 2.9)
13/09/2017 Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states 19
THANK YOU!!!
APE lab: R. Ammendola1, A. Biagioni2, F. Capuani2, P. Cretaro2, G. De Bonis2, O. Frezza2, F. Lo Cicero2, A. Lonardo2,
M. Martinelli2, P. S. Paolucci2, E. Pastorelli2, L. Pontisso2, F. Simula2, P. Vicini2
1 INFN, Roma Tor Vergata2 INFN, Roma
This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1) and No. 671553 (EXANEST)
13/09/2017 20Andrea Biagioni – Parco2017 – International Conference on Parallel Computing The brain on low power scalable architectures: efficient simulation of cortical slow waves and asynchronous states