A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of...
A Teraflop Linux Cluster for Lattice Gauge Simulations in
India N.D. Hari Dass
Institute of Mathematical SciencesChennai
Indian Lattice Community
• IMSc(Chennai): Sharatchandra, Anishetty and Hari Dass.
• IISC(Bangalore): Apoorva Patel.• TIFR (Mumbai): Rajiv Gavai , Sourendu Gupta. • SINP (Kolkata): Asit De, Harindranath. • HRI (Allahabad): S. Naik• SNBOSE (Kolkata): Manu Mathur.• It is small but very active and well recognised.• So far its research mostly theoretical or small scale
simulations except for international collaborations.
• At the International Lattice Symposium held in Bangalore in 2000, the Indian Lattice Community decided to change this situation.
• Form the Indian Lattice Gauge Theory Initiative(ILGTI).• Develop suitable infrastructure at different institutions for
collective use.• Launch new collaborations that would make the best use of
such infrastructure.• At IMSc we have finished integrating a 288-CPU Xeon Linux
cluster.• At TIFR a Cray X1 with 16 CPU’s has been acquired.• At SINP plans are under way to have substantial computing
resources.
Comput Nodes and Interconnect
• After a lot of deliberations it was decided that the compute nodes shall be dual Intel [email protected] GHz.
• The motherboard and 1U rackmountable chassis developed by Supermicro.
• For the interconnect the choice was the SCI technology developed by Dolphinics of Norway.
Interconnect Technologies
WAN LAN I/O Memory Processor
ATM
Myrinet, cLan
100 000 10 000 1 000 100 10 1
20
50 000
1
100 000
1
100 000
100 000
100
1 000
10 000
∞
1
Design space for different
technologies
Distance
Bandwidth
Latency
Infiniband
FibreChannel
Cache
Proprietary Busses
Application areas:
Application requirements:
Bus
Ethernet
Cluster Interconnect Requirements
SCSI
Network
PCI
Dolphin SCI Technology
Rapid IOHyper Transport
PCI-SCI Adapter Card 1 slot 3 dimensions
• SCI ADAPTERS (64 bit - 66 MHz)
– PCI / SCI ADAPTER (D336)
– Single slot card with 3 LCs
– EZ-Dock plug-up module
– Supports 3 SCI ring connections
– Used for WulfKit 3D clusters
– WulfKit Product Code D236 SCI SCI
PSBPCI
LC LC
SCI
LC
Theoretical Scalability with 66MHz/64bits PCI Bus
12
144
1728
0,1
1
10
100
1000
1 10 100 1000 10000
Number of Nodes
GB
yte
/s
Ringlet
2D-Torus
3D-Torus
4D-Torus
PCI
Gb
ytes
/s
Courtesy of Scali NACourtesy of Scali NA
•Channel Bonding option
High PerformanceInterconnect:•Torus Topology•IEEE/ANSI std. 1596 SCI•667MBytes/s/segment/ring•Shared Address Space
System Interconnects
Maintenance and LAN Interconnect:•100Mbit/s Ethernet
Courtesy of Scali NACourtesy of Scali NA
Scali’s MPI Fault Tolerance
• 2D or 3D Torus topology– more routing options
• XYZ routing algorithm– Node 33 fails (3)– Nodes on 33’s ringlets
becomes unavailable– Cluster fractured with
current routing setting
14 24 34 44
13 23 33 43
12 22 32 42
11 21 31 41
Courtesy of Scali NACourtesy of Scali NA
SCAMPI Fault Tolerance cont.
• Scali advanced routing algorithm:– From the “Turn Model”
family of routing algorithms
• All nodes but the failed one can be utilised as one big partition
43 13 23 33
42 12 22 32
41 11 21 31
44 14 24 34
Courtesy of Scali NACourtesy of Scali NA
• It was decided to build the cluster in stages.
• A 9-node Pilot cluster as the first stage.
• Actual QCD codes as well as extensive benchmarkings were run.
Kabru Configuration
• Number of Nodes : 144• Nodes: Intel Dual Xeon @ 2.4 GHz• Motherboard: Supermicro X5DPA-GG• Chipset: E7501 533 MHz FSB• Memory: 266 MHz ECC DDRAM• Memory: 2 GB/Node x 120+4 GB/N x 24• Interconnect: Dolphin 3D SCI• OS: Red Hat Linux v.8.0• Scali MPI
Physical Characterstics
• 1U rackmountable servers
• Cluster housed in 6 42U racks.
• Each rack has 24 nodes.
• Nodes connected in 6x6x4 3D Torus Topology .
• Entire system in a small 400 sqft hall.
Communication Characterstics
• With the PCI slot at 33MHz the highest sustained bandwidth between nodes is 165 MB/s on a packetsize of 16 MB.
• Between processors on the same node it is 864 MB/s on a packet size of 98 KB.
• With the PCI slot at 66 MHz these double. Lowest latency between nodes is 3.8 s. Latency between procs on same node is
0.7 microsecs.
HPL Benchmarks
• Best performance with GOTO BLAS and dgemm from Intel was 959 GFlops on all 144 nodes(problem size 183000).
• Theoretical peak: 1382.4 GFlops.• Efficiency: 70%• With 80 nodes best performance was at
537 GFlops.• Between 80 and 144 nodes the scaling is
nearly 98.5%
MILC Benchmarks
• Numerous QCD codes with and without dynamical quarks have been run.
• Independently developed SSE2 assembly codes for double precision implementation of MILC codes.
• For the ks_imp_dyn1 codes we got 70% scaling as we went from 2 to 128 nodes with 1 proc/node, and 74% as we went from 1 to 64 nodes with 2 procs/node.
• These were for 32x32x32x48 lattices in single precision.
MILC Benchmarks Contd.
• For 64^4 lattices with single precision the scaling was close to 86%.
• For double precision runs on 32^4 lattices the scaling was close to 80% as the number of nodes were increased from 4 to 64.
• For pure-gauge simulations with double precision on 32^4 lattices the scaling was78.5% as one went from 2 to 128 nodes.
Physics Planned on Kabru
• Very accurate simulations in pure gauge theory (with Pushan Majumdar) using the Luscher-Weisz multihit algorithm.
• A novel parallel code both for Wilson loop as well as Polyakov loop correlators has been developed and preliminary runs carried out on lattices upto 32^4.
• Requires 200 GB memory for 64^4 simulations with double precision.
Physics on Kabru…..
• Using the same multihit algorithm we have a long term plan to carry out very accurate measurements of Wilson loops in various representations as well as their correlation functions to get a better understanding of confinement.
• We also plan to study string breaking in the presence of dynamical quarks.
• We propose to use scalar quarks to bypass the problems of dynamical fermions.
• With Sourendu Gupta (TIFR) we are carrying out preliminary simulations on sound velocity in finite temperature QCD.