A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of...

24
A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of Mathematical Sciences Chennai
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of A Teraflop Linux Cluster for Lattice Gauge Simulations in India N.D. Hari Dass Institute of...

A Teraflop Linux Cluster for Lattice Gauge Simulations in

India N.D. Hari Dass

Institute of Mathematical SciencesChennai

Indian Lattice Community

• IMSc(Chennai): Sharatchandra, Anishetty and Hari Dass.

• IISC(Bangalore): Apoorva Patel.• TIFR (Mumbai): Rajiv Gavai , Sourendu Gupta. • SINP (Kolkata): Asit De, Harindranath. • HRI (Allahabad): S. Naik• SNBOSE (Kolkata): Manu Mathur.• It is small but very active and well recognised.• So far its research mostly theoretical or small scale

simulations except for international collaborations.

• At the International Lattice Symposium held in Bangalore in 2000, the Indian Lattice Community decided to change this situation.

• Form the Indian Lattice Gauge Theory Initiative(ILGTI).• Develop suitable infrastructure at different institutions for

collective use.• Launch new collaborations that would make the best use of

such infrastructure.• At IMSc we have finished integrating a 288-CPU Xeon Linux

cluster.• At TIFR a Cray X1 with 16 CPU’s has been acquired.• At SINP plans are under way to have substantial computing

resources.

Comput Nodes and Interconnect

• After a lot of deliberations it was decided that the compute nodes shall be dual Intel [email protected] GHz.

• The motherboard and 1U rackmountable chassis developed by Supermicro.

• For the interconnect the choice was the SCI technology developed by Dolphinics of Norway.

Interconnect Technologies

WAN LAN I/O Memory Processor

ATM

Myrinet, cLan

100 000 10 000 1 000 100 10 1

20

50 000

1

100 000

1

100 000

100 000

100

1 000

10 000

1

Design space for different

technologies

Distance

Bandwidth

Latency

Infiniband

FibreChannel

Cache

Proprietary Busses

Application areas:

Application requirements:

Bus

Ethernet

Cluster Interconnect Requirements

SCSI

Network

PCI

Dolphin SCI Technology

Rapid IOHyper Transport

PCI-SCI Adapter Card 1 slot 3 dimensions

• SCI ADAPTERS (64 bit - 66 MHz)

– PCI / SCI ADAPTER (D336)

– Single slot card with 3 LCs

– EZ-Dock plug-up module

– Supports 3 SCI ring connections

– Used for WulfKit 3D clusters

– WulfKit Product Code D236 SCI SCI

PSBPCI

LC LC

SCI

LC

Theoretical Scalability with 66MHz/64bits PCI Bus

12

144

1728

0,1

1

10

100

1000

1 10 100 1000 10000

Number of Nodes

GB

yte

/s

Ringlet

2D-Torus

3D-Torus

4D-Torus

PCI

Gb

ytes

/s

Courtesy of Scali NACourtesy of Scali NA

•Channel Bonding option

High PerformanceInterconnect:•Torus Topology•IEEE/ANSI std. 1596 SCI•667MBytes/s/segment/ring•Shared Address Space

System Interconnects

Maintenance and LAN Interconnect:•100Mbit/s Ethernet

Courtesy of Scali NACourtesy of Scali NA

Scali’s MPI Fault Tolerance

• 2D or 3D Torus topology– more routing options

• XYZ routing algorithm– Node 33 fails (3)– Nodes on 33’s ringlets

becomes unavailable– Cluster fractured with

current routing setting

14 24 34 44

13 23 33 43

12 22 32 42

11 21 31 41

Courtesy of Scali NACourtesy of Scali NA

SCAMPI Fault Tolerance cont.

• Scali advanced routing algorithm:– From the “Turn Model”

family of routing algorithms

• All nodes but the failed one can be utilised as one big partition

43 13 23 33

42 12 22 32

41 11 21 31

44 14 24 34

Courtesy of Scali NACourtesy of Scali NA

• It was decided to build the cluster in stages.

• A 9-node Pilot cluster as the first stage.

• Actual QCD codes as well as extensive benchmarkings were run.

Integration starts on 17 Nov 2003

KABRU in Final Form

Kabru Configuration

• Number of Nodes : 144• Nodes: Intel Dual Xeon @ 2.4 GHz• Motherboard: Supermicro X5DPA-GG• Chipset: E7501 533 MHz FSB• Memory: 266 MHz ECC DDRAM• Memory: 2 GB/Node x 120+4 GB/N x 24• Interconnect: Dolphin 3D SCI• OS: Red Hat Linux v.8.0• Scali MPI

Physical Characterstics

• 1U rackmountable servers

• Cluster housed in 6 42U racks.

• Each rack has 24 nodes.

• Nodes connected in 6x6x4 3D Torus Topology .

• Entire system in a small 400 sqft hall.

Communication Characterstics

• With the PCI slot at 33MHz the highest sustained bandwidth between nodes is 165 MB/s on a packetsize of 16 MB.

• Between processors on the same node it is 864 MB/s on a packet size of 98 KB.

• With the PCI slot at 66 MHz these double. Lowest latency between nodes is 3.8 s. Latency between procs on same node is

0.7 microsecs.

HPL Benchmarks

• Best performance with GOTO BLAS and dgemm from Intel was 959 GFlops on all 144 nodes(problem size 183000).

• Theoretical peak: 1382.4 GFlops.• Efficiency: 70%• With 80 nodes best performance was at

537 GFlops.• Between 80 and 144 nodes the scaling is

nearly 98.5%

MILC Benchmarks

• Numerous QCD codes with and without dynamical quarks have been run.

• Independently developed SSE2 assembly codes for double precision implementation of MILC codes.

• For the ks_imp_dyn1 codes we got 70% scaling as we went from 2 to 128 nodes with 1 proc/node, and 74% as we went from 1 to 64 nodes with 2 procs/node.

• These were for 32x32x32x48 lattices in single precision.

MILC Benchmarks Contd.

• For 64^4 lattices with single precision the scaling was close to 86%.

• For double precision runs on 32^4 lattices the scaling was close to 80% as the number of nodes were increased from 4 to 64.

• For pure-gauge simulations with double precision on 32^4 lattices the scaling was78.5% as one went from 2 to 128 nodes.

Physics Planned on Kabru

• Very accurate simulations in pure gauge theory (with Pushan Majumdar) using the Luscher-Weisz multihit algorithm.

• A novel parallel code both for Wilson loop as well as Polyakov loop correlators has been developed and preliminary runs carried out on lattices upto 32^4.

• Requires 200 GB memory for 64^4 simulations with double precision.

Physics on Kabru…..

• Using the same multihit algorithm we have a long term plan to carry out very accurate measurements of Wilson loops in various representations as well as their correlation functions to get a better understanding of confinement.

• We also plan to study string breaking in the presence of dynamical quarks.

• We propose to use scalar quarks to bypass the problems of dynamical fermions.

• With Sourendu Gupta (TIFR) we are carrying out preliminary simulations on sound velocity in finite temperature QCD.

Why KABRU?