Planning and building linux based cluster for NWP

75
Planning and building linux based cluster for NWP Climatological Research Institute (CRI Cluster) Dr. Jamali Dr. Jamali Chezgi Chezgi

description

Planning and building linux based cluster for NWP. Climatological Research Institute (CRI Cluster). Dr. Jamali Chezgi. Outline. Introduction Our problem Our solution Building CRI Cluster Monitoring and controlling Benchmarking Feature plans references. Simulation. Nature. - PowerPoint PPT Presentation

Transcript of Planning and building linux based cluster for NWP

Page 1: Planning  and building  linux based cluster  for NWP

Planning and building linux based cluster

for NWP

Climatological Research Institute(CRI Cluster)

Dr. JamaliDr. Jamali ChezgiChezgi

Page 2: Planning  and building  linux based cluster  for NWP

OutlineOutline IntroductionIntroductionOur problemOur problemOur solutionOur solutionBuilding CRI ClusterBuilding CRI ClusterMonitoring and controllingMonitoring and controllingBenchmarkingBenchmarkingFeature plansFeature plans referencesreferences

Page 3: Planning  and building  linux based cluster  for NWP

IntroductionIntroduction

Page 4: Planning  and building  linux based cluster  for NWP

Environment / Climate / WeatherEnvironment / Climate / Weather Aeronautics and space explorationAeronautics and space exploration Energy research Energy research Virtual realityVirtual reality Scientific visualizationScientific visualization Health sciencesHealth sciences

Page 5: Planning  and building  linux based cluster  for NWP

Make observation

Collect andprocess data

Run forecast model

Create product

Provide for end users

Page 6: Planning  and building  linux based cluster  for NWP

Main issuesMain issues

Very large data setsVery large data setsDistributed dataDistributed dataHigh processing requiredHigh processing requiredNeed to real-time processesNeed to real-time processesCoupled models Coupled models

Page 7: Planning  and building  linux based cluster  for NWP

Our problemsOur problems

Data managementData managementLisaLisa

NWP modelsNWP modelsARPSARPSMM5MM5HRMHRM

Climatological modelsClimatological modelsNCMNCM

Page 8: Planning  and building  linux based cluster  for NWP
Page 9: Planning  and building  linux based cluster  for NWP

NWP modelsNWP models

ARPSARPSMM5MM5HRMHRM

Page 10: Planning  and building  linux based cluster  for NWP

ARPSARPS Advanced Regional Prediction SystemAdvanced Regional Prediction System Open sourceOpen source Parallel codeParallel code Running on the all unixesRunning on the all unixes

1.1. IBM IBM RS/6000RS/6000 WorkstationWorkstation

2.2. CrayCray C-90C-903.3. CrayCray T3DT3D4.4. CrayCray J90J905.5. CM-5CM-56.6. PCPC LINUXLINUX

Page 11: Planning  and building  linux based cluster  for NWP

ARPSTERN( Terrain data preprocessor )

Arpstern.input

ARPSSFC

( Surface characteristics data preprocessor )

Arps40.inputArps40.input

EXT2 ARPS

( Gridded data interpolater)

ARPS Analysis System

ARPS Data

Assimilation System

ARPSRETRV

)Doppler Radar Data

Retrieval system(

ARPS

( Main model driver )Arps40.input

ARPSPLT

( Vector graphics post-processor)

ARPSCVT

( History data format converter )

Other

post- processing toolsVisualization packages

( Savi3D,AVS etc )

Arpscvt .inputArpsplt.input

Indexed terrain

elevation file

( 1°,5 min,or 30 sec )

Soil .vegetation

type and other

land-use data

Rawinsondes,VAD. And wind profilers

Doppler

Radar Data

Single-level data

Doppler

Radar Data

User-supplied

gridded data

(e.g.OLAPS.NMC

analysis )

ARPS Model Process Flow chart

Page 12: Planning  and building  linux based cluster  for NWP

Climatological ModelsClimatological Models

Page 13: Planning  and building  linux based cluster  for NWP

Our solutionOur solution

Memory:Memory:

using bigger memory ?using bigger memory ?

CPU:CPU:

using better CPU ?using better CPU ?

Cluster:Cluster:

for powering Memory and for powering Memory and CPU CPU

Page 14: Planning  and building  linux based cluster  for NWP

Building ClusterBuilding Cluster

Page 15: Planning  and building  linux based cluster  for NWP

CRI ClusterCRI Cluster

Page 16: Planning  and building  linux based cluster  for NWP

Prebuilded clusters?Prebuilded clusters?direct relation between technology direct relation between technology

and end userand end userCustomize it for our usersCustomize it for our usersobtaining this technologyobtaining this technologyBetter use Better use We can upgrade itWe can upgrade itLower costsLower costsSamples on the worldSamples on the world

Page 17: Planning  and building  linux based cluster  for NWP
Page 18: Planning  and building  linux based cluster  for NWP

OU ClusterOU Cluster Breakdown of NodesBreakdown of Nodes

132 Compute Nodes 132 Compute Nodes (computing jobs) (computing jobs)

8 Storage Nodes 8 Storage Nodes (Parallel Virtual (Parallel Virtual File System)File System)

2 Head Nodes 2 Head Nodes (login, compile, (login, compile, debug, test)debug, test)

1 Management 1 Management Node (PVFS Node (PVFS control, batch queue)control, batch queue)

Each NodeEach Node 2 Pentium4 XeonDP 2 Pentium4 XeonDP

CPUs (2 GHz, CPUs (2 GHz, 512 KB L2 Cache)512 KB L2 Cache)

2 GB RDRAM 2 GB RDRAM (400 MHz, 3.2 (400 MHz, 3.2 GB/sec)GB/sec)

Myrinet-2000 adapterMyrinet-2000 adapter

Page 19: Planning  and building  linux based cluster  for NWP

Cluster ArchitectureCluster Architecture

Page 20: Planning  and building  linux based cluster  for NWP

Cluster room Cluster room

SpaceSpacePackingPackingPowerPowerAir conditionAir conditionEasily repairingEasily repairingSecuritySecurityCabling Cabling

Page 21: Planning  and building  linux based cluster  for NWP

true multitaskingtrue multitasking

virtual memoryvirtual memory

shared libraryshared library

demand loadingdemand loading

shared copy-on-write executablesshared copy-on-write executables

proper memory managementproper memory management

TCP/IP networkingTCP/IP networking

Up to 64 GB memory support in i386Up to 64 GB memory support in i386

IP Virtual server Support IP Virtual server Support

Virtual server via NAT Virtual server via NAT

Virtual server Tunneling Virtual server Tunneling

Virtual server direct routing Virtual server direct routing

Vlan Vlan

Fast Switching Fast Switching

Bonding driver Bonding driver

Eql Eql

386/486 based pc, ARMS, DEC, ALPHA, SUN sparc, M 68000, MIPS, PowerPC, …

Linux Linux

Page 22: Planning  and building  linux based cluster  for NWP

Communication protocolsCommunication protocols

Internet protocolsInternet protocolsLow latency protocolsLow latency protocols

Active messagesActive messagesFast messagesFast messagesVMMCVMMCU-netU-netBIPBIP

Page 23: Planning  and building  linux based cluster  for NWP

TCP/IP problems for clusteringTCP/IP problems for clustering

1. Latency

for small packets

2. Bandwidth

for big packets

Page 24: Planning  and building  linux based cluster  for NWP
Page 25: Planning  and building  linux based cluster  for NWP

Protocol overheadProtocol overheadNIC system

Os memory

User memory User

process

OS

1)Preparing data

2)Sending intrupt

3) copy

4)Intrupt to sending out data

5)Send to NIC

Internal buffers

Page 26: Planning  and building  linux based cluster  for NWP

Cluster computing standardsCluster computing standards

VIAVIA Combination of the many protocolsCombination of the many protocols Like U-net uses virtual network interfaceLike U-net uses virtual network interface native and emulated native and emulated A version of the emulated VIA has more performance than TCP/IPA version of the emulated VIA has more performance than TCP/IP MPICH over VIAMPICH over VIA

InfinibandInfiniband Compaq dell HP IBM intel microsoft sunCompaq dell HP IBM intel microsoft sun Replace the shared I/O with a Replace the shared I/O with a high speed high speed

serial,channel based,messageserial,channel based,messagepassing ,scalable ,switched fabric.passing ,scalable ,switched fabric.

Using HCA and TCA to connect the channelUsing HCA and TCA to connect the channel Uses Six type transfer method:Uses Six type transfer method:

reliable and unreliable connections and reliable and unreliable connections and datagrams,multicast connections,raw datagrams,multicast connections,raw packetspackets

Support DMASupport DMA IPv6IPv6

Page 27: Planning  and building  linux based cluster  for NWP

Hardware productsHardware products Ethernet fast ethernet and gigabit ethernetEthernet fast ethernet and gigabit ethernet Giganet(cLAN)Giganet(cLAN) MyrinetMyrinet QsnetQsnet ServerNetServerNet SCI(Scalable Coherent Interface)SCI(Scalable Coherent Interface) ATMATM Fiber ChannelFiber Channel HIPPIHIPPI Reflective MemoryReflective Memory ATOLLATOLL

Page 28: Planning  and building  linux based cluster  for NWP

Installing and configuringInstalling and configuring

Installing serverInstalling serverBuilding servicesBuilding servicesAuto installing clientsAuto installing clientsAuto configuring clientsAuto configuring clientsManagement of the nodes Management of the nodes

Page 29: Planning  and building  linux based cluster  for NWP

NIS configurationNIS configurationIn the server

1) Specifying domain name1) Specifying domain name# # domainnam <DOMAIN_NAME>domainnam <DOMAIN_NAME>

2)2) Putting in the “Putting in the “//etc/Sysconfig/network”etc/Sysconfig/network”NISDOMAIN=<DOMAIN_NAME>NISDOMAIN=<DOMAIN_NAME>

3) Specifying server name in “3) Specifying server name in “//etc/yp.conf ” :etc/yp.conf ” :NISDOMAIN <DOMAIN_NAME> SERVER <SERVER_NAME>NISDOMAIN <DOMAIN_NAME> SERVER <SERVER_NAME>

4) Restarting daemons :4) Restarting daemons :# /etc/ rc.d/ ypserv rest# /etc/ rc.d/ ypserv restaartrt# /etc/ rc.d/ypbind restart# /etc/ rc.d/ypbind restart

5) Putting it in the init5) Putting it in the init6)Editing “/etc/yp/Makefile”6)Editing “/etc/yp/Makefile”

MERGE_PASSWD= FALSEMERGE_PASSWD= FALSE TRUETRUE MERGE_GROUP=FALSEMERGE_GROUP=FALSE TRUETRUE delete netgrp from all options.delete netgrp from all options.

7)Bulding NIS Database :7)Bulding NIS Database :#/#/usr/libusr/lib/yp/ypinit -m/yp/ypinit -m

8) If you make any changes in the feature only run this8) If you make any changes in the feature only run this# cd /var/yp; make# cd /var/yp; make

Page 30: Planning  and building  linux based cluster  for NWP

NIS configurationNIS configuration

1) Specifying domain name1) Specifying domain name# # domainnam <DOMAIN_NAME>domainnam <DOMAIN_NAME>

2) Putting in the “2) Putting in the “//etc/Sysconfig/network”etc/Sysconfig/network”NISDOMAIN=<DOMAIN_NAME>NISDOMAIN=<DOMAIN_NAME>

3) Specifying server name in “3) Specifying server name in “//etc/yp.conf ” :etc/yp.conf ” :NISDOMAIN <DOMAIN_NAME> SERVER NISDOMAIN <DOMAIN_NAME> SERVER <SERVER_NAME><SERVER_NAME>

4) Restarting daemons :4) Restarting daemons :# /etc/ rc.d/ypbind restart# /etc/ rc.d/ypbind restart

5) Putting it in the init5) Putting it in the init6) Testing it with logging in with the server users6) Testing it with logging in with the server users

In the client

Page 31: Planning  and building  linux based cluster  for NWP

Monitoring and controllingMonitoring and controlling1)scripts:1)scripts:

perlperlpythonpythonbashbash

2) Prebuilded2) PrebuildedWebminWebminScyldScyldSCDSCD

Page 32: Planning  and building  linux based cluster  for NWP

Hardware monitoring and Hardware monitoring and control(IceBox)control(IceBox)

Icebox management with hardwareIcebox management with hardware monitor temperatures within nodes and remotely reset motherboards monitor temperatures within nodes and remotely reset motherboards

through internally placed probes through internally placed probes SNMP compliant SNMP compliant DHCP or static network configuration DHCP or static network configuration NIMP (Network ICE Management Protocol) NIMP (Network ICE Management Protocol) SIMP (Serial ICE Management Protocol) SIMP (Serial ICE Management Protocol) Out-of-band Serial Data Buffering Out-of-band Serial Data Buffering Accessible with several protocols (NIMP, SIMP, Null Modem, Telnet, Accessible with several protocols (NIMP, SIMP, Null Modem, Telnet,

SNMP, ClusterWorX) SNMP, ClusterWorX) Remote temperature monitoring of CPU temperatures Remote temperature monitoring of CPU temperatures Remote Power Management Remote Power Management Power sequencing to start-up nodes Power sequencing to start-up nodes Optional cabinet temperature monitoring (eight sensors per ICE Box) Optional cabinet temperature monitoring (eight sensors per ICE Box) Node reset Node reset Multiple ICE Boxes scale to support large clusters Multiple ICE Boxes scale to support large clusters Embedded CPU powered by Linux for stable run-time environment Embedded CPU powered by Linux for stable run-time environment Ability to easily and safely update ICE Box Operating System without Ability to easily and safely update ICE Box Operating System without

cluster downtime cluster downtime

Page 33: Planning  and building  linux based cluster  for NWP

SecuritySecurity

SSHSSHPAMPAMXinetdXinetd

Page 34: Planning  and building  linux based cluster  for NWP

Running ARPSRunning ARPS Fortran 77 compiler (GNU)Fortran 77 compiler (GNU) Pre processing dataPre processing data BC and IC data from other modelsBC and IC data from other models Post processing tools (NCARG)Post processing tools (NCARG) Running flowchartRunning flowchart

Preprocessing (always one time)Preprocessing (always one time) splittingsplitting

InitializingInitializing Boundary conditionsBoundary conditions

RunningRunning JoiningJoining Post processing (another computers)Post processing (another computers)

Page 35: Planning  and building  linux based cluster  for NWP

Parallel architecture of the Parallel architecture of the ARPSARPS

Page 36: Planning  and building  linux based cluster  for NWP

Transform ToolTransform Tool

Page 37: Planning  and building  linux based cluster  for NWP
Page 38: Planning  and building  linux based cluster  for NWP

200*200

800*800

800*400

Page 39: Planning  and building  linux based cluster  for NWP

10 km

3 km

1 km

Grid computing?Grid computing?

1-Big domain low res coarse domain and better res2-in data assimulation code goes to the near of data

Page 40: Planning  and building  linux based cluster  for NWP

AUIAUI

Page 41: Planning  and building  linux based cluster  for NWP

BenchmarkingBenchmarking

ARPS resultsARPS results

GMandelGMandel

BPSBPS

Page 42: Planning  and building  linux based cluster  for NWP

Performance UtilitiesPerformance Utilities

1.1. AIMS - instrumentors, monitoring library, and analysis toolsAIMS - instrumentors, monitoring library, and analysis tools

2.2. MPE logging library and Nupshot performance visualization MPE logging library and Nupshot performance visualization

tooltool

3.3. Pablo - monitoring library and analysis toolsPablo - monitoring library and analysis tools

4.4. Paradyn - dynamic instrumentation and run-time analysis toolParadyn - dynamic instrumentation and run-time analysis tool

5.5. SvPablo - integrated instrumentor, monitoring library, and SvPablo - integrated instrumentor, monitoring library, and

analysis toolanalysis tool

6.6. VAMPIRtrace monitoring library and VAMPIR performance VAMPIRtrace monitoring library and VAMPIR performance

visualization toolvisualization tool

7.7. VT - monitoring library and performance analysis and VT - monitoring library and performance analysis and

visualization tool for the IBM SPvisualization tool for the IBM SP

Page 43: Planning  and building  linux based cluster  for NWP

ARPS performanceARPS performance

Performance is better for larger Performance is better for larger domain per CPUdomain per CPU

Because of the network limitationBecause of the network limitation

at the cluster and we need largerat the cluster and we need larger

calculation per data transfer.calculation per data transfer.

Page 44: Planning  and building  linux based cluster  for NWP

Model situation

200*200 per processorPrediction time = 60s output = NONEDtbig = 6s 1km * 1km * 500m grids

Page 45: Planning  and building  linux based cluster  for NWP

--200 * 200 per domain {200 x 200}-1 cpu-- ARPS stopped normally in the main program. The ending time was 60.000 seconds. Thanks for using ARPS. Process CPU time used Percentage----------------------------------------------- Initialization : 0.760000E+01s 1.40% Data output : 0.829005E+01s 1.53% Wind advection : 0.190701E+02s 3.52% Scalar advection: 0.397800E+02s 7.34% Coriolis force : 0.000000E+00s 0.00% Buoyancy term : 0.618995E+01s 1.14% Small time steps: 0.241000E+03s 44.48% Radiation : 0.000000E+00s 0.00% Soil model : 0.000000E+00s 0.00% Surface physics : 0.000000E+00s 0.00% Turbulence : 0.874099E+02s 16.13% Comput. mixing : 0.352601E+02s 6.51% Rayleigh damping: 0.271003E+01s 0.50% TKE src terms : 0.287300E+02s 5.30% Bound.conditions: 0.220026E+00s 0.04% Gridscale precp.: 0.000000E+00s 0.00% Kuo cumulus : 0.000000E+00s 0.00% Kain-Fritsch : 0.000000E+00s 0.00% Warmrain microph: 0.452400E+02s 8.35% Lin ice microph : 0.000000E+00s 0.00% NEM ice microph : 0.000000E+00s 0.00% Hydrometero fall: 0.000000E+00s 0.00% Miscellaneous : 0.169800E+02s 3.13%

  Entire model : 0.541820E+03s 100.00%

0.541820E+03s

Page 46: Planning  and building  linux based cluster  for NWP

--200 * 200 per domain {400 x 200}-2 cpu--ARPS stopped normally in the main program. The ending time was 60.000 seconds. Thanks for using ARPS. Process CPU time used Percentage----------------------------------------------- Initialization : 0.763000E+01s 1.41% Data output : 0.822997E+01s 1.52% Wind advection : 0.190600E+02s 3.52% Scalar advection: 0.402001E+02s 7.42% Coriolis force : 0.000000E+00s 0.00% Buoyancy term : 0.615997E+01s 1.14% Small time steps: 0.241520E+03s 44.56% Radiation : 0.000000E+00s 0.00% Soil model : 0.000000E+00s 0.00% Surface physics : 0.000000E+00s 0.00% Turbulence : 0.872100E+02s 16.09% Comput. mixing : 0.351900E+02s 6.49% Rayleigh damping: 0.276001E+01s 0.51% TKE src terms : 0.285300E+02s 5.26% Bound.conditions: 0.240047E+00s 0.04% Gridscale precp.: 0.000000E+00s 0.00% Kuo cumulus : 0.000000E+00s 0.00% Kain-Fritsch : 0.000000E+00s 0.00% Warmrain microph: 0.451199E+02s 8.32% Lin ice microph : 0.000000E+00s 0.00% NEM ice microph : 0.000000E+00s 0.00% Hydrometero fall: 0.000000E+00s 0.00% Miscellaneous : 0.168399E+02s 3.11%

   Entire model : 0.542000E+03s 100.00%

0.542000E+03s

Page 47: Planning  and building  linux based cluster  for NWP

--200 * 200 per domain {400 x 400}-4 cpu--

 ARPS stopped normally in the main program. The ending time was 60.000 seconds. Thanks for using ARPS. Process CPU time used Percentage----------------------------------------------- Initialization : 0.762000E+01s 1.40% Data output : 0.827001E+01s 1.52% Wind advection : 0.191300E+02s 3.52% Scalar advection: 0.404000E+02s 7.44% Coriolis force : 0.000000E+00s 0.00% Buoyancy term : 0.614000E+01s 1.13% Small time steps: 0.241750E+03s 44.53% Radiation : 0.000000E+00s 0.00% Soil model : 0.000000E+00s 0.00% Surface physics : 0.000000E+00s 0.00% Turbulence : 0.874600E+02s 16.11% Comput. mixing : 0.351000E+02s 6.47% Rayleigh damping: 0.273998E+01s 0.50% TKE src terms : 0.285099E+02s 5.25% Bound.conditions: 0.249939E+00s 0.05% Gridscale precp.: 0.000000E+00s 0.00% Kuo cumulus : 0.000000E+00s 0.00% Kain-Fritsch : 0.000000E+00s 0.00% Warmrain microph: 0.451600E+02s 8.32% Lin ice microph : 0.000000E+00s 0.00% NEM ice microph : 0.000000E+00s 0.00% Hydrometero fall: 0.000000E+00s 0.00% Miscellaneous : 0.169001E+02s 3.11%

Entire model : 0.542850E+03s 100.00%

0.542850E+03s

Page 48: Planning  and building  linux based cluster  for NWP

--200 * 200 per domain {800 x 400}-8 cpu--   ARPS stopped normally in the main program. The ending time was 60.000 seconds. Thanks for using ARPS. Process CPU time used Percentage----------------------------------------------- Initialization : 0.758000E+01s 1.39% Data output : 0.827006E+01s 1.52% Wind advection : 0.190499E+02s 3.50% Scalar advection: 0.404402E+02s 7.44% Coriolis force : 0.000000E+00s 0.00% Buoyancy term : 0.619997E+01s 1.14% Small time steps: 0.242260E+03s 44.57% Radiation : 0.000000E+00s 0.00% Soil model : 0.000000E+00s 0.00% Surface physics : 0.000000E+00s 0.00% Turbulence : 0.873999E+02s 16.08% Comput. mixing : 0.352699E+02s 6.49% Rayleigh damping: 0.271999E+01s 0.50% TKE src terms : 0.286100E+02s 5.26% Bound.conditions: 0.290039E+00s 0.05% Gridscale precp.: 0.000000E+00s 0.00% Kuo cumulus : 0.000000E+00s 0.00% Kain-Fritsch : 0.000000E+00s 0.00% Warmrain microph: 0.451000E+02s 8.30% Lin ice microph : 0.000000E+00s 0.00% NEM ice microph : 0.000000E+00s 0.00% Hydrometero fall: 0.000000E+00s 0.00% Miscellaneous : 0.169199E+02s 3.11%

   Entire model : 0.543510E+03s 100.00%

0.543510E+03s

Page 49: Planning  and building  linux based cluster  for NWP

--- {(200-3)*4+3=791 or ~ 800 totally }-16 cpu--ARPS stopped normally in the main program. The ending time was 60.000 seconds. Thanks for using ARPS. Process CPU time used Percentage----------------------------------------------- Initialization : 0.762000E+01s 1.40% Data output : 0.820012E+01s 1.50% Wind advection : 0.191300E+02s 3.50% Scalar advection: 0.403599E+02s 7.39% Coriolis force : 0.000000E+00s 0.00% Buoyancy term : 0.615000E+01s 1.13% Small time steps: 0.243190E+03s 44.55% Radiation : 0.000000E+00s 0.00% Soil model : 0.000000E+00s 0.00% Surface physics : 0.000000E+00s 0.00% Turbulence : 0.880600E+02s 16.13% Comput. mixing : 0.354600E+02s 6.50% Rayleigh damping: 0.276005E+01s 0.51% TKE src terms : 0.287300E+02s 5.26% Bound.conditions: 0.309933E+00s 0.06% Gridscale precp.: 0.000000E+00s 0.00% Kuo cumulus : 0.000000E+00s 0.00% Kain-Fritsch : 0.000000E+00s 0.00% Warmrain microph: 0.455600E+02s 8.35% Lin ice microph : 0.000000E+00s 0.00% NEM ice microph : 0.000000E+00s 0.00% Hydrometero fall: 0.000000E+00s 0.00% Miscellaneous : 0.169700E+02s 3.11%

 

Entire model : 0.545870E+03s 100.00%

0.545870E+03s

Page 50: Planning  and building  linux based cluster  for NWP

Gmandel-pvm benchmark

calculating with:x1=-2.000000000y1=-2.000000000x2=2.000000000y2=2.000000000limit=1000000wall time=17 secs.

MFLOPS=19461.0

calculating with:x1=-0.760416667y1=-0.354166667x2=-0.614583333y2=-0.208333333limit=1000000wall time=97 secs.

MFLOPS=19556.6

Page 51: Planning  and building  linux based cluster  for NWP
Page 52: Planning  and building  linux based cluster  for NWP

Feature plansFeature plans

Add VRML based monitoring and Add VRML based monitoring and controlling systemcontrolling system

adding scheduling for better use of adding scheduling for better use of the resourcesthe resources

Building one packaged solutionBuilding one packaged solutionExtending itExtending itGrid computing Grid computing

Page 53: Planning  and building  linux based cluster  for NWP

referencesreferences ARPS documentsARPS documents High-speed networking ,james P.G High-speed networking ,james P.G

Sterbenz joseph D.touch (wiley press)Sterbenz joseph D.touch (wiley press) Cluster computing white paper , mark Cluster computing white paper , mark

becker ,university of portsmouth,ukbecker ,university of portsmouth,uk Beowulf howtoBeowulf howto www.beowulf.comwww.beowulf.com www.myricom.comwww.myricom.com www.intel.comwww.intel.com www.infinibandfd.comwww.infinibandfd.com www.clustercomputing.comwww.clustercomputing.com

Page 54: Planning  and building  linux based cluster  for NWP

Thank you

Page 55: Planning  and building  linux based cluster  for NWP

Hardware productsHardware products Fast EthernetFast Ethernet

100 Mbps100 Mbps CSMA/CD (Carrier Sense Multiple Access with CSMA/CD (Carrier Sense Multiple Access with

Collision Detection)Collision Detection) HiPPI (High Performance Parallel Interface)HiPPI (High Performance Parallel Interface)

copper-based, 800/1600 Mbps over 32/64 bit linescopper-based, 800/1600 Mbps over 32/64 bit lines point-to-point channelpoint-to-point channel

ATM (Asynchronous Transfer Mode)ATM (Asynchronous Transfer Mode) connection-oriented packet switchingconnection-oriented packet switching fixed length (53 bytes cell)fixed length (53 bytes cell) suitable for WANsuitable for WAN

SCI (Scalable Coherent Interface)SCI (Scalable Coherent Interface) IEEE standard 1596, hardware DSM supportIEEE standard 1596, hardware DSM support

Page 56: Planning  and building  linux based cluster  for NWP

Hardware productsHardware products ServerNetServerNet

1 Gbps1 Gbps originally, interconnection for high bandwidth I/Ooriginally, interconnection for high bandwidth I/O

MyrinetMyrinet programmable microcontrollerprogrammable microcontroller 1.28 Gbps1.28 Gbps

Memory ChannelMemory Channel 800 Mbps800 Mbps virtual shared memoryvirtual shared memory strict message orderingstrict message ordering

SynfinitySynfinity 12.8 Gbps12.8 Gbps hardware support for message passing, shared memory and hardware support for message passing, shared memory and

synchronizationsynchronization

Page 57: Planning  and building  linux based cluster  for NWP

Link ParametersLink Parameters

Page 58: Planning  and building  linux based cluster  for NWP

Comparing productsComparing products

Page 59: Planning  and building  linux based cluster  for NWP

Prices - MyricomPrices - Myricom

Low Cost, One Port64-bit PCI-X and PCI

Low-profile PCI short card225MHz RISC & 2 MB memory

$795

For applications requiring up to ~490MB/s user-level bidirectional data rate

High End, Two Port64-bit PCI-X and PCI

Low-profile PCI short card333MHz RISC & 2 MB memory

For applications requiring up to ~950MB/s user-level bidirectional data rate

$1,195

Page 60: Planning  and building  linux based cluster  for NWP

Prices-myricomPrices-myricom

Clos network for 128 hosts with all Fiber ports and monitoring capability

Myrinet-2000 Switch Enclosures

DescriptionDescriptionProduct Product

CodeCodePricePrice

2U high, 3-slot enclosure for 2U high, 3-slot enclosure for switches up to 16 portsswitches up to 16 ports

M3-E16M3-E16 $1,600$1,600

3U high, 5-slot enclosure for 3U high, 5-slot enclosure for switch networks up to 32 portswitch networks up to 32 portss

M3-E32M3-E32 $3,200$3,200

5U high, 9-slot enclosure for 5U high, 9-slot enclosure for switch networks up to 64 hostswitch networks up to 64 hostss

M3-E64M3-E64 $6,400$6,400

9U high, 17-slot enclosure for9U high, 17-slot enclosure for switch networks up to 128 po switch networks up to 128 portsrts

M3-E128M3-E128 $12,800$12,800

Page 61: Planning  and building  linux based cluster  for NWP

Prices -dolphinPrices -dolphin

Sci adapter PMC-SCI Adapter Card $1,480 (64 bit, 66 MHz PCI)

Sci Switches 8 Port Expandable Modular BxBAR SCI Switch $4,980

Page 62: Planning  and building  linux based cluster  for NWP
Page 63: Planning  and building  linux based cluster  for NWP
Page 64: Planning  and building  linux based cluster  for NWP

Active messages (zero Active messages (zero copy)copy)

Berekley NOW projectBerekley NOW project Short messages are asynchronous Short messages are asynchronous

(based on request-reply)(based on request-reply) No buffering used on the system bufferNo buffering used on the system buffer Messages transfer directly from user Messages transfer directly from user

memory to user memorymemory to user memory

GAM (generic Active Messages) one GAM (generic Active Messages) one copy at the reciver sidecopy at the reciver side

Page 65: Planning  and building  linux based cluster  for NWP

Fast MessagesFast Messages

University of the illinois University of the illinois Similar to AMSimilar to AMAdded Transfer control mechanismAdded Transfer control mechanismA credit system required to A credit system required to

manage pinned memoriesmanage pinned memoriesGood for heterogeneous nodesGood for heterogeneous nodes

Page 66: Planning  and building  linux based cluster  for NWP

VMMCVMMC(virtual memmory-mapped (virtual memmory-mapped

communication)communication) Princeton SHRIMP projectPrinceton SHRIMP project Sending message=read and write on Sending message=read and write on

user memoryuser memory Mapping memory pages on two side Mapping memory pages on two side Uses A hardware to allow the NIC to Uses A hardware to allow the NIC to

listen on memory write and send it to listen on memory write and send it to the other sidethe other side

A type of DSMA type of DSM

Page 67: Planning  and building  linux based cluster  for NWP

U-netU-net

Cornell UniversityCornell UniversityZero copy where possibleZero copy where possibleA virtual interface for each A virtual interface for each

connectionconnectionActing on demandActing on demand

Page 68: Planning  and building  linux based cluster  for NWP

BIPBIPbasic interface for parallelismbasic interface for parallelism

University of lyonUniversity of lyon Low level message layerLow level message layer Other higher layer message passing Other higher layer message passing

(like MPICH) can build on this layer (like MPICH) can build on this layer BIP-SMP BIP-SMP Using different protocols for different Using different protocols for different

message size(zero or more copy)message size(zero or more copy) Flow controlFlow control

Page 69: Planning  and building  linux based cluster  for NWP

Hardware productsHardware products Fast EthernetFast Ethernet

100 Mbps100 Mbps CSMA/CD (Carrier Sense Multiple Access with CSMA/CD (Carrier Sense Multiple Access with

Collision Detection)Collision Detection) HiPPI (High Performance Parallel Interface)HiPPI (High Performance Parallel Interface)

copper-based, 800/1600 Mbps over 32/64 bit linescopper-based, 800/1600 Mbps over 32/64 bit lines point-to-point channelpoint-to-point channel

ATM (Asynchronous Transfer Mode)ATM (Asynchronous Transfer Mode) connection-oriented packet switchingconnection-oriented packet switching fixed length (53 bytes cell)fixed length (53 bytes cell) suitable for WANsuitable for WAN

SCI (Scalable Coherent Interface)SCI (Scalable Coherent Interface) IEEE standard 1596, hardware DSM supportIEEE standard 1596, hardware DSM support

Page 70: Planning  and building  linux based cluster  for NWP

Hardware productsHardware products ServerNetServerNet

1 Gbps1 Gbps originally, interconnection for high bandwidth I/Ooriginally, interconnection for high bandwidth I/O

MyrinetMyrinet programmable microcontrollerprogrammable microcontroller 1.28 Gbps1.28 Gbps

Memory ChannelMemory Channel 800 Mbps800 Mbps virtual shared memoryvirtual shared memory strict message orderingstrict message ordering

SynfinitySynfinity 12.8 Gbps12.8 Gbps hardware support for message passing, shared memory and hardware support for message passing, shared memory and

synchronizationsynchronization

Page 71: Planning  and building  linux based cluster  for NWP

Link ParametersLink Parameters

Page 72: Planning  and building  linux based cluster  for NWP

Comparing productsComparing products

Page 73: Planning  and building  linux based cluster  for NWP

Prices - MyricomPrices - Myricom

Low Cost, One Port64-bit PCI-X and PCI

Low-profile PCI short card225MHz RISC & 2 MB memory

$795

For applications requiring up to ~490MB/s user-level bidirectional data rate

High End, Two Port64-bit PCI-X and PCI

Low-profile PCI short card333MHz RISC & 2 MB memory

For applications requiring up to ~950MB/s user-level bidirectional data rate

$1,195

Page 74: Planning  and building  linux based cluster  for NWP

Prices-myricomPrices-myricom

Clos network for 128 hosts with all Fiber ports and monitoring capability

Myrinet-2000 Switch Enclosures

DescriptionDescriptionProduct Product

CodeCodePricePrice

2U high, 3-slot enclosure for 2U high, 3-slot enclosure for switches up to 16 portsswitches up to 16 ports

M3-E16M3-E16 $1,600$1,600

3U high, 5-slot enclosure for 3U high, 5-slot enclosure for switch networks up to 32 portswitch networks up to 32 portss

M3-E32M3-E32 $3,200$3,200

5U high, 9-slot enclosure for 5U high, 9-slot enclosure for switch networks up to 64 hostswitch networks up to 64 hostss

M3-E64M3-E64 $6,400$6,400

9U high, 17-slot enclosure for9U high, 17-slot enclosure for switch networks up to 128 po switch networks up to 128 portsrts

M3-E128M3-E128 $12,800$12,800

Page 75: Planning  and building  linux based cluster  for NWP

Prices -dolphinPrices -dolphin

Sci adapter PMC-SCI Adapter Card $1,480 (64 bit, 66 MHz PCI)

Sci Switches 8 Port Expandable Modular BxBAR SCI Switch $4,980