01-Intro to Cluster -...
Transcript of 01-Intro to Cluster -...
-
Introduction to Cluster Introduction to Cluster ComputingComputing
-
Why we need High Performance Computer?
Hypothesis Experiment
Simulation and Modeling
Hypothesis Experiment
The change in scientific discovery process
-
Compute-Intensive Applications
• Simulation and Modeling problems:– Based on successive approximations. More calculations,
better results– Optimization problems
• Problems that dependent on computations andmanipulations of large amounts of data
• Example– Weather Prediction– Image and Signal Processing , Graphics– Database and Data Mining
-
CFD for Clean room
• Analyzing behaviour of air flow in clean room for electronics industry
• Collaboration project– Suranaree University of Technology– Kasetsart University– Funded by NECTEC
-
CFD Software• CAMETA version 3.0
(SUT)– Time-independent
(steady-state) solution– Three-dimensional
domain– Cartesian coordinate
system• Physical quantities of
interest:– Temperature
distribution– Relative humidity
distribution– Particle concentration
distribution
-
Molecular Dynamic Simulation
• Drug Discovery using molecular docking– Avian Flu– HIV
• Analyzing property of Chemical compound
-
SWAP Model Parameter identificationSWAP Model Parameter identification-- Data Assimilation using RS and GA Data Assimilation using RS and GA
–– (KU/AIT)(KU/AIT)
0.00
1.00
2.00
3.00
4.00
0 45 90 135 180 225 270 315 360Day Of Year
Evap
otra
nspi
ratio
nLA
I
RS ObservationRS Observation
SWAP Crop Growth ModelSWAP Crop Growth Model
SWAP Input ParametersSWAP Input Parameters
sowing date, soil property, sowing date, soil property, Water management, and etc.Water management, and etc.
LAI, LAI, EvapotranspirationEvapotranspiration
0.00
1.00
2.00
3.00
4.00
0 45 90 135 180 225 270 315 360Day Of Year
Eav
potra
nspi
ratio
nLA
I
FittingFitting
LAI, LAI, EvapotranspirationEvapotranspiration
Assimilation by Assimilation by finding Optimized finding Optimized
parametersparameters
By GABy GA
RSRS ModelModel
-
Challenges
• The calculation time for identify SWAP parameters only for 1 pixel (1 sq.km) takes several minutes to 30 minutes.
• Thus, a RS image of 1000 x 1000 sq.km of 1000x1000 pixels will take more than 50 years (30min x 1000 x 1000) is not acceptable.
• Solutions– Parallel Computing
• Longitude: 100.008133• Latitude: 14.388195
-
22/08/49 TAM2005 9
Graphics Rendering and Special Effect
• Rendering– Generating 3D image from Model
• Problem – Rendering is a time consuming
process especially for complex and realistic scene
– Massive number of rendering job needed to be done to create a movie
-
Top500 Fastest Installed Computers
• www.top500.org• List the top 500 supercomputer at sites
worldwide.• Provides a snapshot of the SC installed around
the world.• Began in 1993, published every 6 months.• Measures the performance for the TPP Linpack
Benchmark• Provides a way to measure trends
-
Parallel Processing• Solving large problem
by breaking it in to a number of smallproblems, then solvethem on multipleprocessors at the sametime
• Real life example– building a house using
many workers workingon separate part
– Assembly line in factory
-
LARGE SCALEPROBLEM
SUB TASK SUB TASK SUB TASK SUB TASK SUB TASK
PROCESSOR PROCESSOR PROCESSOR PROCESSOR
Parallel Processing
-
Parallel Computer
• Parallel computer is a special computer that is designed for parallel processing – Multiple processors – High-speed Interconnection network that link
these processors together – Hardware and software that help coordinate
the computing tasks• Sometime this tyep of computer are called
MPP (Massively Parallel Processor) Computer
-
Compare with SMP
-
Introduction
• Cluster computing is a technology related to the building of high performance scalable computing system from a collection of small computing system and high speed interconnection network
-
Applications• Scientific computing
– CAD/CAM– Bioinformatics– Large scale financial analysis– Simulation – Drug Design– Automobile Design ( Crash Simulation)
• IT infrastructure– Scalable web server, Search engine
(Google use more than 10000 node servers)
• Entertainment– Rendering– On-line Gaming
-
Why? Price Performance!
-
Beowulf Clustering
• Clustering technology originated from Beowulf Project at NASA– Project started by Thomas Sterling
and Donald Becker• Characteristics of Beowulf Cluster
– Use PC based hardware and commodity components to gain high performance at low cost
– Based on Linux OS, and open source software
– Very popular in high performance scientific computing world, moving quickly toward enterprise uses.
-
Beowulfs!
-
Beowulf Software Architecture
HTCHPC
HPTC
-
How to use cluster system for your work
• High Throughput Computing– Use cluster as a collection of processors– Run sequential applications– Using job scheduler to control the execution
• High Performance Computing– Use cluster as a parallel machine– Develop parallel application to harness the
computing power from the cluster
-
What is needed?
InterconnectionInterconnectionNetworkNetwork
NodesNodesSoftwareSoftware
-
Cluster System Structure
Compute Node
InterconnectionNetwork
Frontend
-
Component
• Node– CPU– Memory– Motherboard– Harddisk– Display card– Chassis
• Network• Others Accessories
– KVM Switch– UPS
-
CPU
• Various architecture – x86 or x86-64• x86/IA32 – Pentium 4, Xeon, Xeon MP,
Athlon XP• x86-64 – Athlon 64, Athlon 64 FX• IA64 – Itanium 2• Effect to chassis, heat sink, and power
supply
-
Heat Sink
• 2 Types of CPU – Box and Tray• Box equips with heat sink and fan• Tray has only CPU• Box is recommended
-
CPU Performance
-
Memory
• Speed and bandwidth must be compatible with CPU– Pentium 4 - DDR– Athlon XP – DDR– Athlon Opteron – Dual Channel DDR– Athlon 64 – Dual Channel DDR– Athlon 64 FX – Dual Channel DDR
-
Memory Performance
-
Motherboard
• Must be compatible with CPU• Bus speed• Compute node may be use full option
motherboard, for example, LAN and VGA on board– Advantage for rack system
-
Selected Motherboard Features
• Wake-on-LAN• RAID• PXE• PCI Slot• Onboard VGA• Onboard Gigabit Network
-
Pre-boot Execution Environment
• Network boot• Network must provide DHCP and TFTP
service• After POST, node use DHCP to request IP
and boot server• After node get its IP, it download OS from
boot server via TFTP• Advantage: installation via PXE is very
convenient or diskless cluster
-
PCI
• Peripheral Component Interconnect• Internal data path• Data rate depends on clock speed and
number of bits• Currently, 33 and 66 MHz with 32 and 64
bit wide• 2 Voltage lever – 3.3 and 5 V
-
PCI-X
• Next generation PCI• Compatible with PCI 3.3 V 66 MHz• PCI-X version 2.0 operates at 266 to 1066
MHz. The data rate is about 2.6 to 8.5 Gb/s
-
Harddisk
• Store system and user data• Various interface
– IDE or ATA– SATA– SCSI
-
IDE (Integrated Drive Electronics)
• So call ATA (AT Attachment)• Low cost• Ease of installation because very motherboard
support• 1 Interface can connect up to 2 drive: master and
slave• Maximum speed 133 MB/s by 80 line connector
by Ultra ATA/133 standard• Rotation speed: 7,200 RPM, Seek time: 9 ms• Size 10 – 200 GB (Seagate)
-
SATA – Serial ATA
• A bit more expensive that ATA
• Higher speed than ATAby 1-5%
• Transfer speed: 150 MB/s• Rotation Speed: 7200
RPM, Seek time about 9ms
• Size 80 – 200 GB• Need new kernel 2.6.2
-
SCSI (Small Computer System Interface)
• Intelligent device: Can connect various device, for example, Harddisk Scanner CD-ROM/RW Tape drive
• High cost for high-end system• Only server’s motherboard supports SCSI,
Low cost motherboard needs SCSI adapter
• 1 Intecface can connect up to 15 devices
-
SCSI (cont’d)
• Rotation speed: 10000 – 150000 RPM, Seek time is about 4.5 ms
• Various speed. Currently available speed are between 20 MB/s (UltraSCSI) to 320MB/s (Ultra 320 หรือ Ultra4 SCSI)
• Capacity 18 – 147 GB (Seagate)
-
Harddisk Performance
-
RAID
• Redundance Array of Independent Disk• Increase capacity, speed, and reliable of
data• Logically merge many hard disk to a big
one• Both software and hardware• Normally, RAID is built over SCSI hard
disks• RAID various level. The popular is 0, 1 and5
-
RAID Level
• Level 0 – Stripping Increase capacity and speed. None of the space is wasted as long as the hard drives used are identical
• Level 1 – Mirror increase data reliable via redundancy
• Level 5 – Block level striping and distributed parity. Increase capacity and data reliability
-
RAID Implementation
• Hardware – RAID adapter, for example, Adaptec. Some motherboard has build-in RAID controller– Linux supports many hardware RAID vendor,
especially Adaptec– There is only and logical RAID drive at boot time
• Software – Linux can emulate RAID 0, 1, and 5– Can be done by creating partition with “RAID” type,
then use RAID utility to create RAID drive. The RAID drives are mapped to /dev/md0, /dev/md1, etc.
-
RAID Performance
-
Display
• Monitor– Resolution at least 1024 × 768– May be LCD or CRT
• Display Adapter– Linux supports most display adapter– On-board display adapter is recommended for
ease of installation
-
Chassis
• 3 main types– Tower– Rack-mount– Blade
-
Tower Chassis
• The most popular• 2 size – Tower and mini-Tower• Advantage – Low cost, ease of
maintenance, installation, and air flow• Disadvantage – big size and wiring
problem in large cluster
-
Rock-Mount
• Smaller – suitable for large cluster
• Advantage – design for rack
• Disadvantage – the smaller, the problem with air flow
-
Blade
• Rack 1U is small but wiring is still a problem
• Blade-server is a case that contain one or morehot-swappable device called blade
• Each blade is a computer that share some equipment such as power supply, network connection
-
System Interconnect
• Interconnection among nodes is the key to cluster performance
• Consideration– Cost– Bandwidth and Latency– Scalability of the interconnection
• Technology used– Fast Ethernet (Switch)– SCI (Scalable Coherent Interface)– Gigabit Ethernet– Myrinet– Infiniband
-
Performance of MPICH-GM
• High Performance LinpackHPL Runtime
0
10
20
30
40
50
60
70
80
90
100
4000 5000 6000 7000 8000
Problem Size
Run
tim
e (s
ec)
MyrinetEthernet
-
Interconnection Network
• At least 1 GB/s• Various vendor and standard
– Gigabit Ethernet– Myrinet– Infiband
-
Gigabit Ethernet• IEEE 802.3• 1 Gb/s• Commodity hardware• Connect by switch• IEEE 802.3z using Fiber
or STP• IEEE 802.3ab using CAT
5 UTP• PCI 64 internal bus is
recommended
-
Gigabit Ethernet Switch
• Support both layer 2 or layer 3 switch• Cluster can use layer 2 switch because
routing function is not required• For future upgrade, stackable switch is
recommended
-
Myrinet
• Proprietary network• Operate at 2+2 Gb/s• High scalability• On-board
communication processor reduce main CPU utilization
• Support both PCI-X both PCI 64
• Fiber
-
Infiniband
• Infiniband is technology for connecting remote storage and network
• Create IPC cluster• Operate at 2.5 Gb/s with copper or fiber
cable
-
Network Performance
• Data from NetPIPE 3.6
-
KVM Switch
• Keyboard Video Mouse• Connect multiple nodes to single KVM• Very convenient
-
UPS
• Uninterruptable Power Supply• Emergency power source• Filter surge current from AC out let,
telephone line, and LAN• Should be able to notify CPU to shut down
the system• Load is in VA unit• Capacity is specified at batteries
-
Basic Clustering Software• Linux (Redhat, Mandrake, Fedora, etc.)• Cluster Distribution
– ROCKS, SCE, OSCAR• Management tools and environment
– SCMS, Ganglia• Programming
– Compilers : PGI, Intel, gcc– Parallel Programming PVM, MPI
• Library– PGAPack, Scalapack, PetSC, Linpack
• Load Scheduler– OpenPBS, SGE , SQMS, LSF
• Package– NWCHEM, GAMESS, Fluent, Oracle 10G
-
Cluster Distribution
• Custom made Linux distribution that can be used to build a cluster– Pre-selected set of software– Automatic Installation tool or “Builder”– Easy to use but user must follow some guide
line• Example: Scyld, Oscar, Rocks and SCE
-
NPACI ROCKSNational Partnership for Advanced Computational Infrastructure
• A scalable, complete and fully automated cluster deployment solution with out of the box rational default settings.
• Developed by– San Diego Supercomputer
Centre - Grid & Cluster Computing Group
– UC Berkeley Millennium Project
– SCS Linux Competency Centre
-
Rocks Philosophy
• NPACI ROCKS is an entire cluster-aware distribution– Full Red Hat release (tracks closely)– NPACI ROCKS packages– Installation, configuration & management tools– Integrated de-facto standard cluster packages
• PBS/MAUI, MPICH, ATLAS, ScaLAPack, HPL...• SGE, PVFS (added by SCS-LCC)
Integrated, easy to manage cluster system!
-
Rocks Philosophy
• Integrated, easy to manage cluster system.• Excellent scaling to large number of nodes.• Single consistent cluster management
methodology.• Avoid version skew and maintains
consistent image across cluster!• Highly skilled and dedicated development
team in SDCS, UC Berkeley and SCS-LCC.
• SUPPORT!
-
Close Cluster• Every node is behind
main node or frontend• Pros
– Suitable for parallel program because external traffic does not interfere cluster communication within cluster
– Ease of security• Cons
– If frontend is failure, user cannot access all nodes
-
Summary
• Cluster is a new dominant platform for HPC– Powerful– Cost effective
• A combination of software/hardware is needed to build a cluster
• Application is the key to harness the power of cluster
-
How to measure your performance
• standard open-source benchmarking tools– HPL (High Performance Linpack)– Linpack– Stream - memory benchmark– Bonnie++ - I/O benchmark– IOZone - Intensive I/O benchmark– IPerf - network performance snapshot– NetPipe - intensive bandwidth test
-
HPL (High Performance Linpack) and Linpack
• HPL– a software package that solves a (random) dense
linear system in double precision (64 bits) arithmetic on distributed-memory computers.
– a standard benchmark tool for Cluster computer, used at top500.org
– Performance reported in Flops (Floating-point operation per second)
– Rely on MPI (Message-Passing Interface)– More info. on http://www.netlib.org/benchmark/hpl/
• Linpack– Old implementation of Linpack, used to determined
single node performance
-
Stream
• Sustainable Memory Bandwidth in High Performance Computers
• Used to measure memory bandwidth using various method
• Performance reported in Mbytes/sec• More info. on
http://www.netlib.org/benchmark/hpl/
-
Bonnie++ and IOZone• Bonnie++
– a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance
– tests database type access to a single file– Performance reported in I/O rate and Kbytes/sec– More info. on http://www.coker.com.au/bonnie++/
• IOZone– a file system benchmark tool. The benchmark
generates and measures a variety of file operations– Performance intensive I/O benchmark on various file
size and record size– More info. on http://www.iozone.org/
-
Iperf and Netpipe
• Iperf– a tool to measure maximum TCP bandwidth, allowing
the tuning of various parameters and UDP characteristics.
– Report data as Kbps or Mbps– More info. http://dast.nlanr.net/Projects/Iperf/
• Netpipe– a protocol independent performance tool that visually
represents the network performance under a variety of conditions
– More info. on http://www.scl.ameslab.gov/netpipe/
-
Thank you
Thank you