Download - 01-Intro to Cluster - Quantumesquantumes.weebly.com/uploads/1/9/4/9/19498995/01-intro_to_cluste… · • Thus, a RS image of 1000 x 1000 sq.km of 1000x1000 pixels will take more

Introduction to Cluster Introduction to Cluster ComputingComputing

Why we need High Performance Computer?

Hypothesis Experiment

Simulation and Modeling

Hypothesis Experiment

The change in scientific discovery process

Compute-Intensive Applications

• Simulation and Modeling problems:– Based on successive approximations. More calculations,

better results– Optimization problems

• Problems that dependent on computations andmanipulations of large amounts of data

• Example– Weather Prediction– Image and Signal Processing , Graphics– Database and Data Mining

CFD for Clean room

• Analyzing behaviour of air flow in clean room for electronics industry

• Collaboration project– Suranaree University of Technology– Kasetsart University– Funded by NECTEC

CFD Software• CAMETA version 3.0

(SUT)– Time-independent

(steady-state) solution– Three-dimensional

domain– Cartesian coordinate

system• Physical quantities of

interest:– Temperature

distribution– Relative humidity

distribution– Particle concentration

distribution

Molecular Dynamic Simulation

• Drug Discovery using molecular docking– Avian Flu– HIV

• Analyzing property of Chemical compound

SWAP Model Parameter identificationSWAP Model Parameter identification-- Data Assimilation using RS and GA Data Assimilation using RS and GA

–– (KU/AIT)(KU/AIT)

0.00

1.00

2.00

3.00

4.00

0 45 90 135 180 225 270 315 360Day Of Year

Evap

otra

nspi

ratio

nLA

I

RS ObservationRS Observation

SWAP Crop Growth ModelSWAP Crop Growth Model

SWAP Input ParametersSWAP Input Parameters

sowing date, soil property, sowing date, soil property, Water management, and etc.Water management, and etc.

LAI, LAI, EvapotranspirationEvapotranspiration

0.00

1.00

2.00

3.00

4.00

0 45 90 135 180 225 270 315 360Day Of Year

Eav

potra

nspi

ratio

nLA

I

FittingFitting

LAI, LAI, EvapotranspirationEvapotranspiration

Assimilation by Assimilation by finding Optimized finding Optimized

parametersparameters

By GABy GA

RSRS ModelModel

Challenges

• The calculation time for identify SWAP parameters only for 1 pixel (1 sq.km) takes several minutes to 30 minutes.

• Thus, a RS image of 1000 x 1000 sq.km of 1000x1000 pixels will take more than 50 years (30min x 1000 x 1000) is not acceptable.

• Solutions– Parallel Computing

• Longitude: 100.008133• Latitude: 14.388195

22/08/49 TAM2005 9

Graphics Rendering and Special Effect

• Rendering– Generating 3D image from Model

• Problem – Rendering is a time consuming

process especially for complex and realistic scene

– Massive number of rendering job needed to be done to create a movie

Top500 Fastest Installed Computers

• www.top500.org• List the top 500 supercomputer at sites

worldwide.• Provides a snapshot of the SC installed around

the world.• Began in 1993, published every 6 months.• Measures the performance for the TPP Linpack

Benchmark• Provides a way to measure trends

Parallel Processing• Solving large problem

by breaking it in to a number of smallproblems, then solvethem on multipleprocessors at the sametime

• Real life example– building a house using

many workers workingon separate part

– Assembly line in factory

LARGE SCALEPROBLEM

SUB TASK SUB TASK SUB TASK SUB TASK SUB TASK

PROCESSOR PROCESSOR PROCESSOR PROCESSOR

Parallel Processing

Parallel Computer

• Parallel computer is a special computer that is designed for parallel processing – Multiple processors – High-speed Interconnection network that link

these processors together – Hardware and software that help coordinate

the computing tasks• Sometime this tyep of computer are called

MPP (Massively Parallel Processor) Computer

Compare with SMP

Introduction

• Cluster computing is a technology related to the building of high performance scalable computing system from a collection of small computing system and high speed interconnection network

Applications• Scientific computing

– CAD/CAM– Bioinformatics– Large scale financial analysis– Simulation – Drug Design– Automobile Design ( Crash Simulation)

• IT infrastructure– Scalable web server, Search engine

(Google use more than 10000 node servers)

• Entertainment– Rendering– On-line Gaming

Why? Price Performance!

Beowulf Clustering

• Clustering technology originated from Beowulf Project at NASA– Project started by Thomas Sterling

and Donald Becker• Characteristics of Beowulf Cluster

– Use PC based hardware and commodity components to gain high performance at low cost

– Based on Linux OS, and open source software

– Very popular in high performance scientific computing world, moving quickly toward enterprise uses.

Beowulfs!

Beowulf Software Architecture

HTCHPC

HPTC

How to use cluster system for your work

• High Throughput Computing– Use cluster as a collection of processors– Run sequential applications– Using job scheduler to control the execution

• High Performance Computing– Use cluster as a parallel machine– Develop parallel application to harness the

computing power from the cluster

What is needed?

InterconnectionInterconnectionNetworkNetwork

NodesNodesSoftwareSoftware

Cluster System Structure

Compute Node

InterconnectionNetwork

Frontend

Component

• Node– CPU– Memory– Motherboard– Harddisk– Display card– Chassis

• Network• Others Accessories

– KVM Switch– UPS

CPU

• Various architecture – x86 or x86-64• x86/IA32 – Pentium 4, Xeon, Xeon MP,

Athlon XP• x86-64 – Athlon 64, Athlon 64 FX• IA64 – Itanium 2• Effect to chassis, heat sink, and power

supply

Heat Sink

• 2 Types of CPU – Box and Tray• Box equips with heat sink and fan• Tray has only CPU• Box is recommended

CPU Performance

Memory

• Speed and bandwidth must be compatible with CPU– Pentium 4 - DDR– Athlon XP – DDR– Athlon Opteron – Dual Channel DDR– Athlon 64 – Dual Channel DDR– Athlon 64 FX – Dual Channel DDR

Memory Performance

Motherboard

• Must be compatible with CPU• Bus speed• Compute node may be use full option

motherboard, for example, LAN and VGA on board– Advantage for rack system

Selected Motherboard Features

• Wake-on-LAN• RAID• PXE• PCI Slot• Onboard VGA• Onboard Gigabit Network

Pre-boot Execution Environment

• Network boot• Network must provide DHCP and TFTP

service• After POST, node use DHCP to request IP

and boot server• After node get its IP, it download OS from

boot server via TFTP• Advantage: installation via PXE is very

convenient or diskless cluster

PCI

• Peripheral Component Interconnect• Internal data path• Data rate depends on clock speed and

number of bits• Currently, 33 and 66 MHz with 32 and 64

bit wide• 2 Voltage lever – 3.3 and 5 V

PCI-X

• Next generation PCI• Compatible with PCI 3.3 V 66 MHz• PCI-X version 2.0 operates at 266 to 1066

MHz. The data rate is about 2.6 to 8.5 Gb/s

Harddisk

• Store system and user data• Various interface

– IDE or ATA– SATA– SCSI

IDE (Integrated Drive Electronics)

• So call ATA (AT Attachment)• Low cost• Ease of installation because very motherboard

support• 1 Interface can connect up to 2 drive: master and

slave• Maximum speed 133 MB/s by 80 line connector

by Ultra ATA/133 standard• Rotation speed: 7,200 RPM, Seek time: 9 ms• Size 10 – 200 GB (Seagate)

SATA – Serial ATA

• A bit more expensive that ATA

• Higher speed than ATAby 1-5%

• Transfer speed: 150 MB/s• Rotation Speed: 7200

RPM, Seek time about 9ms

• Size 80 – 200 GB• Need new kernel 2.6.2

SCSI (Small Computer System Interface)

• Intelligent device: Can connect various device, for example, Harddisk Scanner CD-ROM/RW Tape drive

• High cost for high-end system• Only server’s motherboard supports SCSI,

Low cost motherboard needs SCSI adapter

• 1 Intecface can connect up to 15 devices

SCSI (cont’d)

• Rotation speed: 10000 – 150000 RPM, Seek time is about 4.5 ms

• Various speed. Currently available speed are between 20 MB/s (UltraSCSI) to 320MB/s (Ultra 320 หรือ Ultra4 SCSI)

• Capacity 18 – 147 GB (Seagate)

Harddisk Performance

RAID

• Redundance Array of Independent Disk• Increase capacity, speed, and reliable of

data• Logically merge many hard disk to a big

one• Both software and hardware• Normally, RAID is built over SCSI hard

disks• RAID various level. The popular is 0, 1 and5

RAID Level

• Level 0 – Stripping Increase capacity and speed. None of the space is wasted as long as the hard drives used are identical

• Level 1 – Mirror increase data reliable via redundancy

• Level 5 – Block level striping and distributed parity. Increase capacity and data reliability

RAID Implementation

• Hardware – RAID adapter, for example, Adaptec. Some motherboard has build-in RAID controller– Linux supports many hardware RAID vendor,

especially Adaptec– There is only and logical RAID drive at boot time

• Software – Linux can emulate RAID 0, 1, and 5– Can be done by creating partition with “RAID” type,

then use RAID utility to create RAID drive. The RAID drives are mapped to /dev/md0, /dev/md1, etc.

RAID Performance

Display

• Monitor– Resolution at least 1024 × 768– May be LCD or CRT

• Display Adapter– Linux supports most display adapter– On-board display adapter is recommended for

ease of installation

Chassis

• 3 main types– Tower– Rack-mount– Blade

Tower Chassis

• The most popular• 2 size – Tower and mini-Tower• Advantage – Low cost, ease of

maintenance, installation, and air flow• Disadvantage – big size and wiring

problem in large cluster

Rock-Mount

• Smaller – suitable for large cluster

• Advantage – design for rack

• Disadvantage – the smaller, the problem with air flow

Blade

• Rack 1U is small but wiring is still a problem

• Blade-server is a case that contain one or morehot-swappable device called blade

• Each blade is a computer that share some equipment such as power supply, network connection

System Interconnect

• Interconnection among nodes is the key to cluster performance

• Consideration– Cost– Bandwidth and Latency– Scalability of the interconnection

• Technology used– Fast Ethernet (Switch)– SCI (Scalable Coherent Interface)– Gigabit Ethernet– Myrinet– Infiniband

Performance of MPICH-GM

• High Performance LinpackHPL Runtime

0

10

20

30

40

50

60

70

80

90

100

4000 5000 6000 7000 8000

Problem Size

Run

tim

e (s

ec)

MyrinetEthernet

Interconnection Network

• At least 1 GB/s• Various vendor and standard

– Gigabit Ethernet– Myrinet– Infiband

Gigabit Ethernet• IEEE 802.3• 1 Gb/s• Commodity hardware• Connect by switch• IEEE 802.3z using Fiber

or STP• IEEE 802.3ab using CAT

5 UTP• PCI 64 internal bus is

recommended

Gigabit Ethernet Switch

• Support both layer 2 or layer 3 switch• Cluster can use layer 2 switch because

routing function is not required• For future upgrade, stackable switch is

recommended

Myrinet

• Proprietary network• Operate at 2+2 Gb/s• High scalability• On-board

communication processor reduce main CPU utilization

• Support both PCI-X both PCI 64

• Fiber

Infiniband

• Infiniband is technology for connecting remote storage and network

• Create IPC cluster• Operate at 2.5 Gb/s with copper or fiber

cable

Network Performance

• Data from NetPIPE 3.6

KVM Switch

• Keyboard Video Mouse• Connect multiple nodes to single KVM• Very convenient

UPS

• Uninterruptable Power Supply• Emergency power source• Filter surge current from AC out let,

telephone line, and LAN• Should be able to notify CPU to shut down

the system• Load is in VA unit• Capacity is specified at batteries

Basic Clustering Software• Linux (Redhat, Mandrake, Fedora, etc.)• Cluster Distribution

– ROCKS, SCE, OSCAR• Management tools and environment

– SCMS, Ganglia• Programming

– Compilers : PGI, Intel, gcc– Parallel Programming PVM, MPI

• Library– PGAPack, Scalapack, PetSC, Linpack

• Load Scheduler– OpenPBS, SGE , SQMS, LSF

• Package– NWCHEM, GAMESS, Fluent, Oracle 10G

Cluster Distribution

• Custom made Linux distribution that can be used to build a cluster– Pre-selected set of software– Automatic Installation tool or “Builder”– Easy to use but user must follow some guide

line• Example: Scyld, Oscar, Rocks and SCE

NPACI ROCKSNational Partnership for Advanced Computational Infrastructure

• A scalable, complete and fully automated cluster deployment solution with out of the box rational default settings.

• Developed by– San Diego Supercomputer

Centre - Grid & Cluster Computing Group

– UC Berkeley Millennium Project

– SCS Linux Competency Centre

Rocks Philosophy

• NPACI ROCKS is an entire cluster-aware distribution– Full Red Hat release (tracks closely)– NPACI ROCKS packages– Installation, configuration & management tools– Integrated de-facto standard cluster packages

• PBS/MAUI, MPICH, ATLAS, ScaLAPack, HPL...• SGE, PVFS (added by SCS-LCC)

Integrated, easy to manage cluster system!

Rocks Philosophy

• Integrated, easy to manage cluster system.• Excellent scaling to large number of nodes.• Single consistent cluster management

methodology.• Avoid version skew and maintains

consistent image across cluster!• Highly skilled and dedicated development

team in SDCS, UC Berkeley and SCS-LCC.

• SUPPORT!

Close Cluster• Every node is behind

main node or frontend• Pros

– Suitable for parallel program because external traffic does not interfere cluster communication within cluster

– Ease of security• Cons

– If frontend is failure, user cannot access all nodes

Summary

• Cluster is a new dominant platform for HPC– Powerful– Cost effective

• A combination of software/hardware is needed to build a cluster

• Application is the key to harness the power of cluster

How to measure your performance

• standard open-source benchmarking tools– HPL (High Performance Linpack)– Linpack– Stream - memory benchmark– Bonnie++ - I/O benchmark– IOZone - Intensive I/O benchmark– IPerf - network performance snapshot– NetPipe - intensive bandwidth test

HPL (High Performance Linpack) and Linpack

• HPL– a software package that solves a (random) dense

linear system in double precision (64 bits) arithmetic on distributed-memory computers.

– a standard benchmark tool for Cluster computer, used at top500.org

– Performance reported in Flops (Floating-point operation per second)

– Rely on MPI (Message-Passing Interface)– More info. on http://www.netlib.org/benchmark/hpl/

• Linpack– Old implementation of Linpack, used to determined

single node performance

Stream

• Sustainable Memory Bandwidth in High Performance Computers

• Used to measure memory bandwidth using various method

• Performance reported in Mbytes/sec• More info. on

http://www.netlib.org/benchmark/hpl/

Bonnie++ and IOZone• Bonnie++

– a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance

– tests database type access to a single file– Performance reported in I/O rate and Kbytes/sec– More info. on http://www.coker.com.au/bonnie++/

• IOZone– a file system benchmark tool. The benchmark

generates and measures a variety of file operations– Performance intensive I/O benchmark on various file

size and record size– More info. on http://www.iozone.org/

Iperf and Netpipe

• Iperf– a tool to measure maximum TCP bandwidth, allowing

the tuning of various parameters and UDP characteristics.

– Report data as Kbps or Mbps– More info. http://dast.nlanr.net/Projects/Iperf/

• Netpipe– a protocol independent performance tool that visually

represents the network performance under a variety of conditions

– More info. on http://www.scl.ameslab.gov/netpipe/

Thank you

Thank you