01-Intro to Cluster -...

78
Introduction to Cluster Introduction to Cluster Computing Computing

Transcript of 01-Intro to Cluster -...

  • Introduction to Cluster Introduction to Cluster ComputingComputing

  • Why we need High Performance Computer?

    Hypothesis Experiment

    Simulation and Modeling

    Hypothesis Experiment

    The change in scientific discovery process

  • Compute-Intensive Applications

    • Simulation and Modeling problems:– Based on successive approximations. More calculations,

    better results– Optimization problems

    • Problems that dependent on computations andmanipulations of large amounts of data

    • Example– Weather Prediction– Image and Signal Processing , Graphics– Database and Data Mining

  • CFD for Clean room

    • Analyzing behaviour of air flow in clean room for electronics industry

    • Collaboration project– Suranaree University of Technology– Kasetsart University– Funded by NECTEC

  • CFD Software• CAMETA version 3.0

    (SUT)– Time-independent

    (steady-state) solution– Three-dimensional

    domain– Cartesian coordinate

    system• Physical quantities of

    interest:– Temperature

    distribution– Relative humidity

    distribution– Particle concentration

    distribution

  • Molecular Dynamic Simulation

    • Drug Discovery using molecular docking– Avian Flu– HIV

    • Analyzing property of Chemical compound

  • SWAP Model Parameter identificationSWAP Model Parameter identification-- Data Assimilation using RS and GA Data Assimilation using RS and GA

    –– (KU/AIT)(KU/AIT)

    0.00

    1.00

    2.00

    3.00

    4.00

    0 45 90 135 180 225 270 315 360Day Of Year

    Evap

    otra

    nspi

    ratio

    nLA

    I

    RS ObservationRS Observation

    SWAP Crop Growth ModelSWAP Crop Growth Model

    SWAP Input ParametersSWAP Input Parameters

    sowing date, soil property, sowing date, soil property, Water management, and etc.Water management, and etc.

    LAI, LAI, EvapotranspirationEvapotranspiration

    0.00

    1.00

    2.00

    3.00

    4.00

    0 45 90 135 180 225 270 315 360Day Of Year

    Eav

    potra

    nspi

    ratio

    nLA

    I

    FittingFitting

    LAI, LAI, EvapotranspirationEvapotranspiration

    Assimilation by Assimilation by finding Optimized finding Optimized

    parametersparameters

    By GABy GA

    RSRS ModelModel

  • Challenges

    • The calculation time for identify SWAP parameters only for 1 pixel (1 sq.km) takes several minutes to 30 minutes.

    • Thus, a RS image of 1000 x 1000 sq.km of 1000x1000 pixels will take more than 50 years (30min x 1000 x 1000) is not acceptable.

    • Solutions– Parallel Computing

    • Longitude: 100.008133• Latitude: 14.388195

  • 22/08/49 TAM2005 9

    Graphics Rendering and Special Effect

    • Rendering– Generating 3D image from Model

    • Problem – Rendering is a time consuming

    process especially for complex and realistic scene

    – Massive number of rendering job needed to be done to create a movie

  • Top500 Fastest Installed Computers

    • www.top500.org• List the top 500 supercomputer at sites

    worldwide.• Provides a snapshot of the SC installed around

    the world.• Began in 1993, published every 6 months.• Measures the performance for the TPP Linpack

    Benchmark• Provides a way to measure trends

  • Parallel Processing• Solving large problem

    by breaking it in to a number of smallproblems, then solvethem on multipleprocessors at the sametime

    • Real life example– building a house using

    many workers workingon separate part

    – Assembly line in factory

  • LARGE SCALEPROBLEM

    SUB TASK SUB TASK SUB TASK SUB TASK SUB TASK

    PROCESSOR PROCESSOR PROCESSOR PROCESSOR

    Parallel Processing

  • Parallel Computer

    • Parallel computer is a special computer that is designed for parallel processing – Multiple processors – High-speed Interconnection network that link

    these processors together – Hardware and software that help coordinate

    the computing tasks• Sometime this tyep of computer are called

    MPP (Massively Parallel Processor) Computer

  • Compare with SMP

  • Introduction

    • Cluster computing is a technology related to the building of high performance scalable computing system from a collection of small computing system and high speed interconnection network

  • Applications• Scientific computing

    – CAD/CAM– Bioinformatics– Large scale financial analysis– Simulation – Drug Design– Automobile Design ( Crash Simulation)

    • IT infrastructure– Scalable web server, Search engine

    (Google use more than 10000 node servers)

    • Entertainment– Rendering– On-line Gaming

  • Why? Price Performance!

  • Beowulf Clustering

    • Clustering technology originated from Beowulf Project at NASA– Project started by Thomas Sterling

    and Donald Becker• Characteristics of Beowulf Cluster

    – Use PC based hardware and commodity components to gain high performance at low cost

    – Based on Linux OS, and open source software

    – Very popular in high performance scientific computing world, moving quickly toward enterprise uses.

  • Beowulfs!

  • Beowulf Software Architecture

    HTCHPC

    HPTC

  • How to use cluster system for your work

    • High Throughput Computing– Use cluster as a collection of processors– Run sequential applications– Using job scheduler to control the execution

    • High Performance Computing– Use cluster as a parallel machine– Develop parallel application to harness the

    computing power from the cluster

  • What is needed?

    InterconnectionInterconnectionNetworkNetwork

    NodesNodesSoftwareSoftware

  • Cluster System Structure

    Compute Node

    InterconnectionNetwork

    Frontend

  • Component

    • Node– CPU– Memory– Motherboard– Harddisk– Display card– Chassis

    • Network• Others Accessories

    – KVM Switch– UPS

  • CPU

    • Various architecture – x86 or x86-64• x86/IA32 – Pentium 4, Xeon, Xeon MP,

    Athlon XP• x86-64 – Athlon 64, Athlon 64 FX• IA64 – Itanium 2• Effect to chassis, heat sink, and power

    supply

  • Heat Sink

    • 2 Types of CPU – Box and Tray• Box equips with heat sink and fan• Tray has only CPU• Box is recommended

  • CPU Performance

  • Memory

    • Speed and bandwidth must be compatible with CPU– Pentium 4 - DDR– Athlon XP – DDR– Athlon Opteron – Dual Channel DDR– Athlon 64 – Dual Channel DDR– Athlon 64 FX – Dual Channel DDR

  • Memory Performance

  • Motherboard

    • Must be compatible with CPU• Bus speed• Compute node may be use full option

    motherboard, for example, LAN and VGA on board– Advantage for rack system

  • Selected Motherboard Features

    • Wake-on-LAN• RAID• PXE• PCI Slot• Onboard VGA• Onboard Gigabit Network

  • Pre-boot Execution Environment

    • Network boot• Network must provide DHCP and TFTP

    service• After POST, node use DHCP to request IP

    and boot server• After node get its IP, it download OS from

    boot server via TFTP• Advantage: installation via PXE is very

    convenient or diskless cluster

  • PCI

    • Peripheral Component Interconnect• Internal data path• Data rate depends on clock speed and

    number of bits• Currently, 33 and 66 MHz with 32 and 64

    bit wide• 2 Voltage lever – 3.3 and 5 V

  • PCI-X

    • Next generation PCI• Compatible with PCI 3.3 V 66 MHz• PCI-X version 2.0 operates at 266 to 1066

    MHz. The data rate is about 2.6 to 8.5 Gb/s

  • Harddisk

    • Store system and user data• Various interface

    – IDE or ATA– SATA– SCSI

  • IDE (Integrated Drive Electronics)

    • So call ATA (AT Attachment)• Low cost• Ease of installation because very motherboard

    support• 1 Interface can connect up to 2 drive: master and

    slave• Maximum speed 133 MB/s by 80 line connector

    by Ultra ATA/133 standard• Rotation speed: 7,200 RPM, Seek time: 9 ms• Size 10 – 200 GB (Seagate)

  • SATA – Serial ATA

    • A bit more expensive that ATA

    • Higher speed than ATAby 1-5%

    • Transfer speed: 150 MB/s• Rotation Speed: 7200

    RPM, Seek time about 9ms

    • Size 80 – 200 GB• Need new kernel 2.6.2

  • SCSI (Small Computer System Interface)

    • Intelligent device: Can connect various device, for example, Harddisk Scanner CD-ROM/RW Tape drive

    • High cost for high-end system• Only server’s motherboard supports SCSI,

    Low cost motherboard needs SCSI adapter

    • 1 Intecface can connect up to 15 devices

  • SCSI (cont’d)

    • Rotation speed: 10000 – 150000 RPM, Seek time is about 4.5 ms

    • Various speed. Currently available speed are between 20 MB/s (UltraSCSI) to 320MB/s (Ultra 320 หรือ Ultra4 SCSI)

    • Capacity 18 – 147 GB (Seagate)

  • Harddisk Performance

  • RAID

    • Redundance Array of Independent Disk• Increase capacity, speed, and reliable of

    data• Logically merge many hard disk to a big

    one• Both software and hardware• Normally, RAID is built over SCSI hard

    disks• RAID various level. The popular is 0, 1 and5

  • RAID Level

    • Level 0 – Stripping Increase capacity and speed. None of the space is wasted as long as the hard drives used are identical

    • Level 1 – Mirror increase data reliable via redundancy

    • Level 5 – Block level striping and distributed parity. Increase capacity and data reliability

  • RAID Implementation

    • Hardware – RAID adapter, for example, Adaptec. Some motherboard has build-in RAID controller– Linux supports many hardware RAID vendor,

    especially Adaptec– There is only and logical RAID drive at boot time

    • Software – Linux can emulate RAID 0, 1, and 5– Can be done by creating partition with “RAID” type,

    then use RAID utility to create RAID drive. The RAID drives are mapped to /dev/md0, /dev/md1, etc.

  • RAID Performance

  • Display

    • Monitor– Resolution at least 1024 × 768– May be LCD or CRT

    • Display Adapter– Linux supports most display adapter– On-board display adapter is recommended for

    ease of installation

  • Chassis

    • 3 main types– Tower– Rack-mount– Blade

  • Tower Chassis

    • The most popular• 2 size – Tower and mini-Tower• Advantage – Low cost, ease of

    maintenance, installation, and air flow• Disadvantage – big size and wiring

    problem in large cluster

  • Rock-Mount

    • Smaller – suitable for large cluster

    • Advantage – design for rack

    • Disadvantage – the smaller, the problem with air flow

  • Blade

    • Rack 1U is small but wiring is still a problem

    • Blade-server is a case that contain one or morehot-swappable device called blade

    • Each blade is a computer that share some equipment such as power supply, network connection

  • System Interconnect

    • Interconnection among nodes is the key to cluster performance

    • Consideration– Cost– Bandwidth and Latency– Scalability of the interconnection

    • Technology used– Fast Ethernet (Switch)– SCI (Scalable Coherent Interface)– Gigabit Ethernet– Myrinet– Infiniband

  • Performance of MPICH-GM

    • High Performance LinpackHPL Runtime

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    4000 5000 6000 7000 8000

    Problem Size

    Run

    tim

    e (s

    ec)

    MyrinetEthernet

  • Interconnection Network

    • At least 1 GB/s• Various vendor and standard

    – Gigabit Ethernet– Myrinet– Infiband

  • Gigabit Ethernet• IEEE 802.3• 1 Gb/s• Commodity hardware• Connect by switch• IEEE 802.3z using Fiber

    or STP• IEEE 802.3ab using CAT

    5 UTP• PCI 64 internal bus is

    recommended

  • Gigabit Ethernet Switch

    • Support both layer 2 or layer 3 switch• Cluster can use layer 2 switch because

    routing function is not required• For future upgrade, stackable switch is

    recommended

  • Myrinet

    • Proprietary network• Operate at 2+2 Gb/s• High scalability• On-board

    communication processor reduce main CPU utilization

    • Support both PCI-X both PCI 64

    • Fiber

  • Infiniband

    • Infiniband is technology for connecting remote storage and network

    • Create IPC cluster• Operate at 2.5 Gb/s with copper or fiber

    cable

  • Network Performance

    • Data from NetPIPE 3.6

  • KVM Switch

    • Keyboard Video Mouse• Connect multiple nodes to single KVM• Very convenient

  • UPS

    • Uninterruptable Power Supply• Emergency power source• Filter surge current from AC out let,

    telephone line, and LAN• Should be able to notify CPU to shut down

    the system• Load is in VA unit• Capacity is specified at batteries

  • Basic Clustering Software• Linux (Redhat, Mandrake, Fedora, etc.)• Cluster Distribution

    – ROCKS, SCE, OSCAR• Management tools and environment

    – SCMS, Ganglia• Programming

    – Compilers : PGI, Intel, gcc– Parallel Programming PVM, MPI

    • Library– PGAPack, Scalapack, PetSC, Linpack

    • Load Scheduler– OpenPBS, SGE , SQMS, LSF

    • Package– NWCHEM, GAMESS, Fluent, Oracle 10G

  • Cluster Distribution

    • Custom made Linux distribution that can be used to build a cluster– Pre-selected set of software– Automatic Installation tool or “Builder”– Easy to use but user must follow some guide

    line• Example: Scyld, Oscar, Rocks and SCE

  • NPACI ROCKSNational Partnership for Advanced Computational Infrastructure

    • A scalable, complete and fully automated cluster deployment solution with out of the box rational default settings.

    • Developed by– San Diego Supercomputer

    Centre - Grid & Cluster Computing Group

    – UC Berkeley Millennium Project

    – SCS Linux Competency Centre

  • Rocks Philosophy

    • NPACI ROCKS is an entire cluster-aware distribution– Full Red Hat release (tracks closely)– NPACI ROCKS packages– Installation, configuration & management tools– Integrated de-facto standard cluster packages

    • PBS/MAUI, MPICH, ATLAS, ScaLAPack, HPL...• SGE, PVFS (added by SCS-LCC)

    Integrated, easy to manage cluster system!

  • Rocks Philosophy

    • Integrated, easy to manage cluster system.• Excellent scaling to large number of nodes.• Single consistent cluster management

    methodology.• Avoid version skew and maintains

    consistent image across cluster!• Highly skilled and dedicated development

    team in SDCS, UC Berkeley and SCS-LCC.

    • SUPPORT!

  • Close Cluster• Every node is behind

    main node or frontend• Pros

    – Suitable for parallel program because external traffic does not interfere cluster communication within cluster

    – Ease of security• Cons

    – If frontend is failure, user cannot access all nodes

  • Summary

    • Cluster is a new dominant platform for HPC– Powerful– Cost effective

    • A combination of software/hardware is needed to build a cluster

    • Application is the key to harness the power of cluster

  • How to measure your performance

    • standard open-source benchmarking tools– HPL (High Performance Linpack)– Linpack– Stream - memory benchmark– Bonnie++ - I/O benchmark– IOZone - Intensive I/O benchmark– IPerf - network performance snapshot– NetPipe - intensive bandwidth test

  • HPL (High Performance Linpack) and Linpack

    • HPL– a software package that solves a (random) dense

    linear system in double precision (64 bits) arithmetic on distributed-memory computers.

    – a standard benchmark tool for Cluster computer, used at top500.org

    – Performance reported in Flops (Floating-point operation per second)

    – Rely on MPI (Message-Passing Interface)– More info. on http://www.netlib.org/benchmark/hpl/

    • Linpack– Old implementation of Linpack, used to determined

    single node performance

  • Stream

    • Sustainable Memory Bandwidth in High Performance Computers

    • Used to measure memory bandwidth using various method

    • Performance reported in Mbytes/sec• More info. on

    http://www.netlib.org/benchmark/hpl/

  • Bonnie++ and IOZone• Bonnie++

    – a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance

    – tests database type access to a single file– Performance reported in I/O rate and Kbytes/sec– More info. on http://www.coker.com.au/bonnie++/

    • IOZone– a file system benchmark tool. The benchmark

    generates and measures a variety of file operations– Performance intensive I/O benchmark on various file

    size and record size– More info. on http://www.iozone.org/

  • Iperf and Netpipe

    • Iperf– a tool to measure maximum TCP bandwidth, allowing

    the tuning of various parameters and UDP characteristics.

    – Report data as Kbps or Mbps– More info. http://dast.nlanr.net/Projects/Iperf/

    • Netpipe– a protocol independent performance tool that visually

    represents the network performance under a variety of conditions

    – More info. on http://www.scl.ameslab.gov/netpipe/

  • Thank you

    Thank you