Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr....

Post on 02-Apr-2015

217 views 4 download

Transcript of Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr....

Scalable Scientific Computing at CompaqScalable Scientific Computing at Compaq

CAS 2001

Annecy, France

October 29 – November 1, 2001

Dr. Martin Walker

Compaq Computer EMEA

martin.walker@compaq.com

CAS 2001

Annecy, France

October 29 – November 1, 2001

Dr. Martin Walker

Compaq Computer EMEA

martin.walker@compaq.com

Agenda of the entertainmentAgenda of the entertainmentAgenda of the entertainmentAgenda of the entertainment

From EV4 to EV7: four implementations of the Alpha microprocessor over ten years

Performance on a few applications, including numerical weather forecasting

The Terascale Computing System at the Pittsburgh Supercomputing Center

Marvel: the next (and last) AlphaServer Grid Computing

Scientific basis for vector processor Scientific basis for vector processor choice for Earth Simulator projectchoice for Earth Simulator projectScientific basis for vector processor Scientific basis for vector processor choice for Earth Simulator projectchoice for Earth Simulator project

Comparison of Cray T3D and Cray Y-MP/C90 J.J. Hack, et al, “Computational design of the NCAR community climate model”, J. Parallel Computing 21 (1995) 1545-1569

Fraction of peak performance achieved– 1-7% on Cray T3D

– 30% on Cray Y-MP/C90 Cray T3D used the Alpha EV4 processor from

1992

Key ratios that determine sustained application Key ratios that determine sustained application performance (U.S. DoD/DoE)performance (U.S. DoD/DoE)Key ratios that determine sustained application Key ratios that determine sustained application performance (U.S. DoD/DoE)performance (U.S. DoD/DoE)

Int RegMap

Branch Predictors

Alpha EV6 ArchitectureAlpha EV6 ArchitectureAlpha EV6 ArchitectureAlpha EV6 Architecture

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

Int Issue Queue

(20)

Exec

4 Instructions / cycle

Reg File(80)

Victim Buffer

L1 DataCache64KB2-Set

FP Reg Map

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

Miss Address

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecReg File(80)

FP Issue Queue

(15)

Reg File(72)

Weather Forecasting BenchmarkWeather Forecasting BenchmarkWeather Forecasting BenchmarkWeather Forecasting Benchmark LM = local model, German Weather Service (DWD)

Current version is RAPS 2.0 Grid size is 325 325 35; predefined INPUT set dwd

used for all benchmarks First forecast hour timed (contains more I/O than

subsequent forecast hours) Machines

– Cray T3E/1200 (EV5/600 MHz) in Jülich, Germany– AlphaServer SC40 (EV67/667 MHz) in Marlboro, MA

Study performed by Pallas GmbH (www.pallas.com)

Total time (AS SC40 vs. Cray T3E)Total time (AS SC40 vs. Cray T3E)Total time (AS SC40 vs. Cray T3E)Total time (AS SC40 vs. Cray T3E)

0

100

200

300

400

500

600

700

800

900

10 15 20 25 30 35 40 45 50 55 60

Processors

tim

e se

c

Compa ES40q

Ideal

Cray T3E

Ideal

Performance comparisonsPerformance comparisonsPerformance comparisonsPerformance comparisons

Alpha EV67/667 MHz in AS SC40 delivers about 3 times the performance of EV5/600 MHz in Cray T3E to the LM application

EV5 is running at about 6.7% of peak EV67 is running at about 18.5% of peak

Compilation TimesCompilation TimesCompilation TimesCompilation Times Cray T3E

Flags: -O3 -O aggress,unroll2,split1,pipeline2 Compilation time: 41 min 37 sec

Compaq EV6/500 MHz (EV67 is faster)Flags: -fast -O4Compilation time: 5 min 15 sec

IBM SP3Flags: -04 -qmaxmem=-1Compilation time: 40 min 19 secNote: numeric_utilities.f90 had to be compiled with -O3 in order to avoid crashes

SWEEP3DSWEEP3D

3D discrete ordinates (Sn) neutron transport

Implicit wavefront algorithm– Convergence to stable solution

Target System - multitasked PVP / MPP– Vector style code

– High ratio of (load,stores) to flops memory bandwidth and latency sensitive performance is sensitive to grid size

SWEEP3D “as is” PerformanceSWEEP3D “as is” Performance

CPU/Mhz CPI Mflops % PeakEV5/613 2.27 113. 9.3EV6/500 1.57 135. 13.5EV7/1100 0.93 497. 22.6

Optimizations to SWEEP3DOptimizations to SWEEP3DOptimizations to SWEEP3DOptimizations to SWEEP3D

Fuse inner loops– demote temporary vectors to scalars

– reduce load/store count Separate loops with explicit values for “i2” = -1,1

– allows prefetch code to be generated Fixup code moved “outside” loop

– loop unrolling, pipelining

Instruction counts/iterationInstruction counts/iteration(+ measured cycles on EV6)(+ measured cycles on EV6)Instruction counts/iterationInstruction counts/iteration(+ measured cycles on EV6)(+ measured cycles on EV6)

Original OptimizedInstructions 144.1 90.4Loads 39.5 19.6Store 14.8 9.5

Cycles/instruction 1.57 .88Cycles/iteration 175 75.5

Optimized SWEEP3D PerformanceOptimized SWEEP3D Performance

CPU/Mhz CPI Mflops % PeakEV6/500 0.88 262. 26.2EV7/1100 0.66 767. 34.9

Each @ 128b 8.0GB/s

AlphaServer ES45 (EV68/1.001 GHz)AlphaServer ES45 (EV68/1.001 GHz)

CrossbarSwitch

(Typhoon chipset)

Each @ 64b (4.2GB/s)

Quad Ctl

Data S

lices (8)

PA PP

256b 4.2GB/s

256b 4.2GB/s

64b 266MB/s

Alpha 264Alpha 264

L2 CacheL2 Cache

SDRAM Memory133 MHz

128MB - 32GB

Bank 3

Bank 2

Bank 1

Bank 0Alpha 264

Alpha 264

L2 CacheL2 Cache

4xAGP

PCI 2

PCI 1

PCI 3

PCI 1

PCI 0

PCI 1

PCI 0

64b@66MHz512MB/s

64b@66MHz512MB/s

64b@66MHz256MB/s

32b@133MHz512MB/s

PCI 0

Pittsburgh Supercomputing Center (PSC)Pittsburgh Supercomputing Center (PSC)Pittsburgh Supercomputing Center (PSC)Pittsburgh Supercomputing Center (PSC)

Cooperative effort of– Carnegie Mellon

University– University of

Pittsburgh– Westinghouse

Electric Offices in Mellon

Institute– On CMU campus– Adjacent to UofP

campus

Westinghouse ElectricWestinghouse ElectricWestinghouse ElectricWestinghouse Electric

Energy Center, MonroevillePA

Major computing systems

High-speed network connections

Terascale Computing System at Terascale Computing System at Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterTerascale Computing System at Terascale Computing System at Pittsburgh Supercomputing CenterPittsburgh Supercomputing Center

Sponsored by the U.S. National Science Foundation

Integrated into the PACI program (Partnerships for Academic Computing Infrastructure)

Serving the “very high end” for academic computational science and engineering

The largest open facility in the world

PSC in collaboration with Compaq and with– Application scientists and engineers– Applied mathematicians– Computer Scientists– Facilities staff

Compaq AlphaServer SC technology

System System Block Block DiagramDiagram

System System Block Block DiagramDiagram

3040 CPUs Tru64 UNIX 3 TB memory 41 TB disk 152 CPU cabs 20 switch cabs

SWITCH

NODES

SERVERS

DISKSCONTROL

ES45 nodes– 5 per cabinet

– 3 local disks

Row upon row…Row upon row…Row upon row…Row upon row…

QuadricsQuadricsSwitchesSwitchesQuadricsQuadricsSwitchesSwitches

Rail 1 & Rail 0

Middle Aisle, Switches in CenterMiddle Aisle, Switches in CenterMiddle Aisle, Switches in CenterMiddle Aisle, Switches in Center

QSW switch chassisQSW switch chassisQSW switch chassisQSW switch chassis

Fully wired switch chassis

1 of 42

Control nodes and concentratorsControl nodes and concentratorsControl nodes and concentratorsControl nodes and concentrators

The The Front RowFront RowThe The Front RowFront Row

Installation: from 0 to 3.465 TFLOPS in 29 days Installation: from 0 to 3.465 TFLOPS in 29 days (Latest: 4.059 TFLOPS on 3024 CPUs)(Latest: 4.059 TFLOPS on 3024 CPUs)

Installation: from 0 to 3.465 TFLOPS in 29 days Installation: from 0 to 3.465 TFLOPS in 29 days (Latest: 4.059 TFLOPS on 3024 CPUs)(Latest: 4.059 TFLOPS on 3024 CPUs)

Deliveries & continual integration:– 44 nodes arrived at PSC on Saturday, 9-1-2001– 50 nodes arrived on Friday, 9-7-2001– 30 nodes arrived on Saturday, 9-8-2001– 50 nodes arrived on Monday, 9-10-2001– 180 nodes arrived on Wednesday, 9-12-2001– 130 nodes arrived on Sunday, 9-16-2001– 180 nodes arrived on Thursday, 9-20-2001

To have shipped 12 September! Federated switch cabled/operational by 9-23-01 760 nodes clustered by 9-24-01 3.465 TFLOPS Linpack by 9-29-01 4.059 TFLOPS in Dongarra’s list dated Mon Oct 22 (67% of peak

performance)

http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20011023.html

MM5

http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20011023.html

Alpha Microprocessor SummaryAlpha Microprocessor SummaryAlpha Microprocessor SummaryAlpha Microprocessor Summary

EV6– .35 m, 600 MHz– 4-wide superscalar– Out-of-order execution– High memory BW

EV67– .25 m, up to 750 MHz

EV68– .18 m, 1000 MHz

EV7– .18 m, 1250 MHz– L2 cache on-chip– Memory control on-chip– I/O control on-chip– cc inter-proc com on-

chip

EV79– .13 m, ~1600 MHz

EV7 – The System is the Silicon….EV7 – The System is the Silicon….

• EV68 core with enhancements• Integrated L2 cache

– 1.75 MB (ECC)– 20 GB/s cache bandwidth

• Integrated memory controllers– Direct RAMbus (ECC)– 12.8 GB/s memory bandwidth– Optional RAID in memory

• Integrated network interface– Direct processor-processor interconnects– 4 links - 25.6 GB/s aggregate bandwidth – ECC (single error correct, double error detect)– 3.2 GB/s I/O interface per processor

SMP CPU interconnect used to be external logic…

Now it’s on the chip

Alpha EV7

EV7 – The System is the Silicon….EV7 – The System is the Silicon….EV7 – The System is the Silicon….EV7 – The System is the Silicon….

Electronics to do cache-coherent communications gets placed within the EV7 chip

EV7

Int RegMap

Branch Predictors

Alpha EV7 CoreAlpha EV7 CoreAlpha EV7 CoreAlpha EV7 Core

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

L2 cache

1.75MB7-Set

Int Issue Queue

(20)

Exec

4 Instructions / cycle

Reg File(80)

Victim Buffer

L1 DataCache64KB2-Set

FP Reg Map

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

Miss Address

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecReg File(80)

FP Issue Queue

(15)

Reg File(72)

Virtual Page SizeVirtual Page SizeVirtual Page SizeVirtual Page Size

Current virtual page size– 8K

– 64K

– 512K

– 4M New virtual page size (boot time selection)

– 64K

– 2M

– 64M

– 512M

PerformancePerformancePerformancePerformance SPEC95

– SPECint95 75– SPECfp95 160

SPEC2000– CINT2000 800– CFP2000 1200

59% higher than EV68/1GHz

Building Block Approach to System DesignBuilding Block Approach to System DesignBuilding Block Approach to System DesignBuilding Block Approach to System Design

Key Components:

• EV7 Processor• IO7 I/O Interface• Dual Processor Module

Systems Grow by Adding: Processors Memory I/O

Two complementary views of the GridTwo complementary views of the GridTwo complementary views of the GridTwo complementary views of the GridThe hierarchy of understanding

Data are uninterpreted signals Information is data equipped

with meaning Knowledge is information

applied in practice to accomplish a task

The Internet is about information

The Grid is about knowledge

– Tony Hey, Director, UK eScience Core Program

Main technologies developed by man

Writing captures knowledge Mathematics enables rigorous

understanding, prediction Computing enables prediction of

complex phenomena The Grid enables intentional

design of complex systems

– Rick Stevens, ANL

What is the Grid?What is the Grid?What is the Grid?What is the Grid?

“A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computing capabilities.”

– Ian Foster and Carl Kesselman, editors, “The GRID: Blueprint for a New Computing Infrastructure” (Morgan-Kaufmann Publishers, SF, 1999) 677 pp. ISBN 1-55860-8

The Grid is an infrastructure to enable virtual communities to share distributed resources to pursue common goals

The Grid infrastructure consists of protocols, application programming interfaces, and software development kits to provide authentication, authorization, and resource location and access

– Foster, Kesselman, Tuecke: “The anatomy of the Grid: Enabling Scalable Virtual Organizations” http://www.globus.org/research/papers.html

Compaq and The GridCompaq and The GridCompaq and The GridCompaq and The Grid Sponsor of the Global Grid Forum (www.globalgridforum.org) Founding member of the New Productivity Initiative for

Distributed Resource Management (www.newproductivity.org) Industrial member of the GridLab consortium (www.gridlab.org)

– 20 leading European and US institutions– Infrastructure, applications, testbed– Cactus “worm” demo at SC2001 (www.cactuscode.org)

Intra-Grid within Compaq firewall– Nodes in Annecy, Galway, Nashua, Marlboro, Tokyo– Globus, Cactus, GridLab infrastructure and applications– iPAQ Pocket PC (www.ipaqlinux.com)

Potential dangers for the GridPotential dangers for the GridPotential dangers for the GridPotential dangers for the Grid

Solution in search of a problem Shell game for cheap (free) computing Plethora of unsupported, incompatible, non-

standard tools and interfaces

““Big Science”Big Science”““Big Science”Big Science” As with the Internet, scientific computing will be the first to

benefit from the Grid. Examples:– GriPhyN (US Grid Physics Network for Data-intensive

Science) Elementary particle physics, gravitational wave astronomy,

optical astronomy (digital sky survey) www.griphyn.org

– DataGrid (led by CERN) Analysis of data from scientific exploration www.eu-datagrid.org

– There are also compute-intensive applications that can benefit from the Grid

Final Thoughts: all this will not be easyFinal Thoughts: all this will not be easyFinal Thoughts: all this will not be easyFinal Thoughts: all this will not be easy

How good have we been as a community at making parallel computing easy and transparent?

There are still some things we can’t do– predict the El Niño phenomenon correctly

– plate tectonics and earth mantel convection

– failure mechanisms in new materials Validation and verification of numerical simulation

are crying needs

Thank You!

Please visit our HPTC Web Sitehttp://www.compaq.com/hpc

Stability & Continuity for Stability & Continuity for AlphaServerAlphaServer customerscustomersStability & Continuity for Stability & Continuity for AlphaServerAlphaServer customerscustomers

Commitment to continue implementing the Alpha Roadmap according to the current plan-of-record

– EV68, EV7 & EV79

– Marvel systems

– Tru64 UNIX support

– AlphaServer systems, running Tru64 UNIX, will be sold as long as customers demand, at least several years after EV79 system arrive in 2004, with support continuing for a minimum of 5 years beyond that

EV68

Itanium™Processor

Family

McKinley Madison Itanium Processor Family Next Generation

AlphaProcessor

ProLiantServers

Itanium™

1-32P

McKinley family

AlphaServers

EV7 EV79

20052004200320022001

EV7 Family8–64P (8P BB)2–8P (2P BB)

EV68 Product FamilyGS 1 - 32P ES 1 – 4P DS 1 – 2P

EV79 8–64P (8P BB)2–8P (2P BB)

Madison

EV68

Microprocessor and System Roadmaps

1-8P

Next GenerationServer Family

8-64P,

Blades,

2P, 4P, 8P

Itanium™1 – 4P 1–4P

The New HPThe New HPThe New HPThe New HP

Chairman and CEO Carly Fiorina President Michael Capellas Imaging and Printing $20B Vyamesh Joshi Access Devices $29B Duane Zitzen IT Infrastructure $23B Peter Blackmore Services $15B Ann Livermore