Michael Inggs Advanced Computer Engineering … Inngs.pdf · • Traditional operating systems will...

© CHPC 2009 www.chpc.org.za An initiative of the Department of Science and Technology

Managed by the Meraka Institute

The Need for Parallel Thinking

Michael InggsAdvanced Computer Engineering Laboratory

[email protected]



Presentation Overview

• Our parallel future– Long live Moore's Law

– But it's not going faster

• Hardware goes parallel but,– on an old paradigm

• The misguided youth of today: – It's not their fault

– Revolution in teaching of programming needed

• ACELab projects aiming at the next generation• Conclusions



Long Live Moore's Law

From Wikipedia commons



But the Silicon fails us

It's not going as fast as before



Current Solution: multicore

• However, van Neumann rules, with some exceptions:– Network processors

– CELL (gone)

– Harvard on DSP

– GPU

• Consequences:– Large, complex processor

– Burning power,

– Using lots of transistors

– Huge, slow, memories

– Lots of other problems, to be discussed later



A view from Berkeley

The Landscape of Parallel Computing Research:

A View from Berkeley*

Krste Asanovíc, Rastislav Bodik, Bryan Catanzaro, Joseph Gebis,

Parry Husbands, Kurt Keutzer, David Patterson,

William Plishker, John Shalf, Samuel Williams, and Katherine Yelick

December 18, 2006

*The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No. UCB/EECS-2006-183http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html



The seven critical questions



Their Conclusions (1)

• The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems

• The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar.

• Instead of traditional benchmarks, use 13 “Motifs” to design and evaluate parallel programming models and architectures. (A motif is an algorithmic method that captures a pattern of computation and communication.)

• “Autotuners” should play a larger role than conventional compilers in translating parallel programs.

• To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications.



Continued

• To be successful, programming models should be independent of the number of processors.

• To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism.

• Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters.

• Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines.

• To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost.



The changesOld CW: Power is free, but transistors are expensive.

New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on.

Old CW: If you worry about power, the only concern is dynamic power.

New CW: For desktops and servers, static power due to leakage can be 40% of total power. (See Section 4.1.)

Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins.

New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] [Mukherjee et al 2005]

Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs.

New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.



Continued

Old CW: Researchers demonstrate new architecture ideas by building chips.

New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed.

Old CW: Performance improvements yield both lower latency and higher bandwidth.

New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency. [Patterson 2004]

Old CW: Multiply is slow, but load and store is fast.

New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.



ContinuedOld CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture

innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems.

New CW is the “ILP wall”: There are diminishing returns on finding more ILP. [Hennessy and Patterson 2007]

Old CW: Uniprocessor performance doubles every 18 months.

New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots processor performance for almost 30 years. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years.

Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.

New CW: It will be a very long wait for a faster sequential computer (see above).

Old CW: Increasing clock frequency is the primary method of improving processor performance.

New CW: Increasing parallelism is the primary method of improving processor performance. (See Section 4.1.)

Old CW: Less than linear scaling for a multiprocessor application is failure.

New CW: Given the switch to parallel computing, any speedup via parallelism is a

success.



Patterns of intercommunication

• Colella noted patterns of intercommunication between processors, for different computer tasks.

• Originally Seven (hence, “Dwarfs”), this has been extended to thirteen (next slide).

• Now called Motifs.• These Motifs might form the basis of the different sections

of the future machine (e.g. cortex, medulla, etc. of the human brain).



The Thirteen Motifs (Dwarfs)

1. Dense Linear Algebra (e.g., BLAS or MATLAB)2. Sparse Linear Algebra (e.g., SpMV, OSKI, or SuperLU)3. Spectral Methods (e.g., FFT)4. N-Body Methods (e.g., Barnes-Hut, Fast Multipole Method)5. Structured Grids (e.g., Cactus or Lattice- Boltzmann

Magnetohydrodynamics)6. Unstructured Grids (e.g., ABAQUS or FIDAP)7. MapReduce (e.g., Monte Carlo)8. Combinational Logic9. Graph Traversal10.Dynamic Programming11.Back-track and Branch +Bound12.Graphical Models13.Finite State Machine



Misguided Youth of Today

• Misguided by their teachers.• Most academics teach only sequential computing.• Parallel thinking is accessible, and can be taught very

early.• Numerical methods: who studies these today?• Do we need to understand the packages we use?

– Would you fly in an aircraft without preflight checklists?

A revolution in teaching is required.



Ways forward

• National Computational Science Programme?– Honours, Masters, Doctoral?

– Precedent: NASSP

• Split it into:– Applications bias (Science and Engineering)

– Development (Engineering) bias

• Applications facilitators helping users to transition to new systems.

• How do we get there?



Initiatives that might do it

• Education push from secondary school onwards:– However, Computer Science and Engineering student intakes

dwindling, failure rates higher.

• Revise syllabus' content of the CS and Engineering courses to involve parallel thought.

• Use the material from the mainstream courses to illustrate and give confidence in the use of computers to solve problems. This requires that we,

– Re-skill the trainers to introduce computers to problem solving and exploration.

An initiative of the Department of Science and Technology


Introduction to ACE Laboratory

• Part of the South African CHPC.• Innovate architectural solutions for real computing problems:

– If you can buy it, we probably don't want to be involved.

– Do not provide production compute services.

– Can provide advice on present technology.

• Produce innovative hardware / firmware solutions for possible industrial production:

– South Africa can do it

– We can achieve international peer recognition

• Tool chains for the new computing.



Recent ACELab Initiatives (1)

• July 2008: Analysis of FORTRAN codes (SIMCON) (10)• December 2008: Reconfigurable computing (NCSA) (25)• September 2009: Parallel Computing (Stanford PPL). (25)• Links via real projects with:

– Berkeley CASPER

– NCSA Innovative Systems Laboratory

– Edinburgh Parallel Computer Centre

– Liverpool Hope University

– Stanford NetFPGA

• Videos of lectures and slides available• Google Group with 100+ participants local and

international



ACELab Initiatives (2)

• Why write code to solve a problem on an architecture (usally van Neumann, now GPGPU, Cell (rip), etc.)

• Why not write architectures to solve problems?– Hard work.

• Can we automate this?– Hydra project

• Manycore solution, deployed on FPGA at the moment.

• Automatic scheduler

• Bioinformatics on FPGA• Pulsars / transients on GPU• Investigating Clouds



ACELab Initiatives (3)

• Petting Zoo of interesting hardware, freely available:– Nallatech H101 FPGA hardware

– Cell Broadband Enginer (Sony Playstation)

– ACE-1 FPGA accelerator cards

– GPU cluster

– Mini Cloud based on Eucalyptus

– ROACH computer nodes

– AstroGig SDR input processor

• Large number of software development systems.• Expertise to assist users to implement applications



ACE-1 Reconfigurable Accelerator Card

Virtex -5LX110T

HTX

QDRCX4 Rocket IO

QDR

QDR

QDR

8 Status LEDS

Virtex -5LX50

QDR

QDR

Clock Generation and Distribution

Platform Flash(PROM )

CX4

HyperTransport

Programming

HyperTransport

Programming

1x LX110T Virtex-5 FPGA

1x LX50 Virtex-5 FPGA

6x 2MB QDRII+ Memory Modules

2x 10GbE-CX4 connectors

1x HyperTransport connector

1x JTAG Connector



Card Maximum I/O Capabilities

Virtex -5LX110T

HTX

QDRCX4 Rocket IO

QDR

QDR

QDR

8 Status LEDS

Virtex -5LX50

QDR

QDR

Clock Generation and Distribution

Platform Flash(PROM )

CX4

HyperTransport

Programming

HyperTransport

Programming

2 x 10GbE-CX4

20Gbps TX

20Gbps RX

HTX

3.2 GB

/s TX

3.2 GB

/s RX



HTX Reconfigurable Accelerator CardQDRII+ Memory

• 6 memory banks, each with:– 2Mb accessible space

– 18 bit write access

– 18 bit read access

– Clock rate of up to 400MHz

– 18x400MHzxDDR = 1.8GB/s bandwidth per bank



Configuration 1

Fujitsu XG700

Sys Admin

1GbE 10GbEx2

10GbE

10GbEx2

1 2 3 4

RAID Storage(Optional )

Reconfigurable Nodes

Portable File Storage

10GbE

10GbEx2

10GbEx2

AMD Opteron

AMD Opteron

250GB Storage

HTX Reconfigurable Accelerator Card

Reconfigurable Node

HyperTransport AMD Motherboard

SDRAM

10GbE

1GbE Switch



• Suitable for DSP type solutions, with large datasets• Fujitsu XG700:

– 12x 10GbE-CX4 ports– 1x 1GbE port– 240 Gbps switching throughput– 450ns latency– Cost $7,500 (best value per-port on the market)

Configuration 1



Configuration 2

Sys Admin

1GbE

1GbE

1GbE

1GbE

RAID Storage(Optional )

Reconfigurable Nodes

Portable File Storage

1GbE

1GbE 1GbE

AMD Opteron

AMD Opteron

250GB Storage

HTX Reconfigurable Accelerator Card

Reconfigurable Node

SDRAM

1GbE

1GbE Switch1GbE

1GbE Switch

10 GbE

10 GbE 10 GbE 10 GbE



Configuration 2

• Suitable for computationally dense problems, with small datasets

• Uses cheap 1GbE switch and other 1GbE components

• No need for 10GbE switch.



Conclusions on ACE-1 Board

• Moderate amount of intercommunications.• Reasonable local memory.• Flexible topology.• Software Pipeline needs development:

– M.Sc. project Nick Thorne

• Applications development:– M.Sc. project Jane Hewitson with Simcon and Liverpool Hope

University: Hydra Project



Cooperations

• Edinburgh Parallel Computer Centre (Maxwell Computer)• NCSA (Innovative Systems Lab)• KAT Digital Group• Berkeley CASPER• Liverpool Hope University• Stanford PPL and NetFPGA• Hong Kong• CPUT EE• UCT Bioinformatics• US Computational EM

– and more to follow, but to be based on concrete project work.



Conclusions

• Present computer architectures are running into performance problems, despite Moore's Law.

• Review of the important Berkeley assessment of how we are going to move forward with HPC

– it is many-core,

– parallel and

– a huge innovation in software development and, most importantly,

– Training of a new generation.

• Introduction to the ACE Laboratory.• Some of the projects of the ACE Laboratory, aiming at

architectures for compute problems.

Michael Inggs Advanced Computer Engineering … Inngs.pdf · • Traditional operating systems will...

Documents

Transcript of Michael Inggs Advanced Computer Engineering … Inngs.pdf · • Traditional operating systems will...