Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

68
1 Programming Models and Architectures for ManyCore Systems: Challenges and Opportunities for the next 10 years. Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking National Research Council – Italy [email protected] Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

description

Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking National Research Council – Italy [email protected]. Programming Models and Architectures for ManyCore Systems: Challenges and Opportunities for the next 10 years. Workshop - PowerPoint PPT Presentation

Transcript of Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

Page 1: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

1

Programming Models and Architectures for ManyCore Systems:

Challenges and Opportunities for the next 10 years.

Roberto Vaccaro & Lorenzo Verdoscia

Institute for High Performance Computing and Networking

National Research Council – Italy

[email protected]

WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

Page 2: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

2WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ The computational and storage needs of workloads in several areas as life science are growing exponentially.

■ Heterogeneity/Computing Barriers Overcoming.– The scientist should be allowed to look at the data

• easily,• wherever it may be,• with sufficient processing power for any desired algorithm to

process it.

Page 3: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

3WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ In life science the scientist requirements concerne a range of different scales, from the local parallel component processor to the global atchitectural level of cross-organizational grid.

■ Integrated solutions capable to face the problems at the different architectural level are needed.

Page 4: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

4WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Grid of Clusters

Cluster

Commodity Machine

Microprocessor

Wide Area Netowrk

Local Area Network

System Level Network

Network on Chip

■ ManyCore Chip

■ Photonic Networks for intra-chip, inter-chip, box interconnects

Introduction

(*) T. Agerwala, M. Gupta, “Systems research challenges: A scale-out perspective”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 173,180

Page 5: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

5WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ An ensemble of N nodes each comprising p computing elements

■ The p elements are tightly bound shared memory (e.g., smp, dsm)

■ The N nodes are loosely coupled, i.e., distributed Memory

■ p is greater than N

■ Distinction is which layer gives us the most power through parallelism

Page 6: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

6WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

Page 7: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

7WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ GRIDs built over wide-area networks & across organisational boundaries.

■ lack of (further) improvement in newtork latency.

The approach to Distributed Programmingcurrently prevailing synchronous

(using RPC primitives for ex.)

will have to be replaced with an

ASYNCHRONOUS PROGRAMMING APPROACH more - delay-tolerant - failure-resilient

Page 8: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

8WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ A first step in that direction- peer-to-peer (P2P) architectures- service-oriented architectures (SOA)

capable of support reuse of both functionalities and data.

■ Using P2P architectures and protocols it is possible to- realize distributed systems without any centralized control or

hierarchical organisation,- achieve scalable and reliable location and exchange of scientific data

and software in a decentralised manner.

■ Service-Oriented Architecture (SOA) and the web-service infrastructures that assist in their implementation facilitate reuse of functionality.

(*) G. Kandaswamyetahi “Building Web Services for Scientific Grid Applications”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 249,260

Page 9: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

9WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ The possibility to locate and invoke a service across machine and organisational boundaries (both in a synchronous and an asynchronous manner) is provided by SOA infrastructure fundamental primitive.

■ Computational scientist will be able to flexibly orchestrate SOA services into computational workflow.

Page 10: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

10WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

■ Appropriate programming languages abstractions for science has to be provided.

■ Fortran and Message Passing Interface (MPI) are no longer appropriate for the above described architecture.

■ By using abstract machines it is possible to mix compilation and interpretation as well as integrate code written language seamlessly into an application or service.

Page 11: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

11WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

A viable approach

■ Define a Multilevel Integrated Programming Model

■ Explore the management of concurrency in processor design on a range

of different scales

from instructions to programs

from microgrids to global grids

■ Evaluate the possibility and modalities to implement an integrated H/W and S/W

system capable to give the right answer in terms of:

- Inter/intra processor latency.

- More delay-tolerant and failure-resilient programming approach.

- Capability of data and functionality reuse at global

architecture level (distributed, cross-organisational).

- Capability to take advantages of parallel and distributed resources.

Page 12: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

12WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Introduction

By Little’s law, the amount of concurrency needed to hide the latency of memory accesses will continue to increase as the gap between memory and processor speed grows. Since the memory latency is improving at a rate of only roughly 6% each year, the gap is projected to continue growing even as the increase in processor speed decreases from the historic rate of about 60% each year to about 20% each year.

Page 13: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

13WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computer hardware industry

In 2005 a historic change of direction for computer hardware Industry.

● The major microprocessor companies all announced that

future products would be single-chip multiprocessors

future performance improvements would rely on

○ software-specified parallelism

rather than

○ additional software-transparent parallelism extracted automatically by the microarchitecture

Page 14: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

14WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computer hardware industry

■ It is meaningfull that a multibilliondollar industry has bet its future on solving the general-purpose parallel computing problem.

even if

so many have previously attempted but failed to provide a satisfactory approach.

■ In order to tackle the parallel processing problem, innovative solutions are urgently needed, which in turn require extensive codevelopment of hardware and software.

Page 15: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

15WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computer hardware industry

■ Advances in integrated circuit technology impose new challenges about how to implement a high performance application for low power dissipation on processors created by hundred of cores running at 200 MHz, rather than on one traditional processor running at 20 GHz.

■ The convergence of the high-performance and embedded industry.

Page 16: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

16WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computer hardware industry

Multicore or Manycore?

■Multicore will obviously help multiprogrammed workloads, which contain a mix of independent sequential tasks, but how will individual tasks become faster?

■Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in power-performance.

■Multicore is unlikely to be ideal answer and sneaking up on the problem of parallelism via multicore solutions was likely to fail.

Page 17: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

17WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computer hardware industry

■We desperately need a new solution for parallel hardware and software.

■Compatibility with old binaries and C programs is valuable to industry, and some researchers are trying to help multicore product plans succeed.

■We have been thinking bolder thoughts.Our aim is to realiza thousands of processors on a chip for new applications, and we welcome new programming models and new architectures if theysimplify the efficient programming of such highly parallel systems.

■Rather than multicore, we are, focused on “manycore”.

Page 18: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

18WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computer hardware industry

■Between February 2005 and December 2006 a group of Researcher of University of California at Berkeley from many background (circuit design, computer architecture, massively parallel computing, computer-aided design, embedded h/w and s/w, programming languages, compilers, scientific programming and numerical analysis) met to discuss parallelism from these many angles.

■The result of the borrowing the good ideas regarding parallelism from different disciplines is the report.

“The Landscape of Parallel Computing Research: A View from Berkeley”

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick

Electrical Engineering and Computer Sciences

University of California at Berkeley

Technical Report No. UCB/EECS-2006-183

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

December 18, 2006

Page 19: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

19WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

The Landscape

Page 20: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

20WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

The Landscape

■Seven critical questions used to frame the landscape of parallel computing research:

1. What are the applications?

2. What are common kernels of the applications?

3. What are the hardware building blocks?

4. How to connect them?

5. How to describe applications and kernels?

6. How to program the hardware?

7. How to measure success?

■This report do not have the answers- on some questions non-conventional and provocative perspectives are offered,- On others seemingly obvious sometine-neglected perspectives are stated.

Page 21: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

21WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

The Landscape

Embedded versus High Performance Computing

Have more in common looking forward than they did in the past1. Both are concerned with power, whether it is battery life for cell phones or cost of

electricity and cooling in a data center.

2. Both are concerned with hardware utilization. Embedded systems are always

sensitive to cost, but efficient use of hardware is also required when you spend $

10M to $ 100M for high-end servers.

3. As the size of embedded software increases over time, the fraction of hand tuning

must be limited and so the importance of software reuse must increase.

4. Since both embedded and high-end servers now connect to networks, both need

to prevent unwanted accesses and viruses.

Page 22: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

22WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

The Landscape

■The Biggest difference between the two target is the traditional emphasis on realtime computing in embedded, where the computer and the program need to be just fast enough to meet the deadlines, and there is no benefit to running faster.

■Running faster is usually valuable in server computing.

■As server applications become more media-oriented, real time may become more important for server computing as well

Page 23: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

23WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Information Society Technologies (IST)

Network of Excellence on High Performance Embedded Architectures and Compilers (HiPEAC)

Meteo Valero (UPC Barcellona) HiPEAC Coordinator, introducing the pubblication of the first HiPEAC research roadmap (*) wrote:

“From the document it is clear that there are many challenges ahead of us in the design of future high-performance embedded systems. Some of themare familiar such as the memory wall, the power problem, and the

interconnection bottleneck. Others are new like the proper support for reconfigurable components, fast simulation techniques for multi-core systems, new programming paradigms for parallel programming.”

(*) K. De Bosschere, W. Luk, X. Martorell, N. Navarro, M. O’Boyle, D. Pnevmatikatos, A. Ramirez, P. Sainrat, A. Seznec, P. Stentrom, and O. Temam. “High-Performance Embedded Architecture and Compilation Roadmap” Transactions on HiPEAC I, Lecture Notes in Computer Science 4050, pp 5-29, Springer-Verlag, 2007

Page 24: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

24WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Parallelism

For at least three decades the promise of parallelism has fascinated researchers.

■In the past, parallel computing efforts have shown promise and gathered investment, but in the end, uniprocessor computing always prevailed.

■In this time general-purpose computing is taking an irreversible step toward parallel architectures

●This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures

for parallelism●This plunge into parallelism is actually a retreat from aven greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures

Page 25: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

25WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CW in Computer Architecture

Old & New Conventional Wisdom (CW) in Computer Architecture

1. Old CW: Power is free, but transistors are expensive.▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on.

2. Old CW: If you worry about power, the only concern is dynamic power. ▪ New CW: For desktops and servers, static power due to leakage can be 40% of total power.

3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. ▪ New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates.

4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. ▪ New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability, clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.

guiding principles illustrating how everything is changing in computing

Page 26: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

26WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CW in Computer Architecture

5. Old CW: Researchers demonstrate new architecture ideas by building chips.▪New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed.

6. Old CW: Performance improvements yield both lower latency and higher bandwidth. ▪ New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency.

7. Old CW: Multiply is slow, but load and store is fast. ▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.

8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. ▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP.

Page 27: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

27WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CW in Computer Architecture

9. Old CW: Uniprocessor performance doubles every 18 months.

▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years.

10.Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.

▪ New CW: It will be a very long wait for a faster sequential computer.

11. Old CW: Increasing clock frequency is the primary method of improving processor performance.

▪ New CW: Increasing parallelism is the primary method of improving processor performance.

12. Old CW: Less than linear scaling for a multiprocessor application is failure.

▪ New CW: Given the switch to parallel computing, any speedup via parallelism is a success.

Page 28: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

28WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CW in Computer Architecture

1. Old CW: Power is free, but transistors are expensive.▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on.

7. Old CW: Multiply is slow, but load and store is fast. ▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.

8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. ▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP.

9. Old CW: Uniprocessor performance doubles every 18 months.

▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years.

Conventional Wisdom (CW) in Computer Archietecture

Page 29: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

29WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CW in Computer Architecture

Uniprocessor Performance (SPECint)

From Hennessy and PattersonComputer Architecture: A QuantitativeApproach, 4° edition, 2006

Sea change in chipdesign: multiple “cores” orprocessors per chip

• VAX: 25%/year 1978 to 1986• RISC + x86: 52%/yaer 1986 to 2002• RISC + x86: ??%/year 2002 to present

Page 30: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

30WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CW in Computer Architecture

The State of Hardware

■A Negative picture about the state of hardware is painted by CW pairs based analysis.

■There are compensating positives as well●Moore’s Law continues: it will soon be possible to put thausands of simple processors on a single, economical chip;●Very low latency & very high bandwidth for the communication

between these processors within a chip;●Monolithic manycore microprocessors

- represent a very different design point from traditional multichip multiprocessors- provide promise for the development of new architectures

and programming models.

Page 31: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

31WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

■ Mining the parallelism experience of the high-performance computing community to see if there are lessons we can learn for a broader view of parallel computing.

The hypothesis ● is not that traditional scientific computing is the future of parallel computing

● is that the body of knowledge created in bulding programs that run well on massively parallel computers may prove useful in parallelizing future

applications

■ Many of the authors from other areas, such as embedded computing, were surprised at how well future applications in their domain mapped closely to problems in scientific computing.

■ The way to guide and evaluate architecture innovation is to study a benchmark suite based on existing programs, such as EEMBC (Embedded Microprocessors Benchmark Consortium) or SPEC (Standard Performance Evalution Corporation) or SPLASH (Stanford Parallel Applications for Shared Memory).

Page 32: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

32WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

■ It is currently unclear how to express a parallel computation best: a very big

obstacle to innovation in parallel computing.

■ It seems unwise to let a set of existing source code drive an investigation into

parallel computing.

■ There is a need to find a higher level of abstraction for reasoning about

parallel application requirements.

■ The main aim is to delineate application requirements in a manner that is not

overly specific to individual applications or the optimizations used for certain

hardware platforms.

■ It is possible to draw broader conclusions about hardware requirements.

■ The approach is to define a number of “Dwarfs”, which each capture a

pattern of computation and communication common to a class of important

applications.

Page 33: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

33WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

■ Phil Colella identified seven numerical methods that he believed will be

important for science and engineering for at least the next decade

■ Seven Dwarfs

● Constitute classes where membership in a class is defined by

similarity in computation and data movement

● are specified at a high level of abstraction to allow reasoning about

their behavior across a broad range of applications

Page 34: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

34WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

Page 35: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

35WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

Page 36: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

36WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

Page 37: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

37WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

Seven Dwarfs, their descriptions, corresponding NAS benchmarks, and example computers.

Page 38: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

38WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

Page 39: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

39WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Applications and Dwarfs

Extensions to the original Seven Dwarfs.

Page 40: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

40WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Recognition, Mining, Synthesis (RMS)

Intel’s RMS and how it maps down to functions that are more primitive. Of the five categories at the top of the figure, Computer Vision is classified as Recognition, Data Mining is Mining, and Rendering, Physical Simulation, and Financial Analytics are Synthesis. [Chen 2006]

Intel “Era of Tera” Computation Categories

Page 41: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

41WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Parallel Programming Models

Comparison of 10 current parallel programming models for 5 critical tasks, sorted from most explicit to most implicit. High-performance computing applications [Pancake and Bergmark 1990] and embedded applications [Shah et al 2004a] suggest these tasks must be addressed one way or the other by a programming model: 1) Dividing the application into parallel tasks; 2) Mapping computational tasks to processing elements; 3) Distribution of data tomemory elements; 4) mapping of communication to the inter-connection network; and 5) Inter-task synchronization.

Page 42: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

42WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Limits of Performance of Dwarfs

Limits to performance of dwarfs, inspired by an suggestion by IBM that a packaging technology could offer virtually infinite memory bandwidth. While the memory wall limited performance for almost half the dwarfs, memory latency is a bigger problem than memory bandwidth

Page 43: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

43WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Transistor Integration Capacity

Transistor integration capacity

Page 44: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

44WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Pollack’s Rule

Pollack's Rule

Page 45: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

45WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Frequency and Power Consumption

Frequency and Power Consumption

Page 46: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

46WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

ManyCore System

Illustration of a Many Core System

Page 47: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

47WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Amdahl’s Law Limits Parallel Speedup

Amdahl's Law limits parallel speedup

Page 48: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

48WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Core Performances

Performance of Large, Medium, and Small Cores

Page 49: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

49WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Fine Grain Power Management

Fine grain power management

Page 50: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

50WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Network Power Estimate

Network power estimate

Page 51: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

51WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Three Dimensional Interconnect With Stacking

Three dimensional interconnect with stacking

Page 52: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

52WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Assembly of 3D Memory

Assembly of 3D memory

Page 53: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

53WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Recommended points from Berkeley

■ The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems

■ The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS per watt, MIPS per area of silicon, and MIPS per development dollar.

■ Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures.

A dwarf is an algorithmic method that captures a pattern of computation and communication.

“Autotuners” should play a larger role than conventional compilers in translating parallel programs.

■ To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications.

■ To be successful, programming models should be independent of the number of processors.

Page 54: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

54WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Recommended points from Berkeley

■ To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism.

■ Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters.

■ Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines.

■ To explore the design space rapidly, use system emulators based on FPGAs that are highly scalable and low cost.

maybe they missed some key point, for example:

whenever it is possible, computational execution should happen in asynchronous manner

Page 55: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

55WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Because Asynchronous

■ Low power consumption,

… due to fine-grain clock gating and zero stadby power consumption.

■ High operating speed,

… operating speed is determined by actual local latencies rather than global worst-case latency.

■ Less emission of electro-magnetic noise,

… the local clocks tend to tick at random points in time.

■ Robustness towards variations in supply voltage, temperature, and fabrication process parameters,

… timing is based on matched delays (and can even be insensitive to circuit and wire delays).

■ Better composability and modularity,

… because of the simple hanshake interfaces and the local timing.

■ No clock distribution and clock skew problems,

… there is no global signal that needs to be distributed with minimal phase skew across the circuit.

Page 56: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

56WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Auto-tuners

Page 57: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

57WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computational Model

■ Designing clever parallel hardware and then work out how to program it is a big mistake.

■ Designing parallel programming languages and then work out how to implement them is usually a mistake.

■ Developing the right computational model alongside languages & hardware is the Key.

Page 58: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

58WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Computational Model

■ Think about systems, not just hardware or software.

■ There is lots of (possibly) relevant work e.g.- Dataflow (Single Assignment)- Graph Rewriting (Functional Languages)- Bulk Synchronous Parallelism (BSP)- Transactional Memory

■ Don’t ignore previous work and particularly don’t re-invent the wheel!.

Page 59: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

59WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Language Effectiveness

0

5

10

15

20

25

30

35

40

1970 1975 1980 1985 1990 1995 2000 2005

Language Effectiveness

C

C++

Java

Page 60: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

60WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Language Effectiveness

1

10

100

1000

10000

100000

1000000

10000000

1970 1975 1980 1985 1990 1995 2000 2005

Language EffectivenessMoore's Law

Page 61: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

61WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

CISC Architecture

■ Huge effort into improving performance of sequential instruction stream

■ Complexity has grown unmanageable

■ Even with 1 billion transistors on a chip, what more can be done?

Renaming

Out-of-Order

Execution

Pipelining

SpeculativeExecution

Prefetching

BranchPrediction

ValuePrediction

Page 62: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

62WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

TRIPS Prototype

Page 63: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

63WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Cyclops-64 Architecture

Cyclops-64 Programming Models and System Software Supports

UPC+/-UPC+/-Co-array Fortran

OpenMP-XNEARTH-C +/-EARTH-C +/- MPI……

Application Programming APIApplication Programming API

Cyclops Thread Virtual MachineCyclops Thread Virtual MachineThreadThread

ManagementManagementShared Memory

Operations

Thread Creation & TerminationThread Creation & Termination

SchedulingScheduling

Dynamic memory managementDynamic memory management

Put / get with syncPut / get with sync

acquire / releaseacquire / release fibersfibers

async function invocationasync function invocation

Kcc/gcc

Compiler

Tool

chain

Kcc/gcc

Compiler

Tool

chain

Fine-GrainMultithreading

Thread SynchronizationThread Synchronization

Load BalancingLoad Balancing

Others

Put / getPut / get

Location Location ConsistencyConsistency

System System Software Software

PercolationPercolation

Advanced Execution/ Advanced Execution/ Programming ModelProgramming Model

InfrastructurInfrastructure and Toolse and Tools

Simulation / Simulation / Emulation Emulation

Analytical Analytical Modeling Modeling

Base Base Execution Execution

ModelModel

Fine-Grain Fine-Grain Multithreading Multithreading (e.g. EARTH, (e.g. EARTH,

CARE)CARE)

Communication Ports for3D Mesh Inter-Chip Network

Cyclops-64 ISACyclops-64 ISA

24x24

24 PC cards in 1 shishkebab

1 PetaFlops1 PetaFlops

A-Switch

Crossbar Network

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

TU TU

SP SP

FPU

4 GB/sec* 6

4 GB/sec

50 MB/sec

1 Gbit/sethernet

Off

-Chi

p M

emor

y

OtherChips via 3D

mesh

Off

-Chi

p M

emor

yO

ff-C

hip

Mem

ory

Off

-Chi

p M

emor

y

IDEHDD

4 GB/sec

6

SP SP SP SP SP SP SP SP

TU TU

SP SP

FPU

TU TU

SP SP

FPU

TU TU

SP SP

FPU

A-s

wit

ch

DM

A6A-Switch

Crossbar NetworkCrossbar Network

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

ME

MO

RY

BA

NK

TU TU

SP SP

FPU

TUTU TUTU

SPSP SPSP

FPUFPU

4 GB/sec* 6

4 GB/sec

50 MB/sec

1 Gbit/sethernet

Off

-Chi

p M

emor

yO

ff-C

hip

Mem

ory

OtherChips via 3D

mesh

Off

-Chi

p M

emor

yO

ff-C

hip

Mem

ory

Off

-Chi

p M

emor

yO

ff-C

hip

Mem

ory

Off

-Chi

p M

emor

yO

ff-C

hip

Mem

ory

IDEHDD

4 GB/sec

6

SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP

TU TU

SP SP

FPU

TUTU TUTU

SPSP SPSP

FPUFPU

TU TU

SP SP

FPU

TUTU TUTU

SPSP SPSP

FPUFPU

TU TU

SP SP

FPU

TUTU TUTU

SPSP SPSP

FPUFPU

A-s

wit

ch

DM

A

A-s

wit

ch

DM

A6

Page 64: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

64WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

hHLDS

The homogeneous High Level Dataflow System (hHLDS) model

Firing rules in the classical model

Let A={a1, …, an} be the set of actors

and L ={ll, …, ln} be the set of links

A dataflow graph is a labelled directed graph

G = (N, E)where N = A L is the set of nodes

E (A × L) (L × A) is the set of edges

firing of an actor

a token on each input link and no token on each output link

Page 65: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

65WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

hHLDS

The hHLDS model

Merge

FT

A B

L

FT

Switch

A

L

Decider

A B

L

R L

Gate

are characterized by having heterogeneous I/O conditions

Special actors in the classical model

Page 66: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

66WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

hHLDS

Any actor has two input links and one output link and consumes and produces only data tokens

firing of an actor

a token on each input link

effectconsumes all input tokens and can produces a token on its output link

a+b*c

*

+

a

b c≤

+

a

b c

If b≤c then a

Page 67: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

67WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

hHLDS

Comparison between the two models

TF

=

T F

T F

T F

* 3

/ 2 5

F F

1 c

a

d

F

F

F

T

TT

a )

TF

> 1

+

**

+ +

> <

:_

LS T LS T

++

==

a

b

1

53 2

1

c

d

a

b )

1 2

3

6

8

10

12 13 14

11

9

7

4 5

input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c;output (d)

The hHLDS model

Page 68: Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

68WorkshopDecember 19, Napoli - Italy

R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……

CNR Bioinformatics

Dataflow Computational Model

+

+

+

DATA

Results

memorymemory

Initial

values