mpsoc

180

description

mpsoc

Transcript of mpsoc

  • SoCrates

    - A Scalable Multiprocessor System On Chip

    Authors

    Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

    fmci,mnc,[email protected]

    Supervisors

    Johan Starner and Joakim Adomat

    Examinator

    Lennart Lindh

    Department of Computer Engineering

    Computer Architecture Lab

    Malardalen University

    Box 883, 721 23 Vasteras

    Abstract

    This document is the result of a Master Thesis in Computer Engineering, describing the analysis,

    specication and implementation of The rst prototype of Socrates, a congurable, scalable and

    predictable platform for System-on-chip Multiprocessor system for real-time applications. The design

    time of System-on-a-Chip (SoC) is today rapidly increasing due to high complexity and lack of ecient

    tools for development and verication. By combining all the functions into one chip the system

    becomes smaller, faster, and less power consuming but increasing the complexity. To decrease the

    time-to-market SoCs are entirely or partially build with IP-components. Thanks to SoC, a whole new

    domain of products, like small hand held devices, has emerged. The concept has been around a few

    years now, but there are still challenges that needs to be resolved. There is a lack of standards for

    enabling fast mix and match of cores from dierent vendors. Further needs are new design methods,

    tools, and verication techniques. SoC solutions needs special kind of CPUs that consumes less power,

    is cheaper, smaller, but still has high-performance requirements. To fulll all these demands, they are

    getting more and more complex as the number of transistors are rapidly growing which has led to the

    emerging of multiprocessors systems-on-a-chip. Our initial question is to investigate if it is possible

    to build these complex multiprocessors systems on a single FPGA and if these solutions can lead

    to shorter time-to-market. The consumer demands for cheaper and smaller products makes FPGA

    solutions interesting. Our approach is to have multiple processing nodes containing processing unit,

    memory and a network interface all together connected on a shared bus. A central in-house developed

    hardware real-time unit handles scheduling and synchronization. We have designed and implemented

    a MSoC that ts on a single FPGA in only 40 days, which has to our supervisors knowledge not been

    accomplished before. Our experience is that a tightly coupled group can produce fast results since

    information, new ideas and bug reports propagates immediately.

    SoCrates stands for SoC for Real-Time Systems

  • Introduction

    This report describes the design of the rst prototype of SoCrates, a generic scalable platform gener-

    ator which creates a synthesizable HDL description of a multiprocessor system. The goal was to build

    a predictable multiprocessor system on a single FPGA with mechanisms for prefetching data and an

    in-house developed integrated hardware real-time unit.

    The report consist of three parts where the rst part, Computer Architecture For System on Chip,

    is a state of the art report introducing basic SoC terminology and practice with a deeper analysis in

    CPUs, interconnects and memory hierarchies. The purpose of this analysis was to learn about state of

    the art techniques on how to design complex multiprocessor SoCs. The design process resulted in part

    two, SoCrates - specications, which describes the prototype and all the individual parts functionality

    and specic demands. Part three, Socrates -implementation details, describes the implementation on all

    parts, how to congure the system, and how to compilie and link the system software. We also present

    synthesis results and suggest future work that can be done to improve the system.

  • SoCrates

    -Document index

    Document 1 Computer Architecture for System on Chip - A State of the Art Report

    1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2. Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3. Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

    4. Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    Document 2 Socrates Specications

    1. System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .1

    2. CPU Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

    4. Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    5. IO Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    6. Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    7. Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    8. Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

    9. Memory Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

    Document 3 Socrates -Implementation details

    1. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

    2. Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3. Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4. Compiling & Linking the System Software . . . . . . . . . . . . . . . 31

    5. Configuring the Socrates Platform . . . . . . . . . . . . . . . . . . 35

    6. Current Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    7. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Document 4 Appendix

    1. Demo Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2. I/O Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    4. Task switch routines . . . . . . . . . . . . . . . . . . . . . . . . . 6

    5. Linker scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    6. DATE 2001 Conference, Designers Forum, publication . . . . . . . . . .12

  • Computer Architecture for System on Chip

    - A State of the Art Report

    Revision: 1.0

    Authors

    Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

    fmci,mnc,[email protected]

    Supervisors

    Johan Starner and Joakim Adomat

    Department of Computer Engineering

    Computer Architecture Lab

    Malardalen University

    Box 883, 721 23 Vasteras

    May 20, 2000

    Abstract

    This state of the art report introduces basic SoC terminology and practice with deeper analysis

    in three architectural components: the CPU, the interconnection, and memory hierarchy. A short

    historical view is presented before going into todays trends in SoC architecture and development. The

    SoC concept is not new, but there are challenges that has to be met to satisfy customer demands for

    faster, smaller, cheaper, and less power consuming products today and in the future. This document

    the rst of three documents that forms a Master Thesis in Computer Engineering.

  • CONTENTS II

    Contents

    1 Introduction 1

    1.1 What is SoC ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Soc Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2.1 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2.2 An Example of a SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 Why SoC? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3.2 State of Practice and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Introduction to Computer System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.4.1 Computer System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.5 Research & Design Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.5.1 Hydra: A next generation microarchitecture . . . . . . . . . . . . . . . . . . . . . . 6

    1.5.2 Self-Test in Embedded Systems (STES) . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.5.3 Socware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.5.4 The Pittsbourgh Digital Greenhouse . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.5.5 Cadence SoC Design Centre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Embedded CPU 8

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.2 The Building Blocks of an Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2.1 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2.2 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.4 Memory Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.5 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.6 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 The Microprocessor Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.4 Design Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.4.1 Code Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.4.4 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.5 Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.6 State of Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.6.1 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.6.2 Motorola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.6.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.6.4 Patriot Scientic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.6.5 AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.6.6 Hitachi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.6.7 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.6.8 PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.6.9 Sparc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.7 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.7.1 Multiple-issue Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.7.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.7.3 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.7.4 Chip Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.7.5 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.8 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

  • CONTENTS III

    2.8.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.8.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.9 Trends and Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.9.1 University . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.9.2 Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3 Interconnect 27

    3.1 Introduction and basic denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.2 Bus based architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.2.1 Arbitration mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.2.2 Synchronous versus asynchronous buses . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.2.3 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.2.4 Pipelining and split transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.2.5 Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3.2.6 Bus hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.2.7 Connecting multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.3 Case studies of bus standards with multiprocessor support . . . . . . . . . . . . . . . . . . 31

    3.3.1 FutureBus+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.3.2 VME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.3.3 PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.4 Point-to-point interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.4.1 Interconnection topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.5 Interconnect performance & scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.5.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.5.2 Shared buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.5.3 Point-to-point architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.6 Interconnecting components in a SoC design . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.6.1 VSIA eorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.6.2 Dierences between standard and SoC interconnects . . . . . . . . . . . . . . . . . 36

    3.7 Case studies of existing SoC-Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.7.1 AMBA 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.7.2 CoreConnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.7.3 CoreFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.7.4 FPIbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.7.5 FISPbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.7.6 IPBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.7.7 MARBLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.7.8 PI-Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.7.9 SiliconBackplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.7.10 WISHBONE Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.7.11 Motorola Unied Peripheral Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.8 Case studies of SoC multiprocessor interconnects . . . . . . . . . . . . . . . . . . . . . . . 43

    3.8.1 Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.8.2 Silicon Magic's DVine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4 Memory System 45

    4.1 Semiconductor memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.1.1 ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.1.2 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.2 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.3 Cache memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.3.1 Cache: the general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4.3.2 The nature of cache misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

  • CONTENTS IV

    4.3.3 Storage strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    4.3.4 Replacement policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.3.5 Read policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.3.6 Write policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.3.7 Improve cache performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.4 MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.5 Multiprocessors architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.5.1 Symmetric Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.5.2 Distributed memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.5.3 COMA Cache Only Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.5.4 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.5.5 Coherence through bus-snooping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.5.6 Directory-based coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.6 Hardware-driven prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.6.1 One-Block-Lookahead: (OBL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.6.2 Stream buer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.6.3 Filter buers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.6.4 Opcode-driven cache prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.6.5 Reference Prediction Table: (RPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.6.6 Data preloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.6.7 Prefetching in multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    5 Summary 68

  • 1 INTRODUCTION 1

    1 Introduction

    This State of the Art Report covers computer architecture topics with emphasis on System on Chip (SoC).

    The reader is introduced to the basic ideas behind SoC and general computer architecture concepts before

    presenting an in-depth analysis of three important SoC components: CPU, Interconnect and Memory

    architecture.

    1.1 What is SoC?

    SoC stands for System-on-Chip and is a term for putting a complete system on a single piece of silicon.

    SoC has become a very popular word in the computer-industry, but very few agree on a general denition

    of SoC [19]. There are several alternative names for putting a system on a chip, such as system on

    silicon, system-on-a-chip, system-LSI, system-ASIC, and system-level integration (SLI) device [33]. Some

    might say a large design automatically makes it a SoC, but that would probably include every existing

    design today. A better approach would be to say that a SoC should include dierent structures such as

    a CPU-core, embedded memory and peripheral cores. This still is a wide denition which could imply

    that any modern processor with a on-chip cache should be included into the SoC-community. Therefore

    a more suitable denition of SoC would be:

    A complete system on a single piece of silicon, consisting of several types of modules including at least one

    processing unit designated for software execution, where the system depends on no or very few external

    components in order to execute its task.

    1.2 Soc Designs

    In the beginning, almost all SoC's were simply integrations of existing board-level designs [20]. This way

    of designing a system looses many benets that otherwise could be taken advantage of if the system would

    be designed from scratch. Another approach is to use already existing modules, called IP-components,

    and to integrate them to a complete system suitable for a single die.

    1.2.1 Intellectual Property

    When something is protected through patents, copyrights, trademarks or trade secrets it is considered as

    a Intellectual Property (IP). Only patents and copyrights is relevant for IP-components [13] (also referred

    as macros, cores and Virtual Components (VC) [10]). An IP-component is a pre-implemented, reusable

    module, for example a DMA-controller or a CPU-core. There are several companies that makes their

    living by building, licensing and selling IP-components, which the semiconductor companies pays both

    fees and royalties for

    1

    . There exist three classes of IP-components with dierent properties regarding to

    portability and protection characteristics. As the portability decreases through the classes, the protection

    will increase.

    Soft This class of IP-components have their architecture specied at Register-Transfer Level (RTL),

    which are synthetizable. Soft IP's are functionally validated and are very portable and modiable.

    Since they are not mapped to a specic technology, the behavior according to area, speed, and

    power consumption will be unpredictable. Much work still needs to be done before the component

    can be utilized and the end-result is dependent of the used synthesis tools.

    Firm The rm class components are in general soft components that have been oorplanned and synthe-

    sized into one or several dierent technologies to get better estimations of area, speed, and power

    consumption.

    1

    There are exceptions where one can acquire IP-components without any licensing or royalty fees. More information can

    be found at http://www.openip.org/.

  • 1.3 Why SoC? 2

    Hard Hard-IP's are further renement of rm components. They are fully syntesized into mask-level

    and physicaly validated. Very little work has to be done in order to implement the functionality

    in silicon. Hard IP's are not modiable nor portable, but the prediction of their area, speed, and

    power consumption is very accurate.

    1.2.2 An Example of a SoC

    A typical SoC consists of a CPU-core, a Digital Signal Processor (DSP), some embedded memory, and a

    few peripherals such as DMA, I/O, etc (Figure 1). The CPU could perform several tasks with the assis-

    tance of a DSP when needed. The DSP is usually responsible for o-loading the CPU by doing numerical

    calculation on the incoming signals from the A/D-converter. The SoC could be built of only third-party-

    IP-components, or it could be a mixture of IP-components and custom-made solutions. More recently,

    there has been eorts to implement a Multiprocessor System on Chip (MSoC) [6], which introduces new

    challenges regarding cost, performance, and predictability.

    Figure 1: An example of a SoC

    1.3 Why SoC?

    The rst computer systems consisting of relays and later vacuum tubes, used to occupy whole rooms and

    their performance were negligible compared to todays standard workstations. The advent of the transistor

    in 1948 enabled engineers to minimize a functional block to an Integrated Circuit (IC). These IC's made it

    possible to build complex functions by combining several IC's onto a circuit board. Further development

    of process technology increased the number of transistor on each IC, which led to the emerging of systems

    on board. After this, there has been a constant battle between semiconductor companies to deliver the

    fastest, smallest and cheapest products, resulting in today's multi-billion dollar industry. Even though

    the SoC concept has been around for quite some time, it has not really been fully feasible until recent

    years due to advances like deep sub-micron CMOS process technology.

    1.3.1 Motivation

    There are several reasons why SoC is an attractive way to implement a system. Todays rened manu-

    facturing processes makes it possible to combine both logic and memory on a single die, thus decreasing

    overall memory access times. Given that the application memory requirement is small enough for the

    on-chip embedded memory, memory latency will be reduced due to elimination of data trac between

    separate chips. Since there is no need to access the memory on external chips, the number of pins can

    also be reduced and the use of on-board buses becomes obsolete. Encapsulation counts for over 50% of

    the the overall process cost for chip manufacturing [15]. In comparison to a ordinary system-on-board,

    SoC uses one or very few IC's, reducing total encapsulation cost and thereby total manufacturing cost.

    These characteristics as well as less power consumption and shorter time-to-market enables smaller,

    better, and cheaper products reaching the consumers in an altogether faster rate.

  • 1.3 Why SoC? 3

    1.3.2 State of Practice and Trends

    Until now, much of SoC implementation has been about shrinking existing board-level systems onto a

    single chip, with no or little consideration to all benets that could be gained from a chip-level design.

    Another approach to SoC is to interconnect several dies and place them inside one chip. This kind of

    modules are calledMulti Chip Modules (MCM). The choice of implementation of the Hydra Multiprocessor

    Project was at rst a MCM , which later evolved to a SoC [14, 6].

    Today it is too time-consuming for companies to implement a system from scratch. Instead, a faster and

    more reliable way is to use own or third party pre-implemented IP-components [3], which makes designing

    a whole system more about integrating components rather than designing them. There exist three design

    methodologies, each with it's own eciency and cost regarding SoC design [16, 18]. The vendor design

    approach, which shifts the design responsibilities from the system designers to the ASIC vendors, can

    result in the lowest die cost. But it can also lead to higher engineering costs and longer time-to-market.

    A more exible method is the partial integration approach, which divides the responsibilities of the design

    more equally. It lets the system designers produce the ASIC design, while the semiconductor vendors

    are responsible for the core and integration. This method gives the system designers more control of the

    working process in comparison to the vendor method. Yet more exible is the desktop approach which

    leaves the semiconductor vendors only to design the core. This reduces time-to-market and requires low

    engineering costs. A key property for IP-components in the future are parameterization of soft cores [16].

    There is a continuous growth in the demand for \smart products" which is expected to make our lives

    better and simpler. Recently, SoC products has begun to emerge on several markets in form of Application

    Specic Standard Products (ASSP)

    2

    or Application Specic Instruction-set Processors (ASIP)

    3

    :

    Set-top-boxes A Set-Top-Box (STB) is a device that makes it possible for television viewers to access

    the Internet and also watch digital television (D-TV) broadcasts. The user has access to several

    services: weather and trac updates, on-line shopping, sport statistics, news, e-commerce, etc. By

    integrating the STB's dierent components into a SoC, it will simplify system design and be a more

    competitive product with its shorter time-to-market, be less expensive and less power-consuming.

    The Geode SC1400 chip is an example of a SoC used in a STB that meets the demands of delivering

    both high-quality DVD video and Internet accessibility [34].

    Cell phones A SoC in a cell phone will reduce its size and weight, make it cheaper and less power

    consuming.

    Home automation Many domestic appliances at home will be "smarter". For example, the refrigerator

    will be able to notify its owner when a product is missing and place an order on the Internet.

    Hand-held devices A new generation of hand-held devices is coming, that can send and receive email

    and faxes, make calls, surf the Web etc. A SoC solution is especially suited for portable applications

    such as hand-held PC's, digital cameras, personal digital assistants and other hand-held devices

    because its built-in peripheral functions minimizes overall system cost and eliminates the need to

    use and congure additional components.

    1.3.3 Challenges

    One of the emerging challenges is to standardize interfaces of IP-components to make integration and

    composition easier. A lot of dierent on-chip bus standards has been created by the dierent design houses

    to make it possible to fast integrate IP-components, which has resulted in noncompability caused by the

    dierent interfaces. To solve this dilemma the Virtual Socket Interface Alliance (VSIA) was founded to

    enable the mix and match of IP-components from multiple sources by proposing a hierarchical solution

    2

    High integration chip or chipset for a specic application [59].

    3

    A eld or mask programmable processors of which the architecture and instruction set are optimized to a specic

    application domain [58].

  • 1.4 Introduction to Computer System Architecture 4

    that enables multiple buses [17]. Still some criticizes VSIA for only addressing simple data ows [11].

    More can be read about dierent on-chip bus standards in section ??.

    Since the time-to-market is decreasing, the testing and verication of the SoC must be done very fast.

    By reusing IP-components it is possible that the test development actually takes longer time than the

    work to integrate the dierent functional parts [12]. The fact that the components are from dierent

    sources and may have dierent test methodologies complicates the test of the whole system. At the board

    level design many of the components has their primary inputs visible which made the testing easier, but

    SoC's contain deeply embedded components where there is no or very little possibility to observe signals

    directly from one IP-component after manufacturing. Since the on-chip interconnect is inside the chip,

    it is also hard to test due to the lack of observability.

    As the future is lurking behind the door, integration is not likely to stop with IP-components and

    dierent memory technologies, we are also likely to see a variety of analog functionality. Analog blocks

    are very layout and process dependent, requires isolation and utilizes dierent voltages and grounds. All

    these facts makes them the dicult to integrate in the design [10]. Are the limits to the integration

    urge? As the process technologies becomes more sophisticated, transistor switching speed will increase

    and the voltage for logical levels will decrease. Dropping the voltages will make the units more sensible

    for noise. Analog devices with higher voltage needs can encounter problems working properly in those

    environments [17].

    Apart from the lack of eective design and test methologies [29] and all the technical problems with

    mapping a complex design consisting of several IP-components from dierent design houses onto a partic-

    ular silicon platform, there are complex business issues dealing with licensing fees and royalty payments

    [30].

    1.4 Introduction to Computer System Architecture

    SoC is about putting a whole system on a single piece of silicon. But what is a system? This section serves

    as a introduction to computer system architecture and tries do give the reader a better understanding of

    what is actually put onto a SoC.

    1.4.1 Computer System

    In general, a typical computer system (gure 2) consists of one or more CPU's that executes a program by

    fetching instructions and data from memory. To be able to access the memory, the CPU needs some kind

    of interface and a connection to it. The interface is usually provided by the Memory Management Unit

    (MMU) and the connection is handled by the interconnect. The local interconnect is often implemented

    as a bus consisting of a number of electrical wires. Sometimes, the CPU needs assistance in fetching large

    amount of data, in order to be eective. This work can be done in parallel with the CPU by the Direct

    Memory Access (DMA) component. The system needs some means to communicate with the outside

    world. This is provided by the I/O system. We proceed with a closer look at the important components

    that comprises a computer system.

    CPU The CPU is where the arithmetic, logic, branching and data transfer are implemented[8]. It consists

    of registers, a Arithmetic Logic Unit (ALU) for computations and a control unit. The CPU can

    be classied as a Complex Instruction Set Computer (CISC), if the instruction set is complex (e.g

    has a lot of instructions, several addressing modes, dierent word-length on instructions etc). The

    idea behind a Reduced Instruction Set Computer (RISC) is to make use of a limited instruction set

    to maximize performance on common instructions by working with a lot of registers while taking

    penalties on the load and store instructions. RISC has uniform length on all instructions and

    very few addressing modes. This uniformity is the main reason why this approach is suitable for

    instruction pipelining, in order to increase performance. There are other architectures that further

    increase performance, for example superscalar, VLIW, and vector computers. A machine is called a

    n-bit machine if it is operating internally on n-bit data[8]. Today a lot of the embedded processors

    are still working with 8 or 16 bit words while the majority of workstations and PC's are 32 or 64

    bit machines.

  • 1.5 Research & Design Clusters 5

    Data Data

    Data

    Data

    Address

    AddressAddress Address

    CPU

    DMAcontroller

    DMA device

    Main MemoryCache

    Syste

    m b

    us

    Figure 2: A typical computer system

    Memory System The key to an eective computer system is to implement a ecient memory hierarchy,

    this due to the latency for memory accesses has become a signicant factor when executing a memory

    accessing instruction. In the last decade the gap between memory and CPU speed have been

    growing. Memory sub-systems must be built to overcome it's shortcomings towards the processor

    which otherwise results in wasted computational time from waiting for memory operations to be

    completed.

    The memory are often organized in a primary and a secondary part, where the primary memory

    usually consists of one or several RAM-circuits. The secondary are for longterm storage like Hard

    Disc Drives (HDDs), Disks, Tapes, Optic media, and sometimes FLASH memories. To overbridge

    the gap between the memory and CPU, nearly all modern processors has a cache that makes use

    of the inherent locality of data and code, that ideally can deliver data to the CPU in just one clock

    cycle. Usually there exists several levels of cache between the main memory and CPU, each with

    dierent sizes and optimizations. The memory system interfaces to the outside world (e.g processor

    and I/O) via the MMU, that has the responsibility to translate addresses and to fetch data from

    memory via the the memory bus. In multiprocessor systems there are issues whether the memory

    should be local on every node or global.

    Interconnect A computer systems internal components needs to communicate in order to perform their

    task. To make communication possible between components, a interconnect is usually used. A

    interconnect can be designed in variety of ways, called topologies. Examples of topologies are bus,

    switch, crossbar, mesh, hypercube, torus, etc. Each topology has its own characteristics concerning

    latency, scalability and performance.

    I/O System The Input/Output system is the computer systems interface to the outside world, which

    enables it to receive input and output results to the user. Examples of I/O devices includes HDDs,

    graphic and sound systems, serial ports, parallel ports, keyboard, mouse, etc. The transferring of

    I/O data is usually taken care of by the DMA component to o-load the CPU of constant data

    transfer.

    1.5 Research & Design Clusters

    There is a lot of research eort done on computer architecture, which of course is related in some degree

    SoCs, since they all are actually computers. Unlike most research areas, SoC research is lead by the

    industry and not by the universities. Of those universities that have SoC related research projects, very

    few have reached the implementation stage.

  • 1.5 Research & Design Clusters 6

    1.5.1 Hydra: A next generation microarchitecture

    The Stanford Hydra single-chip multiprocessor [6] started out as an MultiChipModule(MCM) in 1994

    but evolved in 1997 to become a Chip MultiProcessor(CMP). The project are supervised by Associate

    Professor Kunle Olukotun accompanied by Associate Professor Monica S. Lam and Mark Horowitz, also

    incorporated in the project are a dozen students. Early development of the project was performed by

    Basem A. Nayfeh nowdays a Ph.D. The Hydra projects focus on combining shared-cache multiproces-

    sor architectures, innovative synchronization mechanisms, advanced integrated circuit technology and

    parallelizing compiler technology to produce microprocessor cost/performance and parallel processor

    programmability The four integrated MIPS-based processors will demonstrate that it is feasible for a

    multiprocessor to gain better cost/performance than achieved in wide superscalar architectures for se-

    quential applications. By using MCM, communication bandwidth and latency will be improved resulting

    in better parallelism. This makes Hydra a good platform to exploit ne grained parallelism, hence a

    parallelizing complier for extracting this sort of parallelism is under development. The project is nanced

    by US Defense Advanced Research Projects Agency(DARPA) contracts DABT and MDA.

    1.5.2 Self-Test in Embedded Systems (STES)

    STES is a co-operational project between ESLAB, the Laboratory for Dependable Computing of Chalmers

    University of Technology, the Electronic Design for Production Laboratory of Jonkoping University, the

    Ericsson CadLab Research Center, FFV Test Systems, and SAAB Combitech Electronics AB. ESLAB

    are responsible of developing a self-test strategy for system-level testing of complex embedded systems.

    Which utilizes the BIST(Built In Self Test) functionality at the device, board, and MCM level. Except

    the involved commercial participants the project are founded by NUTEK.

    1.5.3 Socware

    An international Swedish Design Center/cluster has recently been builded that will be in close coop-

    eration with the technical universities in Linkoping/ Norrkoping, Lund and Stockholm/Kista. The

    Socware, formerly known as Acreo System Level Integration Center (SLIC), aims to have nearly 40

    employees/specialists in the beginning but this number will is grow to 1500 in the near future with a

    special research institute located in Norrkoping. The Design Center will serve as an bridge between

    the industry and research activity and the universities, enabling research results rapidly converting into

    industrial products.

    The focus of research and development will be directed to design of system components within digital

    media technology. initially special focus will be on applications in mobile radio and broadband networking.

    Project is nanced by the government, the municipality of Norrkotoping and other local and regional

    agencies. More information can be found in [35].

    1.5.4 The Pittsbourgh Digital Greenhouse

    The Pittsburgh Digital Greenhouse is a SoC design cluster that focuses on digital video and networking

    markets. The non-prot organization is an initiative taken by the U.S government, universities, and

    industry that started in June 1999. It involves the Carnegie Mellon University, Penn State University,

    University of Pittsbourgh, and several industry members like Sony, Oki, and Cadence.

    Some ongoing research activities closely related to SoC are:

    Congurable System on a Chip Design Methodologies with a Focus on Network Switching

    This project focuses on development of design tools for hardware/software co-design as those re-

    quired for next generation switches on the Internet and cryptography.

    Architecture and Compiler Power Issues in System on a Chip

    The program is focused on to create a software system that characterizes the power of the ma-

    jor components of a SoC design and allows the design to be optimized for lowest possible power

    consumption.

  • 1.5 Research & Design Clusters 7

    MediaWorm: A Single Chip Router Architecture with Quality of Service Support

    The research has focus on the design, fabrication, and testing of a new high performance switched

    network router, called Mediaworm. It is aimed to be used in computer clusters where there are

    demands on level Quality of Service (QoS) guarantees.

    Lightweight Arithmetic IP: Customizable Computational Cores for Mobile Multimedia Appliances

    Focus is made on complexity of multimedia algorithms and development of mathematical software

    functions which provides the required level of computational performance at lower power levels.

    The long range goal is to have SoC in a wide range of next generation products, from \smart homes"

    to hand-held devices that allows the user to surf on the web, send faxes and receive e-mail. Further

    goals is to provide venture capital, training and education and to assist start-up companies that uses the

    research results and pre-designed chips created by the Digital Greenhouse for use in their products. More

    information can be found at [37].

    1.5.5 Cadence SoC Design Centre

    In February 1998, there was an opening of Cadence Design Centre with the purpose of creating one of

    the electronic industry's largest and most advanced SoC design facilities. The centre is located on The

    Alba Campus in Livingston, Scotland and is the largest European design centre. The centre oers exper-

    tise within the spheres of Digital IC, Multimedia/Wireless, Analogue/Mixed Signal, Datacom/Telecom,

    Silicon Technology Services, and Methodology Services. In 1999, The Centre became authorized as the

    rst ARM approved design centre, through the ARM Technology Access Program (ATAP). Current re-

    search projects conducted at the centre involves a single-chip processor for Internet telephony and audio,

    a exible receiver chip suitable for among other things pinpointing location by picking up high-frequency

    radio waves transmitted by GPS satellites, and a fully customized wireless Local Area Network (LAN)

    environment. There are three main pieces of the center, the Virtual Component Exchange (VCX), the

    Institute for System Level Integration (SLI) and The Alba Centre. VCX opened in 1998, which is an

    institution dedicated for establishing of structured framework of rules and regulations for inter-company

    licensing of IP blocks. Members of VCX include ARM, Motorola, Toshiba, Hitachi, Mentor Graphics,

    and Cadence. The SLI institute is an educational institution dedicated to system level integration and

    research. The institute was established by four of Scotland's leading universities, Edinburgh, Glasgow,

    Heriot Watt and Strathclyde. Finally, the Alba centre is the headquarter of the whole initiative and

    provides a central point for information about the venture and assistance for interested rms.

  • 2 EMBEDDED CPU 8

    2 Embedded CPU

    There are several dierent interpretations of the term CPU. Some say it is "The Brains of the computer"

    or "Where most calculations take place", and that it "Acts as a information nerve center for a computer".

    A more concrete denition is given by John L Hennessy and David A Patterson[8]:

    Where arithmetic, logic, branching, and data transfer are implemented.

    This chapter serves as an introduction to CPU's that are especially suitable for SoC solutions, namely

    embedded CPUs. In this case, the term "embedded" does not only refer to how suitable these CPUs are

    for embedded systems, or as stand-alone microprocessors, but also to how they are good candidates to be

    "embedded" into a SoC. The purpose of this chapter is to look at the possibilities of embedded processors

    as a SoC component and what aspects need to be considered when designing and implementing a solution.

    Techniques for improving and measuring performance is discussed as well as where the research is today

    together with a look at the future of embedded processors.

    The chapter begins with an introduction to embedded CPUs that explains some of the factors behind

    their popularity. Section 2.2 is a presentation of the building blocks of a modern embedded CPU. Section

    2.4 discusses the major factors aecting the design. Section 2.3 looks at which paradigm is currently in

    front regarding embedded CPUs. Section 4.3.7 considers options on how to implement an embedded CPU.

    Section 2.6 presents case studies of embedded CPUs available in the market today. Section 2.7 shows

    several techniques of how to improve the performance. Next, section 2.8 consider how the performance

    of a embedded processor could be measured. Finally, section 2.9 looks at where the research is today and

    what the trends are in the embedded processor market.

    2.1 Introduction

    The latest advances in process technology has increased the number of available transistors on a single

    die almost to the extent that todays battle between designers is not about how to t it all on a single

    piece of silicon, but how to make the most use of it. This evolution has also made it possible for designers

    to put a complete processor, together with some or all of its periphal components on a single die, creating

    a new class of products, called Application Specic Standard Products (ASSPs). The demand for ASSPs

    has also created a new domain of processors, embedded 32-bit CPUs, that are cheap, energy-ecient,

    especially designed for solving their domain of tasks.

    Before getting into all the wonders of embedded CPU's, some clarifactions should be made about what

    they are and what they are not. When CPU's are discussed, the thoughts often goes to the architectures

    from Intel, Motorola, Sun, etc. These architectures are mainly y designed for the desktop market and

    have dominated it for a long time. In recent years, there has been a increasing demand for CPU's designed

    for a specic domain of products. Among those noticing this trend was David Patterson[21]:

    Intel specializes in designing microprocessors for the desktop PC, which in ve years may

    no longer be the most important type of computer. Its successor may be a personal mobile

    computer that integrates the portable computer with a cellular phone, digital camera, and video

    game player... Such devices require low-cost, energy-ecient microprocessors, and Intel is far

    from a leader in that area.

    The question of what the dierence is between a desktop and an embedded processor is still unanswered.

    Actually, some embedded platforms arose from desktop platforms (such as MIPS, Sparc, x86), so the

    dierence can not be in register organization, the instruction set or the pipelining concept. Instead, the

    factors that dierentiate a desktop CPU from an embedded processor will be power consumption, cost,

    integrated periphal, interrupt response time, and the amount of on-chip RAM or ROM. The desktop world

    values the processing power whereas an embedded processor must do the job for a particular application

    at the lowest possible cost[22].

  • 2.2 The Building Blocks of an Embedded CPU 9

    2.2 The Building Blocks of an Embedded CPU

    This section serves as an introduction to the components of a modern embedded CPU. Readers that are

    familiar with the basics of computer architecture and processor design might skip this section.

    A CPU basically consists of three components: register set, ALU, and a control unit. Today, it is often

    the case that the CPU includes a on-chip cache and a pipeline, in order to achieve an adequate level of

    performance (Figure 3). The following text will give a brief introduction to the components function and

    purpose in the CPU.

    Address Register

    IncrementerAddress

    32-bit Registers

    32 x 8 Multiplier

    32-bit ALU

    Write Data Register

    Control Logic

    InstructionDecoder

    &

    Instruction

    Pipeline

    Barrel Shifter

    ALU

    Bus

    PC B

    us

    Incr

    emen

    t Bus

    32-bit Address Bus

    32-bit Data Bus

    Cache

    Figure 3: A typical embedded CPU

    2.2.1 Register File

    The organization of registers or how information is handled inside the computer is part of a machines

    Instruction Set Architecture (ISA)[8, 9]. An ISA includes the instruction set, the machine's memory,

    and all of the registers that is accessible by the programmer. ISAs are usually divided into three main

    categories regarding to how information is stored in the CPU: stack architecture, accumulator architecture,

    and general-purpose register (GPR) architecture. These architectures dier in how an operand is handled.

    A stack architecture keeps its current operands on top of the stack, while a accumulator architecture keeps

    one implicit operand in the accumulator, and a general-purpose register architecture only have explicit

    operands which can reside either in memory or registers. Following example shows how the expression A

    = B + C would be evaluated in these three architectures.

    stack architecture accumulator architecture general-purpose architecture

    PUSH C LOAD R1,C LOAD R1,C

    PUSH B ADD R1,B LOAD R2,B

    ADD STORE A,R1 ADD R3,R2,R1

    POP A STORE A,R3

    The machines in the early days used stack architectures and did not need any registers at all. Instead,

    the operands are pushed onto the stack and popped o into a memory location. Some advantages was

    that space could be saved because the register le was not needed, and no operands were needed during

    arithmetic operation. As the memories became slower compared to the CPU's, the stack architecture

  • 2.2 The Building Blocks of an Embedded CPU 10

    also became ineective, due to the fact that most time is spent while fetching the operands from memory

    and writing them back. This became a major bottleneck, which made the accumulator architecture a

    more attractive choice.

    The accumulator architecture was a step-up regarding performance by letting the CPU hold one of

    the operands in a register. Often, the accumulator machines only had one data accumulator register,

    together with the other address registers. They are called accumulators, due to their responsibility to act

    as a source of one operand and destination of arithmetic instructions, thus accumulating data. The accu-

    mulator machine was a good idea at the time when memories were expensive, because only one address

    operand had to be specied, while the other resided in the accumulator. Still, the accumulator machine

    has it's drawbacks when evaluating longer expressions, due to the limited amount of accumulator registers.

    The GPR machines solved many problems often related to stack and accumulator machines. They

    could store variables in registers, thus reducing the number of accesses to main memory. Also, the

    compiler could associate the variables of a complex expression in several dierent ways, making it more

    exible and ecient for pipelining. A stack machine needs to evaluate the same complex expression from

    left to right which might result in unnecessary stalling. Many embedded CPUs are RISC architectures

    which means that they have lots of registers (usually about 32).

    2.2.2 Arithmetic Logic Unit

    The Arithmetic Logic Unit (ALU) performs arithmetic and logic functions in the CPU. It is usually

    capable of adding, subtracting, comparing, and shifting. The design can range from using simple combi-

    national logic units that does ripple carry addition, shift-and-add multiplication, and a single-bit shifts,

    to no-holds-barred units that do fast addition, hardware multiplication, and barrel shifts[9].

    2.2.3 Control Unit

    The control unit is responsible for generating proper timing and control signals (usually implemented as

    a state-machine that performs the machine cycle: fetch, decode, execute, and store to other logical blocks

    in order to complete the execution of instructions.

    2.2.4 Memory Management Unit

    The Memory Management Unit (MMU) is located between the CPU and the main memory and is

    responsible for translating virtual addresses into a their corresponding physical address. The physical

    address is then presented to the main memory. The MMU can also enforce memory protection when

    needed.

    2.2.5 Cache

    There are few processors today that don't incorporate a cache. The cache acts as a buer between the

    CPU and the main memory to reduce access time, taking advantage of the locality of both code and

    data. There are usually several levels of cache, each with their own purpose. The rst level is usually

    located on-chip, thus together with the CPU. The cache is often separated into a instruction- and a

    data-cache. Cache is especially important in a RISC architecture with frequent loads and stores. For

    example, Digital's StrongARM chip devotes about 90% of its die area to cache[89]. The reader can learn

    more about cache and how it is used in section 4.2.

    2.2.6 Pipeline

    As with the case of cache, there are very few processors today that doesn't use any kind of pipelining

    in order to improve their performance. This section will serve as an introduction to pipelining and

    the benets and drawbacks of using it. Pipelining is implementation technique that tries to achieve

  • 2.2 The Building Blocks of an Embedded CPU 11

    Instruction Level Parallelism (ILP) by letting multiple instructions overlap in execution. The objective

    is to increase throughput, the number of instructions completed at a time. By dividing the execution of

    an instructions into several phases, called pipeline stages, an ideal speedup equal to the pipeline depth

    could theoretically be achieved. Also, by dividing the pipeline into several stages the workload will be less

    each stage, letting the processor run at a higher frequency[8]. Figure 4 shows a typical pipeline together

    with it's stages. This particular pipeline has a length of ve and consists of unique pipeline stages, each

    with their own purpose.

    WBIDIF MEMEX

    Figure 4: A General Pipeline

    The Instruction Fetch cycle (IF) consists of fetching the next instruction from memory.

    The Instruction Decode cycle (ID) handles the decoding of the instruction and reads the register le

    in case one or several of the instruction's operand(s) is a register.

    The Execution cycle (EX) evaluates ALU operations or calculates destination address in case of a

    branch instruction.

    The Memory Access cycle (MEM) is where memory is accessed when needed, or in case of a branch

    the new program counter is set to the calculated destination address from the previous pipeline

    stage

    4

    .

    The Write-back cycle (WB) writes the result back to the register le.

    The time it takes for an instruction to move from one pipeline stage to another is usually referred to

    as a machine cycle. If one stage require several cycles to complete, it could be decomposed into several

    smaller stages, resulting in a superpipline. Because instructions need to move at the same time, the length

    of a machine cycle is dictated by the slowest stage of the pipeline. The designer challenge is to reduce

    the number of machine cycles per instruction (CPI). If one would execute one instruction at a time, the

    CPI count would be equal to the pipeline length. The optimal result in a linear pipeline would be to

    reach a CPI count of 1.0, which means that a instruction is completed every cycle and that every pipeline

    stage is fully utilized. This is not entireably achievable, due to the fact that a program usually consist of

    internal dependencies, branches, etc. These pipeline obstacles are usually referred to as hazards and can

    cause delays in the pipeline, called stalls. An execution of a program completely without hazards would

    execute its instructions with virtually no delays, resulting in a CPI count close to 1.0

    5

    . Those hazards

    that do cause pipeline stalls are usually classied as: structural, data and control hazards.

    structural hazards are caused by resource conicts when the hardware cannot support certain combi-

    nations of overlapped execution. It can depend on just having one port to the register le, causing

    conicts in the ID and WB stage for register requests. Another source of conict might be that the

    memory is not divided into code and data, thus causing conicts in the IF and MEM stage due to

    instruction fetching and memory writing.

    data hazards are caused by an instruction being dependent of another instruction in previous pipeline

    stage so that execution must be stalled, or else written data can be inconsistent. These instruction

    dependencies comes in three avors: Read After Write (RAW), Write After Write (WAW), and

    4

    In case of an conditional branch, the condition will be evaluated. If the instruction branches, the program counter is

    set by previous calculation, otherwise the program counter is incremented to point at the next instruction.

    5

    The CPI count never reaches the ideal value of 1.0, thus cycles are always lost in the beginning because the pipeline

    is initially empty and need to be lled with instructions. By the time the rst instruction reaches the WB phase, several

    cycles is lost.

  • 2.3 The Microprocessor Evolution 12

    Write After Read (WAR). RAW hazard are the most common ones and occurs when a write

    instruction is followed by a read instruction and both instructions operate on the same register,

    causing the one instruction to wait until the write has been issued in the WB stage. This can

    be handled by forwarding, thus introducing "shortcuts" in the pipeline, so that instructions can

    take part of results before the current instruction reaches the WB stage[8]. WAW hazards cannot

    occur in pipelines like the one showed earlier (gure 4). The reason for this is that in order for

    a WAW hazard to occur, either the memory stage has to be divided into several stages, making

    several simultaneous writes possible, or some mechanism where an instruction can bypass an another

    instruction in the pipeline. WAR hazards are rare and happens when a instruction tries to write

    to a register read by an instruction that is ahead in the pipeline. As with WAW hazards, WAR

    hazards cannot occur in a general pipeline because register contents are always read at the ID stage.

    Some pipelines do read register contents late and can create a WAR hazard [8].

    control hazards are caused by the instructions that changes the path of execution, called branches.

    By the time a branch instruction calculates it's destination address in the EXE stage, instructions

    following the branch has reached the IF and ID stage. If the branch was unconditional, the instruc-

    tions that is in the IF and ID stage has to be removed, because the branch changes the program

    counter and the new instructions have to be fetched from a new address, namely the destination

    address of the branch. On the other hand, if there was an unconditional branch, the condition

    need to be evaluated in order to decide if the branch should be taken or if the program counter

    only should be incremented. One way of dealing with this problem is to automatically stall the

    pipeline until condition is evaluated. These stalls are issued in the ID stage, where the branch is

    rst identied. Also, in order to evaluate the condition of a conditional branch and calculate the

    destination address simultaneously, extra logic for condition evaluation is added together with the

    ALU in the ID stage. This way, only one stall cycle will be wasted when a branch instruction

    occurs.

    Most structural hazards can be prevented by adding more ports and dividing the memory into data

    and instruction memory segments. The memory can also be improved by adding cache or increasing the

    cache area. Data hazards can be handled by letting the compiler reschedule the instructions in order to

    reduce the number of dependencies. Control hazards can be reduced by trying to predict the destination

    of a branch. The prediction is based on tables storing historical information about whether the same

    branch did or did not jump in earlier executions. Such tables, called Branch History Table (BTB) or

    Branch Prediction Buer (BPB), are doing just that. Other tables such as Branch Target Buer (BTB)

    acts as a cache storing the destination address of many previously executed branches. The interested

    reader can continue it's reading in several books and articles addressing dierent branch penalty reduction

    techniques[8, 64, 63].

    2.3 The Microprocessor Evolution

    This section serves as a "walk-through" of the dierent phases in microprocessor evolution (gure 5).

    As this section may seem unrelevant to embedded processors, the embedded processor design has always

    been inuenced by the microprocessor and may continue to do so in the future. The reader who feels

    unfamiliar with the principles behind the RISC and CISC paradigms, should reread section 1.4.1 before

    proceeding with this text.

    In the early days, there was a limiting amount of transistors available for the CPU designer. Usually,

    the chips where lled with logic that was seldom used (e.g decoding of seldom used instructions). CISC

    computers used microcoding, which made it easier to execute complex instructions. As the years went

    by, it became harder for CISC designers to keep up with Moore's law

    6

    . Building more complex solutions

    each year was not enough. Some designers realized that the rule locality of reference is something that

    needs to be taken into consideration. It states that A program executes about 90% of its instructions

    6

    The capability of microprocessor doubles every 18 month.

  • 2.3 The Microprocessor Evolution 13

    CISCinstructions take variable time to complete

    RISC/CISC

    RISCCISC

    Superscalar/VLIWMultithreaded Processors Single Chip Multiprocessor

    Simultaneous Multithreading

    microcoding, more complex instructions pipeline, simple instructions for speed

    merging of architectures

    execute multiple instructions duplicated processorsduplicated HW resources (regs, PC, SP)

    any context can execute each cycle

    Figure 5: The evolution of microprocessors.

    in 10% of it's code. The RISC designers thought that if they could implement the 10 % of most used

    instructions and throw out the other 90%, then there would be lots of free die area left for other ways of

    increasing the performance. Some of the performance enhancing techniques are listed below.

    Cache Memory references was becoming a serious bottleneck, and a way to reduce the access time is to

    use the extra on-chip space for cache. With the on-chip cache, the processor did not need to access

    the main memory for all memory references.

    Pipelining By breaking down the handling of an instruction into several simpler stages, the processor

    is able to run faster, resulting in higher frequency.

    More registers When compiling a program into machine code, the handling of variables usually is

    taken care of by registers. Sometimes, there are stalls in the pipeline, due to dependencies between

    registers (e.g one can not use a register until it is available), which can be avoided by register

    renaming. This is possible when increasing the number of registers.

    Computers using some or all of these advantages include RISC I and IBM 801 [2]. These enhancements

    gave the RISC designers the upper hand for several generations in the 80's and 90's. But when the

    number of available transistors on a chip passed the million-mark the number of transistors as a limiting

    factor dissapeared. The CISC designers could level the score by introducing more complex solutions that

    increased their performance a couple of percent, with little concern to how much die are was used. Even

    though the CISC processor was several factors more complex than the corresponding RISC processor,

    it was still keeping up with the RISC. Nowadays, the RISC and CISC paradigms are merged together

    and uses techniques from both of the original paradigms. Now, when there are 10, 20 million or more

    transistors available, the problem the designer is facing is more about making the most use of all the

    transistors than how to t it all on one die. A simple processor can now be realized on only a fraction of

    the available space. There are limits in the performance gains when increasing the cache size, deepening

    the pipeline and increasing the number of registers. So, the question is what to do with the available

    space? To gain more performance, new architectures like Multithreading, Simultaneous Multithreading

    (SMT), Very Long Word Instruction (VLIW), and Single Chip Multiprocessor (CMP) are emerging.

    These architectures will be discussed in section 2.7.

  • 2.4 Design Aspects 14

    2.4 Design Aspects

    The designers of embedded processors are under market pressure when it comes to producing cheap, low

    power-consuming, fast processors[22]. To meet the market demand for a SoC solution, the designers of

    an embedded processor need to look at several design aspects, listed below.

    2.4.1 Code Density

    The size of a program may not be an issue in the desktop world, but is major challenge in embedded

    systems. The embedded processor market is highly constrained by power, cost, and size. For control

    oriented embedded applications a signicant portion of the nal circuitry is used for instruction memory.

    Since the cost of an integrated circuit is strongly related to die size, smaller programs imply cheaper

    smaller dies is needed, which in turn means cheaper dies can be used for embedded systems [81, 82].

    Thumb andMIPS16 are two approaches that tries to reduce the code density of programs by compressing

    the code. Thumb and MIPS16 are subsets of the ARM and MIPS-III architecture. The instructions used

    in the subset are either frequently used or does not require full 32-bits or are important to the compiler

    for generating small code. The original 32-bit instructions are re-encoded to be 16-bits wide. Thumb and

    MIPS16 is reported to achieve code reductions of 30% and 40%, respectively. The 16-bit instructions are

    fetched from instruction memory and decoded to equivalent 32-bit instructions that is run as normally

    by the core. Both approaches have drawbacks:

    Instruction widths are shrunk at the expense of reducing the number of bits used to represent

    registers and immediate values

    Conditional execution and zero-latency shifts are not possible in Thumb

    Floating-point instructions are not available in MIPS16

    The number of instructions in a program grows with compression

    Thumb code runs 15-20% slower on systems with ideal instruction memories

    Both Thumb and MIPS16 are execution-based selection form of selective compression which is a tech-

    nique that selects procedures to compress according to a procedure execution frequency prole. The other

    form is miss-based selection which is invoked only on an instruction cache miss. All performance loss will

    occur on a cache miss path. This way, miss-based selection is based on the number of cache misses and

    not the number of executed instructions as in execution-based selection. Speedup can be achieved by

    letting the procedures with the most cache misses to be in native code.

    Jim Turley has a dierent view on the techniques for reducing code density[89]: Claimed advantages

    in code density should be considered in light of factors such as compiler optimization (loop unrolling,

    procedure inlining, etc), the addressing (32-bit vs. 64-bit integers or pointers), and memory granularity.

    Finally, code density does little or nothing to aect the size of data space. Applications working with

    large data sets requires much more memory than the executable, thus code reduction does little help

    here.

    2.4.2 Power Consumption

    Many products using embedded processors use batteries as power supply. To preserve as much power as

    possible, embedded processors usually operate in three dierent modes: fully operational, standby mode

    and clock-o mode[22]. Fully operational means that the clock signal is propagated to the entire pro-

    cessor, and all functional units are available to execute instructions. When the processor is in standby

    mode, it is not actually executing a instruction, but the DRAM is still refreshed, register contents is also

    available. The processor returns to fully operational mode upon a activity that requires units that are

    not available in standby mode, without loosing any information. Finally, in clock-o mode, the system

    has to be restarted in order to continue, which almost take as much time as a initial start-up. Power

  • 2.5 Implementation Aspects 15

    consumption is often measured as milliwatt per megahertz (mW/MHz).

    The simplest way of reducing the power consumption is to reduce the voltage level. Today, CPU core

    voltage has been reduced to about 1.8V and is still decreasing. Embedded processors are starting to

    incorporate dynamic power management into their design. One example is a pipeline that can shut o

    the clock at various logic blocks that is not needed when executing the current instruction[98]. This type

    of pipeline is usually referred to as a power-aware pipeline. Also, it is no longer sucient to only measure

    the power consumption of the CPU, as it gets integrated with its peripherals in a SoC. Instead, a power

    consumption measure of the entire system has to be done.

    2.4.3 Performance

    Unlike the desktop market, performance isn't everything in the embedded processor market. Instead

    factors like price, power consumption is equally important. A typical embedded processor usually executes

    about one instruction per cycle. Today, performance is still measured as Million Instructions Per Second

    (MIPS) which basically only reveals the amount of instructions executed per second, not if they were

    any useful instructions executed. MIPS is not a good way of measuring performance, and section 2.8.1

    looks at other alternatives. Sometimes, the usual performance of one executed instruction per cycle for

    an embedded processor is not enough and other alternative architectures must be considered in order to

    increase the performance. Section 4.3.7 discusses possible alternative architectures.

    2.4.4 Predictability

    Architectures that supports real-time systems must have the ability to achieve predictability [84]. Pre-

    dictability is dependent on the Worst Case Execution Time (WCET) which is in turn dictated by the

    underlying hardware. Much focus is on improving an architectures performance, and little thought goes

    to make it predictable. This has lead to architectures that includes cache, pipeline, virtual storage man-

    agement, etc, all which has improved the average case execution time, but has worsen the prospects for

    predictable real-time performance.

    Caches have not been popular in the real-time competing community, due its unpredictable behavior.

    This is true for multi-tasking, interrupt driven environments which are common in real-time applica-

    tions [87]. Here, the individual task execution time can have dier from time to time due to interactions

    of real-time tasks and the external environment via the operating system. Preemptions may modify

    the cache contents and thereby cause a nondeterministic cache hit ratio resulting in unpredictable task

    execution task times.

    Pipelines also introduces similar problems to caches concerning worst case execution time. There are

    eorts to achieve predictable performance of pipelines without using a cache and without the hazards

    associated with them [88]. This approach, called Multiple Active Context System (MACS), uses multiple

    processor contexts to achieve increasing performance and predictability. Here, a single pipeline is shared

    among a number of threads and context of every thread is stored within the processor. On each cycle,

    a single context is selected to issue a single instruction to the pipeline. While this instruction proceeds

    through the pipeline, other contexts issue instructions to ll consecutive pipeline stages. Contexts are

    selected in a round-robin fashion. A key feature of MACS architecture is that its memory model allows

    the programmer to derive theoretical upper bounds on memory access times. The maximum number of

    cycles a context will wait for a shared memory request is dictated by the number of contexts, the memory

    issue latency, the number of shared memory competing threads, and the number of contexts scheduled

    between consecutive threads.

    2.5 Implementation Aspects

    There are several options available for the designer who wants to integrate an embedded processor into

    a SoC. Besides building a processor from "scratch", there are other options available. The rst option

  • 2.6 State of Practice 16

    is to acquire the processor core as an hard IP-component

    7

    in form of a specic semiconductor fabri-

    cation process and are delivered as mask data. Several hard IP-cores will be examined in section 2.6.

    The second option is to acquire the CPU as a rm IP-component which is usually delivered in form

    of a netlist. The third and last option is to acquire a soft IP-component in form of VHDL or Verilog

    code or to produce a synthesizable core with a parameterizable core generator. There has been sev-

    eral research eorts to develop generators of parameterizable RISC-cores[73, 76]. One conducted at the

    university of Hanover has developed a parameterizable core generator that outputs fully synthesizable

    VHDL code. The generated core is based on a standard 5 stage pipeline (Figure 4). The designer has

    many choices when using the generator (e.g pipeline length, ALU and data width, size of register le, etc).

    The generated cores are simple RISC-processors with a parameterizable word and instruction width.

    Instruction and data memories are provided as a VHDL template le for simulations, but they are not

    suitable for synthesis. Instead, they should be taken from a technology specic library. Since the cores

    are based on RISC-principles, the instruction set consists of only few instructions and addressing modes.

    A typical 32-bit RISC core with 32 bit data path and 8 32 bit registers can with a 3LM 0.5 micron.

    standard-cell library deliver about 100 MHz achievable clock-frequency.

    Commercial core generators are also available from Tensilica, ARC, and Triscend[100, 101, 99].

    2.6 State of Practice

    The 4, 8 and 16 bit microprocessors was and still are dominating the embedded control market. In

    fact, it was forecasted that eight times more 8-bit than 32-bit CPU's will be shipped during 1999[89].

    The 32-bit embedded processor market diers from the desktop market in that there are about 100

    vendors and a dozen instruction set architectures to choose from. The thing that makes 32-bit embedded

    CPU's attractive is their ability to handle emerging new consumer demands in form of ltering, articial

    intelligence, multimedia, still maintaining a low level of power consumption, price, etc. Next will follow

    a brief presentation of available embedded processors commonly used today.

    2.6.1 ARM

    The Advanced RISC Machines (ARM) company is a leading IP provider that licenses RISC processors,

    periphals, and system-on-chip designs to international electronics companies. The ARM7 family of pro-

    cessors consists of ARM7TDMI and ARM7TDMI-S processor cores, and the ARM710T, ARM720T and

    ARM740T cached processor macrocells.

    An ARM7 processor consists of an ARM7TDMI or ARM7TDMI-S (S stands for Synthesizable and

    means that it can be acquired as VHDL or Verilog code) core that can be augmented with one of

    the available macrocells. The macrocells provides the core with 8KB cache, write buer, and memory

    functions. ARM710T also provides a virtual memory support for operating systems such as Linux and

    Symbain's EPOC32. ARM720T is a superset of ARM710T and supports WindowsCE.

    When writing a 32-bit program for an embedded system there might be a problem to t the entire

    program in the on-chip memory. This kind of problem is usually referred to as a code density problem.

    In order to address the code size problem ARM has developed Thumb, a new instruction set. Thumb is

    an extension to the ARM architecture, containing 36 instruction formats drawn from the standard 32-bit

    ARM instruction set that have been re-coded into 16-bit wide opcodes. Upon execution, the Thumb

    codes are decompressed by the processor to their real ARM instruction set equivalents, which are then

    run on ARM as usual. This gives the designer the benets of running ARM's 32-bit instruction set and

    reducing code size by using Thumb.

    7

    Those who are not familiar with the dierent layers of IP-components, can read the section SoC Design.

  • 2.6 State of Practice 17

    The ARM9 family is a newer and more powerful version of ARM7 and designed for system-on-chip

    solutions due to its built-in DSP capabilities. The ARM9E-S solutions are macrocells intended for in-

    tegration into Application Specic Integrated Circuits (ASICs), Application Specic Standard Products

    (ASSPs) and System-on-chip (SoC) products.

    CPU core Die Area Power Frequency Performance

    ARM7TDMI 1.0 mm

    2

    on 0.25 m 0.6 mW/MHz @ 3.3V 66 MHz 0.9 MIPS/MHz

    ARM9E-S 2.7 mm

    2

    on 0.25 m 1.6 mW/MHz @ 2.5V 160 MHz 1.1 MIPS/MHz

    2.6.2 Motorola

    The Motorola M-CORE microprocessor, introduced in 1997, was targeting the market of analog cellular

    phones, digital phones, PDAs, portable GPS systems, automobile braking systems, automobile engine

    control, and automotive body electronics. The development of the M-CORE architecture was designed

    from the ground up to achieve the lowest milliwatts per MHz. It is a 32-bit processor that has a 16-bit

    xed length instruction format, and a 32-bit RISC architecture. The M-CORE minimizes power usage

    by utilizing dynamic power management.

    Motorola has also developed a modern version of the 68K architecture, the Coldre, which is positioned

    between the 68K (low end) and the PowerPC (high end). This architecture is also known as VL-RISC,

    because although the core is RISC-like, the instructions are variable length (VL). VL instructions help to

    attain higher code density, Coldre has a four-stage pipeline consisting of two subpipelines: a two-stage

    instruction prefetch pipeline and two-stage operand execution pipeline.

    2.6.3 MIPS

    MIPS Technologies designs and licenses embedded 32- and 64-bit intellectual property (IP) and core

    technology for digital consumer and embedded systems market. The MIPS32 architecture is a superset

    of the previous MIPS I and MIPS II instruction set architectures.

    2.6.4 Patriot Scientic

    Patriot Scientic Corporation was one of the rst developing a Java microprocessor, the PSC1000. The

    PSC1000 is targeted for high performance, low-system cost applications like, network computers, set-top

    boxes, cellular phones, Personal Digital Assistants (PDA's) and more. The PSC1000 microprocessor is

    a 32-bit RISC processor that oers ability to execute both Java(tm) programs as well as C and FORTH

    applications. It oers a unique architecture that is a blend of stack- and register-based designs, which

    enables features like 8-bit instructions for reduced code size. The idea behind the PSC1000 is to make

    Internet connectivity for low cost devices such as PDA's, set-top cable boxes and "smart" cell phones.

    2.6.5 AMD

    Advanced Micro Devices (AMD)'s 29000K was an early leader which was frequently used in laser print-

    ers and network buses. The 29K family comprises three product lines, including three-bus Harvard-

    architecture processors, two-bus processors, and a microprocessor with on-chip peripheral support. The

    core is built around a simple four-stage pipeline: fetch, decode, execute, and write-back. The 29K has a

    triple-ported register le of 192 32-bit registers. In 1995, AMD cancelled all further development of the

    29K to concentrate its eorts on x86 chips.

    2.6.6 Hitachi

    Hitachi SuperH (SH) became popular when Sega chose the SH7032 for its Genesis and Saturn video game

    consoles. Then, it expanded to cover consumer-electronics markets. Its short, 16-bit instruction word

    gives SuperH one of the best code density compared with almost any 32-bit processor. The SH family

  • 2.7 Improving Performance 18

    uses a ve-stage pipeline: fetch, decode, execute, memory access, and write-back to register. The CPU

    is built around 25 32-bit registers.

    2.6.7 Intel

    Intel i960 emerged early in the embedded market which made it successful in printer and networking

    equipments. The i960 is well supported with development tools. The i960 combines a Von Neumann

    architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers.

    All i960s have multistage pipelines and use resource scoreboarding to track resource usage.

    2.6.8 PowerPC

    The PowerPC is one of the best-known microprocessor name next to Pentium and is steadily gaining

    ground in the embedded space. IBM and Motorola are pursuing dierent strategies with their embedded

    PowerPC chips, with the former inviting customer designs and the latter leveraging its massive library

    of peripheral I/O logic.

    2.6.9 Sparc

    Sun's SPARC was the rst workstation processor to be openly licensed and is still popular with some

    embedded users. The microSPARC are built around a large multiported register le that breaks down

    into small set of global registers for holding global variables and sets overlapping register windows. The

    microSPARC's pipeline consists of an instruction-fetch unit, two integer ALUs, a load/store unit, and a

    FPU.

    2.7 Improving Performance

    Pipelining is a way of achieving a level of parallelism, resulting in a low CPI count. In order to be

    even more eective, linear pipelining will not suce and other techniques have to be considered. These

    techniques have the ability to execute several instructions at once, resulting in a CPI count below 1.0.

    The most popular techniques includes Multiple-issue Processors (such as Very Long Instruction Word

    (VLIW) and Superscalar Processors),Multithreading, Simultaneous Multithreading (SMT) and Chip Mul-

    tiprocessor (CMP). Also, another technique will be discussed that tries to come to terms with the ever

    growing memory-CPU speed gap. There is one technique, called prefetching or preloading, that hides

    the memory latency by fetching and storing required data or instructions in a buer before it is actually

    needed.

    2.7.1 Multiple-issue Processors

    Although there are techniques that can remedy most of the stalls in an ordinary pipeline, the ideal

    result is still only a CPI count of 1.0, thus executing exactly one instruction for every machine cycle.

    This performance is not always enough and other ways of achieving a higher level of parallelism need

    to be considered. Multiple-issue processors tries to execute several instructions in a machine cycle, thus

    achieving a higher rate of Instruction-Level Parallelism (ILP). There are mainly two types of processors

    using these techniques, namely Very Long Instruction Word (VLIW) and superscalar processors. Also,

    in addition to the two architectures, a third alternative processor, called Multiple Instruction Stream

    Computer (MISC) will be discussed.

    As the name implies, a VLIW processor issues a very long instruction packet that consists of several

    instructions. An example of a instruction packet can be seen in (Figure 6), were there is room for two

    integer/branch operations, one oating point operation, and two memory references. In the case of VLIW

    processors, the task of nding independent instructions in the code is done by the compiler instead of dy-

    namic hardware as in superscalar processors. Additional hardware is saved because the compiler always

  • 2.7 Improving Performance 19

    I