mpsoc

SoCrates

- A Scalable Multiprocessor System On Chip

Authors

Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

fmci,mnc,[email protected]

Supervisors

Johan Starner and Joakim Adomat

Examinator

Lennart Lindh

Department of Computer Engineering

Computer Architecture Lab

Malardalen University

Box 883, 721 23 Vasteras

Abstract

This document is the result of a Master Thesis in Computer Engineering, describing the analysis,

specication and implementation of The rst prototype of Socrates, a congurable, scalable and

predictable platform for System-on-chip Multiprocessor system for real-time applications. The design

time of System-on-a-Chip (SoC) is today rapidly increasing due to high complexity and lack of ecient

tools for development and verication. By combining all the functions into one chip the system

becomes smaller, faster, and less power consuming but increasing the complexity. To decrease the

time-to-market SoCs are entirely or partially build with IP-components. Thanks to SoC, a whole new

domain of products, like small hand held devices, has emerged. The concept has been around a few

years now, but there are still challenges that needs to be resolved. There is a lack of standards for

enabling fast mix and match of cores from dierent vendors. Further needs are new design methods,

tools, and verication techniques. SoC solutions needs special kind of CPUs that consumes less power,

is cheaper, smaller, but still has high-performance requirements. To fulll all these demands, they are

getting more and more complex as the number of transistors are rapidly growing which has led to the

emerging of multiprocessors systems-on-a-chip. Our initial question is to investigate if it is possible

to build these complex multiprocessors systems on a single FPGA and if these solutions can lead

to shorter time-to-market. The consumer demands for cheaper and smaller products makes FPGA

solutions interesting. Our approach is to have multiple processing nodes containing processing unit,

memory and a network interface all together connected on a shared bus. A central in-house developed

hardware real-time unit handles scheduling and synchronization. We have designed and implemented

a MSoC that ts on a single FPGA in only 40 days, which has to our supervisors knowledge not been

accomplished before. Our experience is that a tightly coupled group can produce fast results since

information, new ideas and bug reports propagates immediately.

SoCrates stands for SoC for Real-Time Systems

Introduction

This report describes the design of the rst prototype of SoCrates, a generic scalable platform gener-

ator which creates a synthesizable HDL description of a multiprocessor system. The goal was to build

a predictable multiprocessor system on a single FPGA with mechanisms for prefetching data and an

in-house developed integrated hardware real-time unit.

The report consist of three parts where the rst part, Computer Architecture For System on Chip,

is a state of the art report introducing basic SoC terminology and practice with a deeper analysis in

CPUs, interconnects and memory hierarchies. The purpose of this analysis was to learn about state of

the art techniques on how to design complex multiprocessor SoCs. The design process resulted in part

two, SoCrates - specications, which describes the prototype and all the individual parts functionality

and specic demands. Part three, Socrates -implementation details, describes the implementation on all

parts, how to congure the system, and how to compilie and link the system software. We also present

synthesis results and suggest future work that can be done to improve the system.

SoCrates

-Document index

Document 1 Computer Architecture for System on Chip - A State of the Art Report

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3. Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

4. Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Document 2 Socrates Specications

1. System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .1

2. CPU Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

4. Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5. IO Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6. Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7. Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8. Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

9. Memory Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

Document 3 Socrates -Implementation details

1. CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

2. Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3. Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4. Compiling & Linking the System Software . . . . . . . . . . . . . . . 31

5. Configuring the Socrates Platform . . . . . . . . . . . . . . . . . . 35

6. Current Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Document 4 Appendix

1. Demo Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. I/O Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4. Task switch routines . . . . . . . . . . . . . . . . . . . . . . . . . 6

5. Linker scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6. DATE 2001 Conference, Designers Forum, publication . . . . . . . . . .12

Computer Architecture for System on Chip

- A State of the Art Report

Revision: 1.0

Authors

Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

fmci,mnc,[email protected]

Supervisors

Johan Starner and Joakim Adomat

Department of Computer Engineering

Computer Architecture Lab

Malardalen University

Box 883, 721 23 Vasteras

May 20, 2000

Abstract

This state of the art report introduces basic SoC terminology and practice with deeper analysis

in three architectural components: the CPU, the interconnection, and memory hierarchy. A short

historical view is presented before going into todays trends in SoC architecture and development. The

SoC concept is not new, but there are challenges that has to be met to satisfy customer demands for

faster, smaller, cheaper, and less power consuming products today and in the future. This document

the rst of three documents that forms a Master Thesis in Computer Engineering.

CONTENTS II

Contents

1 Introduction 1

1.1 What is SoC ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Soc Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.2 An Example of a SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Why SoC? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.2 State of Practice and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Introduction to Computer System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4.1 Computer System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Research & Design Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5.1 Hydra: A next generation microarchitecture . . . . . . . . . . . . . . . . . . . . . . 6

1.5.2 Self-Test in Embedded Systems (STES) . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.3 Socware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.4 The Pittsbourgh Digital Greenhouse . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5.5 Cadence SoC Design Centre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Embedded CPU 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 The Building Blocks of an Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.4 Memory Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.5 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.6 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 The Microprocessor Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Design Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Code Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.4 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 State of Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.1 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.2 Motorola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.4 Patriot Scientic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.5 AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.6 Hitachi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.7 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6.8 PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6.9 Sparc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7.1 Multiple-issue Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7.3 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.4 Chip Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7.5 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.8 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

CONTENTS III

2.8.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.8.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9 Trends and Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9.1 University . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9.2 Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Interconnect 27

3.1 Introduction and basic denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Bus based architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Arbitration mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Synchronous versus asynchronous buses . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.4 Pipelining and split transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.5 Direct Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.6 Bus hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.7 Connecting multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Case studies of bus standards with multiprocessor support . . . . . . . . . . . . . . . . . . 31

3.3.1 FutureBus+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 VME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.3 PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Point-to-point interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Interconnection topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Interconnect performance & scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.2 Shared buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.3 Point-to-point architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Interconnecting components in a SoC design . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.1 VSIA eorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.2 Dierences between standard and SoC interconnects . . . . . . . . . . . . . . . . . 36

3.7 Case studies of existing SoC-Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7.1 AMBA 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7.2 CoreConnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.7.3 CoreFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7.4 FPIbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7.5 FISPbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7.6 IPBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7.7 MARBLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7.8 PI-Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7.9 SiliconBackplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7.10 WISHBONE Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.7.11 Motorola Unied Peripheral Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.8 Case studies of SoC multiprocessor interconnects . . . . . . . . . . . . . . . . . . . . . . . 43

3.8.1 Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.8.2 Silicon Magic's DVine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Memory System 45

4.1 Semiconductor memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.2 RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Cache memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3.1 Cache: the general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.2 The nature of cache misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

CONTENTS IV

4.3.3 Storage strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.4 Replacement policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.5 Read policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.6 Write policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.7 Improve cache performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Multiprocessors architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5.1 Symmetric Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5.2 Distributed memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.3 COMA Cache Only Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.4 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5.5 Coherence through bus-snooping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.6 Directory-based coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Hardware-driven prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6.1 One-Block-Lookahead: (OBL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6.2 Stream buer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6.3 Filter buers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6.4 Opcode-driven cache prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6.5 Reference Prediction Table: (RPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6.6 Data preloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6.7 Prefetching in multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Summary 68

1 INTRODUCTION 1

1 Introduction

This State of the Art Report covers computer architecture topics with emphasis on System on Chip (SoC).

The reader is introduced to the basic ideas behind SoC and general computer architecture concepts before

presenting an in-depth analysis of three important SoC components: CPU, Interconnect and Memory

architecture.

1.1 What is SoC?

SoC stands for System-on-Chip and is a term for putting a complete system on a single piece of silicon.

SoC has become a very popular word in the computer-industry, but very few agree on a general denition

of SoC [19]. There are several alternative names for putting a system on a chip, such as system on

silicon, system-on-a-chip, system-LSI, system-ASIC, and system-level integration (SLI) device [33]. Some

might say a large design automatically makes it a SoC, but that would probably include every existing

design today. A better approach would be to say that a SoC should include dierent structures such as

a CPU-core, embedded memory and peripheral cores. This still is a wide denition which could imply

that any modern processor with a on-chip cache should be included into the SoC-community. Therefore

a more suitable denition of SoC would be:

A complete system on a single piece of silicon, consisting of several types of modules including at least one

processing unit designated for software execution, where the system depends on no or very few external

components in order to execute its task.

1.2 Soc Designs

In the beginning, almost all SoC's were simply integrations of existing board-level designs [20]. This way

of designing a system looses many benets that otherwise could be taken advantage of if the system would

be designed from scratch. Another approach is to use already existing modules, called IP-components,

and to integrate them to a complete system suitable for a single die.

1.2.1 Intellectual Property

When something is protected through patents, copyrights, trademarks or trade secrets it is considered as

a Intellectual Property (IP). Only patents and copyrights is relevant for IP-components [13] (also referred

as macros, cores and Virtual Components (VC) [10]). An IP-component is a pre-implemented, reusable

module, for example a DMA-controller or a CPU-core. There are several companies that makes their

living by building, licensing and selling IP-components, which the semiconductor companies pays both

fees and royalties for

1

. There exist three classes of IP-components with dierent properties regarding to

portability and protection characteristics. As the portability decreases through the classes, the protection

will increase.

Soft This class of IP-components have their architecture specied at Register-Transfer Level (RTL),

which are synthetizable. Soft IP's are functionally validated and are very portable and modiable.

Since they are not mapped to a specic technology, the behavior according to area, speed, and

power consumption will be unpredictable. Much work still needs to be done before the component

can be utilized and the end-result is dependent of the used synthesis tools.

Firm The rm class components are in general soft components that have been oorplanned and synthe-

sized into one or several dierent technologies to get better estimations of area, speed, and power

consumption.

1

There are exceptions where one can acquire IP-components without any licensing or royalty fees. More information can

be found at http://www.openip.org/.

1.3 Why SoC? 2

Hard Hard-IP's are further renement of rm components. They are fully syntesized into mask-level

and physicaly validated. Very little work has to be done in order to implement the functionality

in silicon. Hard IP's are not modiable nor portable, but the prediction of their area, speed, and

power consumption is very accurate.

1.2.2 An Example of a SoC

A typical SoC consists of a CPU-core, a Digital Signal Processor (DSP), some embedded memory, and a

few peripherals such as DMA, I/O, etc (Figure 1). The CPU could perform several tasks with the assis-

tance of a DSP when needed. The DSP is usually responsible for o-loading the CPU by doing numerical

calculation on the incoming signals from the A/D-converter. The SoC could be built of only third-party-

IP-components, or it could be a mixture of IP-components and custom-made solutions. More recently,

there has been eorts to implement a Multiprocessor System on Chip (MSoC) [6], which introduces new

challenges regarding cost, performance, and predictability.

Figure 1: An example of a SoC

1.3 Why SoC?

The rst computer systems consisting of relays and later vacuum tubes, used to occupy whole rooms and

their performance were negligible compared to todays standard workstations. The advent of the transistor

in 1948 enabled engineers to minimize a functional block to an Integrated Circuit (IC). These IC's made it

possible to build complex functions by combining several IC's onto a circuit board. Further development

of process technology increased the number of transistor on each IC, which led to the emerging of systems

on board. After this, there has been a constant battle between semiconductor companies to deliver the

fastest, smallest and cheapest products, resulting in today's multi-billion dollar industry. Even though

the SoC concept has been around for quite some time, it has not really been fully feasible until recent

years due to advances like deep sub-micron CMOS process technology.

1.3.1 Motivation

There are several reasons why SoC is an attractive way to implement a system. Todays rened manu-

facturing processes makes it possible to combine both logic and memory on a single die, thus decreasing

overall memory access times. Given that the application memory requirement is small enough for the

on-chip embedded memory, memory latency will be reduced due to elimination of data trac between

separate chips. Since there is no need to access the memory on external chips, the number of pins can

also be reduced and the use of on-board buses becomes obsolete. Encapsulation counts for over 50% of

the the overall process cost for chip manufacturing [15]. In comparison to a ordinary system-on-board,

SoC uses one or very few IC's, reducing total encapsulation cost and thereby total manufacturing cost.

These characteristics as well as less power consumption and shorter time-to-market enables smaller,

better, and cheaper products reaching the consumers in an altogether faster rate.

1.3 Why SoC? 3

1.3.2 State of Practice and Trends

Until now, much of SoC implementation has been about shrinking existing board-level systems onto a

single chip, with no or little consideration to all benets that could be gained from a chip-level design.

Another approach to SoC is to interconnect several dies and place them inside one chip. This kind of

modules are calledMulti Chip Modules (MCM). The choice of implementation of the Hydra Multiprocessor

Project was at rst a MCM , which later evolved to a SoC [14, 6].

Today it is too time-consuming for companies to implement a system from scratch. Instead, a faster and

more reliable way is to use own or third party pre-implemented IP-components [3], which makes designing

a whole system more about integrating components rather than designing them. There exist three design

methodologies, each with it's own eciency and cost regarding SoC design [16, 18]. The vendor design

approach, which shifts the design responsibilities from the system designers to the ASIC vendors, can

result in the lowest die cost. But it can also lead to higher engineering costs and longer time-to-market.

A more exible method is the partial integration approach, which divides the responsibilities of the design

more equally. It lets the system designers produce the ASIC design, while the semiconductor vendors

are responsible for the core and integration. This method gives the system designers more control of the

working process in comparison to the vendor method. Yet more exible is the desktop approach which

leaves the semiconductor vendors only to design the core. This reduces time-to-market and requires low

engineering costs. A key property for IP-components in the future are parameterization of soft cores [16].

There is a continuous growth in the demand for \smart products" which is expected to make our lives

better and simpler. Recently, SoC products has begun to emerge on several markets in form of Application

Specic Standard Products (ASSP)

2

or Application Specic Instruction-set Processors (ASIP)

3

:

Set-top-boxes A Set-Top-Box (STB) is a device that makes it possible for television viewers to access

the Internet and also watch digital television (D-TV) broadcasts. The user has access to several

services: weather and trac updates, on-line shopping, sport statistics, news, e-commerce, etc. By

integrating the STB's dierent components into a SoC, it will simplify system design and be a more

competitive product with its shorter time-to-market, be less expensive and less power-consuming.

The Geode SC1400 chip is an example of a SoC used in a STB that meets the demands of delivering

both high-quality DVD video and Internet accessibility [34].

Cell phones A SoC in a cell phone will reduce its size and weight, make it cheaper and less power

consuming.

Home automation Many domestic appliances at home will be "smarter". For example, the refrigerator

will be able to notify its owner when a product is missing and place an order on the Internet.

Hand-held devices A new generation of hand-held devices is coming, that can send and receive email

and faxes, make calls, surf the Web etc. A SoC solution is especially suited for portable applications

such as hand-held PC's, digital cameras, personal digital assistants and other hand-held devices

because its built-in peripheral functions minimizes overall system cost and eliminates the need to

use and congure additional components.

1.3.3 Challenges

One of the emerging challenges is to standardize interfaces of IP-components to make integration and

composition easier. A lot of dierent on-chip bus standards has been created by the dierent design houses

to make it possible to fast integrate IP-components, which has resulted in noncompability caused by the

dierent interfaces. To solve this dilemma the Virtual Socket Interface Alliance (VSIA) was founded to

enable the mix and match of IP-components from multiple sources by proposing a hierarchical solution

2

High integration chip or chipset for a specic application [59].

3

A eld or mask programmable processors of which the architecture and instruction set are optimized to a specic

application domain [58].

1.4 Introduction to Computer System Architecture 4

that enables multiple buses [17]. Still some criticizes VSIA for only addressing simple data ows [11].

More can be read about dierent on-chip bus standards in section ??.

Since the time-to-market is decreasing, the testing and verication of the SoC must be done very fast.

By reusing IP-components it is possible that the test development actually takes longer time than the

work to integrate the dierent functional parts [12]. The fact that the components are from dierent

sources and may have dierent test methodologies complicates the test of the whole system. At the board

level design many of the components has their primary inputs visible which made the testing easier, but

SoC's contain deeply embedded components where there is no or very little possibility to observe signals

directly from one IP-component after manufacturing. Since the on-chip interconnect is inside the chip,

it is also hard to test due to the lack of observability.

As the future is lurking behind the door, integration is not likely to stop with IP-components and

dierent memory technologies, we are also likely to see a variety of analog functionality. Analog blocks

are very layout and process dependent, requires isolation and utilizes dierent voltages and grounds. All

these facts makes them the dicult to integrate in the design [10]. Are the limits to the integration

urge? As the process technologies becomes more sophisticated, transistor switching speed will increase

and the voltage for logical levels will decrease. Dropping the voltages will make the units more sensible

for noise. Analog devices with higher voltage needs can encounter problems working properly in those

environments [17].

Apart from the lack of eective design and test methologies [29] and all the technical problems with

mapping a complex design consisting of several IP-components from dierent design houses onto a partic-

ular silicon platform, there are complex business issues dealing with licensing fees and royalty payments

[30].

1.4 Introduction to Computer System Architecture

SoC is about putting a whole system on a single piece of silicon. But what is a system? This section serves

as a introduction to computer system architecture and tries do give the reader a better understanding of

what is actually put onto a SoC.

1.4.1 Computer System

In general, a typical computer system (gure 2) consists of one or more CPU's that executes a program by

fetching instructions and data from memory. To be able to access the memory, the CPU needs some kind

of interface and a connection to it. The interface is usually provided by the Memory Management Unit

(MMU) and the connection is handled by the interconnect. The local interconnect is often implemented

as a bus consisting of a number of electrical wires. Sometimes, the CPU needs assistance in fetching large

amount of data, in order to be eective. This work can be done in parallel with the CPU by the Direct

Memory Access (DMA) component. The system needs some means to communicate with the outside

world. This is provided by the I/O system. We proceed with a closer look at the important components

that comprises a computer system.

CPU The CPU is where the arithmetic, logic, branching and data transfer are implemented[8]. It consists

of registers, a Arithmetic Logic Unit (ALU) for computations and a control unit. The CPU can

be classied as a Complex Instruction Set Computer (CISC), if the instruction set is complex (e.g

has a lot of instructions, several addressing modes, dierent word-length on instructions etc). The

idea behind a Reduced Instruction Set Computer (RISC) is to make use of a limited instruction set

to maximize performance on common instructions by working with a lot of registers while taking

penalties on the load and store instructions. RISC has uniform length on all instructions and

very few addressing modes. This uniformity is the main reason why this approach is suitable for

instruction pipelining, in order to increase performance. There are other architectures that further

increase performance, for example superscalar, VLIW, and vector computers. A machine is called a

n-bit machine if it is operating internally on n-bit data[8]. Today a lot of the embedded processors

are still working with 8 or 16 bit words while the majority of workstations and PC's are 32 or 64

bit machines.

1.5 Research & Design Clusters 5

Data Data

Data

Data

Address

AddressAddress Address

CPU

DMAcontroller

DMA device

Main MemoryCache

Syste

m b

us

Figure 2: A typical computer system

Memory System The key to an eective computer system is to implement a ecient memory hierarchy,

this due to the latency for memory accesses has become a signicant factor when executing a memory

accessing instruction. In the last decade the gap between memory and CPU speed have been

growing. Memory sub-systems must be built to overcome it's shortcomings towards the processor

which otherwise results in wasted computational time from waiting for memory operations to be

completed.

The memory are often organized in a primary and a secondary part, where the primary memory

usually consists of one or several RAM-circuits. The secondary are for longterm storage like Hard

Disc Drives (HDDs), Disks, Tapes, Optic media, and sometimes FLASH memories. To overbridge

the gap between the memory and CPU, nearly all modern processors has a cache that makes use

of the inherent locality of data and code, that ideally can deliver data to the CPU in just one clock

cycle. Usually there exists several levels of cache between the main memory and CPU, each with

dierent sizes and optimizations. The memory system interfaces to the outside world (e.g processor

and I/O) via the MMU, that has the responsibility to translate addresses and to fetch data from

memory via the the memory bus. In multiprocessor systems there are issues whether the memory

should be local on every node or global.

Interconnect A computer systems internal components needs to communicate in order to perform their

task. To make communication possible between components, a interconnect is usually used. A

interconnect can be designed in variety of ways, called topologies. Examples of topologies are bus,

switch, crossbar, mesh, hypercube, torus, etc. Each topology has its own characteristics concerning

latency, scalability and performance.

I/O System The Input/Output system is the computer systems interface to the outside world, which

enables it to receive input and output results to the user. Examples of I/O devices includes HDDs,

graphic and sound systems, serial ports, parallel ports, keyboard, mouse, etc. The transferring of

I/O data is usually taken care of by the DMA component to o-load the CPU of constant data

transfer.

1.5 Research & Design Clusters

There is a lot of research eort done on computer architecture, which of course is related in some degree

SoCs, since they all are actually computers. Unlike most research areas, SoC research is lead by the

industry and not by the universities. Of those universities that have SoC related research projects, very

few have reached the implementation stage.


1.5.1 Hydra: A next generation microarchitecture

The Stanford Hydra single-chip multiprocessor [6] started out as an MultiChipModule(MCM) in 1994

but evolved in 1997 to become a Chip MultiProcessor(CMP). The project are supervised by Associate

Professor Kunle Olukotun accompanied by Associate Professor Monica S. Lam and Mark Horowitz, also

incorporated in the project are a dozen students. Early development of the project was performed by

Basem A. Nayfeh nowdays a Ph.D. The Hydra projects focus on combining shared-cache multiproces-

sor architectures, innovative synchronization mechanisms, advanced integrated circuit technology and

parallelizing compiler technology to produce microprocessor cost/performance and parallel processor

programmability The four integrated MIPS-based processors will demonstrate that it is feasible for a

multiprocessor to gain better cost/performance than achieved in wide superscalar architectures for se-

quential applications. By using MCM, communication bandwidth and latency will be improved resulting

in better parallelism. This makes Hydra a good platform to exploit ne grained parallelism, hence a

parallelizing complier for extracting this sort of parallelism is under development. The project is nanced

by US Defense Advanced Research Projects Agency(DARPA) contracts DABT and MDA.

1.5.2 Self-Test in Embedded Systems (STES)

STES is a co-operational project between ESLAB, the Laboratory for Dependable Computing of Chalmers

University of Technology, the Electronic Design for Production Laboratory of Jonkoping University, the

Ericsson CadLab Research Center, FFV Test Systems, and SAAB Combitech Electronics AB. ESLAB

are responsible of developing a self-test strategy for system-level testing of complex embedded systems.

Which utilizes the BIST(Built In Self Test) functionality at the device, board, and MCM level. Except

the involved commercial participants the project are founded by NUTEK.

1.5.3 Socware

An international Swedish Design Center/cluster has recently been builded that will be in close coop-

eration with the technical universities in Linkoping/ Norrkoping, Lund and Stockholm/Kista. The

Socware, formerly known as Acreo System Level Integration Center (SLIC), aims to have nearly 40

employees/specialists in the beginning but this number will is grow to 1500 in the near future with a

special research institute located in Norrkoping. The Design Center will serve as an bridge between

the industry and research activity and the universities, enabling research results rapidly converting into

industrial products.

The focus of research and development will be directed to design of system components within digital

media technology. initially special focus will be on applications in mobile radio and broadband networking.

Project is nanced by the government, the municipality of Norrkotoping and other local and regional

agencies. More information can be found in [35].

1.5.4 The Pittsbourgh Digital Greenhouse

The Pittsburgh Digital Greenhouse is a SoC design cluster that focuses on digital video and networking

markets. The non-prot organization is an initiative taken by the U.S government, universities, and

industry that started in June 1999. It involves the Carnegie Mellon University, Penn State University,

University of Pittsbourgh, and several industry members like Sony, Oki, and Cadence.

Some ongoing research activities closely related to SoC are:

Congurable System on a Chip Design Methodologies with a Focus on Network Switching

This project focuses on development of design tools for hardware/software co-design as those re-

quired for next generation switches on the Internet and cryptography.

Architecture and Compiler Power Issues in System on a Chip

The program is focused on to create a software system that characterizes the power of the ma-

jor components of a SoC design and allows the design to be optimized for lowest possible power

consumption.


MediaWorm: A Single Chip Router Architecture with Quality of Service Support

The research has focus on the design, fabrication, and testing of a new high performance switched

network router, called Mediaworm. It is aimed to be used in computer clusters where there are

demands on level Quality of Service (QoS) guarantees.

Lightweight Arithmetic IP: Customizable Computational Cores for Mobile Multimedia Appliances

Focus is made on complexity of multimedia algorithms and development of mathematical software

functions which provides the required level of computational performance at lower power levels.

The long range goal is to have SoC in a wide range of next generation products, from \smart homes"

to hand-held devices that allows the user to surf on the web, send faxes and receive e-mail. Further

goals is to provide venture capital, training and education and to assist start-up companies that uses the

research results and pre-designed chips created by the Digital Greenhouse for use in their products. More

information can be found at [37].

1.5.5 Cadence SoC Design Centre

In February 1998, there was an opening of Cadence Design Centre with the purpose of creating one of

the electronic industry's largest and most advanced SoC design facilities. The centre is located on The

Alba Campus in Livingston, Scotland and is the largest European design centre. The centre oers exper-

tise within the spheres of Digital IC, Multimedia/Wireless, Analogue/Mixed Signal, Datacom/Telecom,

Silicon Technology Services, and Methodology Services. In 1999, The Centre became authorized as the

rst ARM approved design centre, through the ARM Technology Access Program (ATAP). Current re-

search projects conducted at the centre involves a single-chip processor for Internet telephony and audio,

a exible receiver chip suitable for among other things pinpointing location by picking up high-frequency

radio waves transmitted by GPS satellites, and a fully customized wireless Local Area Network (LAN)

environment. There are three main pieces of the center, the Virtual Component Exchange (VCX), the

Institute for System Level Integration (SLI) and The Alba Centre. VCX opened in 1998, which is an

institution dedicated for establishing of structured framework of rules and regulations for inter-company

licensing of IP blocks. Members of VCX include ARM, Motorola, Toshiba, Hitachi, Mentor Graphics,

and Cadence. The SLI institute is an educational institution dedicated to system level integration and

research. The institute was established by four of Scotland's leading universities, Edinburgh, Glasgow,

Heriot Watt and Strathclyde. Finally, the Alba centre is the headquarter of the whole initiative and

provides a central point for information about the venture and assistance for interested rms.

2 EMBEDDED CPU 8

2 Embedded CPU

There are several dierent interpretations of the term CPU. Some say it is "The Brains of the computer"

or "Where most calculations take place", and that it "Acts as a information nerve center for a computer".

A more concrete denition is given by John L Hennessy and David A Patterson[8]:

Where arithmetic, logic, branching, and data transfer are implemented.

This chapter serves as an introduction to CPU's that are especially suitable for SoC solutions, namely

embedded CPUs. In this case, the term "embedded" does not only refer to how suitable these CPUs are

for embedded systems, or as stand-alone microprocessors, but also to how they are good candidates to be

"embedded" into a SoC. The purpose of this chapter is to look at the possibilities of embedded processors

as a SoC component and what aspects need to be considered when designing and implementing a solution.

Techniques for improving and measuring performance is discussed as well as where the research is today

together with a look at the future of embedded processors.

The chapter begins with an introduction to embedded CPUs that explains some of the factors behind

their popularity. Section 2.2 is a presentation of the building blocks of a modern embedded CPU. Section

2.4 discusses the major factors aecting the design. Section 2.3 looks at which paradigm is currently in

front regarding embedded CPUs. Section 4.3.7 considers options on how to implement an embedded CPU.

Section 2.6 presents case studies of embedded CPUs available in the market today. Section 2.7 shows

several techniques of how to improve the performance. Next, section 2.8 consider how the performance

of a embedded processor could be measured. Finally, section 2.9 looks at where the research is today and

what the trends are in the embedded processor market.

2.1 Introduction

The latest advances in process technology has increased the number of available transistors on a single

die almost to the extent that todays battle between designers is not about how to t it all on a single

piece of silicon, but how to make the most use of it. This evolution has also made it possible for designers

to put a complete processor, together with some or all of its periphal components on a single die, creating

a new class of products, called Application Specic Standard Products (ASSPs). The demand for ASSPs

has also created a new domain of processors, embedded 32-bit CPUs, that are cheap, energy-ecient,

especially designed for solving their domain of tasks.

Before getting into all the wonders of embedded CPU's, some clarifactions should be made about what

they are and what they are not. When CPU's are discussed, the thoughts often goes to the architectures

from Intel, Motorola, Sun, etc. These architectures are mainly y designed for the desktop market and

have dominated it for a long time. In recent years, there has been a increasing demand for CPU's designed

for a specic domain of products. Among those noticing this trend was David Patterson[21]:

Intel specializes in designing microprocessors for the desktop PC, which in ve years may

no longer be the most important type of computer. Its successor may be a personal mobile

computer that integrates the portable computer with a cellular phone, digital camera, and video

game player... Such devices require low-cost, energy-ecient microprocessors, and Intel is far

from a leader in that area.

The question of what the dierence is between a desktop and an embedded processor is still unanswered.

Actually, some embedded platforms arose from desktop platforms (such as MIPS, Sparc, x86), so the

dierence can not be in register organization, the instruction set or the pipelining concept. Instead, the

factors that dierentiate a desktop CPU from an embedded processor will be power consumption, cost,

integrated periphal, interrupt response time, and the amount of on-chip RAM or ROM. The desktop world

values the processing power whereas an embedded processor must do the job for a particular application

at the lowest possible cost[22].

2.2 The Building Blocks of an Embedded CPU 9

2.2 The Building Blocks of an Embedded CPU

This section serves as an introduction to the components of a modern embedded CPU. Readers that are

familiar with the basics of computer architecture and processor design might skip this section.

A CPU basically consists of three components: register set, ALU, and a control unit. Today, it is often

the case that the CPU includes a on-chip cache and a pipeline, in order to achieve an adequate level of

performance (Figure 3). The following text will give a brief introduction to the components function and

purpose in the CPU.

Address Register

IncrementerAddress

32-bit Registers

32 x 8 Multiplier

32-bit ALU

Write Data Register

Control Logic

InstructionDecoder

&

Instruction

Pipeline

Barrel Shifter

ALU

Bus

PC B

us

Incr

emen

t Bus

32-bit Address Bus

32-bit Data Bus

Cache

Figure 3: A typical embedded CPU

2.2.1 Register File

The organization of registers or how information is handled inside the computer is part of a machines

Instruction Set Architecture (ISA)[8, 9]. An ISA includes the instruction set, the machine's memory,

and all of the registers that is accessible by the programmer. ISAs are usually divided into three main

categories regarding to how information is stored in the CPU: stack architecture, accumulator architecture,

and general-purpose register (GPR) architecture. These architectures dier in how an operand is handled.

A stack architecture keeps its current operands on top of the stack, while a accumulator architecture keeps

one implicit operand in the accumulator, and a general-purpose register architecture only have explicit

operands which can reside either in memory or registers. Following example shows how the expression A

= B + C would be evaluated in these three architectures.

stack architecture accumulator architecture general-purpose architecture

PUSH C LOAD R1,C LOAD R1,C

PUSH B ADD R1,B LOAD R2,B

ADD STORE A,R1 ADD R3,R2,R1

POP A STORE A,R3

The machines in the early days used stack architectures and did not need any registers at all. Instead,

the operands are pushed onto the stack and popped o into a memory location. Some advantages was

that space could be saved because the register le was not needed, and no operands were needed during

arithmetic operation. As the memories became slower compared to the CPU's, the stack architecture


also became ineective, due to the fact that most time is spent while fetching the operands from memory

and writing them back. This became a major bottleneck, which made the accumulator architecture a

more attractive choice.

The accumulator architecture was a step-up regarding performance by letting the CPU hold one of

the operands in a register. Often, the accumulator machines only had one data accumulator register,

together with the other address registers. They are called accumulators, due to their responsibility to act

as a source of one operand and destination of arithmetic instructions, thus accumulating data. The accu-

mulator machine was a good idea at the time when memories were expensive, because only one address

operand had to be specied, while the other resided in the accumulator. Still, the accumulator machine

has it's drawbacks when evaluating longer expressions, due to the limited amount of accumulator registers.

The GPR machines solved many problems often related to stack and accumulator machines. They

could store variables in registers, thus reducing the number of accesses to main memory. Also, the

compiler could associate the variables of a complex expression in several dierent ways, making it more

exible and ecient for pipelining. A stack machine needs to evaluate the same complex expression from

left to right which might result in unnecessary stalling. Many embedded CPUs are RISC architectures

which means that they have lots of registers (usually about 32).

2.2.2 Arithmetic Logic Unit

The Arithmetic Logic Unit (ALU) performs arithmetic and logic functions in the CPU. It is usually

capable of adding, subtracting, comparing, and shifting. The design can range from using simple combi-

national logic units that does ripple carry addition, shift-and-add multiplication, and a single-bit shifts,

to no-holds-barred units that do fast addition, hardware multiplication, and barrel shifts[9].

2.2.3 Control Unit

The control unit is responsible for generating proper timing and control signals (usually implemented as

a state-machine that performs the machine cycle: fetch, decode, execute, and store to other logical blocks

in order to complete the execution of instructions.

2.2.4 Memory Management Unit

The Memory Management Unit (MMU) is located between the CPU and the main memory and is

responsible for translating virtual addresses into a their corresponding physical address. The physical

address is then presented to the main memory. The MMU can also enforce memory protection when

needed.

2.2.5 Cache

There are few processors today that don't incorporate a cache. The cache acts as a buer between the

CPU and the main memory to reduce access time, taking advantage of the locality of both code and

data. There are usually several levels of cache, each with their own purpose. The rst level is usually

located on-chip, thus together with the CPU. The cache is often separated into a instruction- and a

data-cache. Cache is especially important in a RISC architecture with frequent loads and stores. For

example, Digital's StrongARM chip devotes about 90% of its die area to cache[89]. The reader can learn

more about cache and how it is used in section 4.2.

2.2.6 Pipeline

As with the case of cache, there are very few processors today that doesn't use any kind of pipelining

in order to improve their performance. This section will serve as an introduction to pipelining and

the benets and drawbacks of using it. Pipelining is implementation technique that tries to achieve


Instruction Level Parallelism (ILP) by letting multiple instructions overlap in execution. The objective

is to increase throughput, the number of instructions completed at a time. By dividing the execution of

an instructions into several phases, called pipeline stages, an ideal speedup equal to the pipeline depth

could theoretically be achieved. Also, by dividing the pipeline into several stages the workload will be less

each stage, letting the processor run at a higher frequency[8]. Figure 4 shows a typical pipeline together

with it's stages. This particular pipeline has a length of ve and consists of unique pipeline stages, each

with their own purpose.

WBIDIF MEMEX

Figure 4: A General Pipeline

The Instruction Fetch cycle (IF) consists of fetching the next instruction from memory.

The Instruction Decode cycle (ID) handles the decoding of the instruction and reads the register le

in case one or several of the instruction's operand(s) is a register.

The Execution cycle (EX) evaluates ALU operations or calculates destination address in case of a

branch instruction.

The Memory Access cycle (MEM) is where memory is accessed when needed, or in case of a branch

the new program counter is set to the calculated destination address from the previous pipeline

stage

4

.

The Write-back cycle (WB) writes the result back to the register le.

The time it takes for an instruction to move from one pipeline stage to another is usually referred to

as a machine cycle. If one stage require several cycles to complete, it could be decomposed into several

smaller stages, resulting in a superpipline. Because instructions need to move at the same time, the length

of a machine cycle is dictated by the slowest stage of the pipeline. The designer challenge is to reduce

the number of machine cycles per instruction (CPI). If one would execute one instruction at a time, the

CPI count would be equal to the pipeline length. The optimal result in a linear pipeline would be to

reach a CPI count of 1.0, which means that a instruction is completed every cycle and that every pipeline

stage is fully utilized. This is not entireably achievable, due to the fact that a program usually consist of

internal dependencies, branches, etc. These pipeline obstacles are usually referred to as hazards and can

cause delays in the pipeline, called stalls. An execution of a program completely without hazards would

execute its instructions with virtually no delays, resulting in a CPI count close to 1.0

5

. Those hazards

that do cause pipeline stalls are usually classied as: structural, data and control hazards.

structural hazards are caused by resource conicts when the hardware cannot support certain combi-

nations of overlapped execution. It can depend on just having one port to the register le, causing

conicts in the ID and WB stage for register requests. Another source of conict might be that the

memory is not divided into code and data, thus causing conicts in the IF and MEM stage due to

instruction fetching and memory writing.

data hazards are caused by an instruction being dependent of another instruction in previous pipeline

stage so that execution must be stalled, or else written data can be inconsistent. These instruction

dependencies comes in three avors: Read After Write (RAW), Write After Write (WAW), and

4

In case of an conditional branch, the condition will be evaluated. If the instruction branches, the program counter is

set by previous calculation, otherwise the program counter is incremented to point at the next instruction.

5

The CPI count never reaches the ideal value of 1.0, thus cycles are always lost in the beginning because the pipeline

is initially empty and need to be lled with instructions. By the time the rst instruction reaches the WB phase, several

cycles is lost.

2.3 The Microprocessor Evolution 12

Write After Read (WAR). RAW hazard are the most common ones and occurs when a write

instruction is followed by a read instruction and both instructions operate on the same register,

causing the one instruction to wait until the write has been issued in the WB stage. This can

be handled by forwarding, thus introducing "shortcuts" in the pipeline, so that instructions can

take part of results before the current instruction reaches the WB stage[8]. WAW hazards cannot

occur in pipelines like the one showed earlier (gure 4). The reason for this is that in order for

a WAW hazard to occur, either the memory stage has to be divided into several stages, making

several simultaneous writes possible, or some mechanism where an instruction can bypass an another

instruction in the pipeline. WAR hazards are rare and happens when a instruction tries to write

to a register read by an instruction that is ahead in the pipeline. As with WAW hazards, WAR

hazards cannot occur in a general pipeline because register contents are always read at the ID stage.

Some pipelines do read register contents late and can create a WAR hazard [8].

control hazards are caused by the instructions that changes the path of execution, called branches.

By the time a branch instruction calculates it's destination address in the EXE stage, instructions

following the branch has reached the IF and ID stage. If the branch was unconditional, the instruc-

tions that is in the IF and ID stage has to be removed, because the branch changes the program

counter and the new instructions have to be fetched from a new address, namely the destination

address of the branch. On the other hand, if there was an unconditional branch, the condition

need to be evaluated in order to decide if the branch should be taken or if the program counter

only should be incremented. One way of dealing with this problem is to automatically stall the

pipeline until condition is evaluated. These stalls are issued in the ID stage, where the branch is

rst identied. Also, in order to evaluate the condition of a conditional branch and calculate the

destination address simultaneously, extra logic for condition evaluation is added together with the

ALU in the ID stage. This way, only one stall cycle will be wasted when a branch instruction

occurs.

Most structural hazards can be prevented by adding more ports and dividing the memory into data

and instruction memory segments. The memory can also be improved by adding cache or increasing the

cache area. Data hazards can be handled by letting the compiler reschedule the instructions in order to

reduce the number of dependencies. Control hazards can be reduced by trying to predict the destination

of a branch. The prediction is based on tables storing historical information about whether the same

branch did or did not jump in earlier executions. Such tables, called Branch History Table (BTB) or

Branch Prediction Buer (BPB), are doing just that. Other tables such as Branch Target Buer (BTB)

acts as a cache storing the destination address of many previously executed branches. The interested

reader can continue it's reading in several books and articles addressing dierent branch penalty reduction

techniques[8, 64, 63].

2.3 The Microprocessor Evolution

This section serves as a "walk-through" of the dierent phases in microprocessor evolution (gure 5).

As this section may seem unrelevant to embedded processors, the embedded processor design has always

been inuenced by the microprocessor and may continue to do so in the future. The reader who feels

unfamiliar with the principles behind the RISC and CISC paradigms, should reread section 1.4.1 before

proceeding with this text.

In the early days, there was a limiting amount of transistors available for the CPU designer. Usually,

the chips where lled with logic that was seldom used (e.g decoding of seldom used instructions). CISC

computers used microcoding, which made it easier to execute complex instructions. As the years went

by, it became harder for CISC designers to keep up with Moore's law

6

. Building more complex solutions

each year was not enough. Some designers realized that the rule locality of reference is something that

needs to be taken into consideration. It states that A program executes about 90% of its instructions

6

The capability of microprocessor doubles every 18 month.

2.3 The Microprocessor Evolution 13

CISCinstructions take variable time to complete

RISC/CISC

RISCCISC

Superscalar/VLIWMultithreaded Processors Single Chip Multiprocessor

Simultaneous Multithreading

microcoding, more complex instructions pipeline, simple instructions for speed

merging of architectures

execute multiple instructions duplicated processorsduplicated HW resources (regs, PC, SP)

any context can execute each cycle

Figure 5: The evolution of microprocessors.

in 10% of it's code. The RISC designers thought that if they could implement the 10 % of most used

instructions and throw out the other 90%, then there would be lots of free die area left for other ways of

increasing the performance. Some of the performance enhancing techniques are listed below.

Cache Memory references was becoming a serious bottleneck, and a way to reduce the access time is to

use the extra on-chip space for cache. With the on-chip cache, the processor did not need to access

the main memory for all memory references.

Pipelining By breaking down the handling of an instruction into several simpler stages, the processor

is able to run faster, resulting in higher frequency.

More registers When compiling a program into machine code, the handling of variables usually is

taken care of by registers. Sometimes, there are stalls in the pipeline, due to dependencies between

registers (e.g one can not use a register until it is available), which can be avoided by register

renaming. This is possible when increasing the number of registers.

Computers using some or all of these advantages include RISC I and IBM 801 [2]. These enhancements

gave the RISC designers the upper hand for several generations in the 80's and 90's. But when the

number of available transistors on a chip passed the million-mark the number of transistors as a limiting

factor dissapeared. The CISC designers could level the score by introducing more complex solutions that

increased their performance a couple of percent, with little concern to how much die are was used. Even

though the CISC processor was several factors more complex than the corresponding RISC processor,

it was still keeping up with the RISC. Nowadays, the RISC and CISC paradigms are merged together

and uses techniques from both of the original paradigms. Now, when there are 10, 20 million or more

transistors available, the problem the designer is facing is more about making the most use of all the

transistors than how to t it all on one die. A simple processor can now be realized on only a fraction of

the available space. There are limits in the performance gains when increasing the cache size, deepening

the pipeline and increasing the number of registers. So, the question is what to do with the available

space? To gain more performance, new architectures like Multithreading, Simultaneous Multithreading

(SMT), Very Long Word Instruction (VLIW), and Single Chip Multiprocessor (CMP) are emerging.

These architectures will be discussed in section 2.7.

2.4 Design Aspects 14

2.4 Design Aspects

The designers of embedded processors are under market pressure when it comes to producing cheap, low

power-consuming, fast processors[22]. To meet the market demand for a SoC solution, the designers of

an embedded processor need to look at several design aspects, listed below.

2.4.1 Code Density

The size of a program may not be an issue in the desktop world, but is major challenge in embedded

systems. The embedded processor market is highly constrained by power, cost, and size. For control

oriented embedded applications a signicant portion of the nal circuitry is used for instruction memory.

Since the cost of an integrated circuit is strongly related to die size, smaller programs imply cheaper

smaller dies is needed, which in turn means cheaper dies can be used for embedded systems [81, 82].

Thumb andMIPS16 are two approaches that tries to reduce the code density of programs by compressing

the code. Thumb and MIPS16 are subsets of the ARM and MIPS-III architecture. The instructions used

in the subset are either frequently used or does not require full 32-bits or are important to the compiler

for generating small code. The original 32-bit instructions are re-encoded to be 16-bits wide. Thumb and

MIPS16 is reported to achieve code reductions of 30% and 40%, respectively. The 16-bit instructions are

fetched from instruction memory and decoded to equivalent 32-bit instructions that is run as normally

by the core. Both approaches have drawbacks:

Instruction widths are shrunk at the expense of reducing the number of bits used to represent

registers and immediate values

Conditional execution and zero-latency shifts are not possible in Thumb

Floating-point instructions are not available in MIPS16

The number of instructions in a program grows with compression

Thumb code runs 15-20% slower on systems with ideal instruction memories

Both Thumb and MIPS16 are execution-based selection form of selective compression which is a tech-

nique that selects procedures to compress according to a procedure execution frequency prole. The other

form is miss-based selection which is invoked only on an instruction cache miss. All performance loss will

occur on a cache miss path. This way, miss-based selection is based on the number of cache misses and

not the number of executed instructions as in execution-based selection. Speedup can be achieved by

letting the procedures with the most cache misses to be in native code.

Jim Turley has a dierent view on the techniques for reducing code density[89]: Claimed advantages

in code density should be considered in light of factors such as compiler optimization (loop unrolling,

procedure inlining, etc), the addressing (32-bit vs. 64-bit integers or pointers), and memory granularity.

Finally, code density does little or nothing to aect the size of data space. Applications working with

large data sets requires much more memory than the executable, thus code reduction does little help

here.

2.4.2 Power Consumption

Many products using embedded processors use batteries as power supply. To preserve as much power as

possible, embedded processors usually operate in three dierent modes: fully operational, standby mode

and clock-o mode[22]. Fully operational means that the clock signal is propagated to the entire pro-

cessor, and all functional units are available to execute instructions. When the processor is in standby

mode, it is not actually executing a instruction, but the DRAM is still refreshed, register contents is also

available. The processor returns to fully operational mode upon a activity that requires units that are

not available in standby mode, without loosing any information. Finally, in clock-o mode, the system

has to be restarted in order to continue, which almost take as much time as a initial start-up. Power

2.5 Implementation Aspects 15

consumption is often measured as milliwatt per megahertz (mW/MHz).

The simplest way of reducing the power consumption is to reduce the voltage level. Today, CPU core

voltage has been reduced to about 1.8V and is still decreasing. Embedded processors are starting to

incorporate dynamic power management into their design. One example is a pipeline that can shut o

the clock at various logic blocks that is not needed when executing the current instruction[98]. This type

of pipeline is usually referred to as a power-aware pipeline. Also, it is no longer sucient to only measure

the power consumption of the CPU, as it gets integrated with its peripherals in a SoC. Instead, a power

consumption measure of the entire system has to be done.

2.4.3 Performance

Unlike the desktop market, performance isn't everything in the embedded processor market. Instead

factors like price, power consumption is equally important. A typical embedded processor usually executes

about one instruction per cycle. Today, performance is still measured as Million Instructions Per Second

(MIPS) which basically only reveals the amount of instructions executed per second, not if they were

any useful instructions executed. MIPS is not a good way of measuring performance, and section 2.8.1

looks at other alternatives. Sometimes, the usual performance of one executed instruction per cycle for

an embedded processor is not enough and other alternative architectures must be considered in order to

increase the performance. Section 4.3.7 discusses possible alternative architectures.

2.4.4 Predictability

Architectures that supports real-time systems must have the ability to achieve predictability [84]. Pre-

dictability is dependent on the Worst Case Execution Time (WCET) which is in turn dictated by the

underlying hardware. Much focus is on improving an architectures performance, and little thought goes

to make it predictable. This has lead to architectures that includes cache, pipeline, virtual storage man-

agement, etc, all which has improved the average case execution time, but has worsen the prospects for

predictable real-time performance.

Caches have not been popular in the real-time competing community, due its unpredictable behavior.

This is true for multi-tasking, interrupt driven environments which are common in real-time applica-

tions [87]. Here, the individual task execution time can have dier from time to time due to interactions

of real-time tasks and the external environment via the operating system. Preemptions may modify

the cache contents and thereby cause a nondeterministic cache hit ratio resulting in unpredictable task

execution task times.

Pipelines also introduces similar problems to caches concerning worst case execution time. There are

eorts to achieve predictable performance of pipelines without using a cache and without the hazards

associated with them [88]. This approach, called Multiple Active Context System (MACS), uses multiple

processor contexts to achieve increasing performance and predictability. Here, a single pipeline is shared

among a number of threads and context of every thread is stored within the processor. On each cycle,

a single context is selected to issue a single instruction to the pipeline. While this instruction proceeds

through the pipeline, other contexts issue instructions to ll consecutive pipeline stages. Contexts are

selected in a round-robin fashion. A key feature of MACS architecture is that its memory model allows

the programmer to derive theoretical upper bounds on memory access times. The maximum number of

cycles a context will wait for a shared memory request is dictated by the number of contexts, the memory

issue latency, the number of shared memory competing threads, and the number of contexts scheduled

between consecutive threads.

2.5 Implementation Aspects

There are several options available for the designer who wants to integrate an embedded processor into

a SoC. Besides building a processor from "scratch", there are other options available. The rst option

2.6 State of Practice 16

is to acquire the processor core as an hard IP-component

7

in form of a specic semiconductor fabri-

cation process and are delivered as mask data. Several hard IP-cores will be examined in section 2.6.

The second option is to acquire the CPU as a rm IP-component which is usually delivered in form

of a netlist. The third and last option is to acquire a soft IP-component in form of VHDL or Verilog

code or to produce a synthesizable core with a parameterizable core generator. There has been sev-

eral research eorts to develop generators of parameterizable RISC-cores[73, 76]. One conducted at the

university of Hanover has developed a parameterizable core generator that outputs fully synthesizable

VHDL code. The generated core is based on a standard 5 stage pipeline (Figure 4). The designer has

many choices when using the generator (e.g pipeline length, ALU and data width, size of register le, etc).

The generated cores are simple RISC-processors with a parameterizable word and instruction width.

Instruction and data memories are provided as a VHDL template le for simulations, but they are not

suitable for synthesis. Instead, they should be taken from a technology specic library. Since the cores

are based on RISC-principles, the instruction set consists of only few instructions and addressing modes.

A typical 32-bit RISC core with 32 bit data path and 8 32 bit registers can with a 3LM 0.5 micron.

standard-cell library deliver about 100 MHz achievable clock-frequency.

Commercial core generators are also available from Tensilica, ARC, and Triscend[100, 101, 99].

2.6 State of Practice

The 4, 8 and 16 bit microprocessors was and still are dominating the embedded control market. In

fact, it was forecasted that eight times more 8-bit than 32-bit CPU's will be shipped during 1999[89].

The 32-bit embedded processor market diers from the desktop market in that there are about 100

vendors and a dozen instruction set architectures to choose from. The thing that makes 32-bit embedded

CPU's attractive is their ability to handle emerging new consumer demands in form of ltering, articial

intelligence, multimedia, still maintaining a low level of power consumption, price, etc. Next will follow

a brief presentation of available embedded processors commonly used today.

2.6.1 ARM

The Advanced RISC Machines (ARM) company is a leading IP provider that licenses RISC processors,

periphals, and system-on-chip designs to international electronics companies. The ARM7 family of pro-

cessors consists of ARM7TDMI and ARM7TDMI-S processor cores, and the ARM710T, ARM720T and

ARM740T cached processor macrocells.

An ARM7 processor consists of an ARM7TDMI or ARM7TDMI-S (S stands for Synthesizable and

means that it can be acquired as VHDL or Verilog code) core that can be augmented with one of

the available macrocells. The macrocells provides the core with 8KB cache, write buer, and memory

functions. ARM710T also provides a virtual memory support for operating systems such as Linux and

Symbain's EPOC32. ARM720T is a superset of ARM710T and supports WindowsCE.

When writing a 32-bit program for an embedded system there might be a problem to t the entire

program in the on-chip memory. This kind of problem is usually referred to as a code density problem.

In order to address the code size problem ARM has developed Thumb, a new instruction set. Thumb is

an extension to the ARM architecture, containing 36 instruction formats drawn from the standard 32-bit

ARM instruction set that have been re-coded into 16-bit wide opcodes. Upon execution, the Thumb

codes are decompressed by the processor to their real ARM instruction set equivalents, which are then

run on ARM as usual. This gives the designer the benets of running ARM's 32-bit instruction set and

reducing code size by using Thumb.

7

Those who are not familiar with the dierent layers of IP-components, can read the section SoC Design.

2.6 State of Practice 17

The ARM9 family is a newer and more powerful version of ARM7 and designed for system-on-chip

solutions due to its built-in DSP capabilities. The ARM9E-S solutions are macrocells intended for in-

tegration into Application Specic Integrated Circuits (ASICs), Application Specic Standard Products

(ASSPs) and System-on-chip (SoC) products.

CPU core Die Area Power Frequency Performance

ARM7TDMI 1.0 mm

2

on 0.25 m 0.6 mW/MHz @ 3.3V 66 MHz 0.9 MIPS/MHz

ARM9E-S 2.7 mm

2

on 0.25 m 1.6 mW/MHz @ 2.5V 160 MHz 1.1 MIPS/MHz

2.6.2 Motorola

The Motorola M-CORE microprocessor, introduced in 1997, was targeting the market of analog cellular

phones, digital phones, PDAs, portable GPS systems, automobile braking systems, automobile engine

control, and automotive body electronics. The development of the M-CORE architecture was designed

from the ground up to achieve the lowest milliwatts per MHz. It is a 32-bit processor that has a 16-bit

xed length instruction format, and a 32-bit RISC architecture. The M-CORE minimizes power usage

by utilizing dynamic power management.

Motorola has also developed a modern version of the 68K architecture, the Coldre, which is positioned

between the 68K (low end) and the PowerPC (high end). This architecture is also known as VL-RISC,

because although the core is RISC-like, the instructions are variable length (VL). VL instructions help to

attain higher code density, Coldre has a four-stage pipeline consisting of two subpipelines: a two-stage

instruction prefetch pipeline and two-stage operand execution pipeline.

2.6.3 MIPS

MIPS Technologies designs and licenses embedded 32- and 64-bit intellectual property (IP) and core

technology for digital consumer and embedded systems market. The MIPS32 architecture is a superset

of the previous MIPS I and MIPS II instruction set architectures.

2.6.4 Patriot Scientic

Patriot Scientic Corporation was one of the rst developing a Java microprocessor, the PSC1000. The

PSC1000 is targeted for high performance, low-system cost applications like, network computers, set-top

boxes, cellular phones, Personal Digital Assistants (PDA's) and more. The PSC1000 microprocessor is

a 32-bit RISC processor that oers ability to execute both Java(tm) programs as well as C and FORTH

applications. It oers a unique architecture that is a blend of stack- and register-based designs, which

enables features like 8-bit instructions for reduced code size. The idea behind the PSC1000 is to make

Internet connectivity for low cost devices such as PDA's, set-top cable boxes and "smart" cell phones.

2.6.5 AMD

Advanced Micro Devices (AMD)'s 29000K was an early leader which was frequently used in laser print-

ers and network buses. The 29K family comprises three product lines, including three-bus Harvard-

architecture processors, two-bus processors, and a microprocessor with on-chip peripheral support. The

core is built around a simple four-stage pipeline: fetch, decode, execute, and write-back. The 29K has a

triple-ported register le of 192 32-bit registers. In 1995, AMD cancelled all further development of the

29K to concentrate its eorts on x86 chips.

2.6.6 Hitachi

Hitachi SuperH (SH) became popular when Sega chose the SH7032 for its Genesis and Saturn video game

consoles. Then, it expanded to cover consumer-electronics markets. Its short, 16-bit instruction word

gives SuperH one of the best code density compared with almost any 32-bit processor. The SH family

2.7 Improving Performance 18

uses a ve-stage pipeline: fetch, decode, execute, memory access, and write-back to register. The CPU

is built around 25 32-bit registers.

2.6.7 Intel

Intel i960 emerged early in the embedded market which made it successful in printer and networking

equipments. The i960 is well supported with development tools. The i960 combines a Von Neumann

architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers.

All i960s have multistage pipelines and use resource scoreboarding to track resource usage.

2.6.8 PowerPC

The PowerPC is one of the best-known microprocessor name next to Pentium and is steadily gaining

ground in the embedded space. IBM and Motorola are pursuing dierent strategies with their embedded

PowerPC chips, with the former inviting customer designs and the latter leveraging its massive library

of peripheral I/O logic.

2.6.9 Sparc

Sun's SPARC was the rst workstation processor to be openly licensed and is still popular with some

embedded users. The microSPARC are built around a large multiported register le that breaks down

into small set of global registers for holding global variables and sets overlapping register windows. The

microSPARC's pipeline consists of an instruction-fetch unit, two integer ALUs, a load/store unit, and a

FPU.

2.7 Improving Performance

Pipelining is a way of achieving a level of parallelism, resulting in a low CPI count. In order to be

even more eective, linear pipelining will not suce and other techniques have to be considered. These

techniques have the ability to execute several instructions at once, resulting in a CPI count below 1.0.

The most popular techniques includes Multiple-issue Processors (such as Very Long Instruction Word

(VLIW) and Superscalar Processors),Multithreading, Simultaneous Multithreading (SMT) and Chip Mul-

tiprocessor (CMP). Also, another technique will be discussed that tries to come to terms with the ever

growing memory-CPU speed gap. There is one technique, called prefetching or preloading, that hides

the memory latency by fetching and storing required data or instructions in a buer before it is actually

needed.

2.7.1 Multiple-issue Processors

Although there are techniques that can remedy most of the stalls in an ordinary pipeline, the ideal

result is still only a CPI count of 1.0, thus executing exactly one instruction for every machine cycle.

This performance is not always enough and other ways of achieving a higher level of parallelism need

to be considered. Multiple-issue processors tries to execute several instructions in a machine cycle, thus

achieving a higher rate of Instruction-Level Parallelism (ILP). There are mainly two types of processors

using these techniques, namely Very Long Instruction Word (VLIW) and superscalar processors. Also,

in addition to the two architectures, a third alternative processor, called Multiple Instruction Stream

Computer (MISC) will be discussed.

As the name implies, a VLIW processor issues a very long instruction packet that consists of several

instructions. An example of a instruction packet can be seen in (Figure 6), were there is room for two

integer/branch operations, one oating point operation, and two memory references. In the case of VLIW

processors, the task of nding independent instructions in the code is done by the compiler instead of dy-

namic hardware as in superscalar processors. Additional hardware is saved because the compiler always

2.7 Improving Performance 19

I

mpsoc

Documents

Transcript of mpsoc