Evolution of Computing Microprocessors and SoCs

27
Evolution of Personal Computing by Microprocessors and SoCs For Credit Seminar: EEC7203 (Internal Assessment) Submitted To Dr. T. Shanmuganantham Associate Professor, Department of Electronics Engineering Azmath Moosa Reg No: 13304006 M. Tech 1 st Yr Department of Electronics Engineering, School of Engg & Tech, Pondicherry University

description

This file documents the evolution of microprocessors from 4004 to the 4th generation i7

Transcript of Evolution of Computing Microprocessors and SoCs

Page 1: Evolution of Computing Microprocessors and SoCs

Evolution of Personal Computing by

Microprocessors and SoCs

For Credit Seminar: EEC7203 (Internal Assessment)

Submitted To

Dr. T. Shanmuganantham

Associate Professor,

Department of Electronics Engineering

Azmath Moosa

Reg No: 13304006 M. Tech 1st Yr

Department of Electronics Engineering, School of Engg & Tech,

Pondicherry University

Page 2: Evolution of Computing Microprocessors and SoCs

Page | i

Abstract

Throughout history, new and improved technologies have transformed the human

experience. In the 20th century, the pace of change sped up radically as we entered the

computing age. For nearly 40 years the Microprocessor driven by innovations of companies

like Intel have continuously created new possibilities in the lives of people around the world.

In this paper, I hope to capture the evolution of this amazing device that has raised computing

to a whole new level and made it relevant in all fields – Engineering, Research, Medical,

Academia, Businesses, Manufacturing, Commuting etc. I will highlight the significant strides

made in each generation of Processors and the remarkable ways in which engineers overcame

seemingly unsurmountable challenges and continued to push the evolution to where it is today.

Page 3: Evolution of Computing Microprocessors and SoCs

Page | ii

Table of Contents

Title Page No.

1. Abstract i

2. Table of Contents ii

3. List of Figures iii

4. Introduction 1

5. X86 and birth of the PC 2

6. The Pentium 3

7. Pipelined Design 4

8. The Pentium 4 5

9. The Core Microarchitecture 7

10. Tick Tock Cadence 10

11. The Nehalem Microarchitecture 10

12. The SandyBridge Microarchitecture 12

13. The Haswell Microarchitecture 15

14. Performance Comparison 16

15. Shift in Computing Trends 18

16. Advanced RISC Machines 18

17. System on Chip (SoC) 19

18. Conclusion 22

19. References

Page 4: Evolution of Computing Microprocessors and SoCs

Page | iii

List of Figures

Figure 1: 4004 Layout 1 Figure 2: Pentium Chip 3 Figure 3: Pentium CPU based PC architecture 4 Figure 4: Pentium 2 logo 4 Figure 5: Pentium 3 logo 4 Figure 6: Pentium 4 HT technology illustration 6 Figure 7: NetBurst architecture feature presentation at Intel Developer

Forum 6 Figure 8: The NetBurst Pipeline 7 Figure 9: The Core architecture feature presentation at Intel Developer

Forum 8 Figure 10: The Core architecture pipeline 8 Figure 11: Macro fusion explained at IDF 9 Figure 12: Power Management capabilities of Core architecture 9 Figure 13: Intel's new tick tock strategy revealed at IDF 10 Figure 14: Nehalem pipeline backend 11 Figure 15: Nehalem pipeline frontend 11 Figure 16: Improved Loop Stream Detector 11 Figure 17: Nehalem CPU based PC architecture 11 Figure 18: Sandybridge architecture overview at IDF 12 Figure 19: Sandybridge pipeline frontend 13 Figure 20: Sandybridge pipeline backend 13 Figure 21: Video transcoding capabilities of Nehalem 14 Figure 22: Typical planar transistor 14 Figure 23: FinFET Tri-Gate transistor 14 Figure 24: FinFET Delay vs Power 15 Figure 25: SEM photograph of fabricated FinFET trigate transistors 15 Figure 26: Haswell pipeline frontend 16 Figure 27: Haswell pipeline backend 16 Figure 28: Performance comparisons of 5 generations of Intel processors 17 Figure 29: Market share of personal computing devices. 18 Figure 30: A smartphone SoC; Qualcomm's OMAP 20 Figure 31: A SoC for tablet; Nvidia TEGRA 21

Page 5: Evolution of Computing Microprocessors and SoCs

Page | 1

Introduction

In 1969, Intel was found with aim of manufacturing memory devices. Their first

product was Shottky TTL bipolar SRAM memory chip. A Japanese company – Nippon

Calculating Machine Corporation approached Intel to design 12 custom chips for its new

calculator. Intel engineers suggested a family of just four chips, including one that could be

programmed for use in a variety of products. Intel designed a set of four chips known as the

MCS-4. It included a central processing unit (CPU) chip—the 4004—as well as a supporting

read-only memory (ROM) chip for the custom applications programs, a random-access

memory (RAM) chip for processing data, and a shift-register chip for the input/output (I/O)

port. MCS-4 was a "building block" that engineers could purchase and then customize with

software to perform different functions in a wide variety of electronic devices.

And thus, the industry of the Microprocessor was born. 4004 had 2,300 pMOS

transistors at 10um and was clocked at 740 kHz. 4 pins were multiplexed for both address and

data (16 pin IC). In the very next year, the 8008 was introduced. It was an 8 bit processor

clocked at 500 kHz with 3,500 pMOS transistors at the same 10um. It was actually slower

with 0.05 MIPS (Millions of instructions per second) as compared to 4004 with 0.07. It was

in 1974, that the 8080 with 10 times the performance of 8008 with a different transistor

technology was launched. It used 4,500 NMOS transistors of size 6um. It was clocked at 2

MHz with a whopping 0.29 MIPS. Finally in March 1976, the 8085 clocked at 3 MHz with

yet another newer transistor technology - depletion type NMOS transistors of size 3 um was

launched. It was capable of 0.37 MIPS. The 8085 was a popular device of its time and is still

used in universities across the globe to introduce students to microprocessors.

Figure 1: 4004 Layout

Page 6: Evolution of Computing Microprocessors and SoCs

Page | 2

x86 and birth of the PC

The 8086 16 bit processor made its debut in 1978. New techniques such as that of

memory segmentation into banks to extend capacity and Pipelining to speed up execution were

introduced. It was designed to be compatible with 8085 Assembly Mnemonics. It had 29,000

transistors of 3um channel length and was clocked at 5, 8 and 10 MHz with a full 0.75 MIPS

at maximum clock. It was the father of what is now known as the x86 Architecture which

eventually turned out to be Intel’s most successful line of processors that power many

computing devices even today. Introduced soon after was the processor that powered the first

PC – the 8088. Clocked at 5-8 MHz with 0.33-0.66 MIPS, it was 8086 with an external 8 bit

bus.

In 1981, a revolution seized the computer industry stirred by the IBM PC. By the late

'70s, personal computers were available from many vendors, such as Tandy, Commodore, TI

and Apple. Computers from different vendors were not compatible. Each vendor had their own

architecture, their own operating system, their own bus interface, and their own software.

Backed by IBM's marketing might and name recognition, the IBM PC quickly captured the

bulk of the market. Other vendors either left the PC market (TI), pursued niche markets

(Commodore, Apple) or abandoned their own architecture in favor of IBM's (Tandy). With a

market share approaching 90%, the PC became a de-facto standard. Software houses wrote

operating systems (MicroSoft DOS, Digital Research DOS), spread sheets (Lotus 123), word

processors (WordPerfect, WordStar) and compilers (MicroSoft C, Borland C) that ran on the

PC. Hardware vendors built disk drives, printers and data acquisition systems that connected

to the PC's external bus. Although IBM initially captured the PC market, it subsequently lost

it to clone vendors. Accustomed to being a monopoly supplier of mainframe computers, IBM

was unprepared for the fierce competition that arose as Compaq, Leading Edge, AT&T, Dell,

ALR, AST, Ampro, Diversified Technologies and others all vied for a share of the PC market.

Besides low prices and high performance, the clone vendors provided one other very important

thing to the PC market: an absolute hardware standard. In order to sell a PC clone, the

manufacturer had to be able to guarantee that it would run all of the customer's existing PC

software, and work with all of the customer's existing peripheral hardware. The only way to do

this was to design the clone to be identical to the original IBM PC at the register level. Thus,

the standard that the IBM PC defined became graven in stone as dozens of clone vendors

shipped millions of machines that conformed to it in every detail. This standardization has been

an important factor in the low cost and wide availability of PC systems.

Page 7: Evolution of Computing Microprocessors and SoCs

Page | 3

8086 and 80186/88 were limited to addressing 1M of memory. Thus, the PC was also

limited to this range. This limitation was increased to 16 MB by 80286 released in 1982. It

had max clock of 16 MHz with more than 2 MIPS. It had 134,000 transistors at 1.5um. The

processors and the PC up to this point were all 16 bit. The 80386 range of processors, released

in 1985, were the first 32 bit processors to be used in the PC. The first of these had 275,000

transistors at 1um and was clocked at 33 MHz with 5.1 MIPS. Its addressing range could be

virtually 32 GB. Over the next few years, Intel modified the architecture and provided some

improvements in terms of memory addressing range and clock speed. The 80486 range of

processors, released in 1989, brought significant advancements in computing capability with a

whopping 41 MIPS for a processor clocked at 50 MHz with 1.2 million transistors at 0.8 um

or 800 nm. It had a new technique to speed up RAM read/writes with the Cache memory. It

was integrated onto the CPU die and was referred to as level 1 or L1 cache (as opposed to the

L2 cache available in the motherboard). As with the previous series, Intel slightly modified

the architecture and released higher clocked versions over the next few years.

The Pentium

The Intel Pentium microprocessor was introduced in

1993. Its microarchitecture, dubbed P5, was Intel's fifth-

generation and first 32 bit superscalar microarchitecture.

Superscalar architecture is one in which multiple execution

units or functional units (such as adders, shifters and

multipliers) are provided and operate in parallel. As a direct

extension of the 80486 architecture, it included dual integer

pipelines, a faster floating-point unit, wider data bus, separate

code and data caches and features for further reduced address

calculation latency. In 1996, the Pentium with MMX Technology (often simply referred to as

Pentium MMX) was introduced with the same basic microarchitecture complemented with an

MMX instruction set, larger caches, and some other enhancements. The Pentium was based

on 0.8 um process technology, involved 3.1 million transistors and was clocked at 60 MHz

with 100 MIPS. The Pentium was truly capable of addressing 4 GB of RAM without any

operating system based virtualization.

Figure 2: Pentium Chip

Page 8: Evolution of Computing Microprocessors and SoCs

Page | 4

The next microarchitecture was the P6

or the Pentium Pro released in 1995.

It had an integrated L2 cache. One

major change Intel brought to the PC architecture was

the presence of FSB (Front Side Bus) that managed the

CPU’s communications with the RAM and other IO.

RAM and Graphics card were high speed peripherals

and were interfaced through the Northbridge. Other IO

devices like keyboard and speakers were interfaced

through the Southbridge.

Pentium II followed it soon in 1997. It

had MMX, improved 16 bit

performance and had double the L2 cache. Pentium II had 7.5 million

transistors starting with 0.35um process technology but later revisions utilised

0.25um transistors.

The Pentium III followed in 1999 with 9.5 million 0.25um transistors and a

new instruction set SSE (Streaming SIMD Extensions) that assisted DSP and

graphics processing. Intel was able to push the clock speed higher and higher

with Pentium III with some variants clocked as high as 1 GHz.

Pipelined Design

At a high level the goal of a CPU is to grab instructions from memory and execute those

instructions. All of the tricks and improvements we see from one generation to the next just

help to accomplish that goal faster.

The assembly line analogy for a pipelined microprocessor is over used but that's because it is

quite accurate. Rather than seeing one instruction worked on at a time, modern processors

Figure 3: Pentium CPU based PC architecture

Figure 4: Pentium 2 logo

Figure 5: Pentium 3 logo

Page 9: Evolution of Computing Microprocessors and SoCs

Page | 5

feature an assembly line of steps that breaks up the grab/execute process to allow for higher

throughput.

The basic pipeline is as follows: fetch, decode, execute, and commit to memory. One would

first fetch the next instruction from memory (there's a counter and pointer that tells the CPU

where to find the next instruction). One would then decode that instruction into an internally

understood format (this is key to enabling backwards compatibility). Next one would execute

the instruction (this stage, like most here, is split up into fetching data needed by the instruction

among other things). Finally one would commit the results of that instruction to memory and

start the process over again. Modern CPU pipelines feature many more stages than what've

been outlined above.

Pipelines are divided into two halves. Frontend and Backend. The front end is responsible for

fetching and decoding instructions, while the back end deals with executing them. The division

between the two halves of the CPU pipeline also separates the part of the pipeline that must

execute in order from the part that can execute out of order. Instructions have to be fetched and

completed in program order (can't click Print until you click File first), but they can be executed

in any order possible so long as the result is correct.

Many instructions are either dependent on one another (e.g. C=A+B followed by E=C+D) or

they need data that's not immediately available and has to be fetched from main memory (a

process that can take hundreds of cycles, or an eternity in the eyes of the processor). Being able

to reorder instructions before they're executed allows the processor to keep doing work rather

than just sitting around waiting.

This document aims to highlight changes to the x86 pipeline with each generation of

processors.

The Pentium 4

The NetBurst microarchitecture started with Pentium 4. This line of processors started

in 2000 clocked at 1.4 GHz, 42 million transistors at 0.18 um process size and SSE2 instruction

set. The early variants were codenamed Willamette (1.9 to 2.0 GHz) and later ones Northwood

(up to 3.0 GHz) and Prescott.

Page 10: Evolution of Computing Microprocessors and SoCs

Page | 6

The diagram is from Intel feature presentation

of the NetBurst architecture. The Willamette

was an early variant with SSE2, Rapid

Execution engine (in which ALUs operate at

twice the core clock frequency) and

Instruction Trace Cache (ITC cached

decoded instructions for faster loop execution).

HT Technology refers to the prevention of

CPU wastage by assigning it to execute one

thread or application when another one waits

for data from RAM to arrive. This essentially acts like a dual processor system.

The NetBurst pipeline was 20 stages long. As illustrated in the figure to the right, the BTB

(Branch Target Buffer) helps to define the address of the next micro-op in the trace cache (TC

Nxt IP). Then micro-ops are fetched out of the trace cache (TC Fetch) and are transferred

(Drive) into the RAT (register alias table). After that, the necessary resources are allocated

(such as loading queues, storing buffers etc. (Alloc)), and there comes logic registers rename

(Rename). Micro-ops are put in the Queue until there appears free place in the Schedulers.

There, micro-ops' dependencies are to be solved, and then micro-ops are transferred to the

register files of the corresponding Dispatch Units. There, a micro-op is executed, and Flags are

calculated. When implementing the jump instruction, the real branch address and the predicted

Figure 7: NetBurst architecture feature presentation at Intel Developer Forum

Figure 6: Pentium 4 HT technology illustration

Page 11: Evolution of Computing Microprocessors and SoCs

Page | 7

one are to be compared (Branch Check). After that the new

address is recorded in the BTB (Drive).

Northwood and Prescott were later variations with certain

enhancements as illustrated in the diagram above. Processor

specific details are unnecessary.

The next major advancement was the 64 bit

NetBurst released in 2005. The Prescott line up continued

with maximum clock speeds of 3.8 GHz, transistor sizes of

0.09um. It had 2MB cache and EIST (Enhanced Intel

SpeedStep Technology – allowing dynamic processor clock

speed scaling through software). EIST was particularly

useful for mobile processors as a lot of power was conserved

when running at low clock speeds. NetBurst family

continued to grow with the Pentium D (dual core HT

disabled processors) and Pentium Extreme Edition

processors (Dual core with HT enabled).

The Core Microarchitecture

The high power consumption and heat intensity, the resulting inability to effectively

increase clock speed, and other shortcomings such as the inefficient pipeline were the primary

reasons for which Intel abandoned the NetBurst microarchitecture and switched to completely

different architectural design, delivering high efficiency through a small pipeline rather than

high clock speeds.

Intel’s solution was the Core microarchitecture released in 2006. The first of these

were sold under the brand name of “Core 2” with duo and quad variants (dual and quad CPUs).

Figure 8: The NetBurst Pipeline

Page 12: Evolution of Computing Microprocessors and SoCs

Page | 8

Merom was for

mobile computing,

Conroe was for

desktop systems,

and Woodcrest

was for servers

and workstations.

While

architecturally

identical, the three

processor lines

differed in the

socket used, bus

speed, and power

consumption. The

diagram below illustrates the Conroe architecture.

14 stage pipeline of the Core

architecture was a trade-off between

long and short pipeline designs. The

architectural highlights of this

generation are given below.

Wide Dynamic Execution referred

to two things. First, the ability of the

processor to fetch, dispatch, execute

and return four instructions

simultaneously. Second, a technique

called Macro fusion in which two

x86 instructions could be combined

into a single micro-op to increase performance.

Figure 9: The Core architecture feature presentation at Intel Developer Forum

Figure 10: The Core architecture pipeline

Page 13: Evolution of Computing Microprocessors and SoCs

Page | 9

In previous generations, the ALU typically broke instructions into two blocks, which resulted

in two micro ops and thus two execution clock cycles. In this generation, Intel extended the

execution width of the ALU and the load/store units to 128 bits, allowing for eight single

precision or four double precision blocks to be processed per cycle. The feature was called

Advanced Digital Media Boost, because it applied to SSE instructions which were utilised by

Multimedia transcoding applications. Intel Advanced Smart Cache referred to the unified

L2 cache that allowed for a large L2 cache to be shared by two processing cores (2 MB or 4

MB). Caching was more effective now because data was no longer stored twice into different

L2 caches any more (no replication). This freed up the system bus from being overloaded with

RAM read/write activity as each core could share data directly through the cache. The Smart

Memory Access feature referred to the inclusion of prefetchers. A prefetcher gets data into a

higher level unit using very speculative algorithms. It is designed to provide data that is very

likely to be requested soon, which can reduce memory access latency and increase efficiency.

The memory prefetchers constantly have a look at memory access patterns, trying to predict if

there is something they could move into the L2 cache from RAM - just in case that data could

be requested next. Intelligent Power Capability was a culmination of many techniques. The

65-nm process provided a good basis for efficient ICs. Clock gating and sleep transistors made

sure that all units as well as single transistors that were not needed remained shut down.

Enhanced SpeedStep still reduced the clock speed when the system was idle or under a low

load and was also capable of controlling each core separately. Some features were also

available such as Execute Disable Bit by which an operating system with support for the bit

may mark certain areas of memory as non-executable. The processor will then refuse to execute

any code residing in these areas of memory. The general technique, known as executable space

Figure 11: Macro fusion explained at IDF Figure 12: Power Management capabilities of Core architecture

Page 14: Evolution of Computing Microprocessors and SoCs

Page | 10

protection, is used to prevent certain types of malicious software from taking over computers

by inserting their code into another program's data storage area and running their own code

from within this section; this is known as a buffer overflow attack. It is also to be noted that

HyperThreading was removed.

Tick-Tock Cadence

Since 2007, Intel adopted a "Tick-Tock" model to follow every microarchitectural

change with a die shrink of the process technology. Every "tick" is a shrinking of process

technology of the previous microarchitecture and every "Tock" is a new microarchitecture.

Every year to 18 months, there is expected to be one Tick or Tock.

In 2007, the Core microarchitecture underwent a “Tick” to the 45 nm process. Processors were

codenamed Penryn. Process shrinking always brings down energy consumption and improves

power savings.

The Nehalem Microarchitecture

The next Tock was introduced in 2008 with the Nehalem microarchitecture. The

transistor count in this generation was nearing the Billion mark with around 700 million

transistors in the i7. The pipeline frontend and backend are illustrated below.

Figure 13: Intel's new tick tock strategy revealed at IDF

Page 15: Evolution of Computing Microprocessors and SoCs

Page | 11

The new changes to the pipeline in this were as

follows:

Loop Stream Detector – detected and cached

loops to prevent fetching instructions from cache

and decoding them again and again

Improved Branch Predictor – Fetched branch

instructions prior to execution based on an improved prediction algorithm

SSE 4+ - New instructions helpful for operations on database and DNA sequencing

were introduced.

Other changes to the architecture were:

HyperThreading – HT was reintroduced

Turbo Boost – The processor could

intelligently control its clock speed as per

application requirements and thus,

dynamically conserve power. Unlike EIST,

no OS intervention is required.

Backend

Figure 15: Nehalem pipeline frontend

Figure 14: Nehalem pipeline backend

Figure 16: Improved Loop Stream Detector

Figure 17: Nehalem CPU based PC architecture

Page 16: Evolution of Computing Microprocessors and SoCs

Page | 12

QPI – QuickPath Interconnect was the new system bus replacing FSB. Intel had moved

the memory controller on to the CPU die.

L3 Cache – shared between all 4 cores

The next tick was in 2010 codenamed Westmere with process shrinking to 32nm.

The SandyBridge Microarchitecture

The next Tock was in 2011 with the SandyBridge microarchitecture also marketed as

2nd generation of i3, i5 and i7 processors. With SandyBridge, Intel surpassed the 1 Billion

transistor count mark. The architectural improvements in this generation can be summarised

in the diagram below:

Changes to the pipeline were as follows:

A Micro-op Cache - When SB’s fetch hardware grabs a new instruction it first checks

to see if the instruction is in the micro-op cache, if it is then the cache services the rest

of the pipeline and the front end is powered down. The decode hardware is a very

complex part of the x86 pipeline, turning it off saves a significant amount of power.

Figure 18: Sandybridge architecture overview at IDF

Page 17: Evolution of Computing Microprocessors and SoCs

Page | 13

Redesigned Branch Prediction Unit – SB

caches twice as many branch targets as

Nehalem with much effective and longer

storage of history.

Physical Register File - A physical register file stores micro-op operands in the register

file; as the micro-op travels down the OoO (Out of Order execution engine) it only

carries pointers to its operands and not the data itself. This significantly reduces the

power of the OoO hardware (moving large amounts of data around a chip eats power),

it also reduces die area further down the pipe. The die savings are translated into a larger

out of order window.

AVX Instruction Set – Advanced Vector Extensions are a group of instructions that

are suitable for floating point intensive calculations in multimedia, scientific and

financial applications. SB features 256 bit operands for this instructions set.

Other changes to the architecture were:

Ring On-Die Interconnect - With Nehalem/Westmere all cores, whether dual, quad or

six of them, had their own private path to the last level (L3) cache. That’s roughly 1000

wires per core. The problem with this approach is that it doesn’t work well for scaling

up in things that need access to the L3 cache. Sandy Bridge adds a GPU and video

transcoding engine on-die that share the L3 cache. Rather than laying out another 2000

wires to the L3 cache Intel introduced a ring bus

Backend Figure 19: Sandybridge pipeline frontend

Figure 20: Sandybridge pipeline backend

Page 18: Evolution of Computing Microprocessors and SoCs

Page | 14

On-Die GPU and QuickSync - The Sandy Bridge GPU is on-die built out of the same

32nm transistors as the CPU cores. It gets equal access to the L3 cache. The GPU is

on its own power island and clock domain. The GPU can be powered down or clocked

up independently of the CPU. Graphics turbo is available on both desktop and mobile

parts. QuickSync is a hardware acceleration technology for video transcoding.

Rendering videos will be faster and more efficient.

Multimedia Transcoding - Media processing in SB is composed

of two major components: video decode, and video encode. The

entire video pipeline is now decoded via fixed function units.

This is in contrast to Intel’s previous design that uses the EU array

for some video decode stages. SB processor power is cut in half

for HD video playback.

More Aggressive Turbo Boost

The next Tick was in 2012 with the IvyBridge microarchitecture. The die was shrinked to

a 22nm process. It was marketed as 3rd generation of i3, i5 and i7 processors. Intel used

FinFET tri-gate transistor structure for the first time. Comparisons of the new structure

released by Intel are provided below.

As the above diagram shows, a FinFET structure or a 3D gate (as Intel calls it) allows

for more control over the channel by maximizing the Gate area. This means high ON current

and extremely low leakage current. This directly translates into lower operating voltages, lower

TDPs and hence higher clock frequencies. Comparisons in terms of delay and operating

voltage between the two structures are shown to the right.

Figure 21: Video transcoding capabilities of Nehalem

Figure 22: Typical planar transistor Figure 23: FinFET Tri-Gate transistor

Page 19: Evolution of Computing Microprocessors and SoCs

Page | 15

A scanning electron microscope image of the actual

transistors fabricated are shown to the right. A single transistor consists of multiple Fins as

parallel conduction paths maximize current flow.

The Haswell Microarchitecture

Ivy Bridge was followed by the next Tock of 2013, the Haswell microarchitecture. It

is currently being marketed as the 4th generation of core i3, i5 and i7 processors.

Changes to the pipeline were as follows:

Wider Execution Unit - adds two more execution ports, one for integer math and

branches (port 6) and one for store address calculation (port 7). The extra ALU and

port does one of two things: either improve performance for integer heavy code, or

allow integer work to continue while FP math occupies ports 0 and 1.

AVX2 and FMA - The other major addition to the execution engine is support for

Intel's AVX2 instructions, including FMA (Fused Multiply-Add). Ports 0 & 1 now

include newly designed 256-bit FMA units. As each FMA operation is effectively two

floating point operations, these two units double the peak floating point throughput of

Haswell compared to Sandy/Ivy Bridge.

Figure 24: FinFET Delay vs Power Figure 25: SEM photograph of fabricated FinFET trigate transistors

Page 20: Evolution of Computing Microprocessors and SoCs

Page | 16

The architectural improvements in this

generation can be summarised as follows:

Improved L3 Cache – The cache

bandwidth has been increased and is

now also capable of clocking itself separately from the Cores.

GPU and QuickSync – Notable performance improvements have been made to the on-

die GPU. QuickSync is a hardware acceleration technology for Multimedia

transcoding. Haswell improves on image quality and adds support for certain codecs

such as SVC, Motion JPEG and MPEG2.

Performance Comparisons

Before concluding this document, a comparison of the performance of these processors

has to be illustrated. The following graphs showcase performance improvements of Intel

processors over five generations starting with Conroe all the

way up to Haswell. Processor naming convention is as

illustrated to the right

Backend Figure 26: Haswell pipeline frontend

Figure 27: Haswell pipeline backend

Page 21: Evolution of Computing Microprocessors and SoCs

Page | 17

Intel is about half a century old. From the 4004 to the current 4th generation of i7, i5

and i3 processors, a lot has changed in the electronics industry. But this is not the end. This

evolution will continue. Intel’s next Tick will be Broadwell scheduled for this year utilizing

14nm transistor technology.

Figure 28: Performance comparisons of 5 generations of Intel processors

Page 22: Evolution of Computing Microprocessors and SoCs

Page | 18

Shift in Computing Trends

With its powerful x86 architecture, and excellent business strategy, Intel has managed

to dominate the PC market for almost as long as its age. Now, however, market analysts have

noticed a significant new shift in computing trends. More and more customers are losing

interest in the PC and moving towards more mobile computing platforms. The chart below

(Courtesy: Gartner) highlights this shift.

Figure 29: Market share of personal computing devices.

PC sales are beginning to drop as is evident. Meanwhile, the era of tablets and

smartphones is beginning. A common mistake many industry giants make is the lack of

importance they give to such shifts and end up losing it all. It happened with IBM (it lost the

PC market) and Intel will be no exception unless it is careful.

Advanced RISC Machines

The battle for the mainstream processor market has been fought between two main

protagonists, Intel and AMD, while semiconductor manufacturers like Sun and IBM

traditionally concentrated on the more specialist Unix server and workstation markets.

Unnoticed to many, another company rose to a point of dominance, with sales of chips based

on its technology far surpassing those of Intel and AMD combined. That pioneering company

is ARM Holdings, and while it's not a name that's on everyone's lips in the same way that the

0

50,000

1,00,000

1,50,000

2,00,000

2,50,000

3,00,000

3,50,000

4,00,000

4,50,000

5,00,000

2012 2013 2014

Market Share

PC (Desk and Notebook) Ultramobile Tablet Smartphone (Normalised by 4)

Page 23: Evolution of Computing Microprocessors and SoCs

Page | 19

'big two' are, indications suggest that this company will continue to go from strength to

strength.

Early 8-bit microprocessors like the Intel 8080 or the Motorola 6800 had only a few

simple instructions. They didn't even have an instruction to multiply two integer numbers, for

example, so this had to be done using long software routines involving multiple shifts and

additions. Working on the belief that hardware was fast but software was slow, subsequent

microprocessor development involved providing processors with more instructions to carry out

ever more complicated functions. Called the CISC (complicated instruction set computer)

approach, this was the philosophy that Intel adopted and that, more or less, is still followed by

today's latest Core i7 processors.

In the early 1980s a radically different philosophy called RISC (reduced instruction set

computer) was conceived. According to this model of computing, processors would have only

a few simple instructions but, as a result of this simplicity, those instructions would be super-

fast, most of them executing in a single clock cycle. So while much more of the work would

have to be done in the software, an overall gain in performance would be achievable. ARM

was established on this philosophy.

Semiconductor companies usually design their chips and fabricate them at their own

facility (like Intel) or lease it to a foundry such as TSMC. However, ARM designs processors

but neither manufactures silicon chips nor markets ARM-branded hardware. Instead it sells, or

more accurately licences, intellectual property (IP), which allows other semiconductor

companies to manufacture ARM-based hardware. Designs are supplied as a circuit description,

from which the manufacturer creates a physical design to meet the needs of its own

manufacturing processes. It's provided in a hardware description language that provides a

textual definition of how the building blocks connect together. The language used is RTL

(register transfer-level).

System on Chip (SoCs)

A processor is the large component that forms the heart of the PC. A core, on the other

hand, is the heart of a microprocessor that semiconductor manufacturers can build into their

own custom chip designs. That customised chip will often be much more than what most people

would think of as a processor, and could provide a significant proportion of the functionality

required in a particular device. Referred to as a system on chip (SoC) design, this type of chip

Page 24: Evolution of Computing Microprocessors and SoCs

Page | 20

minimises the number of components, which, in turn, keeps down both the cost and the size of

the circuit board, both of which are essential for high volume portable products such as

smartphones.

ARM powered SoCs are included in games consoles, personal media players, set-top

boxes, internet radios, home automation systems, GPS receivers, ebook readers, TVs, DVD

and Blu-ray players, digital cameras and home media servers. Cheaper, less powerful chips

are found in home products, including toys, cordless phones and even coffee makers. They're

even used in cars to drive dashboard displays, anti-lock breaking, airbags and other safety-

related systems, and for engine management. Also, healthcare products is a major growth area

over the last five years, with products varying from remote patient monitoring systems to

medical imaging scanners. ARM devices are used extensively in hard disk and solid state

drives. They also crop up in wireless keyboards, and are used as the driving force behind

printers and networking devices like wireless router/access points.

Modern SoCs also come with advanced (DirectX-9 equivalent) graphics capabilities

that can surpass game consoles like the Nintendo Wii. Imagination Technologies, which was

once known in the PC world with its “PowerVR” graphics cards, licenses its graphics

processors designs to many SoC makers, including Samsung, Apple and many more. Others

like Qualcomm or NVIDIA design their own graphics architecture. Qualcomm markets its

Figure 30: A smartphone SoC; Qualcomm's OMAP

Page 25: Evolution of Computing Microprocessors and SoCs

Page | 21

products under the OMAP series. NVIDIA markets under Tegra brand and other companies

such as Apple market theirs as A series. HTC, LG, Nokia and other smartphone manufacturers

do not design their own SoCs but use the above mentioned.

Finally, SoCs come with a myriad of smaller co-processors that are critical to overall

system performance. The video encoding and decoding hardware powers the video

functionality of smartphones. The image processor ensures that photos are processed properly

and saved quickly and the audio processor frees the CPU(s) from having to work on audio

signals. Together, all those components -and their associated drivers/software- define the

overall performance of a system.

Figure 31: A SoC for tablet; Nvidia TEGRA

Page 26: Evolution of Computing Microprocessors and SoCs

Page | 22

Conclusion

Computers have truly revolutionized our world and have changed the way we work,

communicate and entertain ourselves. Fuelled by constant innovations in chip design and

transistor technology this evolution doesn’t seem to be bothered to stop. In recent years, there

have been tremendous shifts in computing trends with mobile computers such as tablets and

smartphones becoming more and more preferable, possibly, due to lowering costs and prices.

While computing did start with the microprocessor, it is headed towards a scheme that

incorporates the microprocessor as a smaller subset of a larger system. One that incorporates

graphics, memory, modem and video transcoding co processors on a single chip. The SoC era

has begun…

Page 27: Evolution of Computing Microprocessors and SoCs

References

[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic

Architecture, [online] Available: http://www.intel.com/products/processor/manuals

[2] King, J. ; Quinnell, E. ; Galloway, F. ; Patton, K. ; Seidel, P. ; Dinh, J. ; Hai Bui and

Bhowmik, A., "The Floating-Point Unit of the Jaguar x86 Core," in 21st IEEE

Symposium on Computer Arithmetic (ARITH), 2013, pp. 7-16.

[3] Ibrahim, A.H. ; Abdelhalim, M.B. ; Hussein, H. ; Fahmy, A., "Analysis of x86

instruction set usage for Windows 7 applications," in 2nd International Conference on

Computer Technology and Development (ICCTD), 2010, pp. 511-516.

[4] PC Architecture, Acid Reviews, [online] 2014,

http://acidreviews.blogspot.in/2008/12/pc-architecture.html (Accessed: 2nd February

2014).

[5] Alpert, D. and Avnon, D., "Architecture of the Pentium microprocessor," IEEE

Micro, vol. 13, Issue 3, pp. 11-21, 1993.

[6] Computer Processor History, Computer Hope, [online] 2014,

http://www.computerhope.com/history/processor.htm (Accessed: 2nd February 2014).

[7] Gartner Press Release, Gartner Analyst, [online] 2014,

http://www.gartner.com/newsroom/id/2610015 (Accessed: 8th February 2014).

[8] Intel Processor Number, CPU World, [online] 2014, http://www.cpu-

world.com/info/Intel/processor-number.html (Accessed: 9th February 2014).