Design with Microprocessors - University of California...

90
1 Design with Microprocessors Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.

Transcript of Design with Microprocessors - University of California...

Page 1: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

1

Design with Microprocessors

Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.

Page 2: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

2 Tajana Simunic Rosing

ES Design

Verification and Validation

Hardware Hardware components

Page 3: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Embedded System Hardware n  Embedded system hardware is frequently

designed in a loop (“hardware in a loop“):

F cyber-physical systems

Page 4: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Hardware platform architecture

4 Tajana Simunic Rosing

Page 5: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

System-on-Chip platforms

5 Tajana Simunic Rosing

Nvidia Tegra 2 die photo

Qualcomm Snapdragon block diagram

General processor n  Application processor (CPU) Specialized units n  Graphics processing unit (GPU) n  Various digital signals processing (DSPs)

Page 6: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Processor comparison metrics

6 Tajana Simunic Rosing

Clock frequency Computation speed Memory subsystem •  Indeterminacy in execution

•  Cache miss: compulsory, conflict, capacity Power consumption Idle power draw, dynamic range, sleep modes Chip area Can be critical for embedded form factor Versatility/specialization FPGAs, ASICs Non-technical Development environment, prior expertise,

licensing

n  E.g. ARM Cortex-A, Intel Atom, TI C54x, TI 60x DSPs, Xilinx Virtex-7, single purpose controller

Page 7: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Processor comparison metrics

7 Tajana Simunic Rosing

Parallelism Superscalar pipeline •  Depth & width à latency & throughput Multithreading •  GPU workload requires different programming effort

Instruction set architecture

Complex instruction set computer (CISC): •  Many addressing modes •  Many operations per instruction •  E.g. TI C54x

Reduced instruction set computer (RISC): •  Load/store •  Easy to pipeline •  E.g. ARM

Very Long Instruction Word (VLIW)

Page 8: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Processor comparison metrics n  How do you define “speed”?

¨  Clock speed – but instructions per cycle may differ ¨  Instructions per second – but work per instr. may differ

n  Practical evaluation ¨  Dhrystone: Synthetic benchmark, developed in 1984

n  Dhrystones/sec (a.k.a. Dhrystone MIPS) to normalize difference in instruction count between RISC/CISC

n  MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780) ¨  SPEC: set of more realistic benchmarks, but oriented to desktops ¨  EEMBC: EDN Embedded Benchmark Consortium, www.eembc.org

n  Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications

n  E.g. CoreMark, intended to replace Dhrystone ¨  0xbench: Integrated Android benchmarks

n  Covers C library, system calls, Javascript (web performance), graphics, Dalvik VM garbage collection

8 Tajana Simunic Rosing

Page 9: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

PARALLEL ARCHITECTURES

9 Tajana Simunic Rosing

Page 10: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

10 Tajana Simunic Rosing

Parallelism in programs n  Exists in several levels

of granularity ¨ Task ¨ Data ¨  Instruction

P1 P2

P3

Ld r1, r2 Add r3,r4 Sub r5,r6

Page 11: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

11 Tajana Simunic Rosing

Parallelism extraction n  Static

¨  Use compiler to analyze program code

¨  Can make use of high-level language constructs

¨  Cannot inspect data values ¨  Simpler CPU control

n  Dynamic ¨  Use hardware to identify

opportunities ¨  Can make use of data

values ¨  More complex CPU

Page 12: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

12 Tajana Simunic Rosing

Superscalar n  Instruction-level parallelism n  Replicated execution resources n  RISC instructions are pipelined

¨  n inst/cycle à n2 HW

Register file

Execution unit

Execution unit

n

n

n

Page 13: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

13 Tajana Simunic Rosing

Simple VLIW architecture n  Compile time assignment of instructions to FUs n  Large register file feeds multiple function units.

Register file

EBOX (execution unit) Add  r1,r2,r3;  Sub  r4,r5,r6;  Ld  r7,foo;  St  r8,baz;  NOP  

ALU ALU Load/store Load/store FU

Page 14: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

14 Tajana Simunic Rosing

Clustered VLIW architecture

n  Register file, function units divided into clusters.

Execution

Register file

Execution

Register file

Cluster bus

Page 15: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Embedded processor trends

15

ARM11 ARM Cortex-A8 ARM Cortex-A9 Qualcomm Scorpion Qualcomm Krait[1] ARM Cortex-A15 MPCore

Decode single-issue 2-wide 2-wide 2-wide 3-wide 3-wide Pipeline depth 8 stages 13 stages 8 stages 10 stages 11 stages 15/17-25 stages Out of Order

Execution No No Yes Yes, non-speculative [2] Yes Yes

FPU VFPv2 (pipelined) VFPv3 (not pipelined)

VFPv3-D16 or VFPv3-D32 (typical)

(pipelined) VFPv3 (pipelined) VFPv4 (pipelined) [3] VFPv4 (pipelined)

NEON None Yes (Partially 128-bit wide)

Optional (Partially 128-bit wide) Yes (128-bit wide) Yes (128-bit wide) Yes (128-bit wide)

Process Technology 90 nm 65/45 nm 45/40/32/28 nm 65/45 nm 28 nm 32/28 nm

L0 Cache 4kB + 4kB direct mapped

L1 Cache Varying, typically 16 kB + 16 kB 32 kB + 32 kB 32 kB + 32 kB 32 kB + 32 kB 16 kB + 16 kB 4-

way set associative 32 kB + 32 kB per

core

L2 Cache Varying, typically none

256 or 512 (typical) kB 1 MB

256 kB (Single-core)/512 kB (Dual-

core)

1 MB 8-way set associative (Dual-core)/2 MB (Quad-

core)

up to 4 MB per cluster, up to 8 MB

per chip

Core Configurations 1 1 1, 2, 4 1, 2 2, 4 2, 4, 8(4×2) DMIPS/MHz

speed per core 1.25 2.0 2.5 2.1 3.3 3.5

NEON: Advanced SIMD extension is a combined 64- and 128-bit instruction set that provides standardized acceleration for media and DSP apps

Parallelization •  Multiple cores •  Multiple-issue per core •  Out of order execution •  SIMD (NEON)

Process technology •  Higher density allows more parallel structures •  Higher power density, thermal issues

Page 16: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Why isn’t everything just massively parallel?

n  Types of architectural hazards ¨  Data (e.g. read-after-write, pointer aliasing) ¨  Structural ¨  Control flow

n  Difficult to fully utilize parallel structures ¨  Programs have real dependencies that limit ILP ¨  Utilization of parallel structures is dependent on

programming model ¨  Limited window size during instruction issue ¨  Memory delays

n  High cost of errors in prediction/speculation ¨  Performance: Stalls introduced to wait for reissue ¨  Energy: Wasted power going down wrong execution path

16 Tajana Simunic Rosing

Page 17: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

ARMV7 ARCHITECTURE General/applications processing

17 Tajana Simunic Rosing

Page 18: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

ARMv7

n ARM assembly language - RISC n ARM programming model

¨ Audio players, pagers etc.; 130 MIPS

n ARM memory organization n ARM data operations (32 bit) n ARM flow of control n Hardware-based floating point unit

18 Tajana Simunic Rosing

Page 19: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

19 Tajana Simunic Rosing

ARM programming model

r0 r1 r2 r3 r4 r5 r6 r7

r8 r9

r10 r11 r12 r13 r14

r15 (PC)

CPSR

31 0

N Z C V

Current Program Status Register

Page 20: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

20 Tajana Simunic Rosing

ARM status bits n Every arithmetic, logical, or shifting

operation sets CPSR bits: ¨ N (negative), Z (zero), C (carry), V (overflow).

n Examples: ¨ -1 + 1 = 0: NZCV = 0110. ¨ 231-1+1 = -231: NZCV = 1001.

Page 21: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

21 Tajana Simunic Rosing

ARM pipeline execution

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

fetch

time

decode

fetch

execute

decode

fetch

execute

decode execute

1 2 3

Page 22: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

22 Tajana Simunic Rosing

ARM data instructions n  ADD, ADC : add (w.

carry) n  SUB, SBC : subtract

(w. carry) n  MUL, MLA : multiply

(and accumulate)

n  AND, ORR, EOR n  BIC : bit clear n  LSL, LSR : logical

shift left/right n  ASL, ASR : arithmetic

shift left/right n  ROR : rotate right n  RRX : rotate right

extended with C

Page 23: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

23 Tajana Simunic Rosing

ARM flow of control n All operations can be performed

conditionally, testing CPSR: ¨ EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE

n Branch operation: B #100 ¨ Can be performed conditionally.

Page 24: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

24 Tajana Simunic Rosing

ARM comparison instructions n CMP : compare n CMN : negated compare n TST : bit-wise AND n TEQ : bit-wise XOR n These instructions set only the NZCV bits

of CPSR.

Page 25: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

25 Tajana Simunic Rosing

ARM load/store/move instructions

n  LDR, LDRH, LDRB : load (half-word, byte) n STR, STRH, STRB : store (half-word,

byte) n Addressing modes:

¨ register indirect : LDR r0,[r1] ¨ with second register : LDR r0,[r1,-r2] ¨ with constant : LDR r0,[r1,#4]

n MOV, MVN : move (negated)MOV r0, r1 ; sets r0 to r1

Page 26: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

26 Tajana Simunic Rosing

Addressing modes n Base-plus-offset addressing:

LDR r0,[r1,#16] ¨ Loads from location r1+16

n Auto-indexing increments base register: LDR r0,[r1,#16]!

n Post-indexing fetches, then does offset: LDR r0,[r1],#16 ¨ Loads r0 from r1, then adds 16 to r1

Page 27: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

27 Tajana Simunic Rosing

ARM subroutine linkage n Branch and link instruction:

BL foo ¨ Copies current PC to r14.

n To return from subroutine: MOV r15,r14

Page 28: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

ARM Summary n  Load/store architecture n Most instructions are RISCy, operate in

single cycle. ¨ Some multi-register operations take longer.

n All instructions can be executed conditionally

n ARMv7-A is deployed in: ¨ Cortex-A15 (Snapdragon Krait, Nvidia Tegra,

TI OMAP), Cortex-A5 (AMD Fusion), Atmel microcontrollers

28 Tajana Simunic Rosing

Page 29: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Next generation: ARMv8 n  Addition of 64-bit support

¨ Larger virtual address space n  New instruction set (A64)

¨ Fewer conditional instructions ¨ New instructions to support 64-bit operands ¨ No arbitrary length load/store multiple instructions

n  Enhanced cryptography (both 32 and 64-bit) n  Mostly backwards compatible with ARMv7

n  Enable expansion into traditional/higher performance markets ¨ Mobile phones, servers, supercomputers ¨ Cortex-A53, Apple Cyclone, Nvidia Denver

29 Tajana Simunic Rosing

Page 30: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

GRAPHICS PROCESSING (GPU)

30 Tajana Simunic Rosing

Page 31: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Graphics pipeline

n  Primary architectural elements: ¨ Vertex shaders, pixel shaders (fragment shaders)

n  Performance measured in GFLOPS

31 http://m.iopscience.iop.org/1742-5468/2009/06/P06016/figures

Page 32: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

GPU programming n  Compute Unified Device Architecture (CUDA)

¨ Enables GPGPUs: GPUs can be used for general purpose processing (i.e., not exclusively graphics)

n  Single program multiple data (SPMD) ¨ Programming has to be explicit ¨ Threading directives and memory addressing

32 Tajana Simunic Rosing

Page 33: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

GPU programming n  A kernel scales across any number

of parallel processors ¨  Contains grid of thread blocks

n  Thread block shape can be 1D, 2D, 3D ¨  All threads in a thread block run

the same code, over the same data in shared memory space

¨  Threads have thread ID numbers within block to select work and address shared data

¨  Threads in different blocks cannot cooperate

n  Threads are grouped by warps (scheduling units)

33 Tajana Simunic Rosing

KQueue Kernel  #  1  Kernel  #  2  

Thread Block

warp 1

warp 2 warp 5

warp 4

warp 3 warp 0

Page 34: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Nvidia GeForce Block Diagram n  Ultra low power (ULP) version of GeForce used in Tegra 3, 4 n  Desktop GTX 280 shown; embedded versions have different

number of TPCs, SMs, etc

Rajib Nath

Geometry  Controller  

SMC  

Texture  L1  Cache  

SM   SM  SM  

Thread Processing Cluster (TPC)

Streaming Multiprocessor (SM)

Shared  Memory  

SFU   SFU  SP  SP  

SP  SP  

SP  SP  

SP  SP  

Constant  Cache  Mul;  Thread  Issue  Instruc;on  Cache  

Register  File  

Floa;ng  Point  Unit  

Integer  Unit  

Streaming Processor (SP)

Page 35: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Qualcomm Adreno 2xx n  Snapdragon S1-4 chipsets (e.g. MSM8x60)

¨ Newest generation Adreno 5xx released in 2015+ n  Unified shaders

¨ Same instruction set for fragment and vertex processing

¨ More versatile hardware n  5-way VLIW n  Also used for non-gaming apps, e.g. browser

¨ Behavior in non-gaming is not as well-understood

Page 36: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Qualcomm Adreno 2xx

n  Snapdragon S4 chipset (MSM8960) n  “Digital core rail power” (red) includes GPU, video decode, &

modem digital blocks n  GLBenchmark: high-end gaming content

¨  CPU power up to 750mW @ 1.5 GHz ¨  GPU power up to 1.6W @ 400Mhz

36 http://www.anandtech.com/print/5559/qualcomm-snapdragon-s4-krait-performance-preview-msm8960-adreno-225-benchmarks

Page 37: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

DIGITAL SIGNAL PROCESSING (DSP)

37 Tajana Simunic Rosing

Page 38: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Digital signal processing (DSP) n  Processing or manipulation of signals using digital

techniques n  Interfacing with the physical world

¨ E.g. audio, digital images, speech processing, medical monitoring (EKG)

Source: Dr D. H. Crawford

ADC DAC Digital Signal

Processor Analog to Digital

Converter

Digital to Analog

Converter

Input Signal

Output Signal

Page 39: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Fundamental DSP Operations n Filtering

¨ Finite impulse response (FIR)

n Frequency transforms ¨ Fast Fourier (FFT) ¨ Discrete cosine (DCT) ¨ Inverse discrete

cosine (IDCT)

Source: Dr D. H. Crawford

∑−

=−=

1

0)()(

L

ii inxany

for (n=0; n<N; n++) { s=0; for (i=0; i<L; i++) { s += a[i] * x[n-i]; } y[n] = s; }

Pseudo C code

Page 40: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

DSP architectural features n  Fixed-point vs Floating point n  VLIW or specialized SIMD techniques

¨ E.g. Qualcomm Hexagon DSP (VLIW) dispatches up to 4 instructions to 4 execution units per cycle

n  No virtual memory or context switching n  Separate instruction and data storage

¨ Harvard architecture vs Von Neumann n  Pipelined FUs are integrated into the datapath

n  Main DSP Manufacturers: ¨ Texas Instruments (http://www.ti.com) ¨ Motorola (http://www.motorola.com) ¨ Analog Devices (http://www.analog.com)

Source: Dr D. H. Crawford

Page 41: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Speech processing Audio effects

Image compression Video encoding

Noise cancellation Virtual/augmented reality Actuation error detection

What is DSP Used For?

Page 42: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Speech processing n Encoding n Compression n Synthesis n Recognition

Source: Dr D. H. Crawford

The blue--- s---p--o---------t i-s--on--the-- k--ey a---g--ai----n------

“oo” in “blue” “o” in “spot” “ee” in “key” “e” in “again” “s” in “spot” “k” in “key”

Page 43: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Image processing n  Trade off “good enough” quality in “essential” regions n  Enable higher transmission bandwidth, minimal

storage, media interactivity n  Still image encoding: JPEG (Joint Photographic

Experts Group) ¨  JPEG2000: Wavelet Transform based

n  Video encoding: MPEG (Moving Pictures Experts Group) ¨  MPEG-4 (aka H.264)

n  Variable macroblock sizes (4x4 to 16x16) n  Enhanced to allow lossless regions

Source: Dr D. H. Crawford

Page 44: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

JPEG Codec

44 Source: Xilinx

Coefficient quantization DCT

Zig-zag run-length encoding

Huffman encoding

Original pixel data

Compressed data

Coefficient denormalization IDCT

Zig-zag run-length expansion

Huffman decoding

Encoding

Decoding

lossy lossless

Reconstructed pixel data

Page 45: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

MPEG-2 Codec

45 Source: Xilinx

Quantization DCT Variable length coder (VLC)

Original video Encoding

- Bitstream out

Motion compensation

Motion estimation Frame store

Inverse quantization

IDCT

+ Loop filter

Intraframe

Interframe

Page 46: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

MPEG-2 Codec

46 Source: Xilinx

Decoding

Motion compensation

Frame store

+Inverse

quantization DCT Variable length decoder (VLD)

Bitstream buffer Output IDCT

Intraframe

Interframe

Page 47: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

DCT/IDCT Concept n  Used for lossy compression n  Discrete cosine transform (DCT)

¨ Represent a finite sequence of data points with a sum of cosine (even) functions of different frequencies

¨ Similar to the discrete Fourier transform (DFT), but using only real numbers

¨ If a coefficient has a lot of variance over a set, then it cannot be removed without affecting the picture quality

n  Inverse DCT ¨ Reconstruct sequence from frequency

coefficients

47 Source: Xilinx

Page 48: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

2D DCT & IDCT n  Image divided into macroblocks of 8x8 pixels n  “The DCT” of each group is an 8x8 transform

coefficient array; entries represent spatial frequencies

48 Source: Xilnx

Page 49: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

DCT & IDCT operations

n  Dedicated functional unit: fused multiply-add n  Common to DCT/IDCT and many other DSP

operations 49

Source: Xilinx

DCT:

IDCT:

Page 50: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Group of Pictures (GOP) n  A structure of consecutive frames that can be decoded

without any other reference frames n  Transmitted sequence is not the same as displayed sequence n  Inter-frame prediction/compression exploits temporal

redundancy between neighboring frames n  Intra-frame coding is applied only to the current frame

Page 51: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Types of frames n  I frame (intra-coded)

¨ Coded without reference to other frames ¨ Begins each GOP

n  P frame (predictive-coded) ¨ Coded with reference to a previous reference

frame (either I or P) ¨ Size is usually about 1/3rd of an I frame

n  B frame (bi-directional predictive-coded) ¨ Coded with reference to both previous and future

reference frames (either I or P) ¨ Size is usually about 1/6th of an I frame

Page 52: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Fixed-Point Design

Idea

Floating-Point Algorithm

Quantization

Fixed-Point Algorithm

Code Generation

Target System

Algorithm

LevelIm

plementation

Level

Range Estimation

n  Digital signal processing algorithms ¨  Early development in floating point ¨  Converted into fixed point for

production to gain efficiency n  Fixed-point digital hardware

¨  Lower area ¨  Lower power ¨  Lower per unit production cost

Copyright Kyungtae Han [2]

Page 53: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Fixed-Point Design n  All variables have to be annotated manually

¨ Value ranges are well known ¨ Avoid overflow ¨ Minimize quantization effects ¨ Find optimum wordlength

n  Manual process supported by simulation ¨ Time-consuming ¨ Error prone

Copyright Kyungtae Han [2]

Page 54: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Fixed-Point Representation n  Fixed point type

¨  Wordlength ¨  Integer wordlength

n  Quantization modes ¨  Round ¨  Truncation

n  Overflow modes ¨  Saturation ¨  Saturation to zero ¨  Wrap-around

S X X X X X

Wordlength

Integer wordlength

SystemC format www.systemc.org

X X X X X

Wordlength

Integer wordlength = -2

Copyright Kyungtae Han [2]

Page 55: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Fixed-Point Representation

         x    =  0.5  x  0.125  +  0.25  x  0.125    

 =  0.0625  +  0.03125    

 =  0.09375  

n  For integer word length iwl=1  and fractional word length fwl=3 decimal digits, the less significant digits are automatically chopped off: x  =  0.093

n  Like a floating point system with numbers ∈ (-1..1), with no stored exponent (bits used to increase precision).

Ø  Automatic scaling: Shifting after multiplications and divisions in order to maintain binary point.

Page 56: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

TI C54x family (CISC) n  Modified Harvard architecture: separate buses

for program code and data ¨ PB: program read bus ¨ CB, DB: data read busses ¨ EB: data write bus ¨ PAB, CAB, DAB, EAB: address busses

n  Can generate two data memory addresses per cycle ¨ Stored in auxiliary register address units

n  High performance, reproducible behavior, optimized for different memory structures

56 Tajana Simunic Rosing

Page 57: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

57 Tajana Simunic Rosing

TI C54x architectural features

n  40-bit ALU & Barrel shifter ¨ Input from accumulator or data memory ¨ Output to ALU

n  17 x 17 multiplier n  Single-cycle exponent encoder n  Two address generators with dedicated

registers n  Accumulators

¨ Low-order (0-15), high-order (16-31), guard (32-39)

Page 58: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

58 Tajana Simunic Rosing

TI C54x instruction set features

n  Compare, select and store unit (CSSU) unit ¨ Compares high and low accumulator words ¨ Accelerates Viterbi operations

n  Repeat and block repeat instructions n  Instructions that read 2, 3 operands

simultaneously n  Three IDLE instructions

¨ Selectively shut down CPU, on-chip peripherals, whole chip including phase-locked loop

Page 59: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

59 Tajana Simunic Rosing

TI C54x pipeline n  Prefetch: Send PC address on program address bus n  Fetch: Load instruction from program bus to IR n  Decode n  Access: Put operand addresses on busses n  Read: Get operands from busses n  Execute

Page 60: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

60 Tajana Simunic Rosing

Addressing Modes

Data

Immediate

Register-direct

Register indirect

Direct

Indirect

Data

Operand field

Register address

Register address

Memory address

Memory address

Memory address Data

Data

Memory address

Data

Addressing mode

Register-file contents

Memory contents

Page 61: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

TIs C55x Pipeline

n  Prefetch 1: ¨  Send address to memory

n  Prefetch 2: ¨  Wait for response

n  Fetch: ¨  Get instruction from memory

and put in IBQ n  Predecode:

¨  Identify where instructions begin and end

¨  Identify parallel instructions

n  Decode: ¨  Decode an instruction pair or single

instruction. n  Address:

¨  Perform address calculations. n  Access 1/2:

¨  Send address to memory; wait. n  Read:

¨  Read data from memory. Evaluate condition registers.

n  Execute: ¨  Read/modify registers. Set conditions.

n  W/W+: ¨  Write data to MMR-addressed

registers, memory; finish.

61 Tajana Simunic Rosing

fetch execute

4 7-8

Page 62: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

62 Tajana Simunic Rosing

C55x organization

Instruction unit

Program flow unit

Address unit

Data unit

3 data read busses 3 data read address busses program address bus

program read bus

2 data write busses 2 data write address busses

16

24

24

16

24

32

Instruction fetch Data read from memory

D bus

Single operand read

C, D busses

Dual operand read

B bus

Dual-multiply coefficient Writes

Page 63: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

63 Tajana Simunic Rosing

C55x hardware extensions

n Target image/video applications ¨ DCT/IDCT ¨ Pixel interpolation ¨ Motion estimation

n Available in 5509 and 5510 ¨ Equivalent C-callable functions for other devices.

Page 64: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

64 Tajana Simunic Rosing

TI C62/C67 (VLIW) n Up to 8 instructions/cycle n  32 32-bit registers n Function units

¨ Two multipliers ¨ Six ALUs

n Data operations ¨ 8/16/32-bit arithmetic ¨ 40-bit operations ¨ Bit manipulation operations

Page 65: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Partitioned register files n  Many memory ports are required to supply enough

operands per cycle. n  Memories with many ports are expensive. Ø  Registers are partitioned into sets, e.g. for TI

C60x:

65 Tajana Simunic Rosing

register file A register file B

L1 S1 M1 D1 D2 M2 S2 L2

Data bus

Address bus

Data path A Data path B

Page 66: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

66 Tajana Simunic Rosing

C6x data paths n General-purpose register files (A and B,

16 words each) n Eight functional units:

¨ .L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2 n Two load units (LD1, LD2) n Two store units (ST1, ST2) n Two register file cross paths (1X and 2X) n Two data address paths (DA1 and DA2)

Page 67: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

67 Tajana Simunic Rosing

C6x functional units n  .L

¨  32/40-bit arithmetic ¨  Leftmost 1 counting ¨  Logical ops

n  .S ¨  32-bit arithmetic ¨  32/40-bit shift and 32-bit field ¨  Branches ¨  Constants

n  .M ¨  16 x 16 multiply

n  .D ¨  32-bit add, subtract, circular address ¨  Load, store with 5/15-bit constant offset

Page 68: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

68 Tajana Simunic Rosing

C6x system n On-chip RAM n  32-bit external memory: SDRAM, SRAM n Host port n Multiple serial ports n Multichannel DMA n  32-bit timer

Page 69: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

PROGRAMMABLE LOGIC DEVICES (PLD)

69 Tajana Simunic Rosing

Page 70: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Programmable Logic Devices (PLD) n  Simple PLD (SPLD) �

¨  Programmable logic array (PLA) ¨  Programmable array logic (PAL) –

fixed OR plane n  Complex PLD (CPLD)

¨  Building block: macrocells n  Variations:

¨  Antifuse PLD ¨  Erasable EPLD & EEPLD

70 Tajana Simunic Rosing

Name Re-programmable Volatile Technology Fuse No No Bipolar

EPROM Yes – out of circuit No UVCMOS

EEPROM Yes – in circuit No EECMOS

SRAM Yes – in circuit Yes CMOS

Antifuse No No CMOS+

Page 71: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Antifuse PLDs n  Actel Axcelerator family

•  Antifuse: –  open when not programmed –  Low resistance when programmed

Page 72: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Clk MUX

Output MUXQ

F/B MUX

Invert Control

AND ARRAY

CLK

pad

8 Product Term AND-OR Array + Programmable MUX's

Programmable polarity

I/O Pin

Seq. Logic Block

Programmable feedback

Erasable Programmable Logic Devices

n  EPLDs: CMOS erasable programmable ROM (EPROM) erased by UV light

n  Altera’s building block is a MACROCELL

Page 73: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Logic Array

Blocks

(similar to macrocells)

Global Routing: Programmable Interconnect Array (PIA)

8 Fixed Inputs 52 I/O Pins 8 LABs 16 Macrocells/LAB 32 Expanders/LAB

EPM5128:

Complex Programmable Logic Devices (CPLD)

n  Altera Multiple Array Matrix (MAX) architecture n  AND-OR structures are relatively limited, cannot share

signals/product terms among macrocells

LAB A LAB H

LAB B LAB G

LAB C LAB F

LAB D LAB E

P I A

Page 74: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

74

Altera MAX 7k (EEPLD) Logic Block

Page 75: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

75

SRAM based PLD n  Altera Flex 10k Block Diagram

Page 76: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

76

SRAM based PLD n  Altera Flex 10k Logic Array Block (LAB)

Page 77: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

77

SRAM based PLD n Altera Flex 10k Logic Element (LE)

Page 78: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Field-Programmable Gate Arrays (FPGA) n  Unlike PLA/PAL, configuration is stored in volatile SRAM n  External memory (ROM) n  Used to emulate/prototype ASICs

78 Tajana Simunic Rosing

Components Logic blocks Implement combinational and sequential logic

Based on lookup tables (LUT) Interconnect Wires connecting I/O to logic blocks I/O blocks Special logic blocks at periphery of device for

external connections Specialized blocks e.g. DSP

Page 79: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

79

FPGA with DSP n Altera Stratix II: Block Diagram

Page 80: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

FPGA with DSP n  Altera Stratix II DSP

block

80

Page 81: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Xilinx Virtex-7 n  Up to 2M logic cells n  High block RAM-to-logic cell ratios

(up to 68 Mb with 1,139K logic cells)

n  Configurable logic blocks (CLB) organized into 2 slices ¨  Lookup tables, carry chains,

registers ¨  Distributed memory and shift

register logic n  Target market:

¨  Test and measurement (T&M) applications, bridging/switch fabric, RADAR, ASIC emulation

81 http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf

Page 82: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Xilinx Virtex-7 Slice n  Modern slices come in two varieties

¨  SLICEM implementing logical, shift register, and memory functions

¨  SLICEL implementing logical functions only

n  SLICEM placed closer to DSP slices for easy access to coefficients

82 http://www.xilinx.com/support/documentation/white_papers/wp405-7Series-Logical-Advantage.pdf

Packing 2 independent logical functions into 1 LUT

Page 83: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Application-specific integrated circuit (ASICs)

n  Custom integrated circuits that have been designed for a single use or application

n  Standard single-purpose processors ¨  “Off-the-shelf”, pre-designed for a common task (e.g.

peripherals) ¨  serial transmission ¨  analog/digital conversions

83 Tajana Simunic Rosing

Page 84: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Combined system: CPU+FPGA+ASICs

n  Actel Fusion Family ¨ ARM7 CPU with FPGA and ASIC implementations of

“smart peripherals” for analog functions

84

Page 85: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

PROCESSOR COMPARISON THROUGH APPLICATIONS

85 Tajana Simunic Rosing

Page 86: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Stereo Vision n Moving window with data sharing between

threads to avoid reprocessing ¨ Optimized implementation per device

n GPU – slow data sharing ¨ Limited speed due to memory access conflicts

n FPGA – 242 data windows (121 per image) ¨ Stepwise reduction after window size

becomes too large ¨ Pipelined execution on logic units

Page 87: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Stereo Vision

Page 88: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Krajník, Tomáš, et al. "FPGA-based module for SURF extraction." Machine vision and applications 25.3 (2014)

SURF Feature Detection n  Efficiently detect and describe interest points in

images n  Enables applications such as object recognition

and 3D reconstruction

88

CPU GPU FPGA Stage 1: Detector speed [ms] 5200 105 100 Stage 2: Descriptor speed [ms] 1.4 0.1 0.7 Power consumption [W] 24 24 6 Mass [g] 850 850 210 Volume [cm3] 600 600 180

Page 89: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Summary n Processor metrics, trends n Architectures and functions

¨ CPUs ¨ GPU ¨ DSP

n  Implementations ¨ Programmable logic – PLDs and FPGAs ¨ Custom ASICs

89 Tajana Simunic Rosing

Page 90: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment

Sources and References n Peter Marwedel, “Embedded Systems

Design,” 2004. n Frank Vahid, Tony Givargis, “Embedded

System Design,” Wiley, 2002. n Wayne Wolf, “Computers as

Components,” Morgan Kaufmann, 2001.

90 Tajana Simunic Rosing