CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

CS 6354

by WeiKeng Qin, Jian Xiang, & Ren

Xu

December 8, 2009

IntroductionMotivation

A Multi-Core on our desksA new microarchitecture to replace Netburst

Intel Core 2 DuoA dual-core CPUISA with SIMD ExtensionIntel Core microarchitectureMemory Hierarchy System

Instruction Set ArchitectureBase: X86-64No VLIW (Itanium)SIMD Extensions: MMX, SSE, SSE2, SSE3,

SSSE3, SSE4.1

Pentium MMX, 1996Pentium III, SSE, 1999

Pentium 4, SSE2, 2001Prescott, SSE3, 2004

Core 2, SSSE3, July 2006Walfdale, SSE4.1, Sep 2006

8 new registers, Float-point Operations8 new registers,

Packed data type, Integer Operations

Double precision, 128-bit register support

DSP-oriented math, process management

e.g. Permuting bytes in a word

Streaming SIMD Extension (SSE) 4.1Beginning with the 45 nm processors47 instructions that improve performance of

media data manipulatione.g. Fast and efficient bit width conversions

Convert single byte values to word (16-bit) values.

00000000000000000000000000000000

SSE2 CodeMOVDQU XMM0, M64PXOR XMM1, XMM1PUNPCKLBW XMM0, XMM1

SSE4.1 CodePMOVZXBW XMM0, M64

DEST[15:0] <-- ZeroExtend(SRC[7:0]);DEST[31:16] <-- ZeroExtend(SRC[15:8]);DEST[47:32] <-- ZeroExtend(SRC[23:16]);DEST[63:48] <-- ZeroExtend(SRC[31:24]);DEST[79:64] <-- ZeroExtend(SRC[39:32]);DEST[95:80] <-- ZeroExtend(SRC[47:40]);DEST[111:96] <-- ZeroExtend(SRC[55:48]);DEST[127:112] <-- ZeroExtend(SRC[63:56]);

BenefitsReduced instruction number (31)Better performance (~40% speedup each loop)Reduced register pressure (21)

MicroarchitectureThe Cores

Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6MNo Hyper-threading, no L3 cacheKeep front-side busLarger L2 cache

Microarchitecture• 14-stage

Pipeline• 4 wide decode• 4 wide Retire• Macro-fusion• Enhanced

ALUs• Deeper Buffers

Another View

Decode Hardware• 128 bits fetch

bandwidth• 18-entry IQ• Complex Decode

-produces 1-4 micro-ops

• Micro-code Sequencer

Macro-fusionNew Micro-op• Represent

instruction pair as single micro-op

Enhanced ALUs• To execute new

compare and jump (CMPJCC) micro-op in one clock

Out of Order Execution• 96 entries ROB• 32 Entry Reservation

Station

Execution Units• 6 dispatch ports(1 Load, 2 Store, 3

universal ports)• 3 integer ALU, 2 float point ALU

Branch Predictor• Loop Detector

- Track the number of loop iterations for future reference

• branch prediction unit (BPU) selects among for every branch:-bimodal predictor-global predictor

-loop detector

Cache Organizationprivate L1 DCache and ICache, 32K/core, 8way, 64B linesi

ze, write-back(directory-based conherence)shared L2 cache, 8way, 64B linesize (E8xxx)

pros: could be less bus trafficcons: longer access latency than private L2 cache;

potential conflict between threads-- FSB 1333MHz (E8xxx)

Memory disambiguationaggressive memory dependence speculation based on a l

oad's- EIP-address-indexed hash tablewatchdog mechanism

Prediction Implementation• History table indexed by Instruction Pointer• Each entry in the history array has a saturating

counter• Once counter saturates: disambiguation possibl

e on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses

• When a particular load failed disambiguation: reset its counter

• Each time a particular load correctly disambiguated: increment counter

when sent from RS, set disambiguation bit

If meets an older unknow store address, set "update"

If prediction is "go", dispatch, set "done"

Else blocked

A store in Load Buffer scan all previous load, if a match found, "reset" bit set.

When load commits, update history.

Predictor Lookup

Prediction Verification

Load Dispatch

Execute Disable Bit SupportAMD Enhanced Virus Protection; ARM eXecute Neverhelp prevent buffer overflow attacksno need of software patches for buffer overflow attackssegregate memory by either storage of code or dataprocessor disable code execution when malicious worms

try to inserting code into data buffers (with OS support)

Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core

L1 ICache:1 traditional prefetcherL2 Cache: 2 IP prefetchers;

predict what memory address will be used and deliver in time

record every load's history using Instruction Pointer

IP history arrayparameters for prefetch traffic control fine-tuned f

or different platformsprefetch monitor

ReferencesIntel's Next Generation Microarchitecture

Unveiled, by David Kanter, Real World Technologies

Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel

Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine

Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX

too many…

Questions?

CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Documents

Transcript of CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.