CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

22
CS 6354 by Wei Keng Q in, Jian Xiang, & Ren Xu December 8, 2009

Transcript of CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Page 1: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

CS 6354

by WeiKeng Qin, Jian Xiang, & Ren

Xu

December 8, 2009

Page 2: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

IntroductionMotivation

A Multi-Core on our desksA new microarchitecture to replace Netburst

Intel Core 2 DuoA dual-core CPUISA with SIMD ExtensionIntel Core microarchitectureMemory Hierarchy System

Page 3: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Instruction Set ArchitectureBase: X86-64No VLIW (Itanium)SIMD Extensions: MMX, SSE, SSE2, SSE3,

SSSE3, SSE4.1

Pentium MMX, 1996Pentium III, SSE, 1999

Pentium 4, SSE2, 2001Prescott, SSE3, 2004

Core 2, SSSE3, July 2006Walfdale, SSE4.1, Sep 2006

8 new registers, Float-point Operations8 new registers,

Packed data type, Integer Operations

Double precision, 128-bit register support

DSP-oriented math, process management

e.g. Permuting bytes in a word

Page 4: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Streaming SIMD Extension (SSE) 4.1Beginning with the 45 nm processors47 instructions that improve performance of

media data manipulatione.g. Fast and efficient bit width conversions

Convert single byte values to word (16-bit) values.

00000000000000000000000000000000

Page 5: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

SSE2 CodeMOVDQU XMM0, M64PXOR XMM1, XMM1PUNPCKLBW XMM0, XMM1

Page 6: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

SSE4.1 CodePMOVZXBW XMM0, M64

DEST[15:0] <-- ZeroExtend(SRC[7:0]);DEST[31:16] <-- ZeroExtend(SRC[15:8]);DEST[47:32] <-- ZeroExtend(SRC[23:16]);DEST[63:48] <-- ZeroExtend(SRC[31:24]);DEST[79:64] <-- ZeroExtend(SRC[39:32]);DEST[95:80] <-- ZeroExtend(SRC[47:40]);DEST[111:96] <-- ZeroExtend(SRC[55:48]);DEST[127:112] <-- ZeroExtend(SRC[63:56]);

BenefitsReduced instruction number (31)Better performance (~40% speedup each loop)Reduced register pressure (21)

Page 7: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

MicroarchitectureThe Cores

Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6MNo Hyper-threading, no L3 cacheKeep front-side busLarger L2 cache

Page 8: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Microarchitecture• 14-stage

Pipeline• 4 wide decode• 4 wide Retire• Macro-fusion• Enhanced

ALUs• Deeper Buffers

Page 9: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Another View

Page 10: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Decode Hardware• 128 bits fetch

bandwidth• 18-entry IQ• Complex Decode

-produces 1-4 micro-ops

• Micro-code Sequencer

Page 11: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Macro-fusionNew Micro-op• Represent

instruction pair as single micro-op

Enhanced ALUs• To execute new

compare and jump (CMPJCC) micro-op in one clock

Page 12: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Out of Order Execution• 96 entries ROB• 32 Entry Reservation

Station

Page 13: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Execution Units• 6 dispatch ports(1 Load, 2 Store, 3

universal ports)• 3 integer ALU, 2 float point ALU

Page 14: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Branch Predictor• Loop Detector

- Track the number of loop iterations for future reference

• branch prediction unit (BPU) selects among for every branch:-bimodal predictor-global predictor

-loop detector

Page 15: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Cache Organizationprivate L1 DCache and ICache, 32K/core, 8way, 64B linesi

ze, write-back(directory-based conherence)shared L2 cache, 8way, 64B linesize (E8xxx)

pros: could be less bus trafficcons: longer access latency than private L2 cache;

potential conflict between threads-- FSB 1333MHz (E8xxx)

Memory disambiguationaggressive memory dependence speculation based on a l

oad's- EIP-address-indexed hash tablewatchdog mechanism

Page 16: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Prediction Implementation• History table indexed by Instruction Pointer• Each entry in the history array has a saturating

counter• Once counter saturates: disambiguation possibl

e on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses

• When a particular load failed disambiguation: reset its counter

• Each time a particular load correctly disambiguated: increment counter

Page 17: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

when sent from RS, set disambiguation bit

If meets an older unknow store address, set "update"

If prediction is "go", dispatch, set "done"

Else blocked

A store in Load Buffer scan all previous load, if a match found, "reset" bit set.

When load commits, update history.

Predictor Lookup

Prediction Verification

Load Dispatch

Page 18: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Execute Disable Bit SupportAMD Enhanced Virus Protection; ARM eXecute Neverhelp prevent buffer overflow attacksno need of software patches for buffer overflow attackssegregate memory by either storage of code or dataprocessor disable code execution when malicious worms

try to inserting code into data buffers (with OS support)

Page 19: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core

L1 ICache:1 traditional prefetcherL2 Cache: 2 IP prefetchers;

predict what memory address will be used and deliver in time

record every load's history using Instruction Pointer

IP history arrayparameters for prefetch traffic control fine-tuned f

or different platformsprefetch monitor

Page 20: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
Page 21: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

ReferencesIntel's Next Generation Microarchitecture

Unveiled, by David Kanter, Real World Technologies

Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel

Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine

Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX

too many…

Page 22: CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.

Questions?