CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
-
Upload
barbra-atkins -
Category
Documents
-
view
215 -
download
1
Transcript of CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
CS 6354
by WeiKeng Qin, Jian Xiang, & Ren
Xu
December 8, 2009
IntroductionMotivation
A Multi-Core on our desksA new microarchitecture to replace Netburst
Intel Core 2 DuoA dual-core CPUISA with SIMD ExtensionIntel Core microarchitectureMemory Hierarchy System
Instruction Set ArchitectureBase: X86-64No VLIW (Itanium)SIMD Extensions: MMX, SSE, SSE2, SSE3,
SSSE3, SSE4.1
Pentium MMX, 1996Pentium III, SSE, 1999
Pentium 4, SSE2, 2001Prescott, SSE3, 2004
Core 2, SSSE3, July 2006Walfdale, SSE4.1, Sep 2006
8 new registers, Float-point Operations8 new registers,
Packed data type, Integer Operations
Double precision, 128-bit register support
DSP-oriented math, process management
e.g. Permuting bytes in a word
Streaming SIMD Extension (SSE) 4.1Beginning with the 45 nm processors47 instructions that improve performance of
media data manipulatione.g. Fast and efficient bit width conversions
Convert single byte values to word (16-bit) values.
00000000000000000000000000000000
SSE2 CodeMOVDQU XMM0, M64PXOR XMM1, XMM1PUNPCKLBW XMM0, XMM1
SSE4.1 CodePMOVZXBW XMM0, M64
DEST[15:0] <-- ZeroExtend(SRC[7:0]);DEST[31:16] <-- ZeroExtend(SRC[15:8]);DEST[47:32] <-- ZeroExtend(SRC[23:16]);DEST[63:48] <-- ZeroExtend(SRC[31:24]);DEST[79:64] <-- ZeroExtend(SRC[39:32]);DEST[95:80] <-- ZeroExtend(SRC[47:40]);DEST[111:96] <-- ZeroExtend(SRC[55:48]);DEST[127:112] <-- ZeroExtend(SRC[63:56]);
BenefitsReduced instruction number (31)Better performance (~40% speedup each loop)Reduced register pressure (21)
MicroarchitectureThe Cores
Single-die(107 mm²), Two identical core(L1 cache 64K x 2), Shared L2 cache 6MNo Hyper-threading, no L3 cacheKeep front-side busLarger L2 cache
Microarchitecture• 14-stage
Pipeline• 4 wide decode• 4 wide Retire• Macro-fusion• Enhanced
ALUs• Deeper Buffers
Another View
Decode Hardware• 128 bits fetch
bandwidth• 18-entry IQ• Complex Decode
-produces 1-4 micro-ops
• Micro-code Sequencer
Macro-fusionNew Micro-op• Represent
instruction pair as single micro-op
Enhanced ALUs• To execute new
compare and jump (CMPJCC) micro-op in one clock
Out of Order Execution• 96 entries ROB• 32 Entry Reservation
Station
Execution Units• 6 dispatch ports(1 Load, 2 Store, 3
universal ports)• 3 integer ALU, 2 float point ALU
Branch Predictor• Loop Detector
- Track the number of loop iterations for future reference
• branch prediction unit (BPU) selects among for every branch:-bimodal predictor-global predictor
-loop detector
Cache Organizationprivate L1 DCache and ICache, 32K/core, 8way, 64B linesi
ze, write-back(directory-based conherence)shared L2 cache, 8way, 64B linesize (E8xxx)
pros: could be less bus trafficcons: longer access latency than private L2 cache;
potential conflict between threads-- FSB 1333MHz (E8xxx)
Memory disambiguationaggressive memory dependence speculation based on a l
oad's- EIP-address-indexed hash tablewatchdog mechanism
Prediction Implementation• History table indexed by Instruction Pointer• Each entry in the history array has a saturating
counter• Once counter saturates: disambiguation possibl
e on this load (take effect since next iteration) -load is allowed to go even meet unkown store addresses
• When a particular load failed disambiguation: reset its counter
• Each time a particular load correctly disambiguated: increment counter
when sent from RS, set disambiguation bit
If meets an older unknow store address, set "update"
If prediction is "go", dispatch, set "done"
Else blocked
A store in Load Buffer scan all previous load, if a match found, "reset" bit set.
When load commits, update history.
Predictor Lookup
Prediction Verification
Load Dispatch
Execute Disable Bit SupportAMD Enhanced Virus Protection; ARM eXecute Neverhelp prevent buffer overflow attacksno need of software patches for buffer overflow attackssegregate memory by either storage of code or dataprocessor disable code execution when malicious worms
try to inserting code into data buffers (with OS support)
Instruction Pointer Based Prefetcher L1 DCache:2 IP prefetchers/core
L1 ICache:1 traditional prefetcherL2 Cache: 2 IP prefetchers;
predict what memory address will be used and deliver in time
record every load's history using Instruction Pointer
IP history arrayparameters for prefetch traffic control fine-tuned f
or different platformsprefetch monitor
ReferencesIntel's Next Generation Microarchitecture
Unveiled, by David Kanter, Real World Technologies
Intel Core Microarchitecture Briefing, by Stephen Smith and Bob Valentine, Intel
Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, Ofri Wechsler, Technology@Intel Magazine
Intel Core: A Next-Generation Microarchitecture, by Alan Zeichick, DevX
too many…
Questions?