1. 2 Components of an IA processor Upon completion of this module, you will be able to describe:...
-
Upload
jennifer-cobb -
Category
Documents
-
view
213 -
download
0
Transcript of 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe:...
![Page 1: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/1.jpg)
1
![Page 2: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/2.jpg)
2
Upon completion of this module, you will be able to describe:
![Page 3: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/3.jpg)
![Page 4: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/4.jpg)
Advanced Digital Media Boost Single Cycle SIMD Operation
8 Single Precision Flops/cycle4 Double Precision Flops/cycle
Wide Operations128 it packed Add128 bit packed Multiply128 bit packed Load128 bit packed Store
Core™ Core™ arch arch
PreviousPrevious
X4X4
Y4Y4
X4opY4X4opY4
SOURCESOURCE
X1opY1X1opY1
X3X3
Y3Y3
X3opY3X3opY3
X2X2
Y2Y2
X2opY2X2opY2
X1X1
Y1Y1
X1opY1X1opY1
SSE/2/3 OPSSE/2/3 OP
X2opY2X2opY2
X3opY3X3opY3X4opY4X4opY4
CLOCKCLOCKCYCLE 1CYCLE 1
CLOCKCLOCKCYCLE 2CYCLE 2
127127
CLOCKCLOCKCYCLE 1CYCLE 1
SIMD OperationSIMD Operation(SSE/SSE2/SSE3/SSSE)(SSE/SSE2/SSE3/SSSE)
![Page 5: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/5.jpg)
CISC vs. RISC Super-scalarOut of Order vs. In OrderArchitecture vs. Micro-Architecture
Intel ArchitecturesIA32/X86Intel64
Historical Micro-ArchitecturesP6 (Pentium Pro, Pentium II, Pentium IIINetBurst (Pentium 4Mobile (Centrino platforms
![Page 6: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/6.jpg)
6
System Bus
2nd Level Cache 1st Level Cache (Data)
Bus Unit
Decode/IQ
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)Scheduler
Branch Prediction Unit
Front EndFront EndExecution CoreExecution Core
![Page 7: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/7.jpg)
7
![Page 8: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/8.jpg)
8
Instruction Preparation before executed Instruction Fetch Unit
Instruction Queue
Instruction Decode Unit
Branch Prediction Unit
![Page 9: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/9.jpg)
9
Prefetches instructions that are likely to be executedCaches frequently-used instructionsPrecodes and buffers instructions
![Page 10: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/10.jpg)
04/21/23 10
I-Cache (Instruction Cache)32Kbytes/8-way/64-byte line16 aligned bytes fetched per cycle
ITLB (Instruction Translation Lookaside Buffer128 4K pages, 8 2M pages
Instruction Prefetcher16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers
Instruction Pre-decoderInstruction length Decode (precode)Avoid Length Changing Prefix, for example
The REX (EM64T) prefix (4xH) is not an LCP
Avoid in loop: MOV dx, 1234h
![Page 11: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/11.jpg)
11
Buffer between instruction pre-decode unit and decoder
Up to 6 predecoded instructions written per cycle18 instructions contained in IQUp to 5 instructions read from IQ
Potential Loop cacheLoop Stream Detector (LSD) support
Re-use of decoded instructionPotential power saving
![Page 12: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/12.jpg)
12
Decode the instructions into micro-opsReady for the execution in OOO core
![Page 13: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/13.jpg)
13
Instructions converted to micro-ops (uops)1-uop includes load+op, stores, indirect jump, RET…
4 Decoders: 1 “large” and 3 “small”All decoders handle “simple” 1-uop instructionsOne large decoder handles instructions up to 4 uops
All decoders working in parallelFour (+) instructions / cycle
Micro sequencer takes over for long flows (handling instruction contains 2~ 4 uops, uCodeRom handles more complex)
![Page 14: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/14.jpg)
14
These instructions took more than one fetch since they are 22 bytesIQ buffers them togetherAll instructions are decodable by all decodersCMP and adjacent JCC are “fused” into a single uop up to 5 instructionsDecoded per cycle cmpjne EAX, [mem], label
sta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
![Page 15: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/15.jpg)
15
Roughly ~15% of all instructions are conditional branches.Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.Enhanced arithmetic Logic Unit (ALU) for macro-fusin. Each marco-fused instruction executes with a single dispatch.Not supported in EM64T long mode
cmpjae eax, [mem], labelScheduler
Execution
flags and target to Write back
BranchEval
![Page 16: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/16.jpg)
16
Read four instructions from Instruction Queue.Each instruction gets decoded into separate uops.Enabling example for (i=0; i<100000; i++ ) {}
![Page 17: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/17.jpg)
17
Read five instructions from Instruction Queue.Send fusable pair to single decoderSingle uop represents two instructionsEnabling example for (unsigned int i=0; i<100000; i++) {}
![Page 18: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/18.jpg)
18
Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation.
Micro-op fusion effectively widens the pipeline
![Page 19: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/19.jpg)
19
U-ops of a Store “mov edx, [mem1]
Sta mem1
Sta edx, [mem1]St edx, [mem1]
![Page 20: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/20.jpg)
20
ESP is calculated by dedicate logic
No explicit Micro-Ops updating ESP
Micro-Ops savingPower saving
![Page 21: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/21.jpg)
21
Allow executing instructions long before the branch outcome is decided.
Superset of Prescott/Pentium-M features
One taken branch every other clockBranch predictions for 32 bytes at a
time, twice the width of the fetch engine
![Page 22: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/22.jpg)
22
16-entry Return Stack Buffer (RSB).Front-end queuing of BPU lookupsType of predictions
Direct Calls and JumpsIndirect Calls and JumpsConditional Calls and Jumps
![Page 23: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/23.jpg)
23
Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements
Indirect Branch Predictor Loop Detector
Branch miss-predictions reduced by >20%
![Page 24: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/24.jpg)
24
Accepted decoder u-ops, assign resources, execute and retire u-ops
RenamerReservation station (RS)Issue portsExecution Unit
![Page 25: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/25.jpg)
25
![Page 26: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/26.jpg)
26
4 uops renamed / retired per clockOne taken branch, any # of untakenOne fxchq per cycle
Uops written to RS and ROBDecoded uops were renamed and allocated with resource by RAT and sent to ROB read and RSRegisters not “in flight” read from ROB during RS write
![Page 27: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/27.jpg)
27
6 dispatch ports from RS3 Execution ports
(shared for integer / fp / simd)LoadStore (address)
128-bit implementationPort 0 has packed multiply (4 cycles SP 5
DP pipelined)Port 1 had packed add (3 cycles all
precisions)FP data had one additional cycle bypass latency
Do not mix SSE FP and SSE integer ops on same register
![Page 28: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/28.jpg)
28
Each uop only takes a single RS entryLoad + add dispatches twice (load, then add)Sta + std dispatches twice
Sta (address) can fire as early as possibleStd must wait for mulps to write back
Cmpjne dispatches only once (functionality is truly fused)No dependency, can fire as early as it wants
![Page 29: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/29.jpg)
29
![Page 30: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/30.jpg)
30
ReOrder Buffer (ROB)Holds micro-ops in various stages of
completionBuffers completed micro-opsUpdates the architectural state in orderManages ordering of exceptions
![Page 31: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/31.jpg)
31
32K D-cache (8-way, 64 byte line size)Loads and services
One 128-bit load and one 128-bit store per cycle to different memory locations
Out-of-order memory operationsData PrefetchingMemory DisambiguationStore forwardingShared cache
![Page 32: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/32.jpg)
32
3 clk latency and 1 clock throughput of L1D; 14 and 2 for L2
Miss LatenciesL1 miss hits L2 ~ 10 cyclesL2 miss, access to memory ~300 cycles (server/FBD)L2 miss, access to memory ~165 cycles (Desk/DDR2)C step broadwater is reported to have ~50ns
latency
Cache BandwidthBandwidth to cache ~ 8.5 bytes/cycle
Memory BandwidthDesktop ~ 6 GB/sec/socket (linux)Server ~3.5 GB/sec/socket
![Page 33: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/33.jpg)
33
Speculates the next needed data and loads it into cache by hardware and/or software.
Door(L1 Cache
Valet Parking Area
(L2 Cache)
Main Parking Lot
(External Memory
![Page 34: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/34.jpg)
34
L1D cache prefetchingData Cache Unit Prefetcher
Known as the streaming prefetcher Recognizes ascending access patterns in recently loaded data Prefetches the next line into the processors cache
Instruction Based Stride PrefetcherPrefetches based upon a load having a regular strideCan prefetch forward or backward 2 Kbytes
1/2 default page size
L2 cache prefetching: Data Prefetch Logic (DPL)Prefetches data to the 2nd level cache before the DCU requests the
dataMaintains 2 tables for tracking loads
Upstream – 16 entriesDownstream – 4 entries
Every load is either found in the DPL or generates a new entryUpon recognition of the 2nd load of a “stream” the DPL will prefetch
the next load
![Page 35: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/35.jpg)
35
Memory Disambiguation predictor:Loads that are predicted NOT to
forward from preceding store are allowed to schedule as early as possible
Increasing the performance of OOO memory pipelines
Disambiguated loads checked at retirement
Extension to existing coherency mechanism
Invisible to software and system
![Page 36: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/36.jpg)
36
Load4 must WAIT until previous stores complete
![Page 37: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/37.jpg)
37
Loads can decouple from stores
Load4 can get its data WITHOUT waiting for stores
![Page 38: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/38.jpg)
38
If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load
![Page 39: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/39.jpg)
39
![Page 40: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/40.jpg)
40
Note that unaligned store forward does not occur when the load crosses a cache line boundary
Note: Unaligned 128-bit stores are issued as two 64-bit stores. This provides two alignments for store forwarding
ld 8 Store forwarded to load
ld 8 No forwarding
‡: No forwarding if the loadcrosses a cache line boundary
![Page 41: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/41.jpg)
41
CPU1 CPU2
Memory
Cache Line
Shipping L2 Cache Line~Half access to memory
Front Side Bus (FSB)
![Page 42: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/42.jpg)
42
L2 is share: No need to ship cache line
CPU2 CPU1
Memory
Cache Line
Front Side Bus (FSB)
![Page 43: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/43.jpg)
43
Load and store access order1.L1 cache of immediate core2.L1 cache of the other core3.L2 cache4.Memory
BusBus
2 MB L2 Cache2 MB L2 Cache
Core1Core1 Core2Core2
![Page 44: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/44.jpg)
44
Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Cache 2 cache transferImproves producer / consumer style MPWider interface to L2
reduced interferenceprocessor line fill is 2 cycles
Higher bandwidth from the L2 cache to the core
~14 clock latency and 2 clock throughput
![Page 45: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/45.jpg)
45
Avoid “Length Changing Prefixes” (LCPs) Affects instructions with immediate data or offset Operand Size Override (66H) Address Size Override (67H) [obsolete] LCPs change the length decoding algorithm –
increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary)
The REX (EM64T) prefix (4xH) is not an LCPThe REX prefix does lengthen the instruction by one byte, so use of the first eight general registers in EM64T is preferred
![Page 46: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/46.jpg)
46
Includes a “Loop Stream Detector” (LSD)Potentially very high bandwidth instruction
streamingA number of requirements to make use of the
LSDMaximum of 18 instructions in up to four 16-
byte packetsNo RET instructions (hence, little practical
use for CALLs)Up to four taken branches allowedMost effective at 70+ iterations
LSD is after PreDecode so there is no added cost for LCPs
Trade-off LSD with conventional loop unrolling
![Page 47: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/47.jpg)
47
Decoder issues up to 4 uOps for renaming/ allocation per clock
This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps
For example, a single four uOp instruction is all that can be renamed/allocated in a single clock
In some cases, multiple simple instructions may be a better choice than a single complex instruction
Single uOp instructions allow more decoder flexibility
For example, 4-1-1-1 can be decoded in one clock
However, 2-2-2-1 takes three clocks to decode
![Page 48: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/48.jpg)
48
Up to six uOps can be dispatched per clock“Store Data” and “Store Address” dispatch ports
are combined on the block diagramUp to four results can be written back per clockSingle clock latency operations are best
Differing latency operations can create writeback conflicts
Separate multiple-clock uOps with several single uOp instructions
Typical instructions here: ADC/SBB, RWM, CMOVcc
In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility)
When equivalent, PS preferred to PD (LCP)For example, MOVAPS over MOVAPD, XORPS
over XORPD
![Page 49: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/49.jpg)
49
Bypass register “access” preferred to register readsPartial register accesses often lead to stalls
Register size access that ‘conflicts’ with recent previous register write
Partial XMM updates subject to dependency delaysPartial flag stall can occur, too much higher cost
Use TEST instruction between shift and conditional to prevent
Common zeroing instructions (e.g., XOR reg,reg) don’t stall
Avoid bypass between execution domainsFor example: FP (ADDPS) and logical ops (PAND) on
XMMnVectorization: careful packing/unpacking sequence
Use MXCSR’s FZ and DAZ controls as appropriate
![Page 50: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/50.jpg)
50
Software prefetch instructionsCan reach beyond a page boundary (including page walk)Prefetches only when it completes without an exception
General techniques to help these prefetchersOrganize data in consecutive linesIn general, increasing addresses are more easily prefetched
![Page 51: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/51.jpg)
51
What has been coveredNotable features of Core® Micro-architecture
Wide Dynamic ExecutionAdvanced Memory AccessAdvanced Smart CacheAdvanced Digital Media BoostPower Efficient Support
Core® Micro-architecture componentsFront EndOOO execution coreMemory sub-system
Advanced cache technology
![Page 52: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/52.jpg)
52
What has been coveredNotable features of Core® Micro-architecture
Wide Dynamic ExecutionAdvanced Memory AccessAdvanced Smart CacheAdvanced Digital Media BoostPower Efficient Support
Core® Micro-architecture componentsFront EndOOO execution coreMemory sub-system
Advanced cache technology
![Page 53: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/53.jpg)
53
Software prefetches are rarely ignored on Merom Architecture
On P4 if you had a DTLB miss the prefetch could be ignored
On Merom architecture they are not ignored and the prefetch can hurt performance since it cannot retire until after the page walk
Critical chunk is not utilized on a software prefetch
A prefetch can hurt performance if it is too close to the load
When the data comes in due to a prefetch you need the entire cache line instead of just the critical chunk before the data can be used by the actual load
![Page 54: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/54.jpg)
54
2x Compute Throughput / Clock
A
B
Lets scale a vector: B[i] := A[i] * C
![Page 55: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/55.jpg)
55
Assume both microarchs have 128-bit path from L1 to Processor
A
B
2x Compute Throughput / Clock
![Page 56: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/56.jpg)
56
Handles all the memory data
2x Compute Throughput / Clock
A
B
Multiply can’tkeep up with
load bandwidth
Multiplier operates on all
data
![Page 57: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/57.jpg)
04/21/23 57
![Page 58: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/58.jpg)
58
Handles all the memory data
2x Compute Throughput / Clock
A
B
Load eventually stalls waiting for multiplier
Load pipe is free to advance
![Page 59: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/59.jpg)
59
Keeps pipeline free for computations
2x Compute Throughput / Clock
A
B
Load eventually stalls waiting for multiplier
Load pipe is free to advance
![Page 60: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/60.jpg)
60
Maintains 2X throughput compared to prior implementations
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 61: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/61.jpg)
61
8 Single Precision flops per cycle
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 62: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/62.jpg)
62
4 Double Precision flops per cycle
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 63: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/63.jpg)
63
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 64: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/64.jpg)
64
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 65: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/65.jpg)
65
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 66: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/66.jpg)
66
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 67: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/67.jpg)
67
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 68: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/68.jpg)
68
2x Compute Throughput / Clock
Load eventually stalls waiting for multiplier
Load pipe is free to advance
A
B
![Page 69: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/69.jpg)
69
![Page 70: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/70.jpg)
70
![Page 71: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/71.jpg)
71
Processor communicates power consumption to external platform components
Optimization of voltage regulator efficiency Load line and power delivery efficiency
PSI-2 / VID
![Page 72: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/72.jpg)
72
DTS – Digital Thermal Sensor
Several thermal sensors are located within the Processor to cover all possible hot spots
Dedicated logic scans the thermal sensors and measures the maximum temperature on the die at any given time.
Accurately reporting Processor temperature enables advanced thermal control schemes
LPF
LPFCore 1 DTS Logic
Core 2DTS Logic
DTS control and status
![Page 73: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/73.jpg)
73
Processor provides its temperature reading over a multi drop single wire bus allowing efficient platform thermal control.
ProcessorFan
AuxiliaryFan
Manager
PECI
ChassisFan 1
ChassisFan 2
PROC #2
PROC #3
PROC #1
![Page 74: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/74.jpg)
74
Front side bus with the following low power improvements Lower voltage DPWR# and BPRI# signals
Must have FSB traffic to enable data and address bus input sense amplifiers and control signals (~ 120 pins)
Eliminated higher address and dual processor capable pins
![Page 75: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/75.jpg)
75
Voltage-Frequency switching separationClock partitioning and recoveryEvent blockingEven during periods of high performance execution, many
parts of the chip core can be shut off.
![Page 76: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/76.jpg)
76
![Page 77: 1. 2 Components of an IA processor Upon completion of this module, you will be able to describe: Working flow of the instruction pipeline Notable features.](https://reader035.fdocuments.in/reader035/viewer/2022070412/5697bf891a28abf838c89c6b/html5/thumbnails/77.jpg)
77