Intel Nehalem Processor
-
Upload
scribddownloaded -
Category
Documents
-
view
244 -
download
0
Transcript of Intel Nehalem Processor
-
8/11/2019 Intel Nehalem Processor
1/65
Ronak Singhal
Nehalem Architect
NGMS001
Inside Intel Next GenerationNehalem Microarchitecture
-
8/11/2019 Intel Nehalem Processor
2/65
2
Nehalem Design Philosophy Enhanced Processor Core
Performance Features
Simultaneous Multi-Threading Feeding the Engine
New Cache Hierarchy
New Platform Architecture
Performance AccelerationVirtualizationNew Instructions
Agenda
Nehalem: Next Generation Intel Microarchitecture
-
8/11/2019 Intel Nehalem Processor
3/65
3
NEHALEM
TO K
SANDY
BRIDGE
TO K
WESTMERE
2009-10
32nm
T K
2005-06
Intel
Core2,Xeon
processors
Intel
Pentium
D,Xeon,
Core
processors
T K TO K
2007-08
PENRYN
45nm
T K
65nm
Tick Tock Development Model
NehalemNehalem --
The IntelThe Intel
45 nm Tock Processor45 nm Tock Processor
-
8/11/2019 Intel Nehalem Processor
4/654
Nehalem Design Goals
Existing AppsExisting Apps
Emerging AppsEmerging Apps
All UsagesAll Usages
Single ThreadSingle Thread
MultiMulti--threadsthreads
Workstation / ServerDesktop / Mobile
World class performance combined with superior energy efficiencyWorld class performance combined with superior energy efficiency
Optimized for:Optimized for:
A single, scalable, foundation optimized across each segment andA single, scalable, foundation optimized across each segment and
power envelopepower envelope
Dynamically scaledperformance when
needed to maximizeenergy efficiency
Nehalem: Next Generation Intel Microarchitecture
A Dynamic and Design Scalable Microarchitecture
Dynamic and Design Scalable Microarchitecture
-
8/11/2019 Intel Nehalem Processor
5/655
Scalable Nehalem Core
Common feature setCommon feature setSame core forSame core for
all segmentsall segments
Common softwareCommon software
optimizationoptimization
45nm45nm
Servers/Workstations
Energy Efficiency,Performance,
Virtualization,Reliability, Capacity, Scalability
Nehalem
ehalem
DesktopPerformance,
Graphics, EnergyEfficiency, Idle Power, Security
Mobile
Battery Life, Performance,
Energy Efficiency, Graphics,Security
Optimized Nehalem core to meet all market segments
ptimized Nehalem core to meet all market segments
Nehalem: Next Generation Intel Microarchitecture
-
8/11/2019 Intel Nehalem Processor
6/656
Intel Core Microarchitecture Recap
Wide Dynamic Execution
- 4-wide decode/rename/retire
Advanced Digital Media Boost
- 128-bit wide SSE execution units Intel HD Boost
- New SSE4.1 Instructions
Smart Memory Access
- Memory Disambiguation- Hardware Prefetching
Advanced Smart Cache
- Low latency, high BW shared L2 cache
N e h a le m b u i l d s o n t h e g r e a t I n t e l Co r e m i c r o a r ch i t e c t u r e
Nehalem: Next Generation Intel
Microarchitecture
-
8/11/2019 Intel Nehalem Processor
7/657
Nehalem Design Philosophy Enhanced Processor Core
Performance Features
Simultaneous Multi-Threading Feeding the Engine
New Cache HierarchyNew Platform Architecture
Performance AccelerationVirtualizationNew Instructions
Agenda
Nehalem: Next Generation Intel
Microarchitecture
-
8/11/2019 Intel Nehalem Processor
8/658
Designed for Performance
Execution
Units
Out-of-Order
Scheduling &
Retirement
L2 Cache
& Interrupt
Servicing
Instruction Fetch
& L1 Cache
Branch PredictionInstruction
Decode &
Microcode
Paging
L1 Data Cache
Memory Ordering
& Execution
Additional Caching
Hierarchy
New SSE4.2
Instructions
DeeperBuffers
Faster
Virtualization
Simultaneous
Multi-ThreadingBetter Branch
Prediction
Improved Lock
Support
ImprovedLoop
Streaming
-
8/11/2019 Intel Nehalem Processor
9/659
Enhanced Processor Core
Instruction Fetch andPre Decode
Instruction Queue
Decode
ITLB
Rename/Allocate
Retirement Unit(ReOrder Buffer)
Reservation Station
Execution Units
DTLB
2nd Level TLB4
4
6
32kBInstruction Cache
32kB
Data Cache
256kB
2nd Level Cache
L3 and beyond
Fr o n t En d E x e c u t i o n
E n g i n e
M e m o r y
-
8/11/2019 Intel Nehalem Processor
10/6510
Front-end
Responsible for feeding the compute engine- Decode instructions- Branch Prediction
Key Intel Core2 processor features- 4-wide decode- Macrofusion- Loop Stream Detector
Instruction Fetch and
Pre Decode
Instruction Queue
Decode
ITLB32kB
Instruction Cache
-
8/11/2019 Intel Nehalem Processor
11/6511
Macrofusion Recap
Introduced in the Intel Core2 Processor Family
TEST/CMP instruction followed by a conditionalbranch treated as a single instruction
- Decode as one instruction- Execute as one instruction- Retire as one instruction
Higher p e r f o r m a n ce - Improves throughput- Reduces execution latency
Improved p o w e r e f f i c i e n c y - Less processing required to accomplish the same work
-
8/11/2019 Intel Nehalem Processor
12/65
12
Nehalem Macrofusion
Goal: Identify more macrofusion opportunities for increasedp e r f o r m a n c e and p o w e r e f f i c i e n c y
Support all the cases in Intel Core2 processor PLUS- CMP+Jcc macrofusion added for the following branch conditions
JL/JNGE
JGE/JNL
JLE/JNG
JG/JNLE
Intel Core 2 processor only supports macrofusion in 32-bit mode
- Nehalem supports macrofusion in both 32-bit and 64-bit modes
I n c r e a s e d m a c r o f u s i o n b e n e f i t o n N e h a l e m
Nehalem: Next Generation Intel
Microarchitecture
-
8/11/2019 Intel Nehalem Processor
13/65
13
Loop Stream Detector Reminder
Loops are very common in most software Take advantage of knowledge of loops in HW
- Decoding the same instructions over and over- Making the same branch predictions over and over
Loop Stream Detector identifies software loops- Stream from Loop Stream Detector instead of normal path- Disable unneeded blocks of logic for p o w e r sa v in g s - H i g h e r p e r f o r m a n ce by removing instruction fetch limitations
I n t e l Co r e 2 p r o c e ss o r L o o p St r e a m D e t e c t o r
Branch
Prediction Fetch Decode
Loop
Stream
Detector
1 8
I n s t r u c t i o n s
Nehalem: Next Generation Intel
Microarchitecture
-
8/11/2019 Intel Nehalem Processor
14/65
14
Nehalem Loop Stream Detector
Same concept as in prior implementations
H i g h e r p e r f o r m a n ce : Expand the size of theloops detected
I m p r o v e d p o w e r e f f i c i e n c y : Disable even morelogic
N e h a l e m Lo o p S t r e a m D e t e c t o r
Branch
Prediction Fetch Decode
Loop
Stream
Detector
2 8
M i c r o - O p s
Nehalem: Next Generation Intel Microarchitecture
-
8/11/2019 Intel Nehalem Processor
15/65
15
Branch Prediction Reminder
Goal: Keep powerful compute engine fed
Options:- Stall pipeline while determining branch direction/target- Predict branch direction/target and correct if wrong
High branch prediction accuracy leads to:- Pe r f o r m a n ce : Minimize amount of time wasted correcting
from incorrect branch predictions Through higher branch prediction accuracy
Through faster correction when prediction is wrong
- Po w e r e f f i c i e n c y : Minimize number ofspeculative/incorrect micro-ops that are executedCo n t i n u e d f o c u s o n b r a n ch
p r e d i c t i o n i m p r o v e m e n t s
-
8/11/2019 Intel Nehalem Processor
16/65
16
L2 Branch Predictor
Problem: Software with a large code footprint notable to fit well in existing branch predictors
- Example: Database applications
Solution: Use multi-level branch prediction scheme
Benefits:- Higher p e r f o r m a n ce through improved branch prediction
accuracy
- Greater p o w e r e f f i c i e n c y through less mis-speculation
d S k ff ( S )
-
8/11/2019 Intel Nehalem Processor
17/65
17
Renamed Return Stack Buffer (RSB)
Instruction Reminder- CALL: Entry into functions- RET: Return from functions
Classical Solution- Return Stack Buffer (RSB) used to predict RET- Problems:
RSB can be corrupted by speculative path
RSB can overflow, overwriting valid entries
The Re n am e d RSB - No RET mispredicts in the common case
- No stack overflow issue
E i E i
-
8/11/2019 Intel Nehalem Processor
18/65
18
Execution Engine
Start with powerful Intel Core2 processorexecution engine- Dynamic 4-wide Execution
- Advanced Digital Media Boost 128-bit wide SSE- HD Boost (Penryn)
SSE4.1 instructions
- Super Shuffler (Penryn)- Add Nehalem enhancements- Additional parallelism for higher performance
Nehalem: Next Generation Intel
Microarchitecture
E ti U it O i
-
8/11/2019 Intel Nehalem Processor
19/65
19
Execute 6 operations/cycle
3 Memory Operations
1 Load
1 Store Address
1 Store Data
3 Computational Operations
Execution Unit Overview
Unified Reservation Station
Port0
Port1
Port2
Po
rt3
Port4
Po
rt5
LoadStore
Address
Store
Data
Integer ALU &
Shift
Integer ALU &
LEA
Integer ALU &
Shift
BranchFP AddFP Multiply
Complex
IntegerDivide
SSE Integer ALU
Integer Shuffles
SSE Integer
Multiply
FP Shuffle
SSE Integer ALU
Integer Shuffles
Unified Reservation Station Schedules operations to Execution units
Single Scheduler for all Execution Units
Can be used by all integer, all FP, etc.
I d P ll li
-
8/11/2019 Intel Nehalem Processor
20/65
20
Increased Parallelism
Goal: Keep powerfulexecution engine fed
Nehalem increases size of
out of order window by 33% Must also increase other
corresponding structures 016
3248
64
80
96
112
128
Dothan Merom Nehalem
Concurrent uOps Possible
I n c r e a s e d R eso u r c e s f o r H i g h e r Pe r f o r m a n c e
Structure Merom Nehalem CommentReservation Station 32 36 Dispatches operations
to execution units
Load Buffers 32 48 Tracks all loadoperations allocated
Store Buffers 20 32 Tracks all storeoperations allocated
Nehalem: Next Generation Intel
Microarchitecture
Source: Microarchitecture specifications
E h d M S b t
-
8/11/2019 Intel Nehalem Processor
21/65
21
Enhanced Memory Subsystem
Start with great Intel Core 2 Processor Features- Memory Disambiguation- Hardware Prefetchers
- Advanced Smart Cache
New Nehalem Features- New TLB Hierarchy
- Fast 16-Byte unaligned accesses- Faster Synchronization Primitives
New TLB Hierarchy
-
8/11/2019 Intel Nehalem Processor
22/65
22
New TLB Hierarchy
Problem: Applications continue to grow in data size
Need to increase TLB size to keep the pace for performance
Nehalem adds new low-latency unified 2nd level TLB
# of Entries
1st Level Instruction TLBs
Small Page (4k) 128
Large Page (2M/4M) 7 per thread
1st Level Data TLBs
Small Page (4k) 64
Large Page (2M/4M) 32
New 2nd Level Unified TLB
Small Page Only 512
Fast Unaligned Cache Accesses
-
8/11/2019 Intel Nehalem Processor
23/65
23
Fast Unaligned Cache Accesses
Two flavors of 16-byte SSE loads/stores exist- Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary- Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement
Prior to Nehalem- Optimized for Aligned instructions
- Unaligned instructions slower, lower throughput -- Even for aligned accesses! Required multiple uops (not energy efficient)- Compilers would largely avoid unaligned load
2-instruction sequence (MOVSD+MOVHPD) was faster
Nehalem optimizes Unaligned instructions- Same speed/throughput as Aligned instructions on aligned accesses
- Optimizations for making accesses that cross 64-byte boundaries fast Lower latency/higher throughput than Intel Core 2 processor- Aligned instructions remain fast
No reason to use aligned instructions on Nehalem! Benefits:
- Compiler can now use unaligned instructions without fear- H ig h e r p e r f o r m a n c e on key media algorithms- More e n e r g y e f f i c i e n t than prior implementations
Nehalem: Next Generation Intel Microarchitecture
Faster Synchronization Primitives
-
8/11/2019 Intel Nehalem Processor
24/65
24
Faster Synchronization Primitives
Multi-threaded softwarebecoming more prevalent
S c a l a b i l i t y of multi-thread
applications can be limitedby synchronization
Synchronization primitives:LOCK prefix, XCHG
Reduce synchronizationlatency for legacy software
Greater thread s c a l a b i l i t y with Nehalem
LOCK CMPXCHG Per fo rm ance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pentium 4 Core 2 Nehalem
RelativeLaten
cy
Nehalem: Next Generation Intel Microarchitecture
Simultaneous Multi-Threading (SMT)
-
8/11/2019 Intel Nehalem Processor
25/65
25
Simultaneous Multi-Threading (SMT)
SMT
- Run 2 threads at the same time per core Take advantage of 4-wide execution
engine
- Keep it fed with multiple threads
- Hide latency of a single thread Most p o w e r e f f i c i e n t performance
feature
- Very low die area cost
- Can provide significant performancebenefit depending on application
- Much more efficient than adding anentire core
Nehalem advantages- Larger caches- Massive memory BW
Sim u l t a n e o u s m u l t i - t h r e a d i n g en h a n ce sp e r f o r m a n c e a n d e n e r g y e f f i c i e n c y
Time(proc.cy
cles)
w/o SMT SMT
Note: Each box
represents a
processorexecution unit
SMT Implementation Details
-
8/11/2019 Intel Nehalem Processor
26/65
26
SMT Implementation Details
Multiple policies possible for implementation of SMT Replicated Duplicate state for SMT
- Register state- Renamed RSB- Large page ITLB
Partitioned Statically allocated between threads- Key buffers: Load, store, Reorder- Small page ITLB
Competitively shared Depends on threads dynamic behavior
- Reservation station- Caches- Data TLBs, 2nd level TLB
Unaware- Execution units
Agenda
-
8/11/2019 Intel Nehalem Processor
27/65
27
Nehalem Design Philosophy Enhanced Processor Core
Performance Features
Simultaneous Multi-Threading Feeding the EngineNew Cache HierarchyNew Platform Architecture
Performance AccelerationVirtualizationNew Instructions
Agenda
Feeding the Execution Engine
-
8/11/2019 Intel Nehalem Processor
28/65
28
Feeding the Execution Engine
Powerful 4-wide dynamic execution engine
Need to keep providing fuel to the execution engine
Nehalem Goals
- Lo w l a t e n cy to retrieve data Keep execution engine fed w/o stalling
- High data b a n d w i d t h Handle requests from multiple cores/threads seamlessly
- S c a l a b i l i t y Design for increasing core counts
Combination of great ca ch e h i e r a r c h y and n e w p l a t f o r m
N e h a l e m d e s i g n e d t o f e e d t h e e x e cu t i o n e n g i n e
Nehalem: Next Generation Intel Microarchitecture
Designed For Modularity
-
8/11/2019 Intel Nehalem Processor
29/65
29
Designed For Modularity
Optimal price / performance / energy efficiency
ptimal price / performance / energy efficiency
for server desktop and mobile products
or server desktop and mobile products
DRAM
QPI
PI
Core
Uncore
C
O
R
E
C
O
R
E
C
O
R
E
IMC QPIPowerPower
&&
ClockClock
#QPI#QPILinksLinks
## memmemchannelschannels
Size ofSize ofcachecache# cores# cores
PowerPowerManageManage--mentment
Type ofType ofMemoryMemory
IntegratedIntegratedgraphicsgraphics
Differentiation in the Uncore:
2008
2009 Servers & Desktops
QPI
L3 Cache
QPI:
Intel
QuickPathInterconnect
Intel Smart Cache Core Caches
-
8/11/2019 Intel Nehalem Processor
30/65
30
Intel Smart Cache Core Caches
New 3-level Cache Hierarchy 1st level caches
- 32kB Instruction cache
- 32kB Data Cache Support more L1 misses in parallel
than Core 2
2nd level Cache
- New cache introduced in Nehalem- Unified (holds code and data)- 256 kB per core
- Pe r f o r m a n ce : Very low latency- S c a l a b i l i t y : As core countincreases, reduce pressure onshared cache
Co r e
2 5 6 k B
L 2 Ca c h e
3 2 k B L 1
Da t a Ca c h e
3 2 k B L 1
I n s t . Ca c h e
Nehalem: Next Generation Intel Microarchitecture
Intel Smart Cache -- 3rd Level Cache
-
8/11/2019 Intel Nehalem Processor
31/65
31
Intel Smart Cache 3 Level Cache
New 3rd level cache
Shared across all cores
Size depends on # of cores- Quad-core: Up to 8MB- S c a l a b i l i t y :
Built to vary size with varied
core counts Built to easily increase L3 size
in future parts
Inclusive cache policy forbest p e r f o r m a n ce
- Data residing in L1/L2 m u s tbe present in 3rd level cache
L3 Cache
Co r e
L 2 Ca c h e
L 1 Ca c h e s
Co r e
L 2 Ca c h e
L 1 Ca c h e s
Co r e
L 2 Ca c h e
L 1 Ca c h e s
Inclusive vs. Exclusive Caches
-
8/11/2019 Intel Nehalem Processor
32/65
32
c us e s c us e Cac esCache Miss
E x c l u s i v e I n c l u s i v e
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 Cache
Data request from Core 0 misses Core 0s L1 and L2
Request sent to the L3 cache
Inclusive vs. Exclusive Caches
-
8/11/2019 Intel Nehalem Processor
33/65
33
Cache Miss
E x c l u s i v e I n c l u s i v e
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core 0 looks up the L3 Cache
Data not in the L3 Cache
MISS! MISS!
Inclusive vs. Exclusive Caches
-
8/11/2019 Intel Nehalem Processor
34/65
34
Cache Miss
E x c l u s i v e I n c l u s i v e
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 CacheMISS! MISS!
Must check other cores Guaranteed data is not on-die
Greater s c a l a b i l i t y from inclusive approach
Inclusive vs. Exclusive Caches
-
8/11/2019 Intel Nehalem Processor
35/65
35
Cache Hit
E x c l u s i v e I n c l u s i v e
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 CacheHIT! HIT!
No need to check other cores Data could be in another core
BUT Nehalem is smart
Nehalem: Next Generation Intel Microarchitecture
Inclusive vs. Exclusive Caches
-
8/11/2019 Intel Nehalem Processor
36/65
36
Cache Hit
I n c l u s i v e
Core
0
Core
1
Core
2
Core
3
L3 CacheHIT!
Core valid bits limitunnecessary snoops
Maintain a set of corevalid bits per cache line inthe L3 cache
Each bit represents a core
If the L1/L2 of a core maycontain the cache line, thencore valid bit is set to 1
No snoops of cores areneeded if no bits are set
If more than 1 bit is set,line cannot be in Modified
state in any core
0 0 0 0
Inclusive vs. Exclusive Caches
-
8/11/2019 Intel Nehalem Processor
37/65
37
Read from other core
E x c l u s i v e I n c l u s i v e
Core
0
Core
1
Core
2
Core
3
L3 Cache
Core
0
Core
1
Core
2
Core
3
L3 CacheMISS! HIT!
Must check all other cores Only need to check the core
whose core valid bit is set
0 0 1 0
Todays Platform Architecture
-
8/11/2019 Intel Nehalem Processor
38/65
38
y
Front-Side Bus Evolution
ICHICH
CPUCPU
CPUCPU
CPUCPU
CPUCPU
MCHMCH
memory
ICHICH
CPUCPU
CPUCPU
CPUCPU
CPUCPU
MCHMCH
memory
ICHICH
CPUCPU
CPUCPU
CPUCPU
CPUCPU
MCHMCH
memory
Nehalem-EP Platform Architecture
-
8/11/2019 Intel Nehalem Processor
39/65
39
Integrated Memory Controller- 3 DDR3 channels per socket- Massive memory b a n d w i d t h
- Memory Bandwidth scales with# of processors
- Very l o w m e m o r y la t e n c y Intel QuickPath Interconnect
(QPI)- New point-to-point interconnect- Socket to socket connections- Socket to chipset connections- Build s c a l a b l e solutions
Nehalem
EP
Nehalem
EP
Tylersburg
EP
Si g n i f i c a n t p e r f o r m a n ce l e a p f r o m n e w p l a t f o r m
Nehalem: Next Generation Intel Microarchitecture
Intel QuickPath Interconnect
-
8/11/2019 Intel Nehalem Processor
40/65
40
Nehalem introduces newIntel QuickPathInterconnect (QPI)
H ig h b a n d w i d t h , l o wl a t e n c y point to pointinterconnect
Up to 6.4 GT/sec initially
- 6.4 GT/sec -> 12.8 GB/sec- Bi-directional link -> 25.6
GB/sec per link
- Future implementations at
even higher speeds Highly s c a l a b l e for systems
with varying # of sockets
Nehalem
EPNehalem
EP
IOH
memoryCPU CPU
CPU CPU
IOH
memory
memory
memory
Nehalem: Next Generation Intel Microarchitecture
Integrated Memory Controller (IMC)
-
8/11/2019 Intel Nehalem Processor
41/65
41
Memory controller optimized per marketsegment Initial Nehalem products
- Native DDR3 IMC- Up to 3 channels per socket- Speeds up to DDR3-1333
Massive m e m o r y b a n d w i d t h
- Designed for l o w l a t e n c y - Support RDIMM and UDIMM- RAS Features
Future products
- S c a l a b i l i t y Vary # of memory channels Increase memory speeds Buffered and Non-Buffered solutions
- Market specific needs Higher memory capacity Integrated graphics
Nehalem
EPNehalem
EP
Tylersburg
EP
DDR3 DDR3
Si g n i f i c a n t p e r f o r m a n ce t h r o u g h n e w I M C
Nehalem: Next Generation Intel Microarchitecture
IMC Memory Bandwidth (BW)
-
8/11/2019 Intel Nehalem Processor
42/65
42
3 memory channels per socket
Up to DDR3-1333 at launch- Massive m em o r y BW - HEDT: 32 GB/sec peak- 2S server: 64 GB/sec peak
S c a l a b i l i t y - Design IMC and core to takeadvantage of BW
- Allow performance to scale withcores
Core enhancements Support more cache misses per
core Aggressive hardware prefetching
w/ throttling enhancements
Example IMC Features Independent memory channels Aggressive Request Reordering
HarpertownHarpertown
1600 FSB1600 FSB0
1
2
3
4
NehalemNehalem
EPEP
3x DDR33x DDR3
--13331333
2 socket Streams Triad
>4Xbandwidth
Ma ss iv e m em o r y BW p r o v i d e s p e r f o r m a n ce a n d s ca la b i l i t y
Nehalem: Next Generation Intel Microarchitecture
Source: Internal measurements
Non-Uniform Memory Access (NUMA)
-
8/11/2019 Intel Nehalem Processor
43/65
43
FSB architecture- All memory in one location Starting with Nehalem
- Memory located in multiple
places Latency to memorydependent on location
Local memory- Highest BW- Lowest latency
Remote Memory- Higher latency
Nehalem
EPNehalem
EP
Tylersburg
EP
En s u r e s o f t w a r e i s N UMA - o p t i m i z e d f o r b e s t p e r f o r m a n c e
Nehalem: Next Generation Intel Microarchitecture
Local Memory Access
-
8/11/2019 Intel Nehalem Processor
44/65
44
CPU0 requests cache line X, not present in any CPU0 cache
- CPU0 requests data from its DRAM- CPU0 snoops CPU1 to check if data is present Step 2:
- DRAM returns data- CPU1 returns snoop response
Local memory latency is the maximum latency of the two responses
Nehalem optimized to keep key latencies close to each other
CPU0 CPU1QPI
DRAMDRAM
Nehalem: Next Generation Intel Microarchitecture
QPI:
Intel
QuickPathInterconnect
Remote Memory Access
-
8/11/2019 Intel Nehalem Processor
45/65
45
CPU0 requests cache line X, not present in any CPU0 cache
- CPU0 requests data from CPU1- Request sent over QPI to CPU1- CPU1s IMC makes request to its DRAM- CPU1 snoops internal caches
- Data returned to CPU0 over QPI Remote memory latency a function of having a low latencyinterconnect
CPU0 CPU1QPI
DRAMDRAM
QPI:
Intel
QuickPathInterconnect
Memory Latency Comparison
-
8/11/2019 Intel Nehalem Processor
46/65
46
Lo w m em o r y l a t en c y critical to high performance Design integrated memory controller for low latency Need to optimize both local and remote memory latency Nehalem delivers
- Huge reduction in local memory latency- Even remote memory latency is fast
Effective memory latency depends per application/OS- Percentage of local vs. remote accesses- NHM has lower latency regardless of mix
Relativ e Memory Latency Comparison
0.00
0.20
0.40
0.60
0.80
1.00
Harpertown (FSB 1600) Nehalem (DDR3-1333) Local Nehalem (DDR3-1333) Remote
RelativeMemoryLatency
Nehalem: Next Generation Intel
Microarchitecture
Agenda
-
8/11/2019 Intel Nehalem Processor
47/65
47
Nehalem Design Philosophy Enhanced Processor CorePerformance FeaturesSimultaneous Multi-Threading
Feeding the EngineNew Cache HierarchyNew Platform Architecture
Performance AccelerationVirtualizationNew Instructions
Virtualization
-
8/11/2019 Intel Nehalem Processor
48/65
48
To get best virtualized p e r f o r m a n ce - Have best native performance- Reduce:
# of transitions into/out of virtual machine
Latency of transitions Nehalem virtualization features
- Reduced latency for transitions- Virtual Processor ID (VPID) to reduce effective cost of transitions
- Extended Page Table (EPT) to reduce # of transitions
Great virtualization performance w/ Nehalem
Nehalem: Next Generation Intel Microarchitecture
Virtualization Enhancements
-
8/11/2019 Intel Nehalem Processor
49/65
49
0%
20%
40%
60%
80%
100%
Relative
Latenc
y
Merom Penryn Nehalem
>40%>40%FasterFaster
Greatly improved Virtualization performance
reatly improved Virtualization performance
Virtualization Latency
irtualization Latency
Time to enter and exit virtual machine
Virtual
ProcessorID (VPID)
ExtendedPage
Tables (EPT)
Reduce frequency of
invalidation of TLB entries
Hardware to translateguest to host physicaladdress
Previously done in VMMsoftware
Removes #1 cause ofexits from virtual machine
Allows guests full controlover page tables
Architecture:rchitecture:
Round Trip Virtualization Latency
Source: pre-silicon/post-silicon measurements
Nehalem: Next Generation Intel Microarchitecture
Extended Page Tables (EPT) Motivation
-
8/11/2019 Intel Nehalem Processor
50/65
50
Guest
OS
VM1
VMM
CR3
Guest Page Table
CR3
Active Page Table
A VMM needs to protectphysical memory Multiple Guest OSs share
the same physicalmemory
Protections areimplemented throughpage-table virtualization
Page table virtualizationaccounts for asignificant portion ofvirtualization overheads VM Exits / Entries
The goal of EPT is to
reduce these overheads
Guest page table changescause exits into the VMM
VMM maintains the activepage table, which is used
by the CPU
EPT Solution
-
8/11/2019 Intel Nehalem Processor
51/65
51
Intel 64 Page Tables
- Map Guest Linear Address to Guest Physical Address- Can be read and written by the guest OS
New EPT Page Tables under VMM Control
- Map Guest Physical Address to Host Physical Address
- Referenced by new EPT base pointer No VM Exits due to Page Faults, INVLPG or CR3
accesses
Intel
64
PageTables
Guest
Linear
Address
EPT
PageTables
CR3
Guest
Physical
Address
EPT
Base Pointer
Host
Physical
Address
Extending Performance and Energy Efficiency- SSE4 . 2 I n s t r u c t i o n Se t A r c h i t e c t u r e ( I SA ) Le a d e r s h i p i n 2 0 0 8
-
8/11/2019 Intel Nehalem Processor
52/65
52
SSE4.2SSE4.2(Nehalem Core)(Nehalem Core)
STTNISTTNIe.g. XMLe.g. XML
accelerationacceleration
POPCNTPOPCNTe.g. Genomee.g. Genome
MiningMining
ATAATA(Application(Application
TargetedTargetedAccelerators)Accelerators)
SSE4.1SSE4.1((PenrynPenryn Core)Core)
SSE4SSE4(45nm CPUs)(45nm CPUs)
CRC32CRC32e.g.e.g. iSCSIiSCSIApplicationApplication
New CommunicationsNew Communications
CapabilitiesCapabilities
Hardware based CRCinstructionAccelerated Network
attached storageImproved power efficiencyfor Software I-SCSI, RDMA,and SCTP
Accelerated SearchingAccelerated Searching
& Pattern Recognition& Pattern Recognitionof Large Data Setsof Large Data Sets
Improved performance forGenome Mining,
Handwriting recognition.Fast Hamming distance /Population count
AcceleratedAcceleratedString and TextString and Text
ProcessingProcessing
Faster XML parsing
Faster search and patternmatchingNovel parallel data matchingand comparison operations
STTNI
ATA
W h a t s h o u l d t h e a p p l i c a t i o n s , O S a n d VM M v e n d o r s d o ?:
U n d e r s t a n d t h e b e n e f i t s & t a k e ad v a n t a g e o f n e w i n s t r u c t i o n s in 2 0 0 8 .
P r o v i d e u s f e e d b a c k o n i n s t r u c t i o n s I SV w o u l d l i k e t o s ee f o r
n e x t g e n e r a t i o n o f a p p l ic a t i o n s
STTNI - STring & Text New InstructionsOp e r a t e s o n s t r i n g s o f b y t e s o r w o r d s ( 1 6 b )
-
8/11/2019 Intel Nehalem Processor
53/65
53
Equal Each Instruction
True for each character in Src2 ifsame position in Src1 is equalSr c1: Test \ t daySr c2: t ad t seTMask: 01101111
Equal Ordered Instruction
Finds the start of a substring (Src1)within another string (Src2)Src1: ABCA0XYZSrc2: S0BACBABMask: 00000010
Equal Any Instruction
True for each character in Src2if any character in Src1 matchesSr c1: Exampl e\ n
Sr c2: at ad t sTMask: 10100000
Ranges Instruction
True if a character in Src2 is inat least one of up to 8 rangesin Src1Sr c1: AZ 0 9zzzSr c2: t aD t seTMask: 00100001
Projected 3.8x kernel speedup on XML parsing &2.7x savings on instruction cycles
STTNI MODEL
x x xx xxx Tx x xx Txx x
x T xx xxx x
x x xx xTx x
x x Fx xxx x
x x xx xxT x
x x xT xxx xF x xx xxx x
t da st Te
T
st
e
ad\t
y
Check
each bit
in the
diagonal
Source1
(XMM)
Source2 (XMM / M128)
IntRes1
0 1 01 111 1
Bit 0
Bit 0
STTNI ModelSource2 (XMM / M128)Source2 (XMM / M128)
EQUAL ANY EQUAL EACH
-
8/11/2019 Intel Nehalem Processor
54/65
54
AND the results
along each
diagonal
x x xx xxx Tx x xx Txx x
x T xx xxx x
x x xx xTx x
x x Fx xxx xx x xx xxT x
x x xT xxx xF x xx xxx x
t da st Te
T
st
e
ad\t
y
Checkeach bit inthediagonal
S
ource1
(XMM)
Source2 (XMM / M128)
IntRes10 1 01 111 1
Bit 0
Bit 0
F F FF FF FFF F FF FFF F
F F FF FFF F
T T FF FFF F
F F FF FFF FF F FF FFF F
F F FF FFF FF F FF FFF F
a a dt t Ts
E
am
x
elp
\n
OR results downeach column
Source1
(XMM)
Source2 (XMM / M128)
IntRes11 1 00 000 0
Bit 0
Bit 0
Source1(XM
M)
Bit 0fF F TfF TFF FfF T FfF FTF x
fF F FfF xFT xfF F TfF xxF xfT fTfTfT xxx xfT fT xfT xxx xfT x xfT xxx xfT x xx xxx x
Source2 (XMM / M128)
S BA0 ABC B
A
CA
B
YX0
Z
IntRes10 0 00 100 0
Bit 0
F TFF FF TTF TTF FFF T
T TTT TTT T
T TFT TTT TF F FF FFF FF F TF FFF F
F F FF FFF FT TTT TTT T
t Da st Te
A
09
Z
zzz
z
Source1
(X
MM)
Source2 (XMM / M128)
IntRes10 100 000 1
Bit 0
First Compare
does GE, next
does LEAND GE/LE pairs of results
OR those results
Bit 0
EQUAL ORDEREDRANGES
AND the results
along each
diagonal
Bit 0
Source1(X
MM)
Example Code For strlen()
STTNI Version
-
8/11/2019 Intel Nehalem Processor
55/65
55
int sttni_strlen(const char * src)
{
char eom_vals[32] = {1, 255, 0};
__asm{
mov eax, src
movdqu xmm2, eom_vals
xor ecx, ecx
topofloop:
add eax, ecx
movdqu xmm1, OWORD PTR[eax]
pcmpistri xmm2, xmm1, imm8
jnz topofloop
endofstring:
add eax, ecx
sub eax, srcret
}
}
string equ [esp + 4]mov ecx,string ; ecx -> string
test ecx,3 ; test if string is aligned on 32 bitsje short main_loopstr_misaligned:
; simple byte loop until string is alignedmov al,byte ptr [ecx]add ecx,1test al,al
je short byte_3
test ecx,3jne short str_misalignedadd eax,dword ptr 0 ; 5 byte nop to align label belowalign 16 ; should be redundant
main_loop:mov eax,dword ptr [ecx] ; read 4 bytesmov edx,7efefeffhadd edx,eax
xor eax,-1xor eax,edxadd ecx,4test eax,81010100h
je short main_loop; found zero byte in the loopmov eax,[ecx - 4]test al,al ; is it byte 0
je short byte_0test ah,ah ; is it byte 1
je short byte_1test eax,00ff0000h ; is it byte 2
je short byte_2test eax,0ff000000h
; is it byte 3je short byte_3jmp short main_loop
; taken if bits 24-30 are clear and bit; 31 is setbyte_3:
lea eax,[ecx - 1]mov ecx,stringsub eax,ecxret
byte_2:lea eax,[ecx - 2]mov ecx,stringsub eax,ecx
retbyte_1:lea eax,[ecx - 3]mov ecx,stringsub eax,ecxret
byte_0:lea eax,[ecx - 4]
mov ecx,stringsub eax,ecxret
strlen endpend
STTNI Version
Cu r r e n t Co d e : M i n im um o f 1 1 i n s t r u c t i o n s ; I n n e r l o o p p r o ce ss es 4 b y t e s w i t h 8 i n s t r u c t io n s
STTN I Co d e : M i n im um o f 1 0 i n s t r u c t i o n s ; A s in g l e i n n e r l o op p r o ce ss es 1 6 b y t e s w i t h o n l y 4i n s t r u c t i o n s
ATA - Application Targeted Accelerators
CRC32 POPCNTPOPCNT d t i th b f
-
8/11/2019 Intel Nehalem Processor
56/65
56
One register maintains the running CRC value as asoftware loop iterates over data.Fixed CRC polynomial = 11EDC6F41h
Replaces complex instruction sequences for CRC in
Upper layer data protocols: iSCSI, RDMA, SCTP
SRC Data 8/16/32/64 bit
Old CRC
63 3132 0
0 New CRC
63 3132 0
DST
DST
0
X
Accumulates a CRC32 value using the iSCSI polynomial
En a b l e s e n t e r p r i s e c l a s s d a t a a ss u r a n ce w i t h h i g h d a t a r a t e s
i n n e t w o r k e d s t o r a g e i n a n y u s e r e n v i r o n m e n t .
0 1 0 . . . 0 0 1 1
63 1 0 Bit
0x3
RAX
RBX
0ZF=? 0
POPCNT determines the number of nonzero
bits in the source.
POPCNT is useful for speeding up fast matching indata mining workloads including: DNA/Genome Matching Voice Recognition
ZFlag set if result is zero. All other flags (C,S,O,A,P)reset
CRC32 Preliminary Performance
-
8/11/2019 Intel Nehalem Processor
57/65
57
crc32c_sse42_optimized_version(uint32 crc, unsigned
char const *p, size_t len)
{ // Assuming len is a multiple of 0x10
asm("pusha");
asm("mov %0, %%eax" :: "m" (crc));
asm("mov %0, %%ebx" :: "m" (p));
asm("mov %0, %%ecx" :: "m" (len));
asm("1:");
// Processing four byte at a time: Unrolled four times:
asm("crc32 %eax, 0x0(%ebx)");
asm("crc32 %eax, 0x4(%ebx)");
asm("crc32 %eax, 0x8(%ebx)");
asm("crc32 %eax, 0xc(%ebx)");
asm("add $0x10, %ebx")2;
asm("sub $0x10, %ecx");
asm("jecxz 2f");
asm("jmp 1b");
asm("2:");
asm("mov %%eax, %0" : "=m" (crc));
asm("popa");
return crc;
}}
Preliminary tests involved Kernel code implementing
CRC algorithms commonly used by iSCSI drivers.
32-bit and 64-bit versions of the Kernel under test
32-bit version processes 4 bytes of data using
1 CRC32 instruction
64-bit version processes 8 bytes of data using1 CRC32 instruction
Input strings of sizes 48 bytes and 4KB used for the
test
32 - bit 64 - bit
InputDataSize =48 bytes
6.53 X 9.85 X
InputDataSize = 4KB
9.3 X 18.63 X
CRC32 optimized Code
P r e l im i n a r y Re su l t s sh o w CRC3 2 i n s t r u c t i o n o u t p e r f o r m i n g t h e
f a s t e s t CRC3 2 C so f t w a r e a lg o r i t h m b y a b i g m a r g i n
Tools Support of New Instructions
- Intel C++ Compiler 10 x supports the new instructions
-
8/11/2019 Intel Nehalem Processor
58/65
58
- Intel C++ Compiler 10.x supports the new instructions SSE4.2 supported via intrinsics
Inline assembly supported on both IA-32 and Intel64 targets
Necessary to include required header files in order to access intrinsics for Supplemental SSE3
for SSE4.1
for SSE4.2
- Intel Library Support XML Parser Library using string instructions will beta Spring 08 and release product
in Fall 08
IPP is investigating possible usages of new instructions
- Microsoft* Visual Studio 2008 VC++ SSE4.2 supported via intrinsics
Inline assembly supported on IA-32 only
Necessary to include required header files in order to access intrinsics
for Supplemental SSE3 for SSE4.1
for SSE4.2
VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions
Software Optimization Guidelines
-
8/11/2019 Intel Nehalem Processor
59/65
59
Most optimizations for Intel
Core
microarchitecturestill hold
Examples of new optimization guidelines:- 16-byte unaligned loads/stores- Enhanced macrofusion rules- NUMA optimizations
Nehalem SW Optimization Guide will be published
Intel C++ Compiler will support settings forNehalem optimizations
Nehalem: Next Generation Intel
Microarchitecture
Summary
-
8/11/2019 Intel Nehalem Processor
60/65
60
Nehalem - The Intel 45 nm Tock Processor Designed for
- Pow e r Ef f i c i e n c y
- S c a l a b i l i t y - Pe r f o r m a n ce
Enhanced Processor Core
Brand New Platform Architecture Extending ISA Leadership
Additional sources ofinformation on this topic:
-
8/11/2019 Intel Nehalem Processor
61/65
61
information on this topic:
DPTS003-High End Desktop Platform Overview for the NextGeneration Intel Microarchitecture (Nehalem) Processor
-April 3 11:10-12:00; Room 3C+3D, 3rd floor
NGMS002-Upcoming Intel 64 Instruction Set ArchitectureExtensions- April 3 16:00 17:50; Auditorium, 3rd floor
NGMC001-Next Generation Intel Microarchitecture (Nehalem) andNew Instruction Extensions - Chalk Talk- April 3 17:50 18:30; Auditorium, 3rd floor
Demos in the Advance Technology Zone on the 3rd floor
More web based info: http://www.intel.com/technology/architecture-silicon/next-gen/index.htm
Session Presentations - PDFs
http://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htmhttp://www.intel.com/technology/architecture-silicon/next-gen/index.htm -
8/11/2019 Intel Nehalem Processor
62/65
62
The PDF of this Session presentation is available from ourIDF Content Catalog:
https://intel.wingateweb.com/SHchina/catalog/controller/catalog
These can also be found from links on www.intel.com/idf
Please Fill out theSession Evaluation Form
https://intel.wingateweb.com/SHchina/catalog/controller/cataloghttp://www.intel.com/idfhttp://www.intel.com/idfhttps://intel.wingateweb.com/SHchina/catalog/controller/catalog -
8/11/2019 Intel Nehalem Processor
63/65
63
Session Evaluation Form
Thank You for your input, we use it to improve futureIntel Developer Forum events
Put in your lucky draw coupon to win the prize at theend of the track!
You must be present to win!
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTEDBY THIS DOCUMENT EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH
-
8/11/2019 Intel Nehalem Processor
64/65
64
BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTELS TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIEDWARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIESRELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANYPATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FORUSE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject tochange without notice.
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, whichmay cause the product to deviate from published specifications. Current characterized errata are availableon request.
Nehalem, Harpertown, Tylersburg, Merom, Dothan, Penryn, Bloomfield and other code names featured are
used internally within Intel to identify products that are in development and not yet publicly announced forrelease. Customers, licensees and other third parties are not authorized by Intel to use code names inadvertising, promotion or marketing of any product or services and any such use of Intel's internal codenames is at the sole risk of the user
Performance tests and ratings are measured using specific computer systems and/or components andreflect the approximate performance of Intel products as measured by those tests. Any difference insystem hardware or software design or configuration may affect actual performance.
Intel, Intel Inside, Xeon, Pentium, Core and the Intel logo are trademarks of Intel Corporation in the UnitedStates and other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2008 Intel Corporation.
-
8/11/2019 Intel Nehalem Processor
65/65