Intel Microarchitecture: Nehalem
Transcript of Intel Microarchitecture: Nehalem
Intel Microarchitecture: NehalemCorti Carlo 729868
Outline
Tendence
Core Architecture
Innovative Technologies
Microarchitecture Enhancement
Westmere
Tendence
Past Every improvements were consideredNo consideration of power cost
Now daysLimited resourcesIncreasing energy costStrict power/performance efficiency treshold“Do more with the same” Philosophy
Core Architecture
Released in November 2008Based on 45nm technology exploiting several innovations
Quad-Core architectureIntegrated Memory ControllerMulti-Chip Package with integrated graphics processorNew point-to-point processor interconnectionIncorporate the SSE 4.2 SIMD instructionsModular block of components
Innovative Technologies
Intel Turbo BoostDeliver Extra performance Activated by the Operating SystemConstraints (N° Cores, Temperature, Current)
Intel Hyper ThreadingSimultaneous multi-threading (2 each cores)Optimal use of every clock cycle
Intel Intelligent PowerIntegrated Power GatesAutomated Low-Power State
Microarchitecture Enhancement (1)
New three-level cache hierarchy
L1 cache (32KB Instruction, 32 KB Data)
L2 cache per core (256 KB)
Fully inclusive and fully shared 8MB L3 cache
The Nehalem microarchitecture implements the MESIFcache coherency protocol, an extended version of the well-known MESI protocol [5, p. 213]. Due to the novelty ofthis microarchitecture, we can only refer to a very limitednumber of publications that are relevant for our test system.Some information can be gathered from Intel documents [6],[7]. However, none of them describe the architecture in muchdetail.
We use BenchIT [8] to develop and run our memorybenchmarks as well as for the results evaluation. Thisperformance measurement suite is designed to run micro-benchmarks on every POSIX 1.003 compliant system in auser-friendly way. It helps to compare different algorithms,implementations of algorithms, properties of the softwarestack, and hardware details of whole systems. The softwareis available as Open Source.
III. SYSTEM ARCHITECTURE
Previous generation quad-core Xeon processors (Harper-town) are composed of two dual-core dies each with a sharedL2 cache. In contrast, the Xeon 5500 series processors(Nehalem-EP) are a native quad-core design. Similar toquad-core AMD Opteron processors (Shanghai), the L1 andL2 caches are implemented per core, while the L3 cache isshared among all cores of one processor. The Front Side Busused in previous Intel CPUs is replaced by point-to-pointlinks called Quick Path Interconnect (QPI). Moreover, eachprocessor contains its own integrated memory controller(IMC). The basic design of a two-socket Nehalem system isdepicted in Figure 1.
The Intel Nehalem microarchitecture supports simulta-neous multithreading (SMT) that allows each core to ex-ecute two threads in parallel. This technique is well-knownfrom the Pentium 4 processors based on Intel’s Netburstmicroarchitecture. Furthermore, processors based on theNehalem microarchitecture feature a dynamic overclockingmechanism (Intel Turbo Boost Technology) that allows theprocessor to raise core frequencies as long as the thermallimit is not exceeded. Table I shows the key differencesbetween the Nehalem microarchitecture and other commonx86 64 server CPUs.
!"#$%"&'()$*+,-"
.,-"'/
0#$-"*'1"2"%'3'.$+#"
45.63'.#$77"%8 (94
1:
.,-"': .,-"'; .,-"'3
1; 1;1;1;
4<='>)?
1:1:1:
!"#$%"&'()$*+,-"
.,-"'@
0#$-"*'1"2"%'3'.$+#"
(94
1:
.,-"'A .,-"'B .,-"'C
1; 1;1;1;
1:1:1:
DDE3'F
45.63'.#$77"%8
DDE3'.
DDE3'G
DDE3'D
DDE3'H
DDE3'I
Figure 1. System overview
Although the basic structure of the memory hierarchyis similar for Nehalem and Shanghai based processors, theimplementation details differ. While AMD processors use a“non-inclusive” L3 cache, Intel implements an inclusive lastlevel cache. “core valid bits” within the L3 cache indicatethat a cache line may be present in a certain core. If a bit isnot set, the associated core certainly does not hold a copyof the cache line, thus reducing snoop traffic to that core.However, unmodified cache lines may be evicted from acore’s cache without notification of the L3 cache. Therefore,a set core valid bit does not guarantee the presence ofthe cache line in a higher level cache. Generally speaking,the shared last level cache with its core valid bits has thepotential to strongly improve the performance of on-chipdata transfers between cores while filtering most unnecessarysnoop traffic.
Nehalem is the first microarchitecture that uses the MESIFcache coherency protocol. It extends the MESI protocol usedin previous Xeon generations by a fifth state called forward-ing. This state allows unmodified data that is shared by twoprocessors to be forwarded to a third one. We thereforeexpect the MESIF improvements to be limited to systemswith more than two processors. The benchmark results ofour dual-processor test system configuration should not beinfluenced.
Table ICOMPARISON OF DIFFERENT X86 64 MICROARCHITECTURES
Processor AMD Opteron 238* Intel Xeon 54** Intel Xeon 55**Microarchitecture Shanghai Harpertown Nehalem-EP
Cache organization non-inclusive inclusive inclusiveCache coherency protocol MOESI MESI MESIF
Shared last level cache yes no yesIntegrated memory controller yes no yes
Point-to-point processor interconnect yes no yesNative quad-core design yes no yes
!"#!"!
Authorized licensed use limited to: Politecnico di Milano. Downloaded on June 03,2010 at 13:19:13 UTC from IEEE Xplore. Restrictions apply.
Microarchitecture Enhancement (2)
Instruction per Cycle improvementsIncreased size of the out-of-order window and scheduler Increased size of the other buffers in the coreFaster Synchronization PrimitivesImproved Hardware Prefetch and Better Load-Store Scheduling
Enhanced Branch PredictionFaster Handling of Branch Mispredictions New Second-Level Branch Target BufferNew Renamed Return Stack Buffer
Westmere
Intel Nehalem Microarchitecture processor migrated to 32nm.Key features
Processors with 6 cores supporting 12 threads;Smaller processor core size;New Multi-Chip Package with graphics integrated in processors;Integrated, discrete/switchable graphics support;Advanced Encryption Standard acceleration;Integrated memory controller.
Performance Evaluation
x2 multithread performance (same power)
30% lower power usage for the same performance
15–20% increase in performance per core
50% less atomic operation latency
References
“Intel Core i7-800 Processor Series and the Intel Core i5-700 Processor Series Based on Intel Microarchitecture (Nehalem)”, Intel Corporation
”Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System”, Daniel Molka, Daniel Hackenberg, Robert Schone and Matthias S Muller
“32nm westmere Family of Processor”, Stephen L. Smith
Wikipedia
Thanks for the attention!!