Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

9
HPCA Assignment ( Memory Hierarchy of Pentium System ) Submitted by : Indranil Nandy MTech, 2006 Roll no. : 06CS6010

description

Pentium Memory Hierarchy

Transcript of Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

Page 1: Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

HPCA Assignment(Memory Hierarchy of Pentium System)

Submitted by: Indranil Nandy

MTech, 2006Roll no. : 06CS6010

Page 2: Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

Memory Hierarchy of Pentium System

First, we see a simplified memory hierarchy block diagram of a Pentium System.

In the most simplified form the memory hierarchy of Pentium can be explained as below:

L1 cache is on-chip, it is separate for data and instructions. Data cache size is 16 KB, it is 4-way associative. It is a write-through cache of 32B lines. Instruction cache also is of size 16 KB and is 4-way associative.L2 is unified cache and its size is at least 128 KB. It uses write allocation technique and it is also 4-way associative.

Now, we will look for more details and particulars: First we show the more complex modern PentiumIV Memory System,

Block Diagram of Pentium IV memory system

Page 3: Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

Storage Hierarchy

CPU cache - memory located on the processor chip (VOLATILE) on-board cache - located on circuit board; fastest external memory available

(VOLATILE) main memory - software managed (VOLATILE) secondary memory - hard drive (NON-VOLATILE) slow secondary memory - tapes, diskettes (NON-VOLATILE)

Cache Memory

A cache is a very fast block of memory that speeds up the performance of another device. Frequently used data are stored in the cache. The computer looks in the cache first to see if what it needs is there.

Level 1 Cache is located directly inside the CPU itself, and stores frequently used data or commands. Although relatively small, Level 1 Cache has the most direct effect on overall performance.

Level 2 Cache is located on the motherboard. It stores frequently used data from the computer's main memory (RAM). In Intel Pentium chips, Advanced Transfer Cache is an improved version of the Level 2 Cache, in which the cache memory operates at the same speed as the processor, which is as much as four times the speed of a standard Level 2 Cache

Pentium :

primary or Level 1 )L1( cache - located in the Pentium processor chip

- 32Kb, with 16Kb for instructions and 16Kb for data

integrated Level 2 )L2( cache )OR NEXT CHOICE( - extra memory located in the Pentium processor chip - accessed by a 256-bit data bus - 8-way set associative

discrete L2 cache called the Advanced Transfer Cache - EITHER THIS OR PREVIOUS CHOICE - connected to processor with a dedicated 64-bit cache bus - faster that cache-on-motherboard implementations

main memory - SDRAMs

Page 4: Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

secondary memory - hard drive slow secondary memory - tapes, diskettes

Cache memory

The Pentium processor has two caches, called the primary or Level 1 ((L1) cache and the secondary or level 2 (L2) cache.

L1 Cache: Pentium Cache is where the processor stores frequently accessed instructions or data for faster performance. The Pentium processor incorporates a 32K Level 1 cache. This cache consists of 16K for instructions and 16K for data. This cache provides the highest information access speeds available. It is non-blocking.

L2 Cache: Working along with the L1 cache, the Pentium processor has either a 512K unified, non-blocking, Level 2(L2) cache or an integrated 256K Advanced Transfer Cache. The integrated 256K L2 cache is located on-die and runs at the core frequency of the processor. The L2 cache is an area of high-speed memory that improves performance by reducing the average memory access time.

Pentium 4's Cache Organization

Cache Organization in the Memory Hierarchy

There is usually a trade-off between cache size and speed. This is mostly because of the extra capacitive loading on the signals that drive the larger SRAM arrays. Refer to the block diagram of the Pentium 4 memory system. Intel has chosen to keep the L1 caches rather small so that they can reduce the latency of cache accesses. Even a data cache hit will take 2 cycles to complete (6 cycles for floating-point data). We'll talk about the L1 caches in a moment, but further down the hierarchy we find that the L2 cache is an 8-way, unified (includes both instruction and data), 256KB cache with a 128B line size.

The 8-way structure means it has 8 sets of tags, providing about the same cache miss rate as a "fully-associative" cache (as good as it gets). This makes the 256KB cache more effective than its size indicates, since the miss rate of this cache is approximately 60% of the miss rate for a direct-mapped (1-way) cache of the same size.

The downside is that an 8-way cache will be slower to access. Intel states that the load latency is 7 cycles (this reflects the time it takes an L2 cache line to be fully retrieved to either the L1 data cache or the x86 instruction prefetch/decode buffers), but the cache is able to transfer new data every 2 cycles (which is the effective throughput assuming multiple concurrent cache transfers are initiated). Again, notice that the L2 cache is shared between instruction fetches and data accesses (unified).

System Bus Architecture is Matched to Memory Hierarchy Organization

Page 5: Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

One interesting change for the L2 cache is to make the line size 128 bytes, instead of the familiar 32 bytes. The larger line size can slightly improve the hit rate (in some cases), but requires a longer latency for cache line refills from the system bus. This is where the new Pentium 4 bus comes into play. Using a 100MHz clock and transferring data four times on each bus clock (which Intel calls a 400MHz data rate), the 64-bit system bus can bring in 32 bytes each cycle. This translates to a bandwidth of 3.2 GB/sec.

To fill an L2 cache line requires four bus cycles- the same number of cycles as the P6 bus for a 32-byte line). Note that the system bus protocol has a 64-byte access length (matching the line size of the L1 cache) and requires 2 main memory request operations to fill an L2 cache line. However, the faster bus only helps overcome the latency of getting the extra data into the CPU from the North Bridge. The longer line size still causes a longer latency before getting all the burst data from main memory. In fact, some analysts note that P4 systems have about 19% more memory latency than Pentium III systems (measured in nanoseconds for the demand word of a cache refill). Smart pre-fetching is critical or else the P4 will end up with less performance on many applications.

Pre-Fetching Hardware Can Help if Data Accesses Follow a Regular Pattern

The L2 cache has pre-fetch hardware to request the next 2 cache lines (256 bytes) beyond the current access location. This pre-fetch logic has some intelligence to allow it to monitor the history of cache misses and try to avoid unnecessary pre-fetches (that waste bandwidth and cache space The hardware pre-fetch logic should easily notice the pattern of cache misses and then pre-load data, leading to much better performance on applications like streaming media types (like video).

Designing for Data Cache Hits

Intel boasts of "new algorithms" to allow faster access to the 8KB, four-way, L1 data cache. They are most likely referring to the fact that the Pentium 4 speculatively processes load instructions as if they always hit in the L1 data cache (and data TLB). By optimizing for this case, there aren't any extra cycles burned while cache tags are checked for a miss. The load instruction is sent on its merry way down the pipeline; if a cache miss delays the load, the processor passes temporarily incorrect data to dependent instructions that assumed the data arrived in 2 cycles. Once the hardware discovers the L1 data cache miss and brings in the actual data from the rest of the memory hierarchy, the machine must "replay' any instructions that had data dependencies and grabbed the wrong data.

Pentium 4 design seems to have been optimized for the case of streaming media (just as Intel claims), since these algorithms are much more regular and demand high performance. The designers probably hope that the pathological worst case only occurs for code that doesn't need high performance. When the L1 data cache does have a miss, it has a "fat pipe" (32 bytes wide) to the L2 cache, allowing each 64-byte cache line to be refilled in 2 clocks. However, there is a 7-cycle latency before the L2 data starts arriving, as we mentioned previously. The Pentium 4 can have up to four L1 data cache misses in process.

Pentium 4's Trace Cache

The Trace Cache Depends on Good Branch Prediction

Page 6: Pentium Memory Hierarchy ( by Indranil Nandy, IIT KGP)

Instead of a classic L1 instruction cache, the Pentium 4 designers felt confident enough in their branch prediction algorithms to implement a trace cache. Rather than storing standard x86 instructions, the trace cache stores the instructions after they've already been decoded into RISC-style instructions. Intel calls them "µops" (micro-ops) and stores 6 µops for each "trace line". The trace cache can house up to 12K µops. Since the instructions have already been decoded, hardware knows about any branches and fetches instructions that follow the branch. We know that it's the conditional branches that could really cause a problem, since we won't know if we're wrong until the branch condition check in Arithmetic Logic Unit 0 (ALU0) of the execution core. By then, our trace cache could have pre-fetched and decoded a lot of instructions we don't need. The pipeline could also allow several out-of-order instructions to proceed if the branch instruction was forced to wait for ALU0.

Hopefully, the alternative branch address is somewhere in the trace cache. Otherwise, we'll have to pay those 7 cycles of latency to get the proper instructions from the L2 cache plus the time to decode the fetched x86 instructions. Intel's reference to the 20-stage P4 pipeline actually starts with the trace cache, and does not include the cycles for instruction or data fetches from system memory or L2 cache.

The Trace Cache has Several Advantages

If predictors work well, then the trace cache is able to provide (the correct) three µops per cycle to the execution scheduler. Since the trace cache is (hopefully) only storing instructions that actually get executed, then it makes more efficient use of the limited cache space. Since the branch target instruction has already been decoded and fetched in execution order, there isn't any extra latency for branches. The person in the back of the room just reminded us of an interesting point. We never mentioned a TLB check for the trace cache, because it does not use one. So, the Pentium 4 isn't so complicated after all. This cache uses virtual addressing, so there isn't any need to convert to physical addresses until we access the L2 cache. Intel documents don't give the size of the instruction TLB for the L2 cache.