A Brief History of the Pentium Processor Family

History of the Pentium Processor

The Pentium family of processors, which has its roots in the Intel486(TM) processor, uses the Intel486 instruction set (with a few additional instructions). The term ''Pentium processor'' refers to a family of microprocessors that share a common architecture and instruction set. The first Pentium processors (the P5 variety) were introduced in 1993. This 5.0-V processor was fabricated in 0.8-micron bipolar complementary metal oxide semiconductor (BiCMOS) technology. The P5 processor runs at a clock frequency of either 60 or 66 MHz and has 3.1 million transistors.

The next version of the Pentium processor family, the P54C processor, was introduced in 1994. The P54C processors are fabricated in 3.3-V, 0.6-micron BiCMOS technology. The P54C processor also has System Management Mode (SMM) for advanced power management

The Intel Pentium processor, like its predecessor the Intel486 microprocessor, is fully software compatible with the installed base of over 100 million compatible Intel architecture systems. In addition, the Intel Pentium processor provides new levels of performance to new and existing software through a reimplementation of the Intel 32-bit instruction set architecture using the latest, most advanced, design techniques. Optimized, dual execution units provide one-clock execution for "core" instructions, while advanced technology, such as superscalar architecture, branch prediction, and execution pipelining, enables multiple instructions to execute in parallel with high efficiency. Separate code and data caches combined with wide 128-bit and 256-bit internal data paths and a 64-bit, burstable, external bus allow these performance levels to be sustained in cost-effective systems. The application of this advanced technology in the Intel

Pentium processor brings "state of the art" performance and capability to existing Intel architecture software as well as new and advanced applications.

The Pentium processor has two primary operating modes and a "system management mode."

The operating mode determines which instructions and architectural features are accessible.

Types of modes :

1. Protected Mode

This is the native state of the microprocessor. In this mode all instructions and architectural features are available, providing the highest performance and capability. This is the recommended mode that all new applications and operating systems should

target. Among the capabilities of protected mode is the ability to directly execute "real-address mode" 8086 software in a protected, multi-tasking environment. This feature is known as Virtual-8086 "mode" (or "V86 mode"). Virtual-8086 "mode" however, is not actually a processor "mode," it is in fact an attribute which can be enabled for any task (with appropriate software) while in protected mode.

2. Real-Address Mode (also called "real mode")

This mode provides the programming environment of the Intel 8086 processor, with a few extensions (such as the ability to break out of this mode). Reset initialization places the processor in real mode where, with a single instruction, it can switch to protected mode.

3. System Management Mode

The Pentium microprocessor also provides support for System Management Mode (SMM). SMM is a standard architectural feature unique to all new Intel microprocessors, beginning with the Intel386 SL processor, which provides an operating-system and application independent and transparent mechanism to implement system power management and OEM differentiation features. SMM is entered through activation of an external interrupt pin (SMI#), which switches the CPU to a separate address space while saving the entire context of the CPU. SMM-specific code may then be executed transparently. The operation is reversed upon returning.

Advanced Features

The Pentium P54C processor is the product of a marriage between the Pentium processor's architecture and Intel's 0.6-micron, 3.3-V BiCMOS process The Pentium processor achieves higher performance than the fastest Intel486 processor by making use of the following advanced technologies.

1. Superscalar Execution: The Intel486 processor can execute only one instruction at a time. With superscalar execution, the Pentium processor can sometimes execute two instructions simultaneously.

2. Pipeline Architecture: Like the Intel486 processor, the Pentium processor executes instructions in five stages. This staging, or pipelining, allows the processor to overlap multiple instructions so that it takes less time to execute two instructions in a row. Because of its superscalar architecture, the Pentium processor has two independent processor pipelines.

3. Branch Target Buffer: The Pentium processor fetches the branch target instruction before it executes the branch instruction.

4. Dual 8-KB On-Chip Caches: The Pentium processor has two separate 8-kilobyte (KB) caches on chip--one for instructions and one for data--which allows the Pentium processor to fetch data and instructions from the cache simultaneously.

5. Write-Back Cache: When data is modified; only the data in the cache is changed. Memory data is changed only when the Pentium processor replaces the modified data in the cache with a different set of data

6. 64-Bit Bus: With its 64-bit-wide external data bus (in contrast to the Intel486 processor's 32-bit- wide external bus) the Pentium processor can handle up to twice the data load of the Intel486 processor at the same clock frequency.

7. Instruction Optimization: The Pentium processor has been optimized to run critical instructions in fewer clock cycles than the Intel486 processor.

8. Floating-Point Optimization: The Pentium processor executes individual instructions faster through execution pipelining, which allows multiple floating-point instructions to be executed at the same time.

9. Pentium Extensions: The Pentium processor has fewer instruction set extensions than the Intel486 processors. The Pentium processor also has a set of extensions for multiprocessor (MP) operation. This makes a computer with multiple Pentium processors possible.

A Pentium system, with its wide, fast buses, advanced write-back cache/memory subsystem, and powerful processor, will deliver more power for today's software applications, and also optimize the performance of advanced 32-bit operating systems (such as Windows 95) and 32-bit software applications.

The most important enhancements over the 486 are the separate instruction and data caches, the dual integer pipelines (the U-pipeline and the V-pipeline, as Intel calls them), branch prediction using the branch target buffer (BTB), the pipelined floating-point unit, and the 64-bit external data bus. Even-parity checking is implemented for the data bus and the internal RAM arrays (caches and TLBs).

As for new functions, there are only a few; nearly all the enhancements in Pentium are included to improve performance, and there are only a handful of new instructions. Pentium is the first high-performance micro-processor to include a system management mode like those found on power-miserly processors for notebooks and other battery-based applications; Intel is holding to its promise to include SMM on all new CPUs. Pentium uses about 3 million transistors on a huge 294 mm 2 (456k mils 2 ). The caches plus TLBs use only about 30% of the die. At about 17 mm on a side, Pentium is one of the largest microprocessors ever fabricated and probably pushes Intel’s production equipment to its limits. The integer data path is in the middle, while the floating-point data path is on the side opposite the data cache. In contrast to other superscalar designs, such as SuperSPARC, Pentium’s integer data path is actually bigger than its FP data path. This is an indication of the extra logic associated with complex instruction support. Intel estimates about 30% of the transistors were devoted to compatibility with the x86 architecture. Much of this overhead is probably in the microcode ROM, instruction decode and control unit, and the adders in the two address generators, but there are other effects of the complex instruction set. For example, the higher frequency of memory references in x86 programs compared to RISC code led to the implementation of the dual-ac.

Register set

The purpose of the Register is to hold temporary results, and control the execution of the program. General-purpose registers in Pentium are EAX, ECX, EDX, EBX, ESP, EBP,ESI, or EDI.

The 32-bit registers are named with prefix E, EAX, etc, and the least 16 bits 0-15 of these registers can be accessed with names such as AX, SI Similarly the lower eight bits (0-7) can be accessed with names such as AL & BL. The higher eight bits (8-15) with names such as AH & BH. The instruction pointer EAP known as program counter(PC) in 8-bit microprocessor, is a 32-bit register to handle 32-bit memory addresses, and the lower 16 bit segment IP is used for 16-bi memory address.

The flag register is a 32-bit register , however 14-bits are being used at present for 13 different tasks; these flags are upward compatible with those of the 8086 and 80286. The comparison of the available flags in 16-bit and 32-bit microprocessor is may provide some clues related to capabilities of these processors. The 8086 has 9 flags, the 80286 has 11 flags, and the 80286 has 13 flags. All of these flag registers include 6 flags related to data conditions (sign, zero, carry, auxiliary, carry , overflow, and parity) and three flags related to machine operations.(interrupts, Single-step and Strings). The 80286 has two additional : I/O Privilege and Nested Task. The I/O Privilege uses two bits in protected mode to determine which I/O instructions can be used, and the nested task is used to show a link between two tasks.

The processor also includes control registers and system address registers , debug and test registers for system and debugging operations.

Addressing mode & Types of instructions

Instruction set is divided into 9 categories of operations and has 11 addressing modes. In addition to commonly available instructions in a 8 bit microprocessor and this set includes operations such as bit manipulation and string operations, high level language support and operating system support. An instruction may have 0-3 operands and the operand can be 8, 16, or 32- bits long. The 80386 handles various types of data such as Single bit , string of bits , signed and unsigned 8-, 16-, 32- and 64- bit data, ASCII character and BCD numbers.

High level language support group includes instructions such as ENTER and LEAVE. The ENTER instruction is used to ENTER from a high level language and it assigns memory location on the stack for the routine being entered and manages the stack. On the other hand the LEAVE generates a return procedure for a high level language. The operating system support group includes several instructions , such as APRL.( Adjust Requested Privilege Level) and the VERR/W (Verify Segment for Reading or Writing). The APRL is designed to prevent the operating system from gaining access to routines with a higher priority level and the instructions VERR/W verify whether the specified memory address can be reached from the current privilege level.

Superscalar Instruction Pairing Rules

Pentium can issue two integer instructions per clockcycle so long as they satisfy the following constraints:1. Both instructions must be “simple.”2 There must be no read-after-write or write-afterwriteregister dependencies.

3 Neither instruction may contain both a displacementand an immediate value.4 Instructions with prefixes (other than jump-conditionalwith 16/32-bit prefix) can occur only in the Upipeline.For the purposes of these rules, simple instructions are:• MOV registerregister/memory/immediate• MOV memoryregister/immediate• ALU-op registerregister/memory/immediate• ALU-op memoryregister/immediate• INC register/memory• DEC register/memory• PUSH register/memory• POP register• LEA register/memory• JUMP/CALL/Jcc near• NOPThese simple instructions are hardwired and execute in asingle clock cycle except for “ALU-op registermemory,”which takes two clocks, and “ALU-op memoryregister/immediate,” which takes three. Another exception tothe pairing rules occurs for shifts: they can be executedonly in the U-pipeline, so they must be the first instructionin a pair.Implicit register dependencies (usually based on thecondition codes) can also prevent dual-instruction issue.

For example, an ALU instruction that sets the carry flagcannot be paired together with an ALU instruction thatreads the carry flag.There are, however, two important exceptions whichallow dependent instructions to be paired. The first exceptionallows a compare and conditional branch thattests the result of the compare to be paired, while the secondallows pairs of pushes or pops to be paired. Branchprediction helps the compare/conditional-branch case,and special hardware is included to resolve the dependencyon the stack pointer for pushes and pops.In general, an integer and floating-point instructionpair, or a pair of floating-point instructions, cannot be simultaneouslyissued. There is one exception: a simplefloating-point load, arithmetic, or compare can be pairedwith an FXCH (floating-point exchange) instruction. TheFXCH must be the second instruction in the pair. If an integerinstruction immediately follows the FXCH, it willstall for one or four clocks depending on the operands tothe pair of floating-point instructions. Simple floatingpointinstructions are:• FLD single/double, FLD ST(i),

• all forms of FADD, FSUB, FMUL, FDIV,• all forms of FCOM, FUCOM, FTST, FABS, and FCHS.4 Intel Reveals Pentium Implementation Details Vol. 7, No. 4, March 29, 1993 © 1993 MicroDesign Resourcesof the first Pentium systems, its inclusion in Pentiumshows that Intel has made good on its commitment to includeSMM as a part of its mainstream x86 processors. Afuture 3.3V version of Pentium will raise the importance ofSMM.Microarchitecture OverviewSuperficially, Pentium’s microarchitecture lookslike a superscalar version of the 486. Even thoughPentium has two integer pipelines, the basic five-stagepipeline structure is unchanged from the 486, as shownin Figure 3. (Note that Figure 3 is a functional diagram,not a timing diagram; thus, a multiplexer may be shownwhere one is not actually present.) U is the “default”pipeline when two instructions cannot be issued simultaneously,and the U pipe is slightlymore powerful since it has a barrelshifter.The pipelines are similar to the486’s: each pipeline begins with instructionprefetching, which loadsprefetch buffers, and instruction decodingis spread out over two

pipeline stages to accommodatesome of the more semantically rich(i.e., complex) instructions. The lasttwo stages are the traditional executeand writeback pipeline phases.While Pentium uses the highlevelstructure of the 486 pipeline,there are many subtle implementationdifferences. For example, thetotal prefetch capacity has been increasedby a factor of four, and theaddress adders in the D2 stage havefour instead of three inputs to reduceby one the number of cycles forsome complex addressing modes.Prefetch StageThe prefetch stage incorporatesone of the most significant enhancementsof Pentium over the486: a separate, 8K instructioncache. The cache has a two-way set associativeorganization with LRU(least-recently used) replacementand a line size of 32 bytes. Intelchose two-way set associativity as acompromise between performanceand implementation constraints.Also, note that Pentium has twotwo-way set-associative caches vs.the 486’s single four-way cache.Full coherency is maintainedbetween the on-chip caches and external memory withHardware snooping. The instruction cache tags are

triple-ported: one port is for snooping operations whilethe other two are used for the split fetch capability (describedbelow). This means the snooping hardware andthe processor can access the cache simultaneously withno contention. The cache implements parity, one bit pereight bytes of data and one bit per tag.Of course, having a separate instruction cache improvesinstruction fetch efficiency because data and instructionaccesses do not compete for a single cache resource,but Pentium further improves instruction fetchingby implementing a “split fetch” capability not presentin the 486. (Split fetching was first implemented in the960CA.) Split fetching gives Pentium the ability to fetchM I C R O P R O C E S S O R R E P O R TInstruction Cache8K, 2-wayByte RotatePrefetch Buffer Prefetch BufferU-Decoder V-DecoderU-pipe AddressRegister ReadV-pipe AddressRegister ReadU-pipe DataRegister Read

V-pipe DataRegister Read==Data Cache8K, 2-wayDTLBw wr rByte Rotate Byte RotateU-pipeRegister WriteV-pipeRegister WriteALU/Shifter ALUAdder AdderITLB+1BranchTargetBuffer=Mispredict?PFD1D2EXWBCodeROMTarget Address ForMispredicted Branch=Mispredict?

FIgure 3. Block diagram of the major Pentium integer pipeline resources.M I C R O P R O C E S S O R R E P O R T5 Intel Reveals Pentium Implementation Details Vol. 7, No. 4, March 29, 1993 © 1993 MicroDesign Resourcesa contiguous block of instruction byteseven if the block is split across two instruction-cache lines. As shown by theworst-case alignment scenario in Figure4, this allows a minimum of 17 bytes tobe fetched from the cache because a fetchcan straddle the boundary between twoconsecutive half-lines. According toIntel’s measurements, the split fetch capabilityimproves Pentium performanceby a few percent.Pentium’s split fetching is an exampleof an important technique for superscalar processors:eliminating instruction-fetch alignment restrictions. In asuperscalar processor, the goal is to simultaneously issueand execute the maximum allowable number of instructionsas often as possible. Other superscalar processorsalso implement some form of split fetching—althoughdifferent names are used—to make sure that instructionfetching is not the limiting factor. All other existing superscalarprocessors, however, are RISCs. The wordalignmentof RISC instructions results in less complex

logic to eliminate alignment restrictions. The split-fetchinglogic, which must take care of byte-aligned x86 instructions,is one place where Pentium pays a price forthe complex x86 architecture.The instruction TLB is four-way set-associative, has32 entries, and uses a pseudo-LRU replacement algorithm;ITLB misses are handled in hardware. The dedicatedITLB allows the I-cache to be physically tagged,which reduces the frequency of I-cache flushes. The 486also indexes its cache with physical addresses.Instruction bytes that are fetched from the I-cacheare aligned, if necessary, and stored in one of the fourprefetch buffers. Each buffer is the length of one cacheline (32 bytes) for a total of 128 bytes. In contrast, the 486has only 32 bytes of prefetch buffer.Coupled with the dedicated instruction cache, theprefetch buffers should virtually guarantee that Pentiumnever waits for instruction bytes, except in the caseof cache misses and mis-predicted branches. In situationswhere the 486 would stall waiting to fill its prefetchbuffer, Pentium will continue executing.First Decode StageThe major function of the D1 stage is instruction decoding.

Of course, Pentium is designed to decode in hardwareas many of the most frequently occurring instructionsas possible. Even the rather complex—at least byRISC standards—memory-to-register and register-tomemoryarithmetic operations do not require microcodeassistance for their processing. Instead, a single, internalmicroword is generated by the D1 decoding logic thattriggers a simple hardware state machine in the EXstage. Thus, while memory/register operations do not requiremicrocode, they do still require sequencing andmultiple cycles.For instructions that are complex enough to requirea microcode routine, the first microword is always generatedby the D1 decoding logic. In contrast, the D1 decodinglogic in the 486 generates the first microcode ROMaddress. Thus, Pentium achieves at least some speedupover the 486 for microcoded instructions by directly generatingthe first microword.For microcoded instructions, the first microwordproceeds to the D2 stage, where the microcode enginetakes over the Pentium execution resources. As shown in

Figure 3, microwords from the microcode ROM controlboth integer pipelines; consequently, the pipelines operateindependently only for pairs of instructions that usehardwired control. Intel has, of course, written the microcoderoutines to take maximum advantage of the dualpipelines.This allows Pentium to reduce the number of cyclesneeded for many of the complex x86 instructions. For example,repeated string move instructions execute at oneclock per iteration, compared to three clocks on the 486.The Pentium microcode actually contains an unrolledloop that writes the element of the destination string inthe U pipeline in parallel with the reading of the nextsource string element in the V pipeline.Pentium microwords are 92 bits long, and the microcodeROM contains about 4K microwords. Since microcodedroutines take over all the execution resources,it is not possible for Pentium to pair microinstructionswith regular, x86 instructions. Thus, instruction fetchingand dispatch are stalled during the execution of acomplex, microcoded instruction.In Figure 3, the circle containing the equal sign betweenthe two inputs to the decoder blocks represents

logic that detects resource conflicts. Situations such asregister dependencies that require serial execution aredetected here. When a conflict is detected, the instructionat the head of the U pipeline gets priority.Branch prediction, also a major function of the D1stage, is covered below.X–1XX+1X+2......32 bytes (one I-Cache line)16 bytes 16 bytesFigure 4. The instruction cache allows “split fetching” across the boundary from onecache line to the next. The worst-case situation, as shown, still delivers 17 bytes ina single cache access.

M I C R O P R O C E S S O R R E P O R TSecond Decode StageThe D2 stage is an artifact of the x86 architecture.Since so many of the instructions specify a multi-componentaddress computation, it makes sense to have dedicatedresources and a separate pipeline stage in which toperform the address addition.Each of the two integer pipelines has a dedicated,

four-input address adder. Four inputs are needed becausex86 operand addresses can consist of a segment descriptorbase, a base address from a general register, anindex from a general register (possibly scaled), and a displacementfrom the instruction. The 486 address adderhas only three inputs; thus, some instructions that spendonly one cycle in D2 on Pentium will spend two cycles inthe D2 stage on the 486. (In Figure 3, the address addersare drawn with only two inputs simply to save space.)What is not shown in the D2 stage in Figure 3 is theseparate, four-input segment limit-check adder. Architecturally,x86 addressing requires that all segment accessesbe checked against the limit stored in the segmentdescriptor. This check requires a separate four-componentaddition, and Pentium has yet two more four-inputadders to perform this check in parallel. As with the addressadders, the 486 limit-check adders have only threeinputs. While the need for this hardware probably haslittle or no effect on the cycle time of the Pentium implementation,it certainly requires significant area andpower. This is another way that Pentium pays for thecomplexity of the x86 architecture.

The other major function of the D2 stage is readingoperands from the register file for use by the ALUs in EX.Execute StageThe execute stage contains the ALUs and the datacache. The U-pipe has a full ALU and a barrel shifter,while the V-pipe has only a full ALU. Thus, all shift instructionsmust be processed in the U-pipe, and the logicin the D1 stage that detects resource requirements takescare of enforcing this rule.The data cache is one of Pentium’s most interestingfeatures. Like the instruction cache, it is a two-way setassociative,8K cache with a 32-byte line size. A MESI coherencyprotocol is used to keep caches coherent in amultiprocessor system. As mentioned earlier, the cachetags are triple-ported to allow concurrent snooping anddual access by the pipelines. The cache has a parity bitfor each tag and each byte of data.As explained in 061201.PDF, this dual-access capability,which lets both pipelines access the data cache simultaneously,is implemented by interleaving the dataarray into eight banks (four-byte granularity within a32-byte cache line). As long as the data accesses fromeach pipe are to separate banks, both accesses can be

processed simultaneously by the cache in a single cycle.This capability is not provided by any other existing microprocessor.(The circle containing the equal sign betweenthe two inputs to the cache and DTLB representsthe bank conflict detection logic.)Since the cache stores physical tags, it is also necessarythat the data TLB be able to perform two addresstranslations simultaneously. This capability is providedby the dual-ported, 64-entry, four-way set-associativeDTLB.The DTLB stores translations for the standard 4Kpages of the 386 architecture. There is a separate eightentry,four-way set-associative DTLB for 4M pages thatis also dual ported. Large-page mapping is standard onall high-end processors and is useful because mappinggraphics frame buffers and operating-system segmentscan be done with only one 4M translation entry insteadof many 4K entries. This keeps frame-buffer referencesfrom “polluting” the main TLB.Most instruction dependencies are resolved in D1,but there is one important case that is resolved in EX: two

register/memory operations. In this case, the two instructionsare simultaneously issued into the U and Vpipelines, and they proceed concurrently to the EX stage.Once there, however, Pentium forces serialized execution,as shown in Table 2. All pairs of register/memory instructionsare serialized in the EX stage to avoid the complexityof checking for dependencies. Even though the instructionsare serialized, the overlap of the store of the first andthe load of the second at cycle n+2 saves one clock.In general, the U and V pipes will be simultaneouslyexecuting separate instructions only if the instructionsthey contain are independent. The exceptions are register/memory operations (which get sequenced and serializedin hardware as just described), stack operations(any combination of push and pop), and compare/conditional-branch.The compare/conditional-branch situation is allowedbecause branch prediction will likely provide thebranch target anyway. If branch prediction is correct, acycle is saved by pairing the compare and the conditionalbranch. Since most compare/conditional-branch pairsthat occur during program execution will be in loops, and

since most loops execute many times, branch predictionshould perform very well for this situation.Note that if the U-pipe contains any kind of branch,6 Intel Reveals Pentium Implementation Details Vol. 7, No. 4, March 29, 1993 © 1993 MicroDesign ResourcesU Pipeline V PipelineloadALUstore-idle--idle--idle--idleloadALUstoreEX-Stage ActivityCyclenn + 1n + 2n + 3n + 4Table 2. Execute-stage activity for two register-to-memoryinstructions.M I C R O P R O C E S S O R R E P O R T7 Intel Reveals Pentium Implementation Details Vol. 7, No. 4, March 29, 1993 © 1993 MicroDesign Resourcesthe V-pipe will be idle.Writeback StageThe major function of the WB stage is to provide a

time slot for writing results of computations and loadsinto the register file. This is shown conceptually inFigure 3 with separate boxes in the WB stage, but actuallythere is, of course, only a single register file.Branch PredictionPentium uses a BTB (branch target buffer) for itsbranch-prediction algorithm. All taken branches arebuffered. As shown in Figure 3, the BTB is accessed instage D1 with the linear address (with the segment calculationsdone but not translated by TLB) of the branchinstruction itself. The BTB stores a single predicted targetfor a branch. As shown in Figure 5, the BTB cachestores 256 branch predictions with a four-way set-associativeorganization. Note that this is different from thebranch target cache in the 29000, which stores the firstfew instructions at the branch target. Pentium’s BTBstores target addresses only.Intel simulated several branch prediction algorithms,finally settling on the method described in apaper from the University of Wisconsin (J. Lee and A. J.Smith, “Branch Prediction Strategies and Branch TargetBuffer Design,” IEEE Computer, January 1984, pp.6–22.). This algorithm uses two bits to hold the prediction

state, with transitions between the four states occurringas necessary when a branch is encountered.Figure 6 shows the state-transition diagram. Thefour states are ST (strongly taken), WT (weakly taken),WNT (weakly not taken), and SNT (strongly not taken).Each time there is a hit in the BTB (though not necessarilya correct prediction), the state bits are updated.When the state bits are either ST or WT, the next predictionfor the given branch will be “taken,” and WNTand SNT mean the next prediction will be “not taken.”The two middle states provide a degree of mispredictionhysteresis to avoid thrashing in certain cases.The hysteresis is provided by the fact that it takes twoconsecutive incorrect predictions to change the predictionpolarity. For example, a branch that has been correctlypredicted as not-taken many times in a row willcontinue to be predicted as not-taken even if the branchis occasionally taken.The BTB allocation policy is that an uncachedbranch allocates an entry in the cache only if it is a takenbranch (i.e., no allocate on miss). As a result, the state

bits are always initialized to ST for a newly allocatedbranch. Branches that cause a miss in the BTB are initiallyassumed (predicted) to be not-taken.As an example of the prediction state transition operation,if this newly allocated branch is not taken thenext time it is encountered, its state bits will make atransition to WT. The next prediction will thus be“taken,” but if this is also a misprediction, the predictionstate will make the transition to WNT. The next predictionwill be “not taken,” and so on.Down the left side of Figure 3 is a (very simplified)pipeline path that is used to verify branch prediction.The predicted direction for the branch is carried alongwith the branch instruction as it moves through thepipeline. As soon as possible, the prediction and the actualdirection taken are compared. For unconditionalbranches in the V pipeline and all branches in the Upipeline, the comparator (circle with equal sign) in theEX stage does the check. For conditionals in V, the checkis made by the comparator in WB to allow resolution of apossible paired “compare” in the U pipe.When an incorrect prediction is discovered or whenthe predicted target is wrong, the pipelines are flushed

and the correct target fetched. Thus, based on the stagein which the misprediction is discovered, mispredictedunconditionals and U-pipeline conditionals incur athree-clock delay, while V-pipeline conditionals incur afour-clock delay.Intel has made some measurements of branch behavioron Pentium. For the programs in the SPEC89suite, the percent of dynamic branches correctly predictedis between 75% and 85%, including not-taken branchesthat miss. The branch distribution between pipelinesappears to be balanced at about 50% for each pipeline oncode produced by both 486-optimized and Pentium optimizedcompilers.Tag Branch TargetHisotry Bits64 entriesper way24 32 232 1PhysicalTarget AddressPredictionUpdate tagon allocateUpdate target onallocate or mispredict

Update history onallocate or hitFigure 5. Pentium branch target buffer (BTB) structure.ST WT WNT SNTNot Taken Not Taken Not TakenTaken Taken TakenTakenNot TakenFigure 6. Prediction history bit state transition diagram.8 Intel Reveals Pentium Implementation Details Vol. 7, No. 4, March 29, 1993 © 1993 MicroDesign ResourcesM I C R O P R O C E S S O R R E P O R TFast Floating-PointIn any benchmark comparison of high-performanceprocessors, the 486 stands up reasonably well in integerresults but trails dramatically in floating point performance.Preliminary benchmark figures from Intel indicatethat Pentium will compete on a more even footingwith other processors in both integer and floating pointperformance (see 070401.PDF).Pentium’s floating-point performance is vastly improvedover the 486 because the simple, serial floatingpointunit of the 486 is replaced with fully pipelined, parallelexecution units. The FPU pipeline is eight stages,where the first five are shared with the integer pipeline:• PF (prefetch)• D1 (instruction decode)

• D2 (address generation)• EX (memory and register read, memory write if FPstore instruction)• X1 (FP execute first stage, write operand to FP registerfile if FP load)• X2 (FP execute second stage)• WF (rounding and write result to FP register file)• ER (error reporting, update status word)This pipeline structure is similar to that of otherhigh-performance processors. For example, the PowerPC601 has a six-stage floating-point pipeline. As shown inTable 3, Pentium has floating-point operation latencyand throughput that is comparable to other processorsfor basic arithmetic operations.As with most other high-performance processors,Pentium allows concurrency between the floating-pointand integer units. Thus, the issue and execution of integerinstructions can proceed in parallel with a long-latencyfloating-point operation.One area where Pentium may actually feel somecompetition is in the Windows NT market (see0704ED.PDF). From Table 3, it is tempting to concludethat Pentium could approximately match the floatingpointperformance of low-end implementations of itsWindows NT competitors (see benchmark results in070401.PDF). Pentium is hampered, however, by its

stack-oriented floating-point register file architectureand by the need to transfer floating-point condition codesto the integer unit before a conditional branch can be executed.For floating-point operands, Pentium maintainsbackward compatibility with previous x86 FPUs: there isa file of eight, 80-bit operand registers that are conceptuallya stack and only marginally directly addressable.Since most floating-point instructions implicitly use thetop of this register stack as one operand, there is a “topof-stack bottleneck.” To circumvent this, programs usethe FXCH (floating-point register exchange) instructionto swap the top of stack with an operand deeper in theregister file.Pentium’s designers added logic to allow superscalarissue and execution for a simple floating-point operationfollowed by an FXCH (see sidebar above). This isthe only case of superscalar issue for floating-point instructionsand is subject to the restriction that the firstinstruction must be “simple” and the FXCH must be thesecond instruction in the pair.Even with the rapid execution of an FPoperation/FXCH pair, Pentium will be hampered by thesmall, eight-register file. In addition, an FP-operation/

FXCH pair followed immediately by an integer instructionwill incur a one-cycle penalty.Another performance problem for Pentium is presentedby branching on floating-point conditions. Mostmicroprocessor architectures allow the results of a floating-point comparison to be tested directly, but the x86 architecturerequires that the floating-point condition codesbe transferred to the integer condition-code register,where a normal integer conditional branch can test them.To effect a floating-point conditional branch requiresfour instructions:1. An FP operation that sets the condition codes2. FSTSW AX (move FP status word to AX register)3. SAHF (transfer to upper half of EFLAGS)4. Jcc (integer jump conditional)This sequence takes nine clock cycles to execute onPentium because the floating-point condition codes areupdated late in the floating-point pipeline. Four of theseclocks can be recovered by inserting integer instructionsbetween instructions 1 and 2.Although many floating-point loops iterate based onan integer condition, such as a loop count equal to thenumber of elements in an array, the need to transfer conditioncodes from the FPU to the integer unit creates a

significant penalty for the case of loops with a floatingpointtermination condition, and for if-then statementswith floating-point conditions.In the final analysis, Pentium will bring a new levelof floating-point performance to the PC market. It willnot, however, out-perform its Windows NT competitorsbecause of the weaknesses of its floating-point architectureand because the R4000 and Alpha processors will beoperating at much higher raw clock speeds.Table 3. Floating-point latencies/throughputs for some modernmicroprocessors. Times are for double-precision operationsexcept for Pentium, which supports an 80-bit internal format.*For pairs of back-to-back multiplies and adds, Pentium has athroughput of one instead of two.Pentium486R4000AlphaPowerPC 6013/18-20/8-204/34/14/13/18-20/8-204/34/14/1

3/2*16/168/44/14/239/3973/7336/3661/6131/29Processor FP Add FP Sub FP Mult FP DivM I C R O P R O C E S S O R R E P O R T9 Intel Reveals Pentium Implementation Details Vol. 7, No. 4, March 29, 1993 © 1993 MicroDesign ResourcesConclusionsPentium solidifies Intel’s position as the premiersupplier of advanced microprocessors for the PC market.While it will be expensive and difficult to manufacturein volume at first, Pentium uses advancedprocessor implementation techniques while maintainingfull compatibility with the installed base of x86 applicationsoftware. The superscalar integer unit, separatecaches, branch prediction, and pipelined floatingpointunit are all significant performance enhancementsto the 486. Pentium’s snooping-based MESIcache-coherency protocol makes it appealing for multiprocessorimplementations.

Like many of the current generation of high-end microprocessors,Pentium integrates a huge number of transistors.At three million, the only other processors in thistransistor-count league are SuperSPARC (3.1 million)and the PowerPC 601 (2.8 million). Those processors,however, have at least twice as much total cache and aremore aggressive in other ways. It appears that many ofPentium’s “extra” transistors are spent on things like internalparity, the triple-ported cache tag arrays, dualportedTLBs, and adders for multi-component addressingmodes and segment limit checking.Certainly, a significant amount of Pentium’s complexityis the result of the complex x86 instruction set.The four-input address adders, microcode ROM, extradecode pipeline stage, and register/memory sequencinglogic in the execute stage are all extra complexity notpresent in RISC processors.While any x86 program will benefit from Pentium’sperformance features, the full performance potential willbe realized only for programs that are structured to takemaximum advantage of Pentium’s capabilities. Instructionsequences must be carefully selected to use the instructionsthat can be dual-issued and, as shown in the

floating-point conditional-branch example above, scheduledto fill all available execution slots.Pentium is a significant microprocessor milestone.It implements sophisticated caching, multiprocessorsupport, and branch prediction. It is also the first superscalarCISC microprocessor and the first high-end microprocessorto implement a system-management mode.The Pentium core will be around for many years to comebecause Intel will be able to exploit it by offering an arrayof microprocessors with varied cache sizes, bus widths,and bus speeds. As for Pentium’s technological positionin the marketplace, some RISCs will be faster or cheaperor both, but with x86 compatibility, multiprocessorsupport, and significant performance gains over the 486,Pentium will satisfy most users’ needs.

A Brief History of the Pentium Processor Family

Documents

Transcript of A Brief History of the Pentium Processor Family