The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

119
The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton

Transcript of The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Page 1: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The AMD Opteron

Henry Cook

Kum Sackey

Andrew Weatherton

Page 2: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Presentation Outline

• History and Goals

• Improvements

• Pipeline Structure

• Performance Comparisons

Page 3: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• The Nx586, March 1994 – Superscalar– Designed by NexGen– Manufactured by IBM– 70-111MHz– 32KB L1 cache– 3.5 million transistors– .5 micron process

Page 4: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• AMD SSA/5 (K5)– March 1996 – Built by AMD from the ground up

• Superscalar architecture• out of-order speculative execution• branch prediction• integrated FPU• power-management

– 75-117MHz– Ran “hot”– 34KB L1 cache– 4.5 million transistors– .35 micron process

Page 5: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• AMD K6 (1997)– Based on NexGen's RISC86 core (in the

Nx586) – Based on Nx586 core– 166-300MHz– 84KB L1 Cache– 8.8 million transistors– .25 micron process

Page 6: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• AMD K6 (1997) continued– Advantages of K6 over K5:

• RISC86 core translates x86 complex instructions into shorter ones, allowing the AMD to reach higher frequencies than the K5 core.

• Larger L1 cache. • New MMX instructions.

– AMD produced both desktop and mobile K6 processors. The only difference being lower processor core voltage for the mobile part

Page 7: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• First AMD Athlons, K7 (June 23, 1999)– Based on the K6 core – improved the K6’s FPU – 128 KB (2x64 KB) L1 cache– Initially 500-700MHz– 8.8 million transistors– .25 micron process

Page 8: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• AMD Athlons, K7 continued– 1999-2002 held fastest x86 title off and on– First to 1GHz clock speed– Intel suffered a series of major production, design,

and quality control issues at this time. – Changed from slot to socket format– Athlon XP – desktop– Athlon XP-M – laptop– Athlon MP – server

Page 9: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 Architecture Development

• AMD Athlons, K7 continued– Final (5th) revision, the Barton– 400 MHz FSB (up from 200 MHz)– Up to 2.2 GHz clock– 512 KB L2 cache, off-chip– 54.3 million transistors– .13 micron process

• In 2004 AMD began using 90nm process on XP-M

Page 10: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The AMD Opteron

• Built on the K8 Core– Released April 22, 2003– AMD's AMD64 (x86-64) ISA

• Direct Connect Architecture– Integrated memory controllers – HyperTransport interface

• Native execution of x86 64-bit apps• Native execution of x86 32-bit apps with

no speed penalty!

Page 11: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Opteron vs. Intel Offerings

• Targeted at the server market– 64-bit computing– Registered memory

• Initial direct competitor was the Itanium• Itanium was the only other 64-bit processor

architecture with 32-bit x86 compatibility• But, 32-bit software support was not native

– Emulated 32-bit performance took a significant hit

Page 12: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Opteron vs. ???

• Opteron had no real competition• Near 1:1 multi-processor scaling• CPUs share a single common bus

– integrated memory controller CPU can access local-RAM without using the Hypertransport bus processor-memory communication.

– contention for the shared-bus leads to decreased efficiency, not an issue for the Opteron

• Still did not dominate the market

Page 13: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Opteron Layout

Page 14: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Other New Opteron Features

• 48-bit virtual address space and a 40-bit physical address space

• ECC (error correcting code) protection for L1 cache data, L2 cache data and tags

• DRAM with hardware scrubbing of all ECC-protected arrays

Page 15: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Other New Opteron Features

• Lower thermal output, improved frequency scaling via .13 micron SOI (silicon-insulator) process technology

• Two additional pipeline stages (compared to K7) for increased performance and frequency scalability

• Higher IPC (instructions-per-clock) with larger TLBs, flush filters, and enhanced branch prediction algorithms

Page 16: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

64-bit Computing

• Move beyond the 4GB virtual-address space ceiling 32-bit systems impose

• Servers and apps like databases, content creation, MCAD, and design-automation tools push that boundary.

• AMD’s implementation allows:– Up to 256TB of virtual-address space– Up to 1TB of physical memory – No performance penalty

Page 17: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

64-bit Computing Cont’d

• AMD believes the following desktop apps stand to benefit the most from its architecture, once 64-bit becomes more widespread – 3D gaming – Codecs – Compression algorithms – Encryption – Internet content serving – Rendering

Page 18: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD and 64-bit Computing

• Goal is not immediate transition to 64-bit operation– Like Intel’s transition to 32-bit with the 386– AMD's Brunner: "The transition will occur at

the pace of demand for its benefits."

• Sets foundation and encourages development of 64-bit applications while fully supporting current 32-bit standard

Page 19: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD64

• AMD’s 64-bit ISA

• 64-bit software support with zero-penalty 32-bit backward compatibility

• x86 based, with extensions

• Cleans up x86-32 idiosyncrasies

• Updated since release: i.e. SSE3

Page 20: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD64 - Features• All benefits of 64-bit processing (e.g. virtual-

address space)• Added registers

– Like Pentium 4 in 32-bit mode, but 8 more 64-bit GPRs available for 64-bit

– 8 more XMM registers

• Native 32-bit compatibility– Low translation overhead (unlike Intel)– Both 32 and 64-bit apps can be run under a 64bit OS

Page 21: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Register Map for AMD64

Page 22: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD64 – More Features• RIP relative data access: Instructions can

reference data relative to PC, which makes code in shared libraries more efficient and able be mapped anywhere in the virtual address space.

• NX Bit: Not required for 64-bit computing, but provides for a more tightly controlled software environment. Hardware set permission levels make it much more difficult for malicious code to take control of the system.

Page 23: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD64 Operating Modes

• Legacy mode supports 16- and 32-bit OSes and apps, while long mode enables 64-bit OSes to accommodate both 32- and 64-bit apps. – Legacy: OS, device drivers, and apps will run exactly

as they did prior to upgrading.– Long: Drivers and apps have to be recompiled, so

software selection will be limited, at least initially.

• Most likely scenario is a 64-bit OS with 64-bit drivers, running a mixture of 32- and 64-bit apps in compatibility mode.

Page 24: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 25: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 26: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Direct Connect Architecture

• I/O Architecture for Opteron and Athlon64

• Microprocessors are connected to:– Memory through an integrated memory

controller. – A high performance I/O subsystem via

Hypertransport bus– To other CPUs via HyperTransport bus

Page 27: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Onboard Memory Control

• Processors do not have to go through a northbridge to access memory

• 128-bit memory bus

• Latency reduced and bandwidth doubled

• Multicore: Processors have own memory interface and own memory

• Available memory scales with the number of processors

Page 28: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

More Onboard Memory Control

• DDR-SDRAM only

• Up to 8 registered DDR DIMMs per processor

• Memory bandwidth of up to 5.3 Gbytes/s (with PC2700) per processor.

• 20% improvement over Athlon just due to integrated memory

Page 29: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

HyperTransport

• Bidirectional, serial/parallel, scalable, high-bandwidth low-latency bus

• Packet based– 32-bit words regardless of physical width

• Facilitates power management and low latencies

Page 30: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

HyperTransport in the Opteron

• 16 CAD HyperTransport (16-bit wide, CAD=Command, Address, Data) – processor-to-processor and processor-to-

chipset– bandwidth of up to 6.4 GB/s (per HT port)– 50% more than what the latest Pentium 4 or

Xeon processors

• 8-bit wide HyperTransport for components such as normal I/O-Hubs

Page 31: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

More Opteron HyperTransport

• Number of HyperTransport channels

(up to 3) determined by number of CPUs– 19.2 Gbytes/s of peak bandwidth per

proccessor

• All are bi-directional, quad-pumped

• Low power consumption (1.2 W) reduces system thermal budget

Page 32: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 33: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 34: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

More HyperTransport

• Auto-negotiated bus widths• Devices negotiate sizes during initialization• 2-bit lines to 32-bit lines.• Busses of various widths can be mixed together

in a single application • Allows for high speed busses between main

memory and the CPU and lower speed busses to peripherals as appropriate

• PCI compatible but 80x faster

Page 35: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

DCA – InterCPU Connections

• Multiple CPUs connected through a proprietary extension running on additional HyperTransport interfaces

• Allows support of a cache-coherent, Non-Uniform Memory Access, multi-CPU memory access protocol

Page 36: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

DCA – InterCPU Connections

• Non-Uniform Memory Access– Separate cache memory for each processor– Memory access time depends on memory location.

(i.e. local faster than non-local)

• Cache coherence– Integrity of data stored in local caches of a shared

resource

• Each CPU can access the main memory of another processor, transparent to the programmer

Page 37: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

DCA Enables Multiple CPUs

• Integrated memory controller allows cache access without using HyperTransport

• For non-local memory access and interprocessor communication, only the initiator and target are involved, keeping bus-utilization to a minimum.

• All CPUs in multiprocessor Intel Xeon systems share a single common bus for both

• Contention for shared bus reduces efficiency

Page 38: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Multicore vs Multi-Processor

• In multi-processor systems (more than one Opteron on a single motherboard), the CPUs communicate using the Direct Connect Architecture

• Most retail motherboards offer one or two CPU sockets

• The Opteron CPU directly supports up to an 8-way configuration (found in mid-level servers)

Page 39: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Multicore vs Multi-Processor

• With multicore each physical Opteron chip contains two separate processor cores (more someday soon?)

• Doubles the compute-power available to each motherboard socket. One socket can delivers the performance of two processors, two deliver a four processor equivalent, etc.

Page 40: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Future Improvements

• Dual-Core vs Double Core – Dual core: Two processors on a single die– Double core: Two single core processors in

one ‘package’• Better for manufacturing• Intel Pentium D 900 Presler

• Combined L2 cache

• Quad-core, etc.

Page 41: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K7 vs. K8 Changes

Page 42: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Summary of Changes From K7 to K8

• Deeper & Wider Pipeline• Better Branch Predictor• Large workload TLB• HyperTransport capabilities eliminate

Northbridge and allow low latency communication between processors as well as I/O

• Larger L2 cache with higher bandwidth and lower latency

• AMD 64 ISA allowing for 64-bit operation

Page 43: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The K7 Basics

• 3 x86 decoding units

• 3 integer units (ALU)

• 3 floating point units (FPU)

• A 128KB L1 cache

• Designed with an efficiency aim – IPC mark (Instructions Per Cycle)– K7 units allow to handle up to 9 instructions

per clock cycle

Page 44: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The K8 Basics

• 3 x86 decoding units

• 3 integer units (ALU)

• 3 floating point units (FPU)

• A 1MB L1 cache

Page 45: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The K7 Core

Page 46: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The K8 Core

Page 47: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Things To Note About the K8

• Schedules a large number of instructions simultaneously– 3 8-entry schedulers for integer instructions– A 36-entry scheduler for floating point

instructions

• Compared to the K7, the K8 allows for more integer instructions to be active in the pipeline. How is this possible?

Page 48: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Processor Constraints

• - A 'bigger' processor has more execution units (width) and more stages in the pipeline (depth)– Processor 'size' is limited by the accuracy of the

branch predictor– determines how many instructions can be active in

the pipeline before an incorrect branch prediction occurs

– in theory, CPU should only accomodate the number of instructions that can be sent in a pipe before a misprediction

Page 49: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The K8 Branch Predictor

Page 50: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The K8 Branch Predictor Details

• Compared to the K7, the K8 has improved branch prediction– Global history counter (ghc) is 4x previous size

• ghc is a massive array of 2-bit (0-3) counters, indexed by a part of an instructions addresse

• if the value is => 2 then branch is predicted as "taken“– Taken branches incrememnt counter

– Untaken branches decrement it

– The larger global history counter means more instruction addresses can be saved thus increasing branch predictor accuracy

ITC Labs and Classrooms
How global history counter works1. ghc is a massive array of 2-bit (0-3) counters2. when branch is reached, branch prediction unit feeds a part of the instruction address into the ghc as an index3. ghc uses index to increment a specific counter3> if the value is > 2 then branch is predicted as "taken"3> correct predictions increment counter3> incorrect predictions decrement counter
Page 51: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Translation Look-aside Buffer

• The number of entries TLB has been increased – Helps

performance in servers with large memory requirements

– Desktop performance impact will be limited to a small boost when running 3D rendering software

Page 52: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

HyperTransport:Typical CPU to Memory Set-Up

• CPU sends 200MHz clock to the north bridge, this is the FSB.

•The bus between north bridge and the CPU is 64 bits wide at 200MHz, (Quad Pumped for 4 packets per cycle) giving effective rate of 800MHz

•The memory bus is also 200MHz and 64 or 128 bits wide (single or dual channel). As it is DDR memory, two 64/128 bits packs are sent every clock cycle.

Page 53: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

HyperTransport:Opteron Memory Set-Up

•integrated memory controller does not improve the memory bandwidth, but drastically reduces memory request time

•HyperTransport uses a 16 bits wide bus at 800MHz, and a double data rate system that enables a 3.2GB peak bandwidth one-way

Page 54: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 55: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Pros & Cons

• Pros– The performance of the integrated controller

of the K8 increases as the CPU speed increases and so does the request speed.

– The addressable memory size and the total bandwidth increase with the number of CPUs

• Cons– Memory controller is customized to use a

specific memory, and is not very flexible about upgrading

Page 56: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Caches

Page 57: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

L1 Cache Comparison

CPU K8 Pentium 4 Prescott

Size

code : 64KB TC : 12Kµops

data : 64KB data : 16KB

Associativity

code : 2 way TC : 8 way

data : 2 way data : 8 way

Cache line size

code : 64 bytes TC : n.a

data : 64 bytes data : 64 bytes

Write policy Write Back Write Through

Latency Given By Manufacturer 3 cycles 4 cycles

Page 58: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 L1 Cache

• Compared to the Intel machine,the large size of the L1 cache allows for bigger block size– Pros: a big range of data or code in the same

memory area – Cons: low associativity tends to create

conflicts during the caching phase.

Page 59: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 60: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

L2 Cache Comparison

CPU K8 Pentium 4 Prescott

Size

512KB (NewCastle)

1024KB1024KB (Hammer)

Associativity 16 way 8 way

Cache line size 64 bytes 64 bytes

Latency given by manufacturer 11 cycles 11 cycles

Bus width 128 bits 256 bits

L1 relationship exclusive inclusive

Page 61: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 L2 cache

• L2 cache of the K8 shares lot of common features with the K7.

• The K8’s L2 cache uses a 16-way set associativity to partially compensates for the low associativity of the L1.

• Although the bus width in the K8 is double what the K7 offered, it still is smaller than the Intel model

• The K8 also includes an hardware prefetch logic, that allows to get data from memory to the L2 cache during the the memory bus idle time.

Page 62: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 63: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Inclusive vs. Exclusive Caching

• Inclusive Caching: Used by the Intel P4– L1 cache contains a subset of the L2 cache– During an L1 miss/L2 success data is copied

into the L1 cache and forwarded to the CPU– During an L1/L2 miss, data is copied from

memory into both L1 and L2 caches

Page 64: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Inclusive vs. Exclusive Caching

• Exclusive: Used by the Opteron– L1 and L2 caches cannot contain the same

data– During an L1 miss/L2 success data

• One line is evicted from the L1 cache into the L2• L2 cache copies data into the L1 cache

– During an L1/L2 miss, data is copied into the L1 cache alone

Page 65: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Drawback of Exclusive Caching and its solution…

• Problem: A line from the L1 must be copied to the L2 *before* getting back the data from the L2.– Takes a lot of clock cycles, adding to the time needed to get data

from the L2• Solution:

– victim buffer (VB), that is a very little and fast memory between L1 and L2.

• The line evicted from L1 is then copied into the VB rather than into the L2. • In the same time, the L2 read request is started, so doing the L1 to VB write

operation is hidden by the L2 latency• Then, if by chance the next requested data is in the VB, getting back the

data from it is much more quickly than getting it from the L2. – The VB is a good improvement, but it is very limited by its small size

(generally between 8 and 16 cache lines). Moreover, when the VB is full, it must be flushed into the L2, that is an additional step and needs some extra cycles.

Page 66: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Drawback of Inclusive

• The constraint on the L1/L2 size ratio needs the L1 to be small,– but a small size will result in reducing its success rate, and

consequently its performance.– On the other hand, if it is too big, the ratio will be too large for

good performance of the L2.

• Reduces flexibility when deciding size of L1 and L2 caches– It is very hard to build a CPU line with such constraints. Intel

released the Celeron P4 as a budget CPU, but its 128KB L2 cache completely broke the performance.

• Total useful cache size is reduced since data is duplicated over the caches

Page 67: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Inclusive vs. Exclusive Caching

Pros Cons

Exclusive •No constraint on the L2 size. •Total cache size is sum of the sub-level sizes.

•L2 performance decreases

Inclusive •L2 performance •Constraint on the L1/L2 size ratio •Total cache size is effectively reduced

Page 68: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The Pipeline

Page 69: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K7 vs. K8 – Pipeline Comparison

Page 70: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The Fetch Stage

• Two Cycles Long

• Feeds 3 Decoders with 16 instruction byres each cycle

• Uses the L1 code cache and the branch prediction logic.

Page 71: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The Decode Stage

• The decoders convert the x86 instruction in fixed length micro-operations (µOPs).

• Can generate 3 µOPs per cycle– The FastPath: "simple" instructions, that are decoded

in 1-2 µOPs, are decoded by hardware then packed and dispatched

– Microcoded path: complex instructions are decoded using the internal ROM

• Compared to the K7, more instructions in the K8 use the fast path especially SSE instructions.– AMD claims that the microcoded instructions number

decreased by 8% for integer and 28% for floating point instructions.

Page 72: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Instruction Dispatch

• There are:– 3 address generation units (AGU) – Three integer units (ALU). Most operations

complete within a cycle, in both 32 and 64bits: addition, rotation, shift, logical operations (and, or).

• Integer multiplication has a 3 cycles latency in 32 bits, and a 5 cycles latency in 64 bits.

– Three floating point units (FPU), that handle x87, MMX, 3DNow!, SSE and SSE2.

Page 73: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Load/Store Stage

• Last stage of the pipeline process – uses the L1 data cache. – the L1 is dual-ported to handle two 32/64 bits

reads or writes each clock cycle.

Page 74: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Cache Summary

• Compared to the K7, the K8 cache provides higher bandwidth and lower latencies

• Compared to the Intel P4, the K8 cache’s are write-back and inclusive

Page 75: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: GPR encodingThe IA32 instructions encoding is made with a special byte called the

ModRM (Mode / Register / Memory), in which are encoded the source and destination registers of the instruction.– 3 bits encode the source register, 3 bits encode the destination

• There’s no way to change the ModRM byte since that would break IA32 compatibility. So to allow instructions to use the 8 new GPRs, an addition bit named the REX is added outside the ModRM.

• The REX is used only in long (64-bit) mode, and only if the specified instruction is a 64-bit one

Page 76: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: SSE

• Abandoned the original MMX, 3DNow! Instruction sets because they operated on the same physical registers

• Supports SSE/SSE2 using eight SSE-dedicated 80-bit registers– If a 128 bit instruction is processed it will take two

steps to complete– Intel’s P4 allows for the use of 128 bit registers so 128

bit instructions only take a single step– However, C/C++ compilers still usually output scalar

SSE instructions that only use 32/64 bits so the Opteron can processes most SSE instructions in one step and thus remain competitive with the P4

Page 77: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: One Last Trick

• suppose we want to write 1 in a register, that is written in pseudo-code as :

• mov register, 1

• In the case of a 32 bits register, the immediate value 1 will be encoded on 32 bits: – mov eax, 00000001h

• In the case the register is 64 bits : – mov rax, 0000000000000001h

• Problems? The 64-bit instruction takes 5 more bits to encode the same number thus wasting space.

Page 78: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: One Last Trick

• Under AMD64, the default size for operand bits is 32.

Page 79: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: One Last Trick

• For memory addressing a more complicated table is used.

Page 80: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: Code Size

• Cpuid.org estimated that a 64 bits code will be 20-25% bigger compared to the same IA32 instructions based code.

• However, the use of sixteen GPR will tend to reduce the number of instructions, and perhaps make 64-bit code shorter than 32-bit code. – The K8 is able to handle the code size increase,

thanks to its 3 decoding units, and its big L1 code cache. The use of big 32KB blocs in the L1 organization in order seems now very useful

Page 81: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: 32-bit code vs. 64-bit Code

Page 82: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

• [H]ard|OCP:• “AthlonXP 3200+ got

outpaced by the Athlon64 3200+…the P4 and the P4EE came in at a dead tie, which suggests that the extra CPU cache is not a factor in this benchmark... pipeline enhancements made to the new K8 core certainly did impact instructions per clock.”Note:

Athlon64 3200+ runs at 2.0GHz

AthlonXP 3200+ runs at 2.2 GHz

Page 83: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD 64: Conclusions

• Allows for a larger addressable memory size

• Allows for wider GPRs and 8 more of them

• Allows the use of all x86 instructions that were avaliable on the AMD64 by default

• Can lead to small code that is faster as a result of less memory shuffling

Page 84: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Opteron vs. Xeon

Page 85: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Opteron vs Xeon in a nutshell

• Opteron offers better computing and per-Watt performance at a roughly equivalent per-device price

• Opteron scales much better when moving from one to two or even more CPUs

• Fundamental limitation:– Xeon processors must share one front side

bus and one memory array

Page 86: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

FSB Bottleneck

Intel’s Xeon AMD’s Opteron

Page 87: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Xeon and the FSB Bottleneck

• External north bridge makes implementing multiple FSB interfaces expensive and hard

• Intel just has all the processors share

• Relies on large on-die L3 caches to hide issue

• Problem grows with number of CPUs

Page 88: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

The AMD Solution

• Recall: Each processor has own integrated memory controller and three HyperTransport ports– No NB required for

memory interaction– 6.4 GB/s bandwidth

between all CPUs

• No scaling issue!

Page 89: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Further Xeon Notes

• Even 64-bit extensions would not solve the fundamental performance bottleneck imposed by the current architecture

• Xeon can make use of Hyperthreading– Found to improve performance by 3 - 5%

Page 90: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AnandTech: Database Benchmarks

• SQL workload based on site’s forum usage, database was forums themselves– i.e. trying to be real world

• Two categories: 2-way and 4-way setups

• Labels:– Xeon: Clock Speed / FSB Speed / L3 Cache Size

– Opteron: Clock Speed / L2 Cache Size

Page 91: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AnandTech: Average Load 2-way

• Longer line is better• Opterons at 2.2 GHz maintain 5% lead over Xeons at 3.2 GHz

Page 92: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AnandTech: Average Load 4-way

• With two more processors, best Opteron system increases performance lead to 11%

• Opterons @ 1.8 GHz nearly equal Xeons at 3.0 GHz

Page 93: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AnandTech: Enterprise benchmarks

• 2-way Xeon at 3+GHz and large L3 cache does better

• 4-way Opteron jumps ahead (8.5% lead)

Stored Procedures / Second

Page 94: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AnandTech Test Conclusions

• Opteron is clear winner for >2 processor systems– Even for dual-processors, Xeon essentially

only ties

• Clearly illustrates the scaling bottleneck• Xeons are using most of their huge (4MB)

L3 cache to keep traffic off the FSB• Also Opteron systems used in tests cost ½

as much

Page 95: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Tom’s Hardware Benchmarks

• AMD's Opteron 250 vs. Intel's Xeon 3.6GH

• Xeon Nocona (i.e. 64-bit processing) – Results enhanced by chipset used (875P)

which has improved memory controller– Still suffers from lack in memory performance

• Workstation applications rather than server based tests

Page 96: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Tom’s Hardware

Page 97: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Tom’s Hardware

Page 98: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Tom’s Hardware Conclusions

• AMD has memory benefits, as before

• Opteron better in video, Intel better with 3D but only when 875P-chipset is used– Otherwise Opteron wins in spite of inferior

graphics hardware

• Still undecided re: 64-bit, no good applications to benchmark on

Page 99: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 in Different Packages

Page 100: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 in Different Packages

• Opteron– Server Market– Registered memory– 940 pin count– Three HyperTransport links

• Multi-cpu configurations (1,2,4, or 8 cpus)• Multiple multi-core cpus supported as well

– Up to 8 1GB DIMMs

Page 101: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 in Different Packages

• Athlon 64– Desktop market– Unregistered memory– 754 or 939pin count– Up to 4 1GB DIMMs– Single HyperTransport links

• Single slot configurations• X2 has multiple cores in one slot

– Athlon 64 FX• Same feature set as Athlon 64• Unlocked multiplier for overclocking• Offered at higher clock speeds (2.8GHz vs. 2.6GHz)

Page 102: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 103: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 in Different Packages• Turion 64

– Named to evoke the “touring” concept– 90nm “Lancaster” Athlon 64 core

• 64bit computing• SSE3 support

– High quality core yields, can run at high clock speeds with low voltage

• Similar process for low wattage opterons

– On chip memory controller• Saves power by running in single channel mode• Better compared to Petium M’s extra controller on the mobo

Page 104: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 105: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Thermal Design Points

• Pentium 4’s TDP: 130w

• Athlon 64’s TDP: 89-104w• Opteron HE - 50w; EE -30w 

• Athlon 64 mobiles: 50w– DTR market sector

• Pentium M: 27w

• Turion 64: 25w

Page 106: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 in Different Packages

• Turion 64 continued– Uses PowerNow! Technology

• Similar to Intel’s SpeedStep• Identical to desktop Cool’N’Quiet• Dynamic voltage and clock frequency modulation

– Operates “on demand”– Run “Cooler and Quieter” even when plugged in

Page 107: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 108: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 109: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

K8 in Different Packages

• AMD uses “Mobile Technology” name– Intel has a monopoly on centrino

• Supplies Wireless, chipset and cpu• invested $300 million in Centrino advertising

– Some consumers think Centrino is the only way to get wireless connectivity in a notebook

– AMD supplies only the cpu• Chipset and wireless are left up to the motherboard

manufacturer/OEM

Page 110: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Marketing

VS

Page 111: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Intel’s Marketing

• Men who are Blue• Moore’s Law• Megahertz• Most importantly: Money• Beginning with: “In order to correctly

communicate the benefits of new processors to PC buyers it became important that Intel transfer any brand equity from the ambiguous and unprotected processor numbers to the company itself”

Page 112: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Industry on AMD vs. Intel

• Intel spends more on R&D in one quarter than AMD makes in a year

• Intel still has a tremendous amount of arrogance • Has been shamed technologically by a flea-

sized (relatively speaking) firm• Humbling? Intel is still grudgingly turning to the

high IPC, low clock rate, dual-core, x86-64, on-die memory controller design pioneered by its diminutive rival. – Geek.com

Page 113: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

AMD’s Marketing

• Mascot = The AMD Arrow• “AMD makes superior CPUs, but the marketing

department is acting like they are still selling the K6” -theinquirer.net

• Guilty with Intel on poor metrics:• “AMD made all the marketing hay it could on the

historically significant clock-speed number.

By trying to turn attention away from that number now, it runs the risk of appearing to want to change the subject when it no longer has the perceived advantage. In marketing, appearance is everything. And no one wants to look like a sore loser, even when they aren't. “- Forbes

Page 114: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.
Page 115: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Anandtech on AMD’s Marketing

• AMD argued that they didn't have to talk about a new architecture, as Intel is just playing catch-up to their current architecture.

• However, we look at it like this - AMD has the clear advantage today, and for a variety of reasons, their stance in the marketplace has not changed all that much.

Page 116: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Conclusion

• Improvements over K7– 64-bit– Integrated memory controller– HyperTransport– Pipeline

• Multiprocessor scaling > Xeon• K8 is dominant in every market

performance-wise• K8 is trounced in every market in sales

Page 117: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Reason for 64-bit in Consumer Market

• “If there aren't widespread, consumer-priced 64-bit machines available in three years, we're going to have a hard time developing games that are more compelling than last year's games.”- Tim Sweeney, Founder & President Epic Games

Page 118: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

Questions?

Page 119: The AMD Opteron Henry Cook Kum Sackey Andrew Weatherton.

http://www.people.virginia.edu/~avw6s/opteron.html