AMD Bulldozer Microarchitecture
Overview
• Two cores - to have high throughput per thread
• Bulldozer module can execute two threads via a combination of shared and dedicated resources.
• AMD’s design focuses on Multithreading.
High Level Block Diagram
The figure is taken from [3]
Branch Prediction & Fetch
• Prediction structures - shared between two threads
• Multilevel BTBs• Guess!!! • Prediction runs ahead of the IF pipeline
during fetch misses or other stalls. • Instruction is prefetched into L1 cache using the
prediction queue.
Decode
• Fetch lines are queued in an instruction byte buffer.
• Decode unit extracts and decodes up to four x86 instructions per cycle.
• Decoded instructions dispatch to one integer core.
Integer Core
• Replicated (2 Integer Cores)• Scheduler handles out of order execution. • Core transparency
o Avoids complexity o Lean Hardware
Integer Core
The figure is taken from [1]
Floating Point Unit
• Single floating point unit.• Shared between integer cores.• Floating point operations implemented in
pipelined fashion & hence exploit SMT.• Interfaces with the decode unit for receiving
cops and load/store unit for data transfer
Floating Point Unit
The figure is taken from [1]
Register Renaming
• PRF(Physical Register File)-based renaming• Table containing mappings of names to
locations (tags).• Issued instructions execute after reading
from PRF.• Uses snapshots for recovering from branch
mispredictions/ exceptions.• Separate register files for integer cores and
floating point unit.
Register Renaming
• Advantageso Eliminates data replication by not using
distributed reservation stations.o Less overhead of CDB.
• Disadvantageso Increase in latency as the tags are
fetched instead of the values.o Complicated recovery mechanism for
branch misprediction.
Multithreading
• Shared front end (vertical multithreading)o Larger resource in single thread modeo Utilize fetch bandwidth
• Dedicated integer execution core (single thread)o Keep the integer execution core small and simpleo Possible to run in a higher frequency
• Shared FPU (SMT)o Consumes a great deal of area and powero Rarely utilized to the full capacity
• Shared L2 (thread agnostic)o Good when 2 threads share instruction/data image
Cache Hierarchy
The figure is taken from [1]
TLB Hierarchy
The figure is taken from [1]
Conclusion
• Decoupled branch prediction and instruction fetch enables the instruction prefetch
• By using PRF-based renaming it is power efficient
• Non-conventional Multithreading
References[1] Bulldozer: An Apporach To Multithreaded Compute Performance
http://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf (2011)
[2] AMD Bulldozer Microarchitecture
http://www.realworldtech.com/bulldozer/ (2010)
[3] Bulldozer (microarchitecture)
http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)
[4] Register Renaming
http://en.wikipedia.org/wiki/Register_renaming
Top Related