ECE552 / CPS550 Advanced Computer Architecture...
-
Upload
dinhkhuong -
Category
Documents
-
view
217 -
download
1
Transcript of ECE552 / CPS550 Advanced Computer Architecture...
ECE552 / CPS550Advanced Computer Architecture I
Lecture 1Metrics and Early Machines
Benjamin LeeElectrical and Computer Engineering
Duke University
www.duke.edu/~bcl15www.duke.edu/~bcl15/class/class_ece552fall16.html
Mark IHarvard University, 1944
EDSACUniversity of Cambridge, 1949
ECE 552 / CPS 550 2
Computing Devices (Then)
ECE 552 / CPS 550 4
Computer Architecture Application
Physics
Gap too large to bridge in one step
Computer architecture is the design of abstraction layers, which allow efficient implementations of computational applications on available technologies
ECE 552 / CPS 550 5
Abstraction Layers
Algorithm
Gates/Register-Transfer Level (RTL)
Application
Instruction Set Architecture (ISA)
Operating System/Virtual Machines
Microarchitecture
Devices
Programming Language
Circuits
Physics
Domain of early
computer architecture
(‘50s-’80s)
Domain of recent
computer architecture(since ‘90s)
ECE 552 / CPS 550 6
An Integrated ApproachArchitect Systems
- Coordinate system across hardware-software interface~ Technology, hardware, run-time software, compilers, apps
- Responsible for end-to-end functionality
Design and Analyze- Search design space of computer systems- Evaluate designs with quantitative metrics
~ Performance, power, cost
Navigate Computing Landscape- Technologies are emerging- Applications are demanding- Systems are scaling
In-order Datapath(understand, ECE 250)
(built, ECE 350)
Chip Multiprocessors(understand, experiment ECE552)
ECE 552 / CPS 550 7
ECE 552 Executive Summary
ECE 552 / CPS 550 8
ECE 552 Administrivia
Instructor Prof. Benjamin [email protected] Hours: TuTh 4-5pm, Hudson 210
Teaching Ramin Bashizade, [email protected] Office Hours: WF 3:30 – 4:30pm, LSRC D301
Tamara Lehman, [email protected] Hours: TuTh 11:30 – 12:30pm, TBD
Lectures Tu/Th 10:05 – 11:20AM, Teer 203
Text Computer Architecture: A Quantitative Approach, 5th Edition (2012). Do not use earlier editions
Web http://www.duke.edu/~BCL15/class/class_ece552fall16.html
ECE 552 / CPS 550 9
ECE 552 PrerequisitesParticipation
- Electrical and Computer Engineering, Computer Science- PhD, MS, Undergraduates
Prerequisites- Introduction to computer architecture (CPS 104, ECE 152, or equiv.)- Programming (homework/projects in C, C++)
Background Knowledge- Instruction sets, computer arithmetic, assembly programmingD.A. Patterson and J.L. Hennessy. Computer Organization and Design: The Hardware/Software Interface, 5th Edition.
ECE 552 / CPS 550 10
ECE 552 Lectures
1. Design Metrics a) Performanceb) Powerc) Early machines
2. Simple Pipelining a) Multi-cycle machinesb) Branch predictionc) In-order superscalar d) Optimizations
3. Complex Pipelining a) Score-boarding, Tomasulo algorithmb) Out-of-order superscalar
Midterm ExamFall Break
4. Explicitly Parallel Architecturesa) VLIWb) Vector machinesc) Multi-threading
5. Memory Systemsa) Cachesb) DRAMc) Virtual memory
6. Multiprocessorsa) Memory modelsb) Coherence protocols
ECE 552 / CPS 550 11
ECE 552 Readings
1. Technologya) Moore’s Law b) Technology scaling
2. Historya) Classic machinesb) The 801 minicomputer
3. Pipelining a) Power as a design constraintb) Optimizing pipeline depth
4. Microarchitecturea) Branch predictionb) Complexity and superscalar design
5. Parallelism Ia) Data flow processorsb) Simultaneous multi-threading
6. Memorya) Victim cacheb) Phase change memory
7. Parallelism IIa) Consistencyb) Coherence
ECE 552 / CPS 550 12
ECE 552 Components30% Homework and Readings
- Homework done in teams of 3- 5 classes dedicated to paper discussions
20% Midterm exam- 75 minutes (in class), closed book
20% Final exam- 3 hours, closed-book
30% Term project/paper- Project done in teams of 3
Academic PolicyUniversity policy as codified by Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism.
ECE 552 / CPS 550 13
ECE 552 Academic Policy
University policy as codified by the Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism.
If a student is suspect of academic dishonesty (e.g., cheating on an exam, copying a lab report, collaborating inappropriately on an assignment), faculty are required to report the matter to the Office of Student Conduct.
A student found responsible for academic dishonesty faces formal disciplinary action, which may include suspension. A student suspended twice for academic dishonesty automatically faces a minimum 5-year separation from Duke University.
ECE 552 / CPS 550 14
ECE 552 Term ProjectScope
- Semester-long research project- Teams of 3 - Students propose project ideas (Oct 14)
Final Paper- 6-12 page research paper- Evaluate research idea quantitatively- Survey and cite related work
ECE 552 / CPS 550 15
ECE 552 Upcoming Deadlines1 September – Reading #1 Due
Readings are available on Sakai.Submit reading responses on Sakai.
1. Moore. “Cramming more components onto integrated circuits”2. Horowitz et al. “Scaling, power, and the future of CMOS”
15 September – Homework #1 DueHomework will be available on Sakai.Submit homework on Sakai in teams of two.
ECE 552 / CPS 550 17
Latency versus ThroughputDefinitions
- Latency: time to finish given task (a.k.a. execution time)- Throughput: number of tasks in given time (a.k.a. bandwidth)- Throughput exploits parallelism. Latency cannot
Example: Move people from Duke to UNC, 10 miles- Car: capacity = 5, speed = 60 miles/hour- Latency = (10 miles @ 60 miles/hour )= 10 minutes- Throughput = (3 trips @ 60 miles per hour) = 15 people/hour
- Bus: capacity = 60, speed = 20 miles/hour- Latency = (10 miles @ 20 miles/hour) = 30 minutes- Throughput = (1 trip @ 20 miles per hour) = 60 people/hour
ECE 552 / CPS 550 18
Aggregating PerformanceAddition
- Latency is additive. Throughput is not. - Example: Consider applications A1 and A2 on processor P- Latency(A1,A2) = Latency(A1) + Latency(A2)- Throughput (A1,A2) = 1/[1/Throughput(A1) + 1/Throughput(A2)]
Averages- Arithmetic Mean: (1/N) * ∑P=1..N Latency(P)- For measures that are proportional to time (e.g., latency)
- Harmonic Mean: N / ∑P=1..N 1/Throughput(P)- For measures that are inversely proportional to time (e.g., throughput)
- Geometric Mean: (∏P=1..N Speedup(P))^(1/N)- For ratios (e.g., speed-ups)
ECE 552 / CPS 550 19
Processor Performance
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/year
52%/year
??%/year
SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, 2006.
ECE 552 / CPS 550 20
Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture
Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)
Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture
ECE 552 / CPS 550 21
Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture
Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)
Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture
ECE 552 / CPS 550 22
Moore’s Law- Moore. “Cramming more components onto integrated circuits.”
Electronics, Vol 38, No. 8, 1965. - As integration increases, packaging cost decreases- How does Moore’s Law impact performance?
ECE 552 / CPS 550 23
Field-Effect Transistors- MOS: metal-oxide semiconductor- FET: field-effect transistor
- Charge carriers flow between source-drain, controlled by gate voltage- Abstract MOSFET as electrical switch
GateSource
Drain
Bulk
Width
Length
Channel
Source
Drain
Gate
ECE 552 / CPS 550 24
Complementary MOS (CMOS)- Map voltages to logical values (Vdd=1, Gnd=0)- Implement complementary Boolean logic
- nFET: conduct charge when Vg = Vdd, used in pull-down network- pFET: conduct charge when Vg = Gnd, used in pull-up network- Examples: Inverter, NAND
Vdd
Gnd
A !A
nFET
pFET BA
A
B
!(AB)
ECE 552 / CPS 550 25
Transistor Dimensions- Process defined by feature size (F), layout design (l = F/2)- Example: F=2l =45nm process technology
- Transistor dimensions determine technology performance- Transistor drive strength (i.e., speed) increases as channel length shrinks
GateSource
Drain
Bulk
Width
Length
Minimum Length=2l
Width=4lSource Drain
Gate
ECE 552 / CPS 550 26
Dennard Scaling- Dennard et al. “Design of ion-implanted MOSFETs with very small physical
dimensions,” Journal Solid State Circuits, 1974.
- Scale not only dimensions but also doping concentration and voltage- Transistors become faster (1.4x)- Applied to Moore’s Law: k=1.4, 1/k = 0.7 every 18-24 months
GateSource
Drain
Bulk
Width
Length
ECE 552 / CPS 550 27
Dennard Scaling Limits- Horowitz et al. “Scaling, power, and the future of CMOS.” IEDM, 2005. - Classical Dennard scaling ended at 130nm in 2000-2001.
- Oxide Thickness: How to manage increasing leakage? Use high-K dielectrics- Channel Length: How to manage increasing leakage? Stop scaling L- Doping Concentration: How to handle imprecise doping? Manage variability- Voltage: How to manage increasing leakage? Stop scaling V- Current: How to increase current with shrinking channels? Stress silicon
- Example: Intel 22nm process technology with FinFET
Image: Courtesy Intel Corp.
ECE 552 / CPS 550 28
Processor Performance
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/year
52%/year
??%/year
SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, 2006.
ECE 552 / CPS 550 29
Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture
Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)
Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture
ECE 552 / CPS 550 30
Cycles per Instruction (CPI)Average Instruction Latency
- Different instructions require different number of cycles - Examine instruction frequency - CPI is slightly easier to calculate than IPC (time versus rate)
Example- Instruction frequency: 1/3 INT, 1/3 FP, 1/3 MEM operations- Instruction cycles: 1cy INT, 3cy FP, 2cy MEM- CPI = (1/3 x 1) + (1/3 x 3) + (1/3 x 2)
Caveat- CPI provides high-level, quick estimates of performance- Does not account for details (e.g., instruction dependences)
ECE 552 / CPS 550 31
CPI and DesignBaseline Processor + Application
- Integer ALU: 50%, 1 cycle- Load: 20%, 5 cycle- Store: 10%, 1 cycle- Branch: 20%, 2 cycle
Possible Enhancements- Option 1: Branch prediction for 1-cycle branch- Option 2: Bigger data cache for 3-cycle load- Which enhancement is preferred?
Cycles Per Instruction- Base = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 2) = 2 cycles- Option 1 = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 1) = 1.8 cycles- Option 1 = (0.5 x 1) + (0.2 x 3) + (0.1 x 1) + (0.2 x 2) = 1.6 cycles
ECE 552 / CPS 550 32
Measuring CPIPhysical Measurements
- Measure wall clock time as application runs- Multiply time by clock frequency to get cycles- Profile application with hardware counters (e.g., Intel VTune)
Simulated Measurements- Cycle-level, microarchitectural simulation (e.g., SimpleScalar)- Run applications on simulated hardware- Track instructions as they progress through the design
ECE 552 / CPS 550 33
BenchmarkingMeasuring Performance
- Target Workload: accurate but not portable- Representative Benchmark: portable but not accurate- Microbenchmark: small, fast code sequences but incomplete
Representative Benchmarks- SPEC (Standard Performance Evaluation Corporation, www.spec.org)- Collects, standardizes, distributes benchmark programs
- Scientific and commercial computing- SPLASH-2, NAS, SPEC OpenMP, SPECjbb
- Online transaction processing (OLTP) with heavy I/O, memory- TPC-C, TPC-H, TPC-W
- Datacenter workloads- Search (e.g., Nutch/Lucene), analytics (e.g,. Hadoop, Spark)
ECE 552 / CPS 550 34
Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture
Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)
Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture
ECE 552 / CPS 550 35
Single Accumulator
- Carry-over from calculators, typically less than 2-dozen instructions- Single operand (AC)
LOAD x AC ¬ M[x]STORE x M[x] ¬ (AC)
ADD x AC ¬ (AC) + M[x]SUB x
SHIFT LEFT AC ¬ 2 ´ (AC)SHIFT RIGHT
JUMP x PC ¬ xJGE x if (AC) ³ 0 then PC ¬ x
LOAD ADR x AC ß Extract address field (M[x])STORE ADR x
ECE 552 / CPS 550 36
Using Accumulator
LOOP LOAD N # AC ß M[N]JGE DONE # if(AC>0), PCß DONEADD ONE # AC ß AC + 1STORE N # M[N] ß AC
F1 LOAD A # AC ß M[A]F2 ADD B # AC ß (AC) + M[B]F3 STORE C # M[C] ß (AC)
JUMP LOOP
DONE HLT
Ci ¬ Ai + Bi, 1 £ i £ n
Notice M[N] is a counter, not an index. How to modify the addresses A, B and C ?
ECE 552 / CPS 550 37
Self-Modifying Code
LOOP LOAD N # AC ß M[N]JGE DONE # if (AC >= 0), PC ß DONEADD ONE # AC ß AC + M[ONE]STORE N # M[N] ß AC
F1 LOAD A # AC ß M[A]F2 ADD B # AC ß AC + M[B]F3 STORE C # M[C] ß (AC)
LOAD ADR F1 # AC ß address field (M[F1])ADD ONE # AC ß AC + M[ONE]STORE ADR F1 # changes address of ALOAD ADR F2ADD ONESTORE ADR F2 # changes address of BLOAD ADR F3ADD ONESTORE ADR F3 # changes address of CJUMP LOOP
DONE HLT
Ci ¬ Ai + Bi, 1 £ i £ n
Each iteration requires:total book-keeping
Inst fetch 17 14Stores 5 4
ECE 552 / CPS 550 38
Index RegistersSpecialized registers to simplify address calculations
- T. Kilburn, Manchester University, 1950s- Instead of single AC register, use AC and IX registers
Modify Existing Instructions- Load x, IX AC ß M[x + (IX)]- Add x, IX AC ß (AC) + M[x + (IX)]
Add New Instructions- Jzi x, IX if (IX)=0, then PCß x, else (IX)ß(IX)+1- Loadi x, IX IX ß M[x] (truncated to fit IX)
Index registers have accumulator-like characteristics
ECE 552 / CPS 550 39
Using Index Registers
LOADi -n, IX # load n into IXLOOP JZi DONE, IX # if(IX=0), DONE
LOAD LASTA, IX # ACß M[LASTA + (IX)]ADD LASTB, IX # note: LASTA is address STORE LASTC, IX # of last element in AJUMP LOOP
DONE HALT
Ci ¬ Ai + Bi, 1 £ i £ n
- Longer instructions (1-2 bits), index registers with ALU circuitry- Does not require self-modifying code, modify IX instead- Improved program efficiency (operations per iteration)
total book-keepingInst fetch 5 2Stores 1 0
Option 1: Increment index register by kAC ß (IX) new instructionAC ß (AC) + kIX ß (AC) new instruction
Also, the AC must be saved and restored
Option 2: Manipulate index register directlyINCi k, IX IX ß (IX) + kSTOREi x, IX M[x] ß (IX) (extended to fit a word)
IX begins to resemble AC- Several index registers, accumulators- Motivates general-purpose registers (e.g., MIPS ISA R0-R31)
ECE 552 / CPS 550 40
Modifying Index Registers
ECE 552 / CPS 550 41
Evolution of Addressing Modes1. Single accumulator, absolute address
Load x AC ß M[x]
2. Single accumulator, index registersLoad x, IX AC ß M[x + (IX)]
3. Single accumulator, indirection Load (x) AC ß M[M[x]]
4. Multiple accumulators, index registers, indirectionLoad Ri, IX, (x) Ri ß M[M[x] + (IX)]
5. Indirection through registersLoad Ri, (Rj) Ri ß M[M[(Rj)]]
6. The WorksLoad Ri, Rj, (Rk) Ri ß M[Rj + (Rk)]; Rj = index; Rk = base address
ECE 552 / CPS 550 42
Evolution of Instruction FormatsZero-address Formats
- Instructions have zero operands- Operands on a stack
add M[sp] ß M[sp] + M[sp-1]load M[sp] ß M[M[sp]]
- Stack can be registers or memory- Top of stack usually cached in registers
One-address Formats- Instructions have one operand- Accumulator is always other implicit operand
CBASP
Register
ECE 552 / CPS 550 43
Evolution of Instruction FormatsTwo-address Formats
- Destination is same as one of the operand sourcesRi ß (Ri) + (Rj) # (Reg x Reg) to RegRi ß (Ri) + M[x] # (Reg x Mem) to Reg
- x can be specified directly or via register- x address calculation could include indexing, indirection, etc.
Three-address Formats- One destination and up to two operand sources
Ri ß (Rj) + (Rk) # (Reg x Reg) to RegRi ß (Rj) + M[x] # (Reg x Reg) to Reg
ECE 552 / CPS 550 44
Data FormatsData Sizes
- Bytes, Half-words, words, double words
Byte Addressing- Location of most-, least- significant bits
Word Alignment- Suppose memory is organized into 32-bit words (e.g., 4 bytes). - Word aligned addresses begin only at 0, 4, 8, … bytes
Big Endian
Little Endian
MSB
MSB LSB
LSB
0 1 2 3 4 5 6 7
ECE 552 / CPS 550 45
Software DevelopmentsNumerical Libraries (up to 1955)
- floating-point operations- transcendental functions- matrix multiplication, equation solvers, etc.
High-level Languages(1955-1960)- Fortran, 1956- assemblers, loaders, linkers, compilers
Operating Systems (1955-1960)- accounting programs to track usage and charges
ECE 552 / CPS 550 46
Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)
Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture
Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)
Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture
ECE 552 / CPS 550 47
Pitfall: Incomplete MetricsIgnoring Instructions per Program
- Neglect dynamic instruction count- Misleading if working in algorithms, compilers, or ISA
Using Instructions per Second- MIPS = (Instructions / Cycle) x (Cycles / Second) x 1E-6- FLOPS: considers only floating-point instructions- Example: CPI = 2, clock frequency = 500MHz, 250 MIPS- Example: compiler removes instructions, latency falls, MIPS increases
Using Clock Frequency- Cannot equate clock frequency with performance- Proc A: CPI = 2, f = 500MHz- Proc B: CPI = 1, f = 300MHz- Given the same ISA and compiler, B is faster
ECE 552 / CPS 550 48
Pitfall: Diminishing Returns- Amdahl. “Validity of the single-processor approach…” AFIPS, 1967.
Amdhal’s Law (Make Common Case Fast)Consider improving fraction F of system with a speedup S.T(new) = T(base) x (1-F) + T(base) x F / S
= T(base) x [(1-F) + F/S]
Speedup = 1 / [(1-F) + F/S] = T(base)/T(new)Max Speedup = 1 / (1 – F)
Example- Suppose FP computation is 1/4 of an application’s execution time- Maximum benefit from optimizing FP unit is 1.3x (=1/0.75)
- Multiprocessor systems were original application of this law- Accounts for diminishing marginal returns
ECE 552 / CPS 550 50
Power FactorsDefinitions
- Energy (Joules) = a x C x V2
- Power (Watts) = a x C x V2 x f
Power Factors and Trends- activity (a): function of application resource usage- capacitance (C): function of design; scales with area- voltage (V): constrained by leakage, which increases as V falls- frequency (f): varies with pipelining and transistor speeds- Models in cycle-accurate simulators (e.g., Princeton Wattch)
Dynamic Voltage and Frequency Scaling (DVFS)- P-states: move between operational modes with different V, f- Intel TurboBoost: increase V, f for short durations without violating thermal design point (TDP)
ECE 552 / CPS 550 51
Power and TemperatureTemperature• Power density (Watts / sq-mm) is
proxy for thermal effects
• Estimate thermal conductivity, resistance to identify processor hot spots (e.g., HotSpot simulator)
Power Budgets• Power à Package Cost• 130W servers, 65W desktops, 10-30W
laptops, 1-2W hand-held
ECE 552 / CPS 550 52
Power and Multiprocessors
Multiprocessors• Chip multiprocessors (CMPs)
integrate multiple cores on die
Efficiency• Reduce power with simpler cores• Recover lost performance with many
core parallelism
ECE 552 / CPS 550 53
Power and MultiprocessorsLower voltage, frequency• Voltage, frequency scale together V ∝ f• Power proportional to V2 and f Power ∝ V2 f• Performance proportional to f Perf ∝ f
Example• Baseline: 1-core at V, f• Multiprocessor: 4-cores at 0.85V, 0.85f; program is 75% parallel• Core Power @ lower V, f 0.61x =0.853
• Core Performance @ lower V, f 0.85x
• Multicore Power @ lower V, f 2.44x = 0.61x 4• Multicore Performance @ 4 cores 2.28x = 1/[0.25 + (0.75 / 4)]• Multicore Performance @ lower f 1.94x = 2.28 x 0.85
• Multiprocessor: 1.5% power per 1% performance [+144% power, +94% perf]• Boosting V, f: 3% power per 1% performance [+(1.013-1) power, + (1.01-1) perf]
ECE 552 / CPS 550 55
CostNon-recurring Engineering (NRE)
- Dominated by engineer-years ($200K per engineer-year)- Mask costs (>$1M per spin)
Chip Cost- Depends on wafer and chip size, process maturity
Packaging Cost- Depends on number of pins (e.g., signal + power/ground)- Depends on thermal design point (e.g., heat sink)
Total Cost of Ownership- Capital costs (e.g., server procurement cost)- Operating costs (e.g., electricity)
ECE 552 / CPS 550 56
YieldWafers
- Integrated circuits built with multi-step chemical process on wafers- Cost per wafer depends on wafer size, number of steps
Chip (a.k.a. Die)- If chips are large, fewer chips per wafer- Larger chips have lower yield- Uniform defect density- Chip cost is proportional to area2-3
Process Variability- Yield is non-binary- Binning for speed grades- Binning for core count- Post-fabrication tuning with spares
ECE 552 / CPS 550 58
CompatibilityEarly 1960s IBM had 4 incompatible computers
- IBM 701, 650, 702, 1401- Different instruction set architecture- Different I/O system, secondary storage (magnetic taps, drums, disks)- Different assemblers, compilers, libraries- Different markets (e.g., business, scientific, real-time)
The need for compatibility motivated IBM 360.
ECE 552 / CPS 550 59
IBM 360: Design PrinciplesAmdahl, Blaauw and Brooks, “Architecture of the IBM System/360” 1964
1. Support growth and successor machines
2. Connect I/O devices with general method
3. Emphasize total performance- Evaluate programmability, answers per month not bits per second
4. Eliminate manual intervention- Machine must be capable of supervising itself
5. Reduce down time- Build hardware fault checking and fault location support
6. Facilitate assembly - Redundant I/O devices, memories for fault tolerance
7. Support flexibility- Some problems required floating-point words > 36bits
ECE 552 / CPS 550 60
IBM 360: General Purpose RegistersProcessor State
• 16, 32-bit general-purpose registers à use as index and base registers• 4, 64-bit floating-point registers• Program status word (PSW) with program counter (PC)• Condition codes, control flags
Data Formats• 8-bit bytes: the IBM 360 is why bytes are 8-bits long today!• 16-bit half-words• 32-bit words• 64-bit double-words
ECE 552 / CPS 550 61
IBM 360: Initial ImplementationModel 30 Model 70
Storage 8K - 64 KB 256K - 512 KBDatapath 8-bit 64-bitCircuit Delay 30 nsec/level 5 nsec/levelLocal Store Main Store Transistor RegistersControl Store 1 microsecond read Conventional circuits
Abstraction• IBM 360 ISA hid technologies across models
Milestone• The first true ISA designed as portable hardware-software interface• With minor modifications, ISA still survives today
Technology (seconds / cycle)5.2 GHz in IBM 45nm CMOS technology1.4 billion transistors in 512 sq-mm
Microarchitecture (cycle / instruction)64-bit virtual addressingOut-of-order, 3-way superscalar pipelineRedundant datapaths
L1 i-cache (64KB); L1 d-cache (128KB) d-cacheL2 cache (1.5MB), private, per-coreL3 cache (24MB), eDRAM
Power and ParallelismQuad-core designScales to 96 cores in one machine
ECE 552 / CPS 550 62
IBM z11: 47 Years Later
IBM HotChips 2010