CSCI-564 Advanced Computer Architecture - Inside...
-
Upload
nguyenkhanh -
Category
Documents
-
view
239 -
download
1
Transcript of CSCI-564 Advanced Computer Architecture - Inside...
![Page 1: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/1.jpg)
CSCI-564 Advanced Computer Architecture
Lecture 3: Amdahl’s Law and Introduction to MIPS
Bo Wu Colorado School of Mines
![Page 2: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/2.jpg)
Amdahl’s Law• The fundamental theorem of performance
optimization• Made by Amdahl!• One of the designers of the IBM 360• Gave “FUD” it’s modern meaning• Optimizations do not (generally) uniformly affect
the entire program • The more widely applicable a technique is, the more
valuable it is• Conversely, limited applicability can (drastically) reduce
the impact of an optimization.Always heed Amdahl’s Law!!!
It is central to many many optimization problems
![Page 3: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/3.jpg)
Amdahl’s Law• The fundamental theorem of performance
optimization• Made by Amdahl!• One of the designers of the IBM 360• Gave “FUD” it’s modern meaning• Optimizations do not (generally) uniformly affect
the entire program • The more widely applicable a technique is, the more
valuable it is• Conversely, limited applicability can (drastically) reduce
the impact of an optimization.Always heed Amdahl’s Law!!!
It is central to many many optimization problems
![Page 4: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/4.jpg)
Amdahl’s Law in Action
`
• SuperJPEG-O-Rama2010 ISA extensions **–Speeds up JPEG decode by 10x!!!–Act now! While Supplies Last!
![Page 5: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/5.jpg)
Amdahl’s Law in Action
**SuperJPEG-O-Rama Inc. makes no claims about the usefulness of this software for any purpose whatsoever. It may not even build. It may cause fatigue, blindness, lethargy, malaise, and irritability. Debugging maybe hazardous. It will almost certainly cause ennui. Do not taunt SuperJPEG-O-Rama. Will not, on grounds of principle, decode images of Justin Beiber. Images of Lady Gaga maybe transposed, and meat dresses may be rendered as tofu. Not covered by US export control laws or the Geneva convention, although it probably should be. Beware of dog. Increases processor cost by 45%. Objects in the rear view mirror may appear closer than they are. Or is it farther? Either way, watch out! If you use SuperJPEG-O-Rama, the cake will not be a lie. All your base are belong to 141L. No whining or complaining. Wingeing is allowed, but only in countries where “wingeing” is a word.
`
• SuperJPEG-O-Rama2010 ISA extensions **–Speeds up JPEG decode by 10x!!!–Act now! While Supplies Last!
![Page 6: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/6.jpg)
Amdahl’s Law in Action
**SuperJPEG-O-Rama Inc. makes no claims about the usefulness of this software for any purpose whatsoever. It may not even build. It may cause fatigue, blindness, lethargy, malaise, and irritability. Debugging maybe hazardous. It will almost certainly cause ennui. Do not taunt SuperJPEG-O-Rama. Will not, on grounds of principle, decode images of Justin Beiber. Images of Lady Gaga maybe transposed, and meat dresses may be rendered as tofu. Not covered by US export control laws or the Geneva convention, although it probably should be. Beware of dog. Increases processor cost by 45%. Objects in the rear view mirror may appear closer than they are. Or is it farther? Either way, watch out! If you use SuperJPEG-O-Rama, the cake will not be a lie. All your base are belong to 141L. No whining or complaining. Wingeing is allowed, but only in countries where “wingeing” is a word.
`
• SuperJPEG-O-Rama2010 ISA extensions **–Speeds up JPEG decode by 10x!!!–Act now! While Supplies Last!
![Page 7: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/7.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
![Page 8: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/8.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
![Page 9: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/9.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
![Page 10: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/10.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
Amdahlate our
Speedup!
![Page 11: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/11.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
Amdahlate our
Speedup!
Is this worth the 45% increase in
cost?
![Page 12: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/12.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
Amdahlate our
Speedup!
Is this worth the 45% increase in
cost?
Metric = Latency * Cost =>
![Page 13: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/13.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
Amdahlate our
Speedup!
Is this worth the 45% increase in
cost?
Metric = Latency * Cost => No
![Page 14: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/14.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
Amdahlate our
Speedup!
Is this worth the 45% increase in
cost?
Metric = Latency * Cost =>
Metric = Latency2 * Cost =>
No
![Page 15: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/15.jpg)
Amdahl’s Law in Action• SuperJPEG-O-Rama2010 in the wild• PictoBench spends 33% of it’s time doing
JPEG decode• How much does JOR2k help?
56
JPEG Decodew/o JOR2k
w/ JOR2k
30s
21s
Performance: 30/21 = 1.42x Speedup != 10x
Amdahlate our
Speedup!
Is this worth the 45% increase in
cost?
Metric = Latency * Cost =>
Metric = Latency2 * Cost => YesNo
![Page 16: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/16.jpg)
Explanation• Latency*Cost and Latency2*Cost are smaller-is-better metrics.• Old System: No JOR2k
• Latency = 30s• Cost = C (we don’t know exactly, so we assume a constant, C)
• New System: With JOR2k• Latency = 21s• Cost = 1.45 * C
• Latency*Cost• Old: 30*C• New: 21*1.45*C• New/Old = 21*1.45*C/30*C = 1.015• New is bigger (worse) than old by 1.015x
• Latency2*Cost• Old: 302 *C• New: 212 *1.45*C• New/Old = 212*1.45*C/302*C = 0.71• New is smaller (better) than old by 0.71x
• In general, you can make C = 1, and just leave it out.57
![Page 17: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/17.jpg)
Explanation• Latency*Cost and Latency2*Cost are smaller-is-better metrics.• Old System: No JOR2k
• Latency = 30s• Cost = C (we don’t know exactly, so we assume a constant, C)
• New System: With JOR2k• Latency = 21s• Cost = 1.45 * C
• Latency*Cost• Old: 30*C• New: 21*1.45*C• New/Old = 21*1.45*C/30*C = 1.015• New is bigger (worse) than old by 1.015x
• Latency2*Cost• Old: 302 *C• New: 212 *1.45*C• New/Old = 212*1.45*C/302*C = 0.71• New is smaller (better) than old by 0.71x
• In general, you can make C = 1, and just leave it out.57
![Page 18: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/18.jpg)
Explanation• Latency*Cost and Latency2*Cost are smaller-is-better metrics.• Old System: No JOR2k
• Latency = 30s• Cost = C (we don’t know exactly, so we assume a constant, C)
• New System: With JOR2k• Latency = 21s• Cost = 1.45 * C
• Latency*Cost• Old: 30*C• New: 21*1.45*C• New/Old = 21*1.45*C/30*C = 1.015• New is bigger (worse) than old by 1.015x
• Latency2*Cost• Old: 302 *C• New: 212 *1.45*C• New/Old = 212*1.45*C/302*C = 0.71• New is smaller (better) than old by 0.71x
• In general, you can make C = 1, and just leave it out.57
![Page 19: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/19.jpg)
Explanation• Latency*Cost and Latency2*Cost are smaller-is-better metrics.• Old System: No JOR2k
• Latency = 30s• Cost = C (we don’t know exactly, so we assume a constant, C)
• New System: With JOR2k• Latency = 21s• Cost = 1.45 * C
• Latency*Cost• Old: 30*C• New: 21*1.45*C• New/Old = 21*1.45*C/30*C = 1.015• New is bigger (worse) than old by 1.015x
• Latency2*Cost• Old: 302 *C• New: 212 *1.45*C• New/Old = 212*1.45*C/302*C = 0.71• New is smaller (better) than old by 0.71x
• In general, you can make C = 1, and just leave it out.57
![Page 20: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/20.jpg)
Explanation• Latency*Cost and Latency2*Cost are smaller-is-better metrics.• Old System: No JOR2k
• Latency = 30s• Cost = C (we don’t know exactly, so we assume a constant, C)
• New System: With JOR2k• Latency = 21s• Cost = 1.45 * C
• Latency*Cost• Old: 30*C• New: 21*1.45*C• New/Old = 21*1.45*C/30*C = 1.015• New is bigger (worse) than old by 1.015x
• Latency2*Cost• Old: 302 *C• New: 212 *1.45*C• New/Old = 212*1.45*C/302*C = 0.71• New is smaller (better) than old by 0.71x
• In general, you can make C = 1, and just leave it out.57
![Page 21: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/21.jpg)
Amdahl’s Law• The second fundamental theorem of
computer architecture.• If we can speed up x of the program by S
times• Amdahl’s Law gives the total speed up, Stot
Stot = 1 . (x/S + (1-x))
![Page 22: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/22.jpg)
Amdahl’s Law• The second fundamental theorem of
computer architecture.• If we can speed up x of the program by S
times• Amdahl’s Law gives the total speed up, Stot
Stot = 1 . (x/S + (1-x))
x =1 => Stot = 1 = 1 = S (1/S + (1-1)) 1/S
Sanity check:
![Page 23: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/23.jpg)
Amdahl’s Corollary #1• Maximum possible speedup Smax, if we are
targeting x of the program.
Smax = 1 (1-x)
S = infinity
![Page 24: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/24.jpg)
Amdahl’s Law Example #1• Protein String Matching Code• It runs for 200 hours on the current machine, and
spends 20% of time doing integer instructions• How much faster must you make the integer unit to
make the code run 10 hours faster?• How much faster must you make the integer unit to
make the code run 50 hours faster?
A)1.1B)1.25C)1.75D)1.31
E) 10.0F) 50.0G) 1 million timesH) Other
1.33
![Page 25: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/25.jpg)
Amdahl’s Law Example #2• Protein String Matching Code• 4 days execution time on current machine• 20% of time doing integer instructions• 35% percent of time doing I/O• Which is the better tradeoff?• Compiler optimization that reduces number of integer
instructions by 25% (assume each integer instruction takes the same amount of time)
• Hardware optimization that reduces the latency of each IO operations from 6us to 5us.
64
![Page 26: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/26.jpg)
Explanation• Speed up integer ops• x = 0.2• S = 1/(1-0.25) = 1.33• Sint = 1/(0.2/1.33 + 0.8) = 1.052• Speed up IO• x = 0.35• S = 6us/5us = 1.2• Sio = 1/(.35/1.2 + 0.65) = 1.062• Speeding up IO is better
65
![Page 27: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/27.jpg)
Amdahl’s Corollary #2• Make the common case fast (i.e., x should be
large)!• Common == “most time consuming” not necessarily
“most frequent” • The uncommon case doesn’t make much difference• Be sure of what the common case is• The common case can change based on inputs,
compiler options, optimizations you’ve applied, etc.• Repeat…
• With optimization, the common becomes uncommon.• An uncommon case will (hopefully) become the new
common case.• Now you have a new target for optimization.
66
![Page 28: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/28.jpg)
Amdahl’s Corollary #2: Example
• In the end, there is no common case!• Options:
• Global optimizations (faster clock, better compiler)• Divide the program up differently
• e.g. Focus on classes of instructions (maybe memory or FP?), rather than functions.
• e.g. Focus on function call over heads (which are everywhere).• War of attrition• Total redesign (You are probably well-prepared for this)
Common case
![Page 29: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/29.jpg)
Amdahl’s Corollary #2: Example
• In the end, there is no common case!• Options:
• Global optimizations (faster clock, better compiler)• Divide the program up differently
• e.g. Focus on classes of instructions (maybe memory or FP?), rather than functions.
• e.g. Focus on function call over heads (which are everywhere).• War of attrition• Total redesign (You are probably well-prepared for this)
Common case
7x => 1.4x
![Page 30: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/30.jpg)
Amdahl’s Corollary #2: Example
• In the end, there is no common case!• Options:
• Global optimizations (faster clock, better compiler)• Divide the program up differently
• e.g. Focus on classes of instructions (maybe memory or FP?), rather than functions.
• e.g. Focus on function call over heads (which are everywhere).• War of attrition• Total redesign (You are probably well-prepared for this)
Common case
7x => 1.4x4x => 1.3x
![Page 31: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/31.jpg)
Amdahl’s Corollary #2: Example
• In the end, there is no common case!• Options:
• Global optimizations (faster clock, better compiler)• Divide the program up differently
• e.g. Focus on classes of instructions (maybe memory or FP?), rather than functions.
• e.g. Focus on function call over heads (which are everywhere).• War of attrition• Total redesign (You are probably well-prepared for this)
Common case
7x => 1.4x4x => 1.3x
1.3x => 1.1x
Total = 20/10 = 2x
![Page 32: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/32.jpg)
Amdahl’s Corollary #2: Example
• In the end, there is no common case!• Options:
• Global optimizations (faster clock, better compiler)• Divide the program up differently
• e.g. Focus on classes of instructions (maybe memory or FP?), rather than functions.
• e.g. Focus on function call over heads (which are everywhere).• War of attrition• Total redesign (You are probably well-prepared for this)
Common case
7x => 1.4x4x => 1.3x
1.3x => 1.1x
Total = 20/10 = 2x
![Page 33: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/33.jpg)
Amdahl’s Corollary #3• Benefits of parallel processing• p processors• x of the program is p-way parallizable• Maximum speedup, Spar
• A key challenge in parallel programming is increasing x for large p. • x is pretty small for desktop applications, even for p = 2• This is a big part of why multi-processors are of limited
usefulness.
68
Spar = 1 . (x/p + (1-x))
![Page 34: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/34.jpg)
Amdahl’s Corollary #3• Benefits of parallel processing• p processors• x of the program is p-way parallizable• Maximum speedup, Spar
• A key challenge in parallel programming is increasing x for large p. • x is pretty small for desktop applications, even for p = 2• This is a big part of why multi-processors are of limited
usefulness.
68
Spar = 1 . (x/p + (1-x))
![Page 35: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/35.jpg)
Example #3• Recent advances in process technology have
quadruple the number transistors you can fit on your die. • Currently, your key customer can use up to 4
processors for 40% of their application. • You have two choices:• Increase the number of processors from 1 to 4• Use 2 processors but add features that will allow the
application to use 2 processors for 80% of execution.
• Which will you choose?
69
![Page 36: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/36.jpg)
Amdahl’s Corollary #4• Amdahl’s law for latency (L)• By definition• Speedup = oldLatency/newLatency• newLatency = oldLatency * 1/Speedup• By Amdahl’s law:• newLatency = old Latency * (x/S + (1-x))• newLatency = x*oldLatency/S + oldLatency*(1-x)
• Amdahl’s law for latency• newLatency = x*oldLatency/S + oldLatency*(1-x)
![Page 37: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/37.jpg)
Amdahl’s Non-Corollary• Amdahl’s law does not bound slowdown• newLatency = x*oldLatency/S + oldLatency*(1-x) • newLatency is linear in 1/S• Example: x = 0.01 of execution, oldLat = 1 • S = 0.001; • Newlat = 1000*Oldlat *0.01 + Oldlat *(0.99) = ~ 10*Oldlat• S = 0.00001; • Newlat = 100000*Oldlat *0.01 + Oldlat *(0.99) = ~
1000*Oldlat
• Things can only get so fast, but they can get arbitrarily slow.• Do not hurt the non-common case too much!
71
![Page 38: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/38.jpg)
Amdahl’s Non-Corollary• Amdahl’s law does not bound slowdown• newLatency = x*oldLatency/S + oldLatency*(1-x) • newLatency is linear in 1/S• Example: x = 0.01 of execution, oldLat = 1 • S = 0.001; • Newlat = 1000*Oldlat *0.01 + Oldlat *(0.99) = ~ 10*Oldlat• S = 0.00001; • Newlat = 100000*Oldlat *0.01 + Oldlat *(0.99) = ~
1000*Oldlat
• Things can only get so fast, but they can get arbitrarily slow.• Do not hurt the non-common case too much!
71
![Page 39: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/39.jpg)
Amdahl’s Non-Corollary• Amdahl’s law does not bound slowdown• newLatency = x*oldLatency/S + oldLatency*(1-x) • newLatency is linear in 1/S• Example: x = 0.01 of execution, oldLat = 1 • S = 0.001; • Newlat = 1000*Oldlat *0.01 + Oldlat *(0.99) = ~ 10*Oldlat• S = 0.00001; • Newlat = 100000*Oldlat *0.01 + Oldlat *(0.99) = ~
1000*Oldlat
• Things can only get so fast, but they can get arbitrarily slow.• Do not hurt the non-common case too much!
71
![Page 40: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/40.jpg)
Amdahl’s Example #4This one is tricky
• Memory operations currently take 30% of execution time.• A new widget called a “cache” speeds up
80% of memory operations by a factor of 4• A second new widget called a “L2 cache”
speeds up 1/2 the remaining 20% by a factor of 2.• What is the total speed up?
72
![Page 41: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/41.jpg)
Explanation
31
![Page 42: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/42.jpg)
Answer in Pictures
73
L1
L1
sped
up
L
2
n
a
Not memory
L1
sped
up
n
a
Not memory
L
2
n
aNot memory
Memory time
0.24 0.03 0.03 0.7
0.7
0.7
0.030.030.06
0.030.0150.06
Total = 0.82
Total = 1
Total = 0.805
85%4.2%4.2%8.6%
24% 3% 3% 70%
Speed up = 1.242OOPS:
![Page 43: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/43.jpg)
Answer in Pictures
73
L1
L1
sped
up
L
2
n
a
Not memory
L1
sped
up
n
a
Not memory
L
2
n
aNot memory
Memory time
0.24 0.03 0.03 0.7
0.7
0.7
0.030.030.06
0.030.0150.06
Total = 0.82
Total = 1
Total = 0.805
85%4.2%4.2%8.6%
24% 3% 3% 70%
Speed up = 1.242OOPS:
![Page 44: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/44.jpg)
Amdahl’s Pitfall: This is wrong!• You cannot trivially apply optimizations one at a time with
Amdahl’s law. • Apply the L1 cache first
• S1 = 4• x1 = .8*.3• StotL1 = 1/(x1/S1 + (1-x1))• StotL1 = 1/(0.8*0.3/4 + (1-(0.8*0.3))) = 1/(0.06 + 0.76) = 1.2195 times
• Then, apply the L2 cache• SL2 = 2• xL2 = 0.3*(1 - 0.8)/2 = 0.03• StotL2 = 1/(0.03/2 + (1-0.03)) = 1/(.015 + .97) = 1.015 times
• Combine • StotL2 = StotL2’ * StotL1 = 1.02*1.21 = 1.237
74
• What’s wrong? -- after we do the L1 cache, the execution time changes, so the fraction of execution that the L2 effects actually grows
![Page 45: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/45.jpg)
Amdahl’s Pitfall: This is wrong!• You cannot trivially apply optimizations one at a time with
Amdahl’s law. • Apply the L1 cache first
• S1 = 4• x1 = .8*.3• StotL1 = 1/(x1/S1 + (1-x1))• StotL1 = 1/(0.8*0.3/4 + (1-(0.8*0.3))) = 1/(0.06 + 0.76) = 1.2195 times
• Then, apply the L2 cache• SL2 = 2• xL2 = 0.3*(1 - 0.8)/2 = 0.03• StotL2 = 1/(0.03/2 + (1-0.03)) = 1/(.015 + .97) = 1.015 times
• Combine • StotL2 = StotL2’ * StotL1 = 1.02*1.21 = 1.237
74
This is wrong
• What’s wrong? -- after we do the L1 cache, the execution time changes, so the fraction of execution that the L2 effects actually grows
![Page 46: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/46.jpg)
Amdahl’s Pitfall: This is wrong!• You cannot trivially apply optimizations one at a time with
Amdahl’s law. • Apply the L1 cache first
• S1 = 4• x1 = .8*.3• StotL1 = 1/(x1/S1 + (1-x1))• StotL1 = 1/(0.8*0.3/4 + (1-(0.8*0.3))) = 1/(0.06 + 0.76) = 1.2195 times
• Then, apply the L2 cache• SL2 = 2• xL2 = 0.3*(1 - 0.8)/2 = 0.03• StotL2 = 1/(0.03/2 + (1-0.03)) = 1/(.015 + .97) = 1.015 times
• Combine • StotL2 = StotL2’ * StotL1 = 1.02*1.21 = 1.237
74
This is wrong
So is this• What’s wrong? -- after we do the L1 cache, the execution time changes, so the
fraction of execution that the L2 effects actually grows
![Page 47: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/47.jpg)
Answer in Pictures
75
L1
L1
sped
up
L
2
n
a
Not memory
L1
sped
up
n
a
Not memory
L
2
n
aNot memory
Memory time
0.24 0.03 0.03 0.7
0.7
0.7
0.030.030.06
0.030.0150.06
Total = 0.82
Total = 1
Total = 0.805
85%4.2%4.2%8.6%
24% 3% 3% 70%
Speed up = 1.242
![Page 48: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/48.jpg)
Multiple optimizations done right• We can apply the law for multiple optimizations• Optimization 1 speeds up x1 of the program by S1• Optimization 2 speeds up x2 of the program by S2
• Stot = 1/(x1/S1 + x2/S2 + (1-x1-x2))
• Note that x1 and x2 must be disjoint! • i.e., S1 and S2 must not apply to the same portion of execution.
• If not then, treat the overlap as a separate portion of execution and measure it’s speed up independently• ex: we have x1only, x2only, and x1&2 and S1only, S2only, and S1&2
• Then Stot = 1/(x1only/S1only + x2only/S2only + x1&2/S1&2+ (1 - x1only -x2only - x1&2))
• You can estimate S1&2 as S1only*S2only, but the real value could be higher or lower.
76
![Page 49: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/49.jpg)
Multiple optimizations done right• We can apply the law for multiple optimizations• Optimization 1 speeds up x1 of the program by S1• Optimization 2 speeds up x2 of the program by S2
• Stot = 1/(x1/S1 + x2/S2 + (1-x1-x2))
• Note that x1 and x2 must be disjoint! • i.e., S1 and S2 must not apply to the same portion of execution.
• If not then, treat the overlap as a separate portion of execution and measure it’s speed up independently• ex: we have x1only, x2only, and x1&2 and S1only, S2only, and S1&2
• Then Stot = 1/(x1only/S1only + x2only/S2only + x1&2/S1&2+ (1 - x1only -x2only - x1&2))
• You can estimate S1&2 as S1only*S2only, but the real value could be higher or lower.
76
![Page 50: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/50.jpg)
Multiple Opt. Practice• Combine both the L1 and the L2
• memory operations are 30% of execution time• SL1 = 4• xL1 = 0.3*0.8 = .24• SL2 = 2• xL2 = 0.3*(1 - 0.8)/2 = 0.03• StotL2 = 1/(xL1/SLl + xL2/SL2 + (1 - xL1 - xL2))• StotL2 = 1/(0.24/4 + 0.03/2 + (1-.24-0.03))
= 1/(0.06+0.015+.73)) = 1.24 times
77
![Page 51: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/51.jpg)
The Idea of the CPU
4
![Page 52: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/52.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 53: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/53.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 54: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/54.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 55: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/55.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 56: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/56.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 57: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/57.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 58: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/58.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 59: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/59.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 60: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/60.jpg)
The Stored Program Computer
• The program is data• It is a series of bits• It lives in memory• A series of discrete
“instructions”
• The program counter (PC) control execution• It points to the current
instruction• Advances through the
program
6
CPU
Data Memory
Instruction Memory
PC
![Page 61: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/61.jpg)
The Instruction Set Architecture (ISA)
• The ISA is the set of instructions a computer can execute
• All programs are combinations of these instructions• It is an abstraction that programmers (and compilers)
use to express computations• The ISA defines a set of operations, their semantics, and rules for
their use.• The software agrees to follow these rules.
• The hardware can implement those rules IN ANY WAY IT CHOOSES!• Directly in hardware• Via a software layer (i.e., a virtual machine)• Via a trained monkey with a pen and paper• Via a software simulator (like SPIM)
• Also called “the big A architecture”7
![Page 62: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/62.jpg)
The Instruction Set Architecture (ISA)
• The ISA is the set of instructions a computer can execute
• All programs are combinations of these instructions• It is an abstraction that programmers (and compilers)
use to express computations• The ISA defines a set of operations, their semantics, and rules for
their use.• The software agrees to follow these rules.
• The hardware can implement those rules IN ANY WAY IT CHOOSES!• Directly in hardware• Via a software layer (i.e., a virtual machine)• Via a trained monkey with a pen and paper• Via a software simulator (like SPIM)
• Also called “the big A architecture”7
![Page 63: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/63.jpg)
The Instruction Set Architecture (ISA)
• The ISA is the set of instructions a computer can execute
• All programs are combinations of these instructions• It is an abstraction that programmers (and compilers)
use to express computations• The ISA defines a set of operations, their semantics, and rules for
their use.• The software agrees to follow these rules.
• The hardware can implement those rules IN ANY WAY IT CHOOSES!• Directly in hardware• Via a software layer (i.e., a virtual machine)• Via a trained monkey with a pen and paper• Via a software simulator (like SPIM)
• Also called “the big A architecture”7
![Page 64: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/64.jpg)
The MIPS ISA
8
![Page 65: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/65.jpg)
We Will Study Two ISAs• MIPS• Simple, elegant, easy to implement• Designed with the benefit many years ISA design
experience• Designed for modern programmers, tools, and
applications• The basis for your implementation project in 141L• Not widely used in the real world (but similar ISAs
are pretty common, e.g. ARM)
• x86• Ugly, messy, inelegant, crufty, arcane, very difficult
to implement.• Designed for 1970s technology• Nearly the last in long series of unfortunate ISA
designs.• The dominant ISA in modern computer systems.
9
![Page 66: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/66.jpg)
We Will Study Two ISAs• MIPS• Simple, elegant, easy to implement• Designed with the benefit many years ISA design
experience• Designed for modern programmers, tools, and
applications• The basis for your implementation project in 141L• Not widely used in the real world (but similar ISAs
are pretty common, e.g. ARM)
• x86• Ugly, messy, inelegant, crufty, arcane, very difficult
to implement.• Designed for 1970s technology• Nearly the last in long series of unfortunate ISA
designs.• The dominant ISA in modern computer systems.
9
![Page 67: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/67.jpg)
MIPS Basics• Instructions• 4 bytes (32 bits)• 4-byte aligned (i.e., they start at addresses that are a multiple of 4 --
0x0000, 0x0004, etc.)• Instructions operate on memory and registers
• Memory Data types (also aligned)• Bytes -- 8 bits• Half words -- 16 bits• Words -- 32 bits • Memory is denote “M” (e.g., M[0x10] is the byte at address 0x10)
• Registers• 32 4-byte registers in the “register file”• Denoted “R” (e.g., R[2] is register 2)
• There’s a handy reference on the inside cover of your text book and a detailed reference in Appendix B.
10
![Page 68: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/68.jpg)
MIPS Basics• Instructions• 4 bytes (32 bits)• 4-byte aligned (i.e., they start at addresses that are a multiple of 4 --
0x0000, 0x0004, etc.)• Instructions operate on memory and registers
• Memory Data types (also aligned)• Bytes -- 8 bits• Half words -- 16 bits• Words -- 32 bits • Memory is denote “M” (e.g., M[0x10] is the byte at address 0x10)
• Registers• 32 4-byte registers in the “register file”• Denoted “R” (e.g., R[2] is register 2)
• There’s a handy reference on the inside cover of your text book and a detailed reference in Appendix B.
10
![Page 69: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/69.jpg)
MIPS Basics• Instructions• 4 bytes (32 bits)• 4-byte aligned (i.e., they start at addresses that are a multiple of 4 --
0x0000, 0x0004, etc.)• Instructions operate on memory and registers
• Memory Data types (also aligned)• Bytes -- 8 bits• Half words -- 16 bits• Words -- 32 bits • Memory is denote “M” (e.g., M[0x10] is the byte at address 0x10)
• Registers• 32 4-byte registers in the “register file”• Denoted “R” (e.g., R[2] is register 2)
• There’s a handy reference on the inside cover of your text book and a detailed reference in Appendix B.
10
![Page 70: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/70.jpg)
Bytes and Words
11
Address Data
0x0000 0xAA0x0001 0x150x0002 0x130x0003 0xFF0x0004 0x76
... .0xFFFE .0xFFFF .
Address Data
0x0000 0xAA1513FF0x0004 .0x0008 .0x000C .
... .
... .
... .0xFFFC .
Byte addresses Word AddressesAddress Data
0x0000 0xAA150x0002 0x13FF0x0004 .0x0006 .
... .
... .
... .0xFFFC .
Half Word Addrs
• In modern ISAs (including MIPS) memory is “byte addressable”• In MIPS, half words and words are aligned.
![Page 71: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/71.jpg)
The MIPS Register File• All registers are the same• Where a register is needed
any register will work• By convention, we use them
for particular tasks• Argument passing• Temporaries, etc.• These rules (“the register
discipline”) are part of the ISA
• $zero is the “zero register”• It is always zero.• Writes to it have no effect.
12
Name number use Calleesaved
$zero 0 zero n/a$at 1 Assemble Temp no
$v0 - $v1 2 - 3 return value no$a0 - $a3 4 - 7 arguments no$t0 - $t7 8 - 15 temporaries no$s0 - $s7 16 - 23 saved temporaries yes$t8 - $t9 24 - 25 temporaries no$k0 - $k1 26 - 27 Res. for OS yes
$gp 28 global ptr yes$sp 29 stack ptr yes$fp 30 frame ptr yes$ra 31 return address yes
![Page 72: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/72.jpg)
MIPS R-Type Arithmetic Instructions
• R-Type instructions encode operations of the form “a = b OP c” where ‘OP’ is +, -, <<, &, etc.• More formally, R[rd] = R[rs] OP R[rt]
• Bit fields• “opcode” encodes the operation type.• “funct” specifies the particular operation.• “rs” are “rt” source registers; “rd” is the
destination register• 5 bits can specify one of 32 registers.
• “shamt” is the “shift amount” for shift operations• Since registers are 32 bits, 5 bits are sufficient
13
Opcode rs rt rd shamt funct31 26 25 21 20 16 15 11 10 6 5 0
6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsR-Type
Examples• add $t0, $t1, $t2
• R[8] = R[9] + R[10]
• opcode = 0, funct = 0x20
• nor $a0, $s0, $t4
• R[4] = ~(R[16] | R[12])
• opcode = 0, funct = 0x27
• sll $t0, $t1, 4
• R[4] = R[16] << 4
• opcode = 0, funct = 0x0, shamt = 4
![Page 73: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/73.jpg)
MIPS R-Type Control Instructions
• R-Type encodes “register-indirect” jumps
• Jump register• jr rs: PC = R[rs]
• Jump and link register• jalr rs, rd: R[rd] = PC + 8; PC = R[rs]• rd default to $ra (i.e., the assembler will fill it
in if you leave it out)
14
Opcode rs rt rd shamt funct31 26 25 21 20 16 15 11 10 6 5 0
6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsR-Type
Examples• jr $t2
• PC = r[10]
• opcode = 0, funct = 0x8
• jalr $t0
• PC = R[8]
• R[31] = PC + 8
• opcode = 0, funct = 0x9
• jalr $t0, $t1
• PC = R[8]
• R[9] = PC + 8
• opcode = 0, funct = 0x9
![Page 74: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/74.jpg)
MIPS I-Type Arithmetic Instructions
• I-Type arithmetic instructions encode operations of the form “a = b OP #”
• ‘OP’ is +, -, <<, &, etc and # is an integer constant• More formally, e.g.: R[rd] = R[rs] + 42
• Components• “opcode” encodes the operation type.• “rs” is the source register• “rd” is the destination register
• “immediate” is a 16 bit constant used as an argument for the operation
15
Examples• addi $t0, $t1, -42
• R[8] = R[9] + -42
• opcode = 0x8
• ori $t0, $zero, 42
• R[4] = R[0] | 42
• opcode = 0xd
• Loads a constant into $t0
Opcode rs rt Immediate31 26 25 21 20 16 15 0
6 bits 5 bits 5 bits 16 bitsI-Type
![Page 75: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/75.jpg)
MIPS I-Type Branch Instructions
• I-Type also encode branches• if (R[rd] OP R[rs])
PC = PC + 4 + 4 * Immediateelse PC = PC + 4
• Components• “rs” and “rt” are the two registers to be
compared• “rt” is sometimes used to specify branch type.
• “immediate” is a 16 bit branch offset• It is the signed offset to the target of the
branch• Limits branch distance to 32K instructions• Usually specified as a label, and the
assembler fills it in for you.
16
Examples• beq $t0, $t1, -42
• if R[8] == R[9] PC = PC + 4 + 4*-42
• opcode = 0x4
• bgez $t0, -42
• if R[8] >= 0 PC = PC + 4 + 4*-42
• opcode = 0x1
• rt = 1
Opcode rs rt Immediate31 26 25 21 20 16 15 0
6 bits 5 bits 5 bits 16 bitsI-Type
![Page 76: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/76.jpg)
MIPS I-Type Memory Instructions
• I-Type also encode memory access• Store: M[R[rs] + Immediate] = R[rt]• Load: R[rt] = M[R[rs] + Immediate]
• MIPS has load/stores for byte, half word, and word
• Sub-word loads can also be signed or unsigned• Signed loads sign-extend the value to fill a 32
bit register.• Unsigned zero-extend the value.
• “immediate” is a 16 bit offset• Useful for accessing structure components• It is signed.
17
Examples• lw $t0, 4($t1)
• R[8] = M[R[9] + 4]
• opcode = 0x23
• sb $t0, -17($t1)
• M[R[12] + -17] = R[4]
• opcode = 0x28
Opcode rs rt Immediate31 26 25 21 20 16 15 0
6 bits 5 bits 5 bits 16 bitsI-Type
![Page 77: CSCI-564 Advanced Computer Architecture - Inside …inside.mines.edu/~bwu/CSCI_564_15SPRING/slides/lec3_perf2.pdfCSCI-564 Advanced Computer Architecture Lecture 3: Amdahl’s Law and](https://reader030.fdocuments.in/reader030/viewer/2022020302/5adbbc157f8b9a1a088b9d9f/html5/thumbnails/77.jpg)
MIPS J-Type Instructions
• J-Type encodes the jump instructions• Plain Jump• JumpAddress = {PC+4[31:28],Address,2’b0}• Address replaces most of the PC• PC = JumpAddress
• Jump and Link• R[$ra] = PC + 8; PC = JumpAddress;
• J-Type also encodes misc instructions• syscall, interrupt return, and break
(more later)
18
Examples• j $t0
• PC = R[8]
• opcode = 0x2
• jal $t0
• R[31] = PC + 8
• PC = R[8]
Opcode Address31 26 25 0
6 bits 26 bitsJ-Type