Question Bank CS1601

Sub-code CS1601

Subject Name Computer Architecture

Semester I

Degree ME

Branch Computer Science

Staff Name Dr. Hari T. S. Narayanan

Date of Update 1.10.08

Part Unit Group No

1 A 2 & 3 1 1 Describe the three types of instruction

dependences with appropriate examples

2

A 2 & 3 1 2 Give examples using assembly level code segments for each of the three instruction dependencies. No definitions required

3 A 1 1 3 If a computer supports an address space of 4G,

how many bits are required for address register?

4

A 4 1 4 Draw

the memory hierarchy that is used in a typical desktop computer. List typical size and performance values for each of these levels.

5 A 4 1 5 Compare

three cache mapping funct ions in terms of access speed and cache miss.

6 A 5 2 1

Compare

coarse grain and f ine grain mult ithreading with their performance t rade-offs.

7 A 5 2 2 Why Invalidation is preferred over Write-

distribution? State exactly two reasons.

8 A 4 2 3 What is the average rotat ional delay for a disk

system with 10000 RPM?

9

A 1 2 4 Describe the relationship between response time, user time, elapsed time, and system time in the context of a process.

10 A 1 2 5 What are Big Endian and Small Endian

representations?

11 A 2 & 3 3 1 Draw a 6-stage pipeline with multiple Functional

Units

12 A 2 & 3 3 2 Why Read After Write (RAW) is not a problem

with extended ID (Issue & RO)?

13 A 2 & 3 3 3 Describe the three types of instruction

dependences with appropriate examples.

14 A 2 & 3 3 4 Illustrate Write After Read and Write After

Write hazards with appropriate examples.

15 A 2 & 3 3 5 Compare

static scheduling of instructions with dynamic scheduling.

16 A 2 & 3 4 1 Describe the working of branch prediction

algorithm using prediction buffer of size 1 bit.

17

A 2 & 3 4 2

Provide an example to show that data dependence by itself is not sufficient and control dependence needs to be considered as well. Why Data Flow and Exceptional Behavior combination is preferred over Control and Data Dependence combination in scheduling instructions.

18 A 2 & 3 4 3 Draw a 5-stage pipeline with multiple Functional

Units.

19 A 2 & 3 4 4 Describe Write After Read Hazard and Write

After Write Hazard.

20 A 2 & 3 4 5 Explain Read After Write (RAW) with an

appropriate example

21

A 2 & 3 5 1 What is the steady state best -case throughput from a pipelined architecture where each stage takes 1 clock cycle of a 4 GHz clock? Give your answer in number of instructions per second.

22

A 4 5 2 Draw the memory hierarchy that is used in a typical desktop computer. List typical size and access speed for each of these levels

23 A 4 5 3

Describe cache direct mapping using an example. For instance, a memory with 512 blocks and a cache with 32 blocks.

24

A 4 5 4 A program accesses cache 4 million t imes during its execut ion. How many of these accesses are hits if the miss rate is 0.01%?

25

A 4 5 5

A program includes 5 million instructions. On the average, each instruction takes 1.5 cycles if the entire program were to be loaded into cache of a computer 1 GHz clock. If the program takes 8 million CPU cycles to execute; how much time

wasted spent in stalling

26

A 4 6 1

A program accesses cache 4 million times during its execution. The miss rate is 0.01%. If the CPU stalls 2000 cycles for each cache miss, compute the number of misses

27 A 5 6 2 What are two dif ferent shared memory

arrangement used in MIMD architecture?

28

A 5 6 3 Why shared-bus MIMD system is referred to as symmetric multi-processor system? What is the other common name for this arrangement?

29 A 1 6 4 Compare RISC and CISC architectures

30 A 2 & 3 6 5 What is the basic principle behind Tomasulo’s

algorithm? Illustrate that with an example.

31

A 4 7 1 List the essential memory requirements. Compare the memory requirements of Desktop, Server, and Embedded Systems.

32

A 4 7 2 Draw the memory hierarchy that is used in a typical desktop computer. List size and performance values for each of these levels.

33

A 4 7 3 A program accesses cache 2 million times during its execution. How many of these accesses are hits if the miss rate is 0.025 %?

34

A 4 7 4

A program includes 4 million instructions. On the average, each instruction takes 1.5 cycles if the entire program were to be loaded into cache. If the program takes 7 million CPU cycles to execute; how many CPU cycles are spent in stalling?

35 A 4 7 5 Do we need cache replacement scheme for

direct mapping? Justify your choice.

36 A 4 8 1 Compare two Write Miss schemes

37 A 4 8 2 Compare

Write Back and Write Through objectively.

38 A 1 8 3 What are bench-marks?

39

A 2 & 3 8 4

In the 6-stage pipeline if there are three consecutive arithmetic instructions and if each arithmetic instruction take 3 cycles then the third instruction has to be stalled due to limited number (2) of ALUs. This is referred to as Structural dependence

40 A 4 8 5 Compare Split cache and Unified cache

42 Part B

43

B 1 1 1

i. Write program segments to multiply two integer numbers that are in memory (A and B) using different internal storage types (stack, register-register, register-memory, and accumulator). The result of this operation is to be stored in memory. (4) ii. Draw and describe 32 bit floating point representation (4) iii. List the steps in converting a given decimal number to a binary 32-bit floating-point representation. (4) iv. Convert the following number to 32-bit floating-point representation: 255.625. (4)

44

B 1 1 2

i. Do we need both auto-decrement and auto-increment instructions; Can we implement one using the other? (2) ii. Compare register-register and register-memory internal storages? (2) iii. Describe Floating-point arithmetic operations (add, subtract, multiply, and divide) using appropriate examples (4). iv. Convert the following decimal number to 32-bit floating-point representation: 20482.875 (6) v. Describe briefly the two locality properties. (2)

45

B 1 1 3

i. Describe Amdahl’s law on speedup. (4) ii. Deduce the limit of speedup suggested by Amdahl’s law. (4) iii. Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20 Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all operations to 2.5. Compare these two design alternatives using the CPU performance equation. (8)

46

B 1 1 4

iv. Consider the following program, which includes two parts – A & B. Part A, must be executed serially, Part B can be broken down into components that can be executed in parallel. If Part A and Part B were to be executed serially in a computer they take 20 and 70 million CPU cycles respectively. a. If the above computer is using 2 GHz clock, what is the total CPU time required to complete the program? (3) b. If the total instructions executed were to be 30 million. What is the average Clock cycles Per Instruction (CPI)? (3) c. What is the average time taken to complete an instruction? (2) d. What is the maximum speedup that is possible for this program? (4) v. Is it possible to have an average CPI of less than 1? Justify your answer. (4)

47

B 2 & 3 1 5

i. Describe how the following terms are related – Dependence, Hazard, and Stall in the context of Instruction Level Parallelism (ILP). (4) ii. Describe the four types of data dependence hazards. Provide examples for RAW, WAW, and WAR using simple assembly code (4) iii. Consider the un-pipelined processor section. Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 30%, 30% and 40% respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.25 ns of overhead to the clock. Ignoring any latency impact, how many speedups in the instruction execution rate will we gain from a pipeline? (8)

48

B 2 & 3 2 1

i. Describe how the following terms are related – stall, bypass, dynamic scheduling. (4) ii. Describe Scoreboard algorithm. (4) iii. Describe how this algorithm solves WAR, WAW, and RAW hazards (4) iv. Identify all the dependences in the following code (4) DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8

49

B 2 & 3 2 2

i. Describe control dependence (4) ii. Write a code segment that illustrates control, data, and name dependence (4) iii. Consider a pipelined processor with 5 stages. There is only one Functional Unit to execute all the arithmetic and logical operations. The clock that drives this process runs at 4 GHz. Each stage can be completed in single clock cycle. a. What will be the throughput in the best case? (2) b. Express your answer for (a) in instructions per second (2) c. What is the best-case speed up compared to a non-pipelined computer? d. If the execution stage were to take 2 clock cycles to complete then what will be the best-case throughput? Express you answer in instructions per second. Assume static scheduling (4) e. If we add one more Arithmetic & Logic unit and continue to use static scheduling; will there be any difference in the average throughput?

50

B 2 & 3 2 3

i. Describe Scoreboard algorithm and describe how it solves WAR, RAW, and WAW hazards (6) ii. Describe Tomasulo’s algorithm and describe how it solves WAR, RAW, and WAW hazards. (6) iii. Eliminate the name dependence in the following code with minimal change (1) DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 iv. Is it possible for a compiler to eliminate the name dependence found in the above code segment? Is there any limitation in doing so? (3)

51

B 2 & 3 2 4

i. Why MIPS cannot be used to compare two computers? (2) ii. Discuss the role of compilers in achieving instruction level parallelism (4) iii. Describe the role of benchmarking in evaluating computers. (4) iv. Choose one of the following computers (C1, C2, and C3) for a set of applications with the following profile; justify your answer. The applications with in the set are classified into 3 types P1, P2, and P3. The integer value in ith row and jth column indicates the average time (in micro second) taken to complete Pi type programs in Cj computer. 80% of the programs in the set are of type P1, 10% are of type P2, and the rest are of type P3. (6) Profile/Computer C1 C2 C3 P1 5 10 20 P2 100 50 20 P3 10 10 20

52

2 5

i. Describe Scoreboard algorithm and compare it with Tomasulo’s algorithm (6) ii. Why dynamic scheduling does not make sense with single functional unit? (2) iii. There is a simple pipeline with n stages. Assume all stages occupy equal number of clock cycles and there is only one ALU. a. What is the maximum possible speed-up? (2) b. The average stall per instruction is 0.2 CPU cycle, what is the maximum possible speed-up (2) c. The effect of having additional Functional Units is to decrease the average stall by 20%. What is the speed possible? (4)

53

B 1 & 3 3 1

i. Compare register-register and register-memory internal storages? (2) ii. Write program segments to add two integer numbers (in memory locations A and B) using different internal storage types (stack, register-register, register-memory, and accumulator). The result should be stored in A. (4) iii. What are the conditions under which two consecutive multiplication commands could be scheduled without stalling? (2) iv. Describe 32 bit floating point representation (3) v. Describe the steps in converting a given decimal number to a binary 32-bit floating-point representation. (2) vi. Convert the following number to 32-bit floating-point representation: 155. 5 (3)

54

B 2 & 3 3 2

i. Describe Amdahl’s law. (2) ii. Deduce the limit of speedup suggested by Amdahl’s law. (2) iii. Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 20% Average CPI of FP operations = 4.0 Average CPI of other instructions 1.33 Frequency of FPSQR = 5% CPI of FPSQR = 20 Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all operations to 2.5. Compare these two design alternatives using the CPU performance equation. (4) i. Describe how the following terms are related – Dependence, Hazard, and Stall in the context of Instruction Level Parallelism (ILP). (2) ii. Describe the four types of Hazards that arise when you try to exploit ILP. Provide examples for RAW, WAW, and WAR using simple assembly code (2) iii. Consider the un-pipelined processor section. Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 40% and 20% respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.25 ns of overhead to the clock. Ignoring any latency impact, how many speedups in the instruction execution rate will we gain from a pipeline? (4)

55

B 1 & 3 3 3

i. What is the use of cache coherence protocols? What are the two classes of cache coherence protocols? (4) ii. What is snooping protocol? Justify the name snooping protocol (iii) What is Directory-based coherence protocol? (4) iv. What are the inherent features of bus-based systems that are missing in interconnect based systems (4)

56 B 2 & 3 3 4

57

B 1 & 3 3 5

i. Explain the program execution time in terms of Miss Rate, Miss Penalty, Memory Access Per Instruction, CPU Cycles Per Instruction, and Number of instructions. (4) ii. Compute the program execution time where the number of instructions is 12500, Average CPU Cycles per instruction is 2, Average memory access per instruction is 1.5, Miss Rate 2%, and Miss Penalty is 200 CC (CPU Cycles). Compare the execution time for a Miss Penalty of 150 CC. (4) iii. Describe Least Recently Used (LRU) and Least Frequently Used (LFU) cache replacement algorithms. (4) iv. Describe MTBF, MTTR, & MTTDL. (4)

58

B 4 4 1

i. Describe the differences between SRAM and DRAM in the context of main memory and cache memory. (4) ii. Describe and compare unified and split cache schemes. What is the appropriate cache type (split or unified) to use for Level1 and Level2 cache implementations? Justify your choice. (4) iii. Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 45 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits? (4) iv. What is processor-memory gap? Explain Gordon Moore’s law in this context? (4)

59

B 2 & 3 4 2

i. Describe the technical and logistical problems of VLIW model. (4) ii. Describe Loop-carried dependence with dependence distance of n with an example (4)

iii. Consider the following loop. What are the dependence between S1 and S2? Is this loop parallel, if not make it parallel? (4) for (i= 1; i <= 100; i = i + 1) { A [ i ] = A [ i ] + B [ i ] ; /* S1 */ B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */ } iv. The following loop has multiple types of dependences. Find all the true dependences, and anti-dependences, and eliminate the output dependences and anti-dependences by renaming. (4) for ( i = 1 , i <= 100 ; i = i + 1) { y [ i ] = x [ i ] / c ; /* S1 */ x [ i ] = x [ i ] + c ; /* S2 */ z [ i ] = y [ i ] + c ; /* S3 */ y [ i ] = c - y [ i ] ; /* S4 */ }

60

B 4 4 3

i. Describe how having multiple level cache reduces the miss penalty (4) ii. Describe Reduce Critical Word First and Early Restart. Explain how these two techniques reduces miss penalty. (4) iii. Describe the effect of having larger line size on cache misses. (4) iv. Assume a fully associative write-back cache with many cache entries that starts empty. Below is a sequence of five memory operations (the address is in square brackets). What are the number of hits and misses using no-write allocate versus write allocate? (4) WriteMem[100]; WriteMem[100]; ReadMem[200]; WriteMem[200]; WriteMem[100]; WriteMem[100]; ReadMem[300]; WriteMem[200];

61

B 4 4 4

i. Describe principle of localities; discuss how spatial and temporal localities are made use of in caching. (4) ii. Describe Loop interchange with an example. (4) iii. Describe array merge with an example. (4) iv. Describe Loop fusion with an example. (4)

62

B 1 4 5

i. Write program segments to multiply two integer numbers that are in memory (A and B) using different internal storage types (stack, register-register, register-memory, and accumulator). The result of this operation is to be stored in memory (A). (4) ii. Draw and describe 32 bit floating point representation (4) iii. List the steps in converting a given decimal number to a binary 32-bit floating-point representation. (4) iv. Convert the following number to 32-bit floating-point representation: 125. 5. (4)

63

B 4 5 1

i. Given the data below, what is the impact of second-level cache associativity on its miss penalty? (4)\ Hit time L2 for direct mapped = 10 clock cycles Two-way set associativity increases hit time by 0.1 clock cycles to 10.1 clock cycles. Local miss rate L2 for direct mapped = 30% Local miss rateL2 for two-way set associative = 20% Miss penaltyL2 = 200 clock cycles ii. Describe the effect of larger cache size on cache misses. (4) iii. Describe Direct-mapping, Associative mapping, and fully associative mapping. Use one diagram illustrates all the three (4) iv. A program includes 4 million instructions. On the average, each instruction takes 2.0 cycles (1GHz) if the entire program were to be loaded into cache. If the program takes 9 million CPU cycles to execute, how much CPU time is spent in stalling. If the CPU stalls 4000 nanosecond for each cache miss, compute the number of misses. (4)

64

B 2 & 3 5 2

ii. Unroll and schedule the following piece of code. This code adds a constant value to each element of a floating-point vector. The required latency information is listed in the following table. Assume there are 4 FP adders in your CPU (10) L: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 ADDI R1, R1, #-8 7 stall 8; note latency of 1 for ALU, BNE BNE R1, R2, L 9 stall

ii. Calculate the average instructions to process one addition in your code (2) iii. Why dynamic scheduling does not make sense with single functional unit? (2) iv. There is a simple pipeline with n stages. Assume all stages occupy equal number of clock cycles and there is only one ALU. What is the maximum possible speed-up? (2)

65

B 4 5 3

i. Represent the following cache system specification in a diagram: (5) Word size: 4 bytes Number of words per memory block is 8 Number of blocks is 512 Size of a Line is 8 words Number of Lines 16 Number of Sets 4 ii. How many bits are required to address a word in memory? iii. How many of these bits are used to address a block in memory? iv. How many bits are required to address a Line.

v. Draw a diagram that illustrates your address word (size & bit allocation) vi. If you are using Direct Mapping, where in cache, the last memory block mapped? vii. Where in cache, the first memory block mapped? viii. Where in cache, the 24th memory block mapped? ix. How many bits are required to address the sets in cache? x. How many comparisons are made in direct mapping? xi. How many comparisons are made in fully associative mapping? xii. How many comparisons are made in set associative mapping?

66

B 4 5 4

i. Compare Write-back and Write-through algorithms in terms of their complexity, and average cache access time. Will you recommend write-through for a single processor system? Give reason(s) for your answer. (4) ii. Describe Direct Mapping and Fully Associative Mapping using Set Associative mapping. (4) iii. Compute the number of bits required for coding tag, line, and word offset for direct mapping for the following hypothetical cache-memory system. (4) The size of the main memory 512 words Size of each block is 4 words Size of the cache is 4 Lines Each line size is 4 words In set-associative mapping each set contains 2 blocks There are 2 sets in cache iv. Where are the memory blocks 0, 8, 23, 52 mapped (in cache) in the above mapping? (4)

67

B 5 5 5 i.What is Directory-based coherence protocol?(8) ii.What constraints restrict coherence protocols in interconnect based system? (8)

68

B 2 & 3 6 1

i. Describe Software pipelining (4) ii. Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[i]=x[i]+s for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore the branch delay slot.(4) iii. Describe software pipelining with an illustration. (4) iv. Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2: (4) Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop You can omit the start-up and clean-up code.

69

B 5 6 2

i.What is memory consistency? How is memory consistency guaranteed in a multiprocessor system?(8) ii.What is multithreading, describe two types of multithreading?(8)

70

B 5 6 3 i.What is the mechanism used in multiprocessor system to implement atomic operations? (8) ii.What are spin-locks? Give an example(8)

71

B 4 6 4

i. Explain the program execution time in terms of Miss Rate, Miss Penalty, Memory Access Per Instruction, CPU Cycles Per Instruction, and Number of instructions. (4) ii. Compute the program execution time where the number of instructions is 12000, Average CPU Cycles per instruction is 2, Average memory access per instruction is 1.5, Miss Rate 2%, and Miss Penalty is 150 CC (CPU Cycles). Compare the execution time for a Miss Penalty of 100 CC. (4) iii. Describe Least Recently Used (LRU) and Least Frequently Used (LFU) cache replacement algorithms. (4) iv. Describe MTBF, MTTR, MTTDL, and MTTDI. (4)

72

B 5 6 5

i. What is multithreading, describe the two types of multithreading? (4) ii. What are spin-locks? Give an example (4) iii. Describe the use of LL and SC in implementing atomic operations with an example. How is SC implemented – specifically describe how SC decides to change the memory location only when the combination of LL and SC are atomic. (4) iv. What mechanism is used in a single processor system to implement atomic operation? Can we use this in a multiprocessor system? (4)

73

B 5 7 1

i. Describe Write Invalidate with an example (4) ii. Describe Write Distribute with an example (4) iii. Compare the above two snooping implementations. (4) iv. Describe how valid, shared, & dirty bits are used to improve the performance of cache coherence system. (4)

74

B 5 7 2

i.How is interrupt used to create software synchronization primitives? Why can’t this be used in multiprocessor environment(8) ii.What mechanism is used in a single processor system to implement atomic operation? Can we use this in a multiprocessor system?(8)

75

B 4 7 3

i. Describe the two empirical rules that are followed in caching. (4) ii. Describe Direct Mapping and Fully Associative Mapping using Set Associative mapping. (4) iii. Compute the number of bits required for coding tag, line, and word offset for direct mapping for the following hypothetical cache-memory system. (4) The size of the main memory 512 words Size of each block is 4 words Size of the cache is 4 Lines Each line size is 4 words In set-associative mapping each set contains 2 blocks There are 2 sets in cache iv. Where are the memory blocks 0, 8, 23, 52 mapped (in cache) in the above mapping? (4)

76

B 4 7 4

i. Describe and compare unified and split cache schemes. . Which one of this cache is used for Level 1 and Level 2 caches. Justify your choice. (4) ii. Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 25 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits? (4) iii. What is processor-memory gap? Explain Gordon Moore’s law in this context? (4) iv. Calculate the average cycles per instructions for the following two scenarios of memory systems: Scenario 1 Scenario 2 Block size = 1 word Block size = 4 word Memory bus size = 1 word Memory bus size = 2 word Miss rate = 3% Miss rate = 2.5% Memory access per instruction = 1.2 Memory access per instruction = 1.2 Cache miss Penalty = 64 CC Cache miss Penalty = 128 CC Avg Cycles per instruction = 2 Avg Cycles per instruction = 2

77

B 2 & 3 7 5

i. Why MIPS cannot be used to compare two computers? (2) ii. Discuss the role of compilers in achieving instruction level parallelism (4) iii. Describe branch prediction using branch correlation. Use an appropriate example. (4) iv. Choose one of the following computers (C1, C2, and C3) for a set of applications with the following profile; justify your answer.

78

B 2 & 3 8 1

i. Unroll and schedule the following piece of code. This code adds a constant value to each element of a floating-point vector. The required latency information is listed in the following table. Assume there are 4 FP adders in your CPU (10) ii. Calculate the average instructions to process one addition in your code (2) iii. Why dynamic scheduling does not make sense with single functional unit? (2) iv. There is a simple pipeline with n stages. Assume all stages occupy equal number of clock cycles and there is only one ALU. What is the maximum possible speed-up? (2)

79

B 4 8 2

i.What are the essential memory requirements? Compare the memory requirements of Desktop, Server, and Embedded Systems.(8) ii.What is processor-memory gap? Explain Gordon Moore’s law in this context (8)

80

B 4 8 3

i.What is memory hierarchy and why do we need memory hierarchy? (8) Ii.What is cache memory, how does cache memory work?(8)

81

B 4 8 4

i.In general, how does cache memory operate, explain the terms cache hit, cache miss, spatial locality, and temporal locality.(8) ii.Explain the program execution time in terms of Miss Rate, Miss Penalty, Memory Access Per Instruction, CPU Cycles Per Instruction, and Number of instructions.(8)

82

B 4 8 5

i.Compute the program execution time where the number of instructions is 12000, Average CPU Cycles per instruction is 2, Average memory access per instruction is 1.5, Miss Rate 2%, and Miss Penalty is 150 CC (CPU Cycles). Compare the execution time for a Miss Penalty of 100 CC.(8) ii.Why do we need cache-mapping functions? Describe the working of Direct-mapping function.(8)

83

B 4 9 1

i.Compare Write-back and Write-through algorithms in terms of their complexity, and average cache access time. Will you recommend write-through for a single processor system? Give reason(s) for your answer. (8) ii.Describe Direct Mapping and Fully Associative Mapping using Set Associative mapping.(8)

84

B 4 9 2 i.Compare Mapping functions in terms of their hit ratio and search speed.(8) ii.Describe loop-merging arrays with appropriate examples. (8)

85

B 4 9 3

i.Compute tag values, line values, and word offsets for direct- mapping for the following hypothetical cache-memory system. (8) ii.Why do we need cache replacement algorithms? (8)

86

B 4 9 4

i.Describe Least Recently Used (LRU) cache replacement algorithm.(8) ii.Describe Least Frequently Used (LFU) cache replacement algorithm.i.Describe MTBF, MTTR, MTTDL, and MTTDI.(8)

87

B 4 9 5

i.Describe and compare unified and split cache schemes.(8) ii.Describe the differences between SRAM and DRAM in the context of main memory and cache memory.(8)

88

B 5 10 1

i.Why do we need multiprocessor architectures? Describe Flynn’s classification of multiprocessor architectures.(8) ii.What are the two classes of MIMD architectures? Describe them briefly.(8)

89

B 5 10 2

i.Explain the terms Polling, Interrupt, Synchronous, and Asynchronous in the context of message passing.(8) ii.Describe the 3 important metrics of Communication Mechanisms.(8)

90

B 5 10 3

i. Describe the advantages of different communication Mechanisms – Shared Memory and Message Passing.(8) ii.To achieve a speedup of 80 with 100 processors what should be the Fraction of the program that can be executed in parallel or enhanced mode?(8)

91

B 5 10 4

i.Suppose we have an application running on 32-processor multiprocessor, which has 400 ns time to handle references to a remote memory. For this application, assume that all the references except those involving communication hit in the local memory hierarchy, which is slightly optimistic. Processors are stalled on a remote request, and the processor clock rate is 1 GHz. If the base Instructions Per Cycle (IPC) (assuming that all references hit in the cache) is 2, how much faster is the multiprocessor if there is no communication versus if 0.2% of the instructions involve a remote communication references?(8) ii.Explain cache coherence problem using the following diagram where there are 2 CPUs. What kind of cache-write policy is used?(8)

92

B 5 10 5

i. What are different properties or conditions or requirements cache coherency must satisfy? Compare coherency and consistency.(8) ii. What are the two features offered by cache memory in SMP systems? Why do we need these features? (8)

93

B 5 11 1

i.What are the two different ways snooping protocols maintain the cache coherence properties? (8) ii.Compare Write Invalidate and Write Broadcast protocols(8)

94

B 5 11 2

i.Describe how valid, shared, & dirty bits are used in cache coherence (8) ii.What is coherence miss? What are the two types of coherence misses? What is the effect of block size on false miss?(8)

This document was created with Win2PDF available at http://www.win2pdf.com.The unregistered version of Win2PDF is for evaluation or non-commercial use only.This page will not be added after purchasing Win2PDF.

http://www.win2pdf.com

Question Bank CS1601

Documents

Transcript of Question Bank CS1601