By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj...

21
The Case for a Single- Chip Multiprocessor By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti

description

The Limits of the Superscalar Approach The Case for a Single-Chip Multiprocessor Floor plans for a six-issue superscalar micro architecture and a 4 x2 way super scalar multiprocessor comparison of results of both the processors

Transcript of By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj...

Page 1: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

The Case for a Single-Chip Multiprocessor

ByKunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang

Presented by Dheeraj Kumar Kaveti

Page 2: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Trend: wide instruction issue super scalar processors

Limitations: More logic circuitry

Comparing performance: 6-issue dynamically scheduled superscalar processor with a 4 x two-issue multiprocessor.

Introduction

Page 3: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

OutlineThe Limits of the Superscalar Approach

The Case for a Single-Chip Multiprocessor

Floor plans for a six-issue superscalarmicro architecture and a 4 x2 way super scalar multiprocessor

comparison of results of both the processors

Page 4: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

out of program order execution uses dynamic scheduling.

Hard ware to track register dependencies between instructions.

The three phases in a superscalar processors are Fetch ,issue and execute

The Limits of the Superscalar Approach

Page 5: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Factors constrain instruction fetch: mispredicted branches, instruction misalignment and cache misses.

Even with good branch prediction and alignment a significant cache miss rate will limit performance.

Fortunately, it is possible to hide some of the instruction cache miss latency.

The Limits of the Superscalar Approachin Fetch stage

Page 6: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

There are two ways to implement renaming.

1. Explicit table for mapping architectural registers to physical

2. use a combination reorder buffer/instruction.

The advantage of the mapping table is that no comparisons are required for register renaming.

The disadvantage of the mapping table is that the number of access ports required.

The Limits of the Superscalar Approach in issue stage

Page 7: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

For example, a machine with 8 wide issue, 3 operand instructions, a 64-entry instruction queue, and 6-bit comparisons requires 9,216 1-bit comparators.

So it takes large area to implement.

This accounts for the long delays.

So queue will limit the performance .

The Limits of the Superscalar Approach in issue stage

Page 8: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Wider instructions requires more register renaming.

The no. of ports required to satisfy the full instruction issue bandwidth also grows with issue width.

The better way to add ports to the data cache is by building a banked cache.

Added banked cache increases the access time of the cache.

The Limits of the Superscalar Approach in is execute stage

Page 9: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

To increase the throughput .

Increasing wide spread of multimedia and use of visualization.

To execute the multiple threads in parallel that come from a single execution.

To accelerate execution of sequential applications with out manual intervention.

The Case for a Single-Chip Multiprocessor

Page 10: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Two micro architectures

Page 11: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

6way super scalar Architecture

Now the number of ports in instruction buffer now increased by 50% thus area of each buffer increased by 30-40%.

To handle out of order the instruction issue should occupy 30% of die but it has only 18%.

Also size of branch target buffer and call-return stack are increased to 2048 and 32 respectively,which increases the branch prediction accuracy.

Page 12: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

4x2-way superscalar multiprocessor architecture

It has 4 processors arranged in a grid.

Size of each processor is less than one 4th of 6-way SS processor.

Here the I cache and D cache and L2 are shared by four processors.

The Cache hit time is 5 cycles but for 6 way SS is 4 cycles.

Page 13: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Applications

Page 14: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Performance comparision

Page 15: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

IPC break down

Page 16: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Performance of 4x2 issue processor

Page 17: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Comparison of Both processors

Page 18: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

High delays are encountered with the Super scalar architecture.

Can exploit this parallelism so that the superscalar micro architecture is at most 10% better, even at the same clock rate.

large grained thread-level parallelism and multiprogramming workloads the multiprocessor performs 50--100% better than the wide superscalar micro architecture.

Conclusion

Page 19: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Questions

Page 20: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Thank you

Page 21: By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

[1] S.P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.-W.Tseng, "An overview of the SUIF compiler for scalable parallel machines," Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Compiler, San Francisco, 1995.

[2] S. Amarasinghe et.al., "Hot compilers for future hot chips,“ presented at Hot Chips VII, Stanford, CA, 1995.

[3] D.W. Anderson, F. J. Sparacio, and R. M. Tomasulo, "The IBM System/360 model 91: Machine philosophy and instruction-handling," IBM Journal of Research and Development, vol. 11, pp. 8-24, 1967.

[4] W. Bowhill et. al., "A 300MHz 64b quad-issue CMOS microprocessor," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 182-183, San Francisco, CA, 1995.

[5] E, Bugnion, J. Anderson, T. Mowry, M. Rosenblum, and M. Lam. "Compiler-Directed Page Coloring for Multiprocessors," Proceedings Seventh International Syrup. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), October 1996.

[6] "Chart watch: RISC processors," Microprocessor Report, vol. 10, no. 1, p. 22, January, 1996.

References