By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj...

The Case for a Single-Chip Multiprocessor

ByKunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang

Presented by Dheeraj Kumar Kaveti

Trend: wide instruction issue super scalar processors

Limitations: More logic circuitry

Comparing performance: 6-issue dynamically scheduled superscalar processor with a 4 x two-issue multiprocessor.

Introduction

OutlineThe Limits of the Superscalar Approach


Floor plans for a six-issue superscalarmicro architecture and a 4 x2 way super scalar multiprocessor

comparison of results of both the processors

out of program order execution uses dynamic scheduling.

Hard ware to track register dependencies between instructions.

The three phases in a superscalar processors are Fetch ,issue and execute

The Limits of the Superscalar Approach

Factors constrain instruction fetch: mispredicted branches, instruction misalignment and cache misses.

Even with good branch prediction and alignment a significant cache miss rate will limit performance.

Fortunately, it is possible to hide some of the instruction cache miss latency.

The Limits of the Superscalar Approachin Fetch stage

There are two ways to implement renaming.

1. Explicit table for mapping architectural registers to physical

2. use a combination reorder buffer/instruction.

The advantage of the mapping table is that no comparisons are required for register renaming.

The disadvantage of the mapping table is that the number of access ports required.

The Limits of the Superscalar Approach in issue stage

For example, a machine with 8 wide issue, 3 operand instructions, a 64-entry instruction queue, and 6-bit comparisons requires 9,216 1-bit comparators.

So it takes large area to implement.

This accounts for the long delays.

So queue will limit the performance .

The Limits of the Superscalar Approach in issue stage

Wider instructions requires more register renaming.

The no. of ports required to satisfy the full instruction issue bandwidth also grows with issue width.

The better way to add ports to the data cache is by building a banked cache.

Added banked cache increases the access time of the cache.

The Limits of the Superscalar Approach in is execute stage

To increase the throughput .

Increasing wide spread of multimedia and use of visualization.

To execute the multiple threads in parallel that come from a single execution.

To accelerate execution of sequential applications with out manual intervention.


Two micro architectures

6way super scalar Architecture

Now the number of ports in instruction buffer now increased by 50% thus area of each buffer increased by 30-40%.

To handle out of order the instruction issue should occupy 30% of die but it has only 18%.

Also size of branch target buffer and call-return stack are increased to 2048 and 32 respectively,which increases the branch prediction accuracy.

4x2-way superscalar multiprocessor architecture

It has 4 processors arranged in a grid.

Size of each processor is less than one 4th of 6-way SS processor.

Here the I cache and D cache and L2 are shared by four processors.

The Cache hit time is 5 cycles but for 6 way SS is 4 cycles.

Applications

Performance comparision

IPC break down

Performance of 4x2 issue processor

Comparison of Both processors

High delays are encountered with the Super scalar architecture.

Can exploit this parallelism so that the superscalar micro architecture is at most 10% better, even at the same clock rate.

large grained thread-level parallelism and multiprogramming workloads the multiprocessor performs 50--100% better than the wide superscalar micro architecture.

Conclusion

Questions

Thank you

[1] S.P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.-W.Tseng, "An overview of the SUIF compiler for scalable parallel machines," Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Compiler, San Francisco, 1995.

[2] S. Amarasinghe et.al., "Hot compilers for future hot chips,“ presented at Hot Chips VII, Stanford, CA, 1995.

[3] D.W. Anderson, F. J. Sparacio, and R. M. Tomasulo, "The IBM System/360 model 91: Machine philosophy and instruction-handling," IBM Journal of Research and Development, vol. 11, pp. 8-24, 1967.

[4] W. Bowhill et. al., "A 300MHz 64b quad-issue CMOS microprocessor," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 182-183, San Francisco, CA, 1995.

[5] E, Bugnion, J. Anderson, T. Mowry, M. Rosenblum, and M. Lam. "Compiler-Directed Page Coloring for Multiprocessors," Proceedings Seventh International Syrup. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), October 1996.

[6] "Chart watch: RISC processors," Microprocessor Report, vol. 10, no. 1, p. 22, January, 1996.

References

By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj...

Documents

Transcript of By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj...