Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept....

28
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University

Transcript of Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept....

Page 1: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Lecture 11

Multithreaded Architectures

Graduate Computer Architecture

Fall 2005

Shih-Hao Hung

Dept. of Computer Science and Information Engineering

National Taiwan University

Page 2: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Concept

• Data Access Latency– Cache misses (L1, L2)– Memory latency (remote, local)– Often unpredictable

• Multithreading (MT)– Tolerate or mask long and often unpredictable lat

ency operations by switching to another context, which is able to do useful work.

Page 3: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Why Multithreading Today?

• ILP is exhausted, TLP is in.• Large performance gap bet. MEM and

PROC.• Too many transistors on chip• More existing MT applications Today.• Multiprocessors on a single chip.• Long network latency, too.

Page 4: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Classical Problem, 60’ & 70’

• I/O latency prompted multitasking • IBM mainframes • Multitasking • I/O processors • Caches within disk controllers

Page 5: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Requirements of Multithreading

• Storage need to hold multiple context’s PC, registers, status word, etc.

• Coordination to match an event with a saved context

• A way to switch contexts • Long latency operations must use resources

not in use

Page 6: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Processor Utilization vs. Latency

R = the run length to a long latency event

L = the amount of latency

Page 7: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Problem of 80’

• Problem was revisited due to the advent of graphics workstations – Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for

the workstations to be more responsive. – These processes could drive or monitor display, i

nput, file system, network, user processing – Process switch was slow so the subsystems wer

e microprogrammed to support multiple contexts

Page 8: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Scalable Multiprocessor (90’)

• Dance hall – a shared interconnect with memory on one side and processors on the other.

• Or processors may have local memory

Page 9: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

How do the processors communicate?

• Shared Memory • Potential long latency on every load

– Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP3, BBN’s Butter

fly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and

semaphores. • Message Passing

– Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it.

– Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5

– Synchronization occurs through send and receive

Page 10: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Cycle-by-Cycle Interleaved Multithreading

• Denelcor HEP1 (1982), HEP2

• Horizon, which was never built

• Tera, MTA

Page 11: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Cycle-by-Cycle Interleaved Multithreading

• Features– An instruction from a different context is launched at each

clock cycle – No interlocks or bypasses thanks to a non-blocking

pipeline

• Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on

completion

Page 12: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Challenges with this approach

• I-Cache:– Instruction bandwidth – I-Cache misses: Since instructions are being grabbed from many different

contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time:

– Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts.

– In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times.

• Single thread performance– Single thread performance significantly degraded since the context is forc

ed to switch to a new thread even if none are available. • Very high bandwidth network, which is fast and wide • Retries on load empty or store full

Page 13: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Improving Single Thread Performance

• Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from

each context. – This could lead to pipeline hazards, so other safe instructio

ns could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data depen

dencies and the hardware enforces it by switching to another context if detected.

• Switch on load • Switch on miss

– Switching on load or miss will increase the context switch time.

Page 14: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Simultaneous Multithreading (SMT)

• Tullsen, et. al. (U. of Washington), ISCA ‘95• A way to utilize pipeline with increased parall

elism from multiple threads.

Page 15: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Simultaneous Multithreading

Page 16: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

SMT Architecture• Straightforward extension to conventional superscal

ar design.– multiple program counters and some mechanism by which

the fetch unit selects one each cycle,– a separate return stack for each thread for predicting subro

utine return destinations,– per-thread instruction retirement, instruction queue flush, a

nd trap mechanisms,– a thread id with each branch target buffer entry to avoid pr

edicting phantom branches, and– a larger register file, to support logical registers for all threa

ds plus additional registers for register renaming. • The size of the register file affects the pipeline and the

scheduling of load-dependent instructions.

Page 17: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

SMT PerformanceTullsen ‘96

Page 18: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Commercial Machines w/ MT Support

• Intel Hyperthreding (HT)– Dual threads– Pentium 4, XEON

• Sun CoolThreads – UltraSPARC T1– 4-threads per core

• IBM– POWER5

Page 19: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

Page 20: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

IBM Power5http://www.research.ibm.com/journal/rd/494/mathis.pdf

Page 21: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

SMT Summary

• Pros:– Increased throughput w/o adding much cost– Fast response for multitasking environment

• Cons:– Slower single processor performance

Page 22: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Multicore

• Multiple processor cores on a chip– Chip multiprocessor (CMP)– Sun’s Chip Multithreading (CMT)

• UltraSPARC T1 (Niagara)– Intel’s Pentium D– AMD dual-core Opteron

• Also a way to utilize TLP, but– 2 cores 2X costs– No good for single thread performacne

• Can be used together with SMT

Page 23: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Chip Multithreading (CMT)

Page 24: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Sun UltraSPARC T1 Processor

http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads%20servers

Page 25: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

8 Cores vs 2 Cores

• Is 8-cores too aggressive?– Good for server applications, given

• Lots of threads• Scalable operating environment• Large memory space (64bit)

– Good for power efficiency• Simple pipeline design for each core

– Good for availability– Not intended for PCs, gaming, etc

Page 26: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

SPECWeb 2005

IBM X346: 3Ghz Xeon

T2000: 8 core 1.0GHz T1 Processor

Page 27: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Sun Fire T2000 Server

Page 28: Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Server Pricing

• UltraSPARC– Sun Fire T1000 Server

• 6 core 1.0GHz T1 Processor

• 2GB memory, 1x 80GB disk

• List price: $5,745

– Sun Fire T2000 Server • 8 core 1.0GHz

T1 Processor• 8GB DDR2 memory,

2 X 73GB disk• List price: $13,395

• X86– Sun Fire X2100 Server

• Dual core AMD Opteron 175

• 2GB memory, 1x80GB disk

• List price: $2,295

– Sun Fire X4200 Server• 2x Dual core

AMD Opteron 275• 4GB memory,

2x 73GB disk• List price: $7,595