CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations (Con’t) Memory Technology
CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Con’t) Branch Prediction
description
Transcript of CS252 Graduate Computer Architecture Lecture 12 Vector Processing (Con’t) Branch Prediction
CS252Graduate Computer Architecture
Lecture 12
Vector Processing (Con’t)Branch Prediction
John KubiatowiczElectrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252http://www-inst.eecs.berkeley.edu/~cs252
3/5/2007 cs252-S07, Lecture 12 2
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and
Store InstructionsLV v1, r1, r2
Base, r1 Stride, r2 Memory
Vector Register
Review: Vector Programming Model
3/5/2007 cs252-S07, Lecture 12 3
Review: Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0, 4, 8, …
Elements 1, 5, 9, …
Elements 2, 6, 10, …
Elements 3, 7, 11, …
3/5/2007 cs252-S07, Lecture 12 4
Review: Vector StripminingProblem: Vector registers have finite lengthSolution: Break loops into pieces that fit into vector
registers, “Stripmining” ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainderloop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?
for (i=0; i<N; i++) C[i] = A[i]+B[i];
+
+
+
A B C
64 elements
Remainder
3/5/2007 cs252-S07, Lecture 12 5
Vector ReductionsProblem: Loop-carried dependence on reduction variables
sum = 0;for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary tree to perform reduction# Rearrange as:sum[0:VL-1] = 0 # Vector of VL partial sumsfor(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum# Now have VL partial sums in one vector registerdo { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials} while (VL>1)
3/5/2007 cs252-S07, Lecture 12 6
Novel Matrix Multiply Solution• Consider the following:
/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j++) {
sum = 0; for (t=1; t<k; t++)
sum += a[i][t] * b[t][j]; c[i][j] = sum; }}
• Do you need to do a bunch of reductions? NO!– Calculate multiple independent sums within one vector register– You can vectorize the j loop to perform 32 dot-products at the same
time (Assume Max Vector Length is 32)• Show it in C source code, but can imagine the
assembly vector instructions from it
3/5/2007 cs252-S07, Lecture 12 7
Optimized Vector Example/* Multiply a[m][k] * b[k][n] to get c[m][n] */for (i=1; i<m; i++) { for (j=1; j<n; j+=32) {/* Step j 32 at a time. */
sum[0:31] = 0; /* Init vector reg to zeros. */ for (t=1; t<k; t++) { a_scalar = a[i][t]; /* Get scalar */ b_vector[0:31] = b[t][j:j+31]; /* Get vector */
/* Do a vector-scalar multiply. */prod[0:31] = b_vector[0:31]*a_scalar;
/* Vector-vector add into results. */ sum[0:31] += prod[0:31];
}/* Unit-stride store of vector of results. */
c[i][j:j+31] = sum[0:31];}
}
3/5/2007 cs252-S07, Lecture 12 8
How Pick Vector Length?
• Longer good because:1) Hide vector startup2) lower instruction bandwidth3) tiled access to memory reduce scalar processor memory
bandwidth needs4) if know max length of app. is < max vector length, no strip
mining overhead5) Better spatial locality for memory access
• Longer not much help because:1) diminishing returns on overhead savings as keep doubling
number of element2) need natural app. vector length to match physical register
length, or no help (lots of short vectors in modern codes!)
3/5/2007 cs252-S07, Lecture 12 9
How Pick Number of Vector Registers?
• More Vector Registers:1) Reduces vector register “spills” (save/restore)
» 20% reduction to 16 registers for su2cor and tomcatv» 40% reduction to 32 registers for tomcatv» others 10%-15%
2) Aggressive scheduling of vector instructinons: better compiling to take advantage of ILP
• Fewer:1) Fewer bits in instruction format (usually 3 fields)2) Easier implementation
3/5/2007 cs252-S07, Lecture 12 10
Context switch overhead:Huge amounts of state!• Extra dirty bit per processor
– If vector registers not written, don’t need to save on context switch
• Extra valid bit per vector register, cleared on process start
– Don’t need to restore on context switch until needed
3/5/2007 cs252-S07, Lecture 12 11
Exception handling: External Interrupts?• If external exception, can just put pseudo-op into
pipeline and wait for all vector ops to complete– Alternatively, can wait for scalar unit to complete and begin
working on exception code assuming that vector unit will not cause exception and interrupt code does not use vector unit
3/5/2007 cs252-S07, Lecture 12 12
Exception handling: Arithmetic Exceptions
• Arithmetic traps harder• Precise interrupts => large performance loss!• Alternative model: arithmetic exceptions set
vector flag registers, 1 flag bit per element• Software inserts trap barrier instructions from
SW to check the flag bits as needed• IEEE Floating Point requires 5 flag bits
3/5/2007 cs252-S07, Lecture 12 13
Exception handling: Page Faults• Page Faults must be precise• Instruction Page Faults not a problem
– Could just wait for active instructions to drain– Also, scalar core runs page-fault code anyway
• Data Page Faults harder• Option 1: Save/restore internal vector unit state
– Freeze pipeline, dump vector state– perform needed ops– Restore state and continue vector pipeline
• Option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers
– multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from buffer.
– Per Address Instruction Queue (PAIQ) which sends to TLB and memory while in parallel go to Address Check Instruction Queue (ACIQ)
– When passes checks, instruction goes to Committed Instruction Queue (CIQ) to be there when data returns.
– On page fault, only save intructions in PAIQ and ACIQ
3/5/2007 cs252-S07, Lecture 12 14
Multimedia Extensions• Very short vectors added to existing ISAs for micros• Usually 64-bit registers split into 2x32b or 4x16b or
8x8b• Newer designs have 128-bit registers (Altivec, SSE2)• Limited instruction set:
– no vector length control– no strided load/store or scatter/gather– unit-stride loads must be aligned to 64/128-bit boundary
• Limited vector register length:– requires superscalar dispatch to keep multiply/add/load units busy– loop unrolling to hide latencies increases register pressure
• Trend towards fuller vector support in microprocessors
3/5/2007 cs252-S07, Lecture 12 15
“Vector” for Multimedia?• Intel MMX: 57 additional 80x86 instructions (1st since
386)– similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC
• 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits– reuse 8 FP registers (FP and MMX cannot mix)
• short vector: load, add, store 8 8-bit operands
• Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...
– use in drivers or added to library routines; no compiler
+
3/5/2007 cs252-S07, Lecture 12 16
MMX Instructions• Move 32b, 64b• Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
– opt. signed/unsigned saturate (set to max) if overflow
• Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b
• Multiply, Multiply-Add in parallel: 4 16b• Compare = , > in parallel: 8 8b, 4 16b, 2 32b
– sets field to 0s (false) or 1s (true); removes branches
• Pack/Unpack– Convert 32b<–> 16b, 16b <–> 8b– Pack saturates (set to max) if number is too large
3/5/2007 cs252-S07, Lecture 12 17
Administrivia• Exam: Wednesday 3/14
Location: TBATIME: 5:30 - 8:30
• This info is on the Lecture page (has been)• Meet at LaVal’s afterwards for Pizza and
Beverages • CS252 Project proposal due by Monday 3/5
– Need two people/project (although can justify three for right project)
– Complete Research project in 8 weeks» Typically investigate hypothesis by building an artifact and
measuring it against a “base case”» Generate conference-length paper/give oral presentation» Often, can lead to an actual publication.
3/5/2007 cs252-S07, Lecture 12 18
Spec92fp Operations (Millions) Instructions (M)Program RISC Vector R / V RISC Vector R / Vswim256 11595 1.1x115 0.8142xhydro2d 5840 1.4x 58 0.8 71xnasa7 6941 1.7x 69 2.2 31xsu2cor 5135 1.4x 51 1.8 29xtomcatv 1510 1.4x 15 1.3 11xwave5 2725 1.1x 27 7.2 4xmdljdp2 3252 0.6x 32 15.8 2x
Operation & Instruction Count: RISC v. Vector Processor(from F. Quintana, U. Barcelona.)
Vector reduces ops by 1.2X, instructions by 20X
3/5/2007 cs252-S07, Lecture 12 19
Common Vector Metrics
• R: MFLOPS rate on an infinite-length vector– vector “speed of light”– Real problems do not have unlimited vector lengths, and the start-up penalties
encountered in real problems will be larger – (Rn is the MFLOPS rate for a vector of length n)
• N1/2: The vector length needed to reach one-half of R – a good measure of the impact of start-up
• NV: The vector length needed to make vector mode faster than scalar mode – measures both start-up and speed of scalars relative to vectors, quality
of connection of scalar unit to vector unit
3/5/2007 cs252-S07, Lecture 12 20
Vector Execution Time• Time = f(vector length, data dependicies, struct. hazards) • Initiation rate: rate that FU consumes vector elements
(= number of lanes; usually 1 or 2 on Cray T-90)• Convoy: set of vector instructions that can begin
execution in same clock (no struct. or data hazards)• Chime: approx. time for a vector operation• m convoys take m chimes; if each vector length is n,
then they take approx. m x n clock cycles (ignores overhead; good approximization for long vectors)
4 convoys, 1 lane, VL=64=> 4 x 64 = 256 clocks(or 4 clocks per result)
1: LV V1,Rx ;load vector X2: MULV V2,F0,V1 ;vector-scalar
mult.LV V3,Ry ;load vector Y
3: ADDV V4,V2,V3 ;add4: SV Ry,V4 ;store the result
3/5/2007 cs252-S07, Lecture 12 2132
Memory operations• Load/store operations move groups of data between registers
and memory• Three types of addressing
– Unit stride» Contiguous block of information in memory» Fastest: always possible to optimize this
– Non-unit (constant) stride» Harder to optimize memory system for all possible strides» Prime number of data banks makes it easier to support different strides at full
bandwidth– Indexed (gather-scatter)
» Vector equivalent of register indirect» Good for sparse arrays of data» Increases number of programs that vectorize
3/5/2007 cs252-S07, Lecture 12 22
Interleaved Memory Layout
• Great for unit stride: – Contiguous elements in different DRAMs– Startup time for vector operation is latency of single read
• What about non-unit stride?– Above good for strides that are relatively prime to 8– Bad for: 2, 4
Vector Processor
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
Unpipelined
DRAM
AddrMod 8
= 0
AddrMod 8
= 1
AddrMod 8
= 2
AddrMod 8
= 4
AddrMod 8
= 5
AddrMod 8
= 3
AddrMod 8
= 6
AddrMod 8
= 7
3/5/2007 cs252-S07, Lecture 12 23
How to get full bandwidth for Unit Stride?• Memory system must sustain (# lanes x word) /clock• No. memory banks > memory latency to avoid stalls
– m banks m words per memory lantecy l clocks– if m < l, then gap in memory pipeline:clock: 0 …l l+1 l+2… l+m- 1 l+m…2 lword: -- …0 1 2… m-1 --…m– may have 1024 banks in SRAM
• If desired throughput greater than one word per cycle– Either more banks (start multiple requests simultaneously)– Or wider DRAMS. Only good for unit stride or large data types
• More banks/weird numbers of banks good to support more strides at full bandwidth
– can read paper on how to do prime number of banks efficiently
3/5/2007 cs252-S07, Lecture 12 24
Avoiding Bank Conflicts• Lots of banksint x[256][512];for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)• HW: Prime number of banks
– bank number = address mod number of banks– address within bank = address / number of words in bank– modulo & divide per memory access with prime no. banks?– address within bank = address mod number words in bank– bank number? easy if 2N words per bank
3/5/2007 cs252-S07, Lecture 12 25
• Chinese Remainder TheoremAs long as two sets of integers ai and bi follow these rules
and that ai and aj are co-prime if i j, then the integer x has only one solution (unambiguous mapping):
– bank number = b0, number of banks = a0 (= 3 in example)– address within bank = b1, number of words in bank = a1
(= 8 in example)– N word address 0 to N-1, prime no. banks, words power of 2
bi xmodai,0bi ai, 0x a0a1a2
Fast Bank Number
Seq. Interleaved Modulo InterleavedBank Number: 0 1 2 0 1 2
Address within Bank: 0 0 1 2 0 16 8
1 3 4 5 9 1 172 6 7 8 18 10 23 9 10 11 3 19 114 12 13 14 12 4 205 15 16 17 21 13 56 18 19 20 6 22 147 21 22 23 15 7 23
3/5/2007 cs252-S07, Lecture 12 26
Vectors Are Inexpensive
Scalar• N ops per cycle
2) circuitry• HP PA-8000
• 4-way issue• reorder buffer:
850K transistors• incl. 6,720 5-bit register
number comparators
Vector• N ops per cycle
2) circuitry• T0 vector micro
• 24 ops per cycle• 730K transistors total
• only 23 5-bit register number comparators
• No floating point
3/5/2007 cs252-S07, Lecture 12 27
Vectors Lower PowerVector
• One inst fetch, decode, dispatch per vector
• Structured register accesses
• Smaller code for high performance, less power in instruction cache misses
• Bypass cache
• One TLB lookup pergroup of loads or stores
• Move only necessary dataacross chip boundary
Single-issue Scalar• One instruction fetch, decode,
dispatch per operation• Arbitrary register accesses,
adds area and power• Loop unrolling and software
pipelining for high performance increases instruction cache footprint
• All data passes through cache; waste power if no temporal locality
• One TLB lookup per load or store
• Off-chip access in whole cache lines
3/5/2007 cs252-S07, Lecture 12 28
Superscalar Energy Efficiency Even Worse
Vector• Control logic grows
linearly with issue width• Vector unit switches
off when not in use
• Vector instructions expose parallelism without speculation
• Software control ofspeculation when desired:
– Whether to use vector mask or compress/expand for conditionals
Superscalar• Control logic grows
quadratically with issue width
• Control logic consumes energy regardless of available parallelism
• Speculation to increase visible parallelism wastes energy
3/5/2007 cs252-S07, Lecture 12 29
Vector Applications
Limited to scientific computing?• Multimedia Processing (compress., graphics, audio synth, image proc.)
• Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)• Lossy Compression (JPEG, MPEG video and audio)• Lossless Compression (Zero removal, RLE, Differencing, LZW)• Cryptography (RSA, DES/IDEA, SHA/MD5)• Speech and handwriting recognition• Operating systems/Networking (memcpy, memset, parity, checksum)• Databases (hash/join, data mining, image/video serving)• Language run-time support (stdlib, garbage collection)• even SPECint95
3/5/2007 cs252-S07, Lecture 12 30
Older Vector Machines
Machine Year Clock Regs Elements FUs LSUsCray 1 1976 80 MHz8 64 6 1Cray XMP 1983 120 MHz8 64 8 2 L, 1 SCray YMP 1988 166 MHz8 64 8 2 L, 1 SCray C-90 1991 240 MHz8 128 8 4Cray T-90 1996 455 MHz8 128 8 4Conv. C-1 1984 10 MHz8 128 4 1Conv. C-4 1994 133 MHz16 128 3 1Fuj. VP200 1982 133 MHz8-256 32-1024 3 2Fuj. VP300 1996 100 MHz8-256 32-1024 3 2NEC SX/2 1984 160 MHz8+8K 256+var 16 8NEC SX/3 1995 400 MHz8+8K 256+var 16 8
3/5/2007 cs252-S07, Lecture 12 31
Newer Vector Computers• Cray X1
– MIPS like ISA + Vector in CMOS
• NEC Earth Simulator– Fastest computer in world for 3 years; 40 TFLOPS– 640 CMOS vector nodes
3/5/2007 cs252-S07, Lecture 12 32
Key Architectural Features of X1
New vector instruction set architecture (ISA)– Much larger register set (32x64 vector, 64+64 scalar)
– 64- and 32-bit memory and IEEE arithmetic
– Based on 25 years of experience compiling with Cray1 ISA
Decoupled Execution– Scalar unit runs ahead of vector unit, doing addressing and control– Hardware dynamically unrolls loops, and issues multiple loops concurrently– Special sync operations keep pipeline full, even across barriers Allows the processor to perform well on short nested loops
Scalable, distributed shared memory (DSM) architecture– Memory hierarchy: caches, local memory, remote memory
– Low latency, load/store access to entire machine (tens of TBs)
– Processors support 1000’s of outstanding refs with flexible addressing– Very high bandwidth network– Coherence protocol, addressing and synchronization optimized for DM
3/5/2007 cs252-S07, Lecture 12 33
• Technology refresh of the X1 (0.13m)– ~50% faster processors
– Scalar performance enhancements
– Doubling processor density
– Modest increase in memory system bandwidth
– Same interconnect and I/O
• Machine upgradeable– Can replace Cray X1 nodes with X1E nodes
Cray X1E Mid-life Enhancement
3/5/2007 cs252-S07, Lecture 12 34
ESS – configuration of a general purpose supercomputer
1. Processor Nodes (PN) Total number of processor nodes is 640. Each processor node consists of eight vector processors of 8 GFLOPS and 16GB shared memories. Therefore, total numbers of processors is 5,120 and total peak performance and main memory of the system are 40 TFLOPS and 10 TB, respectively. Two nodes are installed into one cabinet, which size is 40”x56”x80”. 16 nodes are in a cluster. Power consumption per cabinet is approximately 20 KVA.
2) Interconnection Network (IN): Each node is coupled together with more than 83,000 copper cables via single-stage crossbar switches of 16GB/s x2 (Load + Store). The total length of the cables is approximately 1,800 miles.
3) Hard Disk. Raid disks are used for the system. The capacities are 450 TB for the systems operations and 250 TB for users.
4) Mass Storage system: 12 Automatic Cartridge Systems (STK PowderHorn9310); total storage capacity is approximately 1.6 PB.
From Horst D. Simon, NERSC/LBNL, May 15, 2002, “ESS Rapid Response Meeting”
3/5/2007 cs252-S07, Lecture 12 35
Earth Simulator
3/5/2007 cs252-S07, Lecture 12 36
Earth Simulator Building
3/5/2007 cs252-S07, Lecture 12 37
ESS – complete system installed 4/1/2002
3/5/2007 cs252-S07, Lecture 12 38
Vector Summary• Vector is alternative model for exploiting ILP• If code is vectorizable, then simpler hardware,
more energy efficient, and better real-time model than Out-of-order machines
• Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations
• Fundamental design issue is memory bandwidth– With virtual address translation and caching
• Will multimedia popularity revive vector architectures?