Instruction Sets Instruction Sets November 21, 2015 INF5062: Programming asymmetric multi-core...
-
Upload
wilfred-park -
Category
Documents
-
view
222 -
download
1
Transcript of Instruction Sets Instruction Sets November 21, 2015 INF5062: Programming asymmetric multi-core...
Instruction Sets Instruction Sets
April 20, 2023
INF5062:Programming asymmetric multi-core processors
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
Instruction Sets
Why?−Give insight into designers’ mind−Allow debugging
Classical Programming−Von Neumann machines
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
Instruction Sets
CISC Complex instructions Many cycles for each
instruction Irregular number of
cycles per instruction Two or more memory
locations per instruction possible
Implicit load and store operations
Small code sizes Chip surface used for
instruction storage
RISC Simple instructions Few cycles for each
instruction Fixed or few different
numbers of cycle per instruction
No memory locations in computing instructions
Explicit load and store operations
Large code size Chip surface used for
memory
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
Instruction Sets
SISD MISD
SIMD MIMD
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
Instruction Sets
Typical mechanisms for performance improvement−Caching−Data and instruction prefetching−Pipelining−Speculative execution−Out-of-order execution−Symmetrical multithreading (“hyperthreading”)
These mechanisms are not generally applied in heterogeneous multi-cores
INF5062:Programming asymmetric multi-core processors
IXP 2400
Engine instruction set
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP Microengine Assembler
SISD architecture
Assembler provides−Detailed control of timing−Detailed control of hardware context switching−Understanding of actual instructions
Assembler requires−Knowledge of supporting hardware
Need to know−No stack
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: I/O and Context Swap Instructions
DRAMRead and Write
RBUF and TBUF
CAPEnumerated CSR Addressing
Calculated Addressing
Reflect
HALT, CTX_ARB
HASH, MSF, PCI
SCRATCHRead and Write
Atomic Operations
Ring Operations
SRAM
Read and Write
Atomic Operations
Control Status Register (CSR)
Read Queue Descriptor
Write Queue Descriptor
Enqueue
Dequeue
Ring Operations
Journal Operations
Can transfer bursts of data Operate asynchronously Can generate signals when
finished Needs two registers to
specify addresses Where memory is aligned,
address register bits aren’t shifted, but least relevant bits are ignored
Can be combined with command tokens
Supports so-called indirect references: state of a previous ALU command modifies meaning
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: general instructions
ALU
ALU_SHF
ASR
BYTE_ALIGN_BE, _LE
CRC_BE, _LE
DBL_SHF
MUL_STEP
FFS
POP_COUNT
IMMED
IMMED_B0, _B1, _B2, _B3
IMMED_W0, _W1
LD_FIELD
LD_FIELD_W_CLR
LOAD_ADDR
LOCAL_CSR_RD, _WR
NOP
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
ALU
ALU_SHF
ASR
BYTE_ALIGN_BE, _LE
CRC_BE, _LE
DBL_SHF
MUL_STEP
FFS
POP_COUNT
IMMED
IMMED_B0, _B1, _B2, _B3
IMMED_W0, _W1
LD_FIELD
LD_FIELD_W_CLR
LOAD_ADDR
LOCAL_CSR_RD, _WR
NOP
IXP: general instructionsLOCAL_CSR_WR[byte_index,2]
NOP
NOP
NOP
BYTE_ALIGN_BE[--,r0]
BYTE_ALIGN_BE[dest1,r1]
BYTE_ALIGN_BE[dest2,r2]
BYTE_ALIGN_BE[dest3,r3]
...
...
Aligning long sequences of registersThis can be aligned with asynchronous load and store ops
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
ALU
ALU_SHF
ASR
BYTE_ALIGN_BE, _LE
CRC_BE, _LE
DBL_SHF
MUL_STEP
FFS
POP_COUNT
IMMED
IMMED_B0, _B1, _B2, _B3
IMMED_W0, _W1
LD_FIELD
LD_FIELD_W_CLR
LOAD_ADDR
LOCAL_CSR_RD, _WR
NOP
IXP: general instructionsStatus checkALU[r1,--,B,r1] move r1 into r1, just updating registers: N = (r1 < 0), Z = (r1 == 0), V = 0, C = 0 negative, zero, overflow, carry
General arithmetic operations on all kinds of registers
Arithmetic operationsALU[r3,r1,+,r2] r3=r1+r2 N = (r3 < 0), Z = (r3 == 0) V = (r1+r2>231), C = (r1+r2>232)
ALU[r5,r1,+,r2]
ALU[r6,r3,+carry,r4] r5=r1+r2 r6=r3+r4+carry-bit
ALU[r3,r1,+8,r2] r3=r1 + (r2 & 0xff)
Boolean operationsALU[r3,r1,AND~,r2] r3=r1 AND NOT r2
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: general instructions
ALU
ALU_SHF
ASR
BYTE_ALIGN_BE, _LE
CRC_BE, _LE
DBL_SHF
MUL_STEP
FFS
POP_COUNT
IMMED
IMMED_B0, _B1, _B2, _B3
IMMED_W0, _W1
LD_FIELD
LD_FIELD_W_CLR
LOAD_ADDR
LOCAL_CSR_RD, _WR
NOP
Replacing bytesLD_FIELD_W_CLEAR[r2,0100,r1,<<rot12], not load_cc move content of r1 to temporary space t rotate t left by 12 bits set r2 = ( r2 & 0xff00ffff ) OR ( t & 0xff0000 ) don’t change any status bits
Find first set bitIMMED[r1,0xc00]
FFS[r2,r1] set r2 to lowest index of a bit in r1, here 19
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: branching instructions
BCC
BEQ
BNE
BMI
BPL
BCS
BHS
BCC
BLO
BVS
BVC
BGE
BLT
BGT
BLE
BR
BR_BCLR, BR_BSET
BR=BYTE, BR!=BYTE
BR=CTX, BR!=CTX
BR_INP_STATE, BR_!INP_STATE
BR_SIGNAL, BR_!SIGNAL
JUMP
RTN
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: branching instructions
BCC
BEQ
BNE
BMI
BPL
BCS
BHS
BCC
BLO
BVS
BVC
BGE
BLT
BGT
BLE
BR
BR_BCLR, BR_BSET
BR=BYTE, BR!=BYTE
BR=CTX, BR!=CTX
BR_INP_STATE, BR_!INP_STATE
BR_SIGNAL, BR_!SIGNAL
JUMP
RTN
Jump table:IMMED[r1,2]
JUMP[r1,label0#], targets[label0#, label1#, label2#, label3#]
...
label0#:
BR[elsewhere0#]
label1#:
BR[elsewhere1#]
label2#:
BR[elsewhere2#]
label3#:
BR[elsewhere3#] example jumps to label2
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: branching instructions
BCC
BEQ
BNE
BMI
BPL
BCS
BHS
BCC
BLO
BVS
BVC
BGE
BLT
BGT
BLE
BR
BR_BCLR, BR_BSET
BR=BYTE, BR!=BYTE
BR=CTX, BR!=CTX
BR_INP_STATE, BR_!INP_STATE
BR_SIGNAL, BR_!SIGNAL
JUMP
RTN
Emulating a function call:...
LOAD_ADDR[r1,return_label#]
BR[subroutine_label#]
return_label#:
...
Subroutine_label#:
...
RTN[r1]
... example jumps to subroutine_label and returns to return_label (unless r1 gets overwritten)
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: branching instructions Branching operations
can be deferredto “do something else” before jumping
Great for confusing the mortals
Especially when combined with ctx_arb
Saves registers− Don’t need to save temp2 in
the example
...
immed[temp1,20,<<0]
immed[temp3,10,<<0]
alu[temp2,temp1,-,temp3]
bgt[label#], defer[3]
alu[temp,temp,+,temp2]
alu[temp,--,B,temp2]
alu[--,temp,-,temp3]
alu[temp,--,B,temp1]
...
label:
...
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
IXP: content addressable memory
CAM_CLEAR
CAM_LOOKUP
CAM_READ_TAG
CAM_READ_STATE
CAM_WRITE
CAM_WRITE_STATE
Operate on CAM memory− 1 per-Engine− 16 longwords of 32 bits− operated as a LRU cache of 32
bit write operations− a lookup hit refreshes an access− a 4-bit state can be associated
with each CAM entry• written by CAM_WRITE STATE• retrieved by CAM_LOOKUP or
CAM_READ_STATE
The IXP1200 had CAM access to all SRAM memory!!!− apparently it wasn’t worth it ...
INF5062:Programming asymmetric multi-core processors
Cell Broadband Engine
SPU instruction set
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL
CELL instruction set very similar to AltiVec A set of SIMD instruction for floating point
operations Single precision is/was considered sufficient
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL
The CBE is big-endian The SPEs have 16-byte (128-bit) registers
MSB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LSB
most significant bit least significant bit
lower memory addresses higher memory addresses
Big-endian escapes the typical x86 *********:char c[4]; int *e;
c[0]=0x01; c[1]=0x02; c[2]=0x03; c[3]=0x04;
e = (int*)c;
printf(“%d\n”,*e); --> 0x4030201
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL
The CBE is big-endian The SPEs have 16-byte (128-bit) registers
MSB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LSB
Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 Byte 8 Byte 9Byte 10Byte 11Byte 12Byte 13Byte 14Byte 15
Char 0 Char 1 Char 2 Char 3 Char 4 Char 5 Char 6 Char 7 Char 8 Char 9Char 10Char 11Char 12Char 13Char 14Char 15
Half-word 0 Half-word 1 Half-word 2 Half-word 3 Half-word 4 Half-word 5 Half-word 6 Half-word 7
Word 0 Word 1 Word 2 Word 3
Doubleword 0 Doubleword 1
Quadword 0
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
qword
CELL
“Preferred slots” of ABI types in registers
char
short [0] short [1] short [2] short [3] short [4] short [5] short [6] short [7]
int [0] int [1] int [2] int [3]
double [0] double [1]
Data type: vector double
Data type: vector signed int
Data type: vector signed short
Data type: doubledouble
Data type: int
int
Data type: short
Data type: char
short
So what have they smoked?
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
Straight-forward
Straight-forward
See together with shufb instruction
il rt,0x12
int [0]
int [1]
int [2]
int [3]
=0x12
=0x12
=0x12
=0x12
fsmi rt,0x9c1a
Byte 0
Byte 1
Byte 2
Byte 3
Byte 4
Byte 15
0x9c1a : 1001110000011010
0xff
0x00
0x00
0xff
0xff
0x00
...
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
ah rt,ra,rb
rt [0]
rt [1]
rt [2]
rt [3]
rt [4]
rt [5]
rt [6]
rt [7]
ra [0]
ra [1]
ra [2]
ra [3]
ra [4]
ra [5]
ra [6]
ra [7]
=
=
=
=
=
=
=
=
rb [0]
rb [1]
rb [2]
rb [3]
rb [4]
rb [5]
rb [6]
rb [7]
+
+
+
+
+
+
+
+
cg rt,ra,rb
...
= ( + > 216 ) ? 0x01 : 0x00rt [0] ra [0] rb [0]
= ( + > 216 ) ? 0x01 : 0x00rt [15] ra [15] rb [15]
bgx rt,ra,rb
...
= ( > ) ? 0x00 : 0x01rt [0] ra [0] rb [0]
= ( > ) ? 0x00 : 0x01rt [15] ra [15] rb [15]
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
mpyhha rt,ra,rb
rt{0:3} ra{0:1}= * +rb{0:1} rt{0:3}
rt{4:7} ra{4:5}= * +rb{4:5} rt{4:7}
rt{8:11} ra{8:9}= * +rb{8:9} rt{8:11}
rt{12:15} ra{12:13}= * +rb{12:13} rt{12:15}
For every byte separately
For every byte separately
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
fsm rt,ra
ra
rt{0:3}
rt{4:7}
rt{8:11}
rt{12:15}
= ( == 0 ) ? 0x00000000 : 0xffffffff
= ( == 0 ) ? 0x00000000 : 0xffffffff
= ( == 0 ) ? 0x00000000 : 0xffffffff
= ( == 0 ) ? 0x00000000 : 0xffffffff
For every byte separately
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
sumb rt,ra,rb
ra{0} ra{1} ra{2} ra{3} ra{4} ra{5} ra{6} ra{7} ra{8} ra{9} ra{10}ra{11}ra{12}ra{13}ra{14}ra{15}
rb{0} rb{1} rb{2} rb{3} rb{4} rb{5} rb{6} rb{7} rb{8} rb{9} rb{10}rb{11}rb{12}rb{13}rb{14}rb{15}
rt{0:1}
rt{2:3}
rt{4:5}
rt [14:15]
=
=
=
=
+
+
+
+
rb{0} rb{1} rb{2} rb{3}
rb{4} rb{5} rb{6} rb{7}
ra{0} ra{1} ra{2} ra{3}
ra{12} ra{13} ra{14} ra{15}
+
+
+
+
+
+
+
+
...
and rt,ra,rbbit-wise: t = a AND b
eqv rt,ra,rbbit-wise: t = NOT (a XOR b )
selb rt,ra,rb,rcbit-wise: t = c ? b : a
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
il rd,0
chd rc,4(rd)
shufb rt,ra,rb,rc
ra{0} ra{1} ra{2} ra{3} ra{4} ra{5} ra{6} ra{7} ra{8} ra{9} ra{10}ra{11}ra{12}ra{13}ra{14}ra{15}
rb{0} rb{1} rb{2} rb{3} rb{4} rb{5} rb{6} rb{7} rb{8} rb{9} rb{10}rb{11}rb{12}rb{13}rb{14}rb{15}
rd{0:16} = 0
0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x08 0x09 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f
rt{0} rt{1} rt{2} rt{3} rt{4} rt{5} rt{6} rt{7} rt{8} rt{9} rt{10} rt{11} rt{12} rt{13} rt{14} rt{15}
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Memory Load and Store
lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction
Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate
Integer and Logical Operations
ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes
shufb rt,ra,rb,rc
ra{0} ra{1} ra{2} ra{3} ra{4} ra{5} ra{6} ra{7} ra{8} ra{9} ra{10}ra{11}ra{12}ra{13}ra{14}ra{15}
rb{0} rb{1} rb{2} rb{3} rb{4} rb{5} rb{6} rb{7} rb{8} rb{9} rb{10}rb{11}rb{12}rb{13}rb{14}rb{15}
0x80 0x11 0xe0 0x14 0x14 0x11 0xc0 0x17 0x08 0x0a 0x1a 0x0c 0x1c 0x0e 0x1e 0x1f
0x00 rb{1} 0x80 rb{4} rb{4} rb{1} 0xff rb{7} ra{8} ra{10}rb{10}ra{12}rb{12}ra{14}rb{14}rb{15}
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Shift and Rotateshl - shift left wordshlqbybi - shift left quadword by bytes from bit shift countrotmi - rotate and mask word immediate
Compare, Branch and Halt
heqi - halt if equal immediateceqb - compare equal byteclgt - compare logical greater than wordbrsl - branch relative and set linkbrz - branch if zero
Hint-for-Branch hbra - hint for branch (a-form)
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Shift and Rotateshl - shift left wordshlqbybi - shift left quadword by bytes from bit shift countrotmi - rotate and mask word immediate
Compare, Branch and Halt
heqi - halt if equal immediateceqb - compare equal byteclgt - compare logical greater than wordbrsl - branch relative and set linkbrz - branch if zero
Hint-for-Branch hbra - hint for branch (a-form)
shlqbybi rt,ra,rb
rb
H L
bitshift count = H L e.g.: 3
... L6 L5 L4 L3 L2 L1 L0H0 H1 H2 H3 H4 H5 H6
L3 L2 L1 L0 0 0 0...H3 H4 H5 H6 H7 H8 H9
ra=
-> rt
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructionsOp group Example
Shift and Rotateshl - shift left wordshlqbybi - shift left quadword by bytes from bit shift countrotmi - rotate and mask word immediate
Compare, Branch and Halt
heqi - halt if equal immediateceqb - compare equal byteclgt - compare logical greater than wordbrsl - branch relative and set linkbrz - branch if zero word
Hint-for-Branch hbra - hint for branch (a-form)
cegb rt,ra,rbbyte-wise: t = ( a == b ) ? 0xff : 0x00
clgt rt,ra,rbword-wise: t = ( a > b ) ? 0xffffffff : 0x00000000
brsl rt,symbolrt = PC+1jump to PC+symbol
brsl rt,symbolif( rt{4:7} == 0 } jump to symbol
hbra brinst,brtargWarns the processor: There will be a jump to brtarg, coming brinst instructions from current PC
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructions
Floating Point
fa - floating adddfs - double floating subtractfma - floating multiply and addfrest - floating reciprocal estimatecsfit - convert signed integer to floatingfrds - floating round double to singledftsv - double float test special valuefcmgt - floating compare magnitude greater than
Controlnop - no operation (execute)dsync - synchronize data
SPU Channelrdch - read channelrchcnt - read channel countwrch - write channel
SPU Interrupt Facility iret - return from interrupt
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CELL: groups of instructions
Floating Point
fa - floating adddfs - double floating subtractfma - floating multiply and addfrest - floating reciprocal estimatecsfit - convert signed integer to floatingfrds - floating round double to singledftsv - double float test special valuefcmgt - floating compare magnitude greater than
Controlnop - no operation (execute)dsync - synchronize data
SPU Channelrdch - read channelrchcnt - read channel countwrch - write channel
SPU Interrupt Facility iret - return from interrupt
Channels: A chip CELL has 128 so-called channels A channel is an atomic pipe for communication between the SPE and its environment
The meaning of channels is defined by integrators and implemented in the MMIO (memory mapped I/O) unit In current-day CELLs channels are among other things used for implementing mailboxes and DMA operations
So the meaning of these commands varies with platform
Interrupts: an SPE can have one registered interrupt handler at address 0 iret returns from such an interrupts alternatively
interrupts can be disabled, and programs can use the bisled command (branch indirect and set link if external data) to check conditions that might otherwise be handled by interrupt
INF5062:Programming asymmetric multi-core processors
nVIDIA GPGPU
CUDA instruction set
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
nVIDIA G92
Streaming Multiprocessor (SM)
Store to
SP0 RF0
SP1 RF1
SP2 RF2
SP3 RF3
SP4RF4
SP5RF5
SP6RF6
SP7RF7
Constant L1 Cache
L1 Fill
Load from Memory
Load Texture
SFU
SFU
Instruction Fetch
Instruction L 1 Cache
Thread / Instruction Dispatch
L1 Fill
Work
Control
Results
Shared Memory
Store to Memory
Stream Multiprocessor (SM)− 8 Stream Processors (SPs)
− Sharing local ”shared memory”− Sharing L1 cache
instruction set: PTX− Parallel thread execution
PTX is one abstraction: meant to make GPGPU programming easier to handle
Parts of the card’s operation is not formulated in PTX
PTX programs are sandboxed, running in a virtual machine-like environment
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
nVIDIA G92
Streaming Multiprocessor (SM)
Store to
SP0 RF0
SP1 RF1
SP2 RF2
SP3 RF3
SP4RF4
SP5RF5
SP6RF6
SP7RF7
Constant L1 Cache
L1 Fill
Load from Memory
Load Texture
SFU
SFU
Instruction Fetch
Instruction L 1 Cache
Thread / Instruction Dispatch
L1 Fill
Work
Control
Results
Shared Memory
Store to Memory
SPs are not as independent as PTX seems to imply− From the manual: ”Each
multiprocessor has a Single Instruction, Multiple Data architecture (SIMD): At any given clock cycle, each processor of the multiprocessor executes the same instruction, but operates on different data. “
Implications− You program several different
threads− Threads in the same SM
execute the same instruction, some may wait
− If threads don’t execute the same instr., they are divergent
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
nVIDIA G92
Streaming Multiprocessor (SM)
Store to
SP0 RF0
SP1 RF1
SP2 RF2
SP3 RF3
SP4RF4
SP5RF5
SP6RF6
SP7RF7
Constant L1 Cache
L1 Fill
Load from Memory
Load Texture
SFU
SFU
Instruction Fetch
Instruction L 1 Cache
Thread / Instruction Dispatch
L1 Fill
Work
Control
Results
Shared Memory
Store to Memory
Parallelity− Threads are grouped into warps
• Threads of one warp run on the same SM at the same time
• Efficiency is lost if they diverge
− The warps form cooperative thread arrays, or CTAs
− One CTA runs on only one SM− CTAs have a 1D, 2D or 3D shape
• software-defined• determines threads’ 3D ID (%tid)
− An SM can handle several CTAs in time-sharing mode
− Time-sharing is preemptive
The instruction set hides this entirely
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CUDA PTX doesn’t have
conditional branches
However− It has absolute branches− It has the ability to hop over
instructions
“Predicate”
Want to emulate
Steps
Branching
if(i<n)
j=j+1
.reg .pred p
setp.lt.s32 p, i, n
@p add.s32 j, j, 1
setp.lt.s32 p, i, n
@p bra label
...
label:
...
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CUDA Branching is by default
divergant− Must tell assembler if this is
not the case:
The threads can declare, load and store vectors of intrinsics− .v2, .v3, .v4of− .b8, .b16, .b32, .b64, .s8, .s16, .s32
, .s64, .u8, .u16, .u32, .u64, .f32, .f64
vector double a as on the Cell would be:.v2 .f64
but PTX can even sayld.shared.v4.f64 rt,[a]
And, of course, this instruction is effectively working on 8 of these in parallel (2096 bits)
setp.lt.s32 p, i, n
@p bra.uni label
...
label:
...
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CUDAcomparison
set, setp, selp, slct EQ, NE, LT, LE, GT, GEEQ, NE, LO, LS, HI, HS
arithmetic add, sub, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max
logic and shift
and, or, xor, not, cnot, shl, shr
move and convert
mov, ld, st, cvt .const, .global, .local, .param, .shared.v2, .v3, .v4.pred, .b8, b16, b32, b64, u8, u16, u32, u64, s8, s16, s32, s64, f32, f64
texture tex
control flow
@, bra, call, ret, exit Considered divergent unless .uni is specified
sync’ing atom, bar atom can have subcommands add, inc, dec, min, max, and, or, xor, cas, exch
float rcp, sqrt, rsqrt, sin, cos, lg2, ex2
misc trap, brkpt
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CUDAcomparison
set, setp, selp, slct EQ, NE, LT, LE, GT, GEEQ, NE, LO, LS, HI, HS
arithmetic add, sub, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max
logic and shift
and, or, xor, not, cnot, shl, shr
move and convert
mov, ld, st, cvt .const, .global, .local, .param, .shared.v2, .v3, .v4.pred, .b8, b16, b32, b64, u8, u16, u32, u64, s8, s16, s32, s64, f32, f64
texture tex
control flow
@, bra, call, ret, exit Considered divergent unless .uni is specified
sync’ing atom, bar atom can have subcommands add, inc, dec, min, max, and, or, xor, cas, exch
float rcp, sqrt, rsqrt, sin, cos, lg2, ex2
misc trap, brkpt
Given predicate p, general registers rt, ra, rb
selp.s32 rt, ra, rb, p
sets rt = p ? ra : rb
General registers rt, ra, rbmul.hi.u32 rt, ra, rb
tmp = ra*rb
rt = high 32 bits of tmp
Load an int into rt from shared memory addressed by ra plus offsetld.shared.u32 rt,[ra+16]
Load a double vector into rt from global memory absolutely addressedld.global.v4.f64 rt,[240]
A true function call, recursion forbidden, in this case non-divergantcall.uni label
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
CUDAcomparison
set, setp, selp, slct EQ, NE, LT, LE, GT, GEEQ, NE, LO, LS, HI, HS
arithmetic add, sub, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max
logic and shift
and, or, xor, not, cnot, shl, shr
move and convert
mov, ld, st, cvt .const, .global, .local, .param, .shared.v2, .v3, .v4.pred, .b8, b16, b32, b64, u8, u16, u32, u64, s8, s16, s32, s64, f32, f64
texture tex
control flow
@, bra, call, ret, exit Considered divergent unless .uni is specified
sync’ing atom, bar atom can have subcommands add, inc, dec, min, max, and, or, xor, cas, exch
float rcp, sqrt, rsqrt, sin, cos, lg2, ex2
misc trap, brkpt
Atomically adds ra to signed integer at position in global memory pointed to by a.atom.global.add.s32 d,[a], ra
A barrier. Not initialized, requires all SPs to arrive here.bar.sync 0
Reciprocal value: rt = 1/rarcp.f64 rt, ra
Square root: rt = sqrt(ra)sqrt.f32 rt, ra
Base-2 logarithm: rt = log2(ra)lg2.f32 rt, ra
INF5062:Programming asymmetric multi-core processors
Discussion
INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo
Which IS did you like best?
What tasks would you have them do?
What do you think about IS abstractions à la PTX?
How hard is using the SIMD paradigm?
What don’t you know from knowing the ISA?
Looks into ISAs of 3 specialized processors