Instruction Sets Instruction Sets November 21, 2015 INF5062: Programming asymmetric multi-core...

Instruction Sets Instruction Sets

April 20, 2023

INF5062:Programming asymmetric multi-core processors

INF5062, Pål Halvorsen and Carsten GriwodzUniversity of Oslo

Instruction Sets

Why?−Give insight into designers’ mind−Allow debugging

Classical Programming−Von Neumann machines


Instruction Sets

CISC Complex instructions Many cycles for each

instruction Irregular number of

cycles per instruction Two or more memory

locations per instruction possible

Implicit load and store operations

Small code sizes Chip surface used for

instruction storage

RISC Simple instructions Few cycles for each

instruction Fixed or few different

numbers of cycle per instruction

No memory locations in computing instructions

Explicit load and store operations

Large code size Chip surface used for

memory


Instruction Sets

SISD MISD

SIMD MIMD


Instruction Sets

Typical mechanisms for performance improvement−Caching−Data and instruction prefetching−Pipelining−Speculative execution−Out-of-order execution−Symmetrical multithreading (“hyperthreading”)

These mechanisms are not generally applied in heterogeneous multi-cores


IXP 2400

Engine instruction set


IXP Microengine Assembler

SISD architecture

Assembler provides−Detailed control of timing−Detailed control of hardware context switching−Understanding of actual instructions

Assembler requires−Knowledge of supporting hardware

Need to know−No stack


IXP: I/O and Context Swap Instructions

DRAMRead and Write

RBUF and TBUF

CAPEnumerated CSR Addressing

Calculated Addressing

Reflect

HALT, CTX_ARB

HASH, MSF, PCI

SCRATCHRead and Write

Atomic Operations

Ring Operations

SRAM

Read and Write

Atomic Operations

Control Status Register (CSR)

Read Queue Descriptor

Write Queue Descriptor

Enqueue

Dequeue

Ring Operations

Journal Operations

Can transfer bursts of data Operate asynchronously Can generate signals when

finished Needs two registers to

specify addresses Where memory is aligned,

address register bits aren’t shifted, but least relevant bits are ignored

Can be combined with command tokens

Supports so-called indirect references: state of a previous ALU command modifies meaning


IXP: general instructions

ALU

ALU_SHF

ASR

BYTE_ALIGN_BE, _LE

CRC_BE, _LE

DBL_SHF

MUL_STEP

FFS

POP_COUNT

IMMED

IMMED_B0, _B1, _B2, _B3

IMMED_W0, _W1

LD_FIELD

LD_FIELD_W_CLR

LOAD_ADDR

LOCAL_CSR_RD, _WR

NOP


ALU

ALU_SHF

ASR

BYTE_ALIGN_BE, _LE

CRC_BE, _LE

DBL_SHF

MUL_STEP

FFS

POP_COUNT

IMMED

IMMED_B0, _B1, _B2, _B3

IMMED_W0, _W1

LD_FIELD

LD_FIELD_W_CLR

LOAD_ADDR

LOCAL_CSR_RD, _WR

NOP

IXP: general instructionsLOCAL_CSR_WR[byte_index,2]

NOP

NOP

NOP

BYTE_ALIGN_BE[--,r0]

BYTE_ALIGN_BE[dest1,r1]



...

...

Aligning long sequences of registersThis can be aligned with asynchronous load and store ops


ALU

ALU_SHF

ASR

BYTE_ALIGN_BE, _LE

CRC_BE, _LE

DBL_SHF

MUL_STEP

FFS

POP_COUNT

IMMED

IMMED_B0, _B1, _B2, _B3

IMMED_W0, _W1

LD_FIELD

LD_FIELD_W_CLR

LOAD_ADDR

LOCAL_CSR_RD, _WR

NOP

IXP: general instructionsStatus checkALU[r1,--,B,r1] move r1 into r1, just updating registers: N = (r1 < 0), Z = (r1 == 0), V = 0, C = 0 negative, zero, overflow, carry

General arithmetic operations on all kinds of registers

Arithmetic operationsALU[r3,r1,+,r2] r3=r1+r2 N = (r3 < 0), Z = (r3 == 0) V = (r1+r2>231), C = (r1+r2>232)

ALU[r5,r1,+,r2]

ALU[r6,r3,+carry,r4] r5=r1+r2 r6=r3+r4+carry-bit

ALU[r3,r1,+8,r2] r3=r1 + (r2 & 0xff)

Boolean operationsALU[r3,r1,AND~,r2] r3=r1 AND NOT r2


IXP: general instructions

ALU

ALU_SHF

ASR

BYTE_ALIGN_BE, _LE

CRC_BE, _LE

DBL_SHF

MUL_STEP

FFS

POP_COUNT

IMMED

IMMED_B0, _B1, _B2, _B3

IMMED_W0, _W1

LD_FIELD

LD_FIELD_W_CLR

LOAD_ADDR

LOCAL_CSR_RD, _WR

NOP

Replacing bytesLD_FIELD_W_CLEAR[r2,0100,r1,<<rot12], not load_cc move content of r1 to temporary space t rotate t left by 12 bits set r2 = ( r2 & 0xff00ffff ) OR ( t & 0xff0000 ) don’t change any status bits

Find first set bitIMMED[r1,0xc00]

FFS[r2,r1] set r2 to lowest index of a bit in r1, here 19


IXP: branching instructions

BCC

BEQ

BNE

BMI

BPL

BCS

BHS

BCC

BLO

BVS

BVC

BGE

BLT

BGT

BLE

BR

BR_BCLR, BR_BSET

BR=BYTE, BR!=BYTE

BR=CTX, BR!=CTX

BR_INP_STATE, BR_!INP_STATE

BR_SIGNAL, BR_!SIGNAL

JUMP

RTN



BCC

BEQ

BNE

BMI

BPL

BCS

BHS

BCC

BLO

BVS

BVC

BGE

BLT

BGT

BLE

BR

BR_BCLR, BR_BSET

BR=BYTE, BR!=BYTE

BR=CTX, BR!=CTX



JUMP

RTN

Jump table:IMMED[r1,2]

JUMP[r1,label0#], targets[label0#, label1#, label2#, label3#]

...

label0#:

BR[elsewhere0#]

label1#:

BR[elsewhere1#]

label2#:

BR[elsewhere2#]

label3#:

BR[elsewhere3#] example jumps to label2



BCC

BEQ

BNE

BMI

BPL

BCS

BHS

BCC

BLO

BVS

BVC

BGE

BLT

BGT

BLE

BR

BR_BCLR, BR_BSET

BR=BYTE, BR!=BYTE

BR=CTX, BR!=CTX



JUMP

RTN

Emulating a function call:...

LOAD_ADDR[r1,return_label#]

BR[subroutine_label#]

return_label#:

...

Subroutine_label#:

...

RTN[r1]

... example jumps to subroutine_label and returns to return_label (unless r1 gets overwritten)


IXP: branching instructions Branching operations

can be deferredto “do something else” before jumping

Great for confusing the mortals

Especially when combined with ctx_arb

Saves registers− Don’t need to save temp2 in

the example

...

immed[temp1,20,<<0]

immed[temp3,10,<<0]

alu[temp2,temp1,-,temp3]

bgt[label#], defer[3]

alu[temp,temp,+,temp2]

alu[temp,--,B,temp2]

alu[--,temp,-,temp3]

alu[temp,--,B,temp1]

...

label:

...


IXP: content addressable memory

CAM_CLEAR

CAM_LOOKUP

CAM_READ_TAG

CAM_READ_STATE

CAM_WRITE

CAM_WRITE_STATE

Operate on CAM memory− 1 per-Engine− 16 longwords of 32 bits− operated as a LRU cache of 32

bit write operations− a lookup hit refreshes an access− a 4-bit state can be associated

with each CAM entry• written by CAM_WRITE STATE• retrieved by CAM_LOOKUP or

CAM_READ_STATE

The IXP1200 had CAM access to all SRAM memory!!!− apparently it wasn’t worth it ...


Cell Broadband Engine

SPU instruction set


CELL

CELL instruction set very similar to AltiVec A set of SIMD instruction for floating point

operations Single precision is/was considered sufficient


CELL

The CBE is big-endian The SPEs have 16-byte (128-bit) registers

MSB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LSB

most significant bit least significant bit

lower memory addresses higher memory addresses

Big-endian escapes the typical x86 *********:char c[4]; int *e;

c[0]=0x01; c[1]=0x02; c[2]=0x03; c[3]=0x04;

e = (int*)c;

printf(“%d\n”,*e); --> 0x4030201


CELL

The CBE is big-endian The SPEs have 16-byte (128-bit) registers

MSB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LSB

Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 Byte 8 Byte 9Byte 10Byte 11Byte 12Byte 13Byte 14Byte 15

Char 0 Char 1 Char 2 Char 3 Char 4 Char 5 Char 6 Char 7 Char 8 Char 9Char 10Char 11Char 12Char 13Char 14Char 15

Half-word 0 Half-word 1 Half-word 2 Half-word 3 Half-word 4 Half-word 5 Half-word 6 Half-word 7

Word 0 Word 1 Word 2 Word 3

Doubleword 0 Doubleword 1

Quadword 0


qword

CELL

“Preferred slots” of ABI types in registers

char

short [0] short [1] short [2] short [3] short [4] short [5] short [6] short [7]

int [0] int [1] int [2] int [3]

double [0] double [1]

Data type: vector double

Data type: vector signed int

Data type: vector signed short

Data type: doubledouble

Data type: int

int

Data type: short

Data type: char

short

So what have they smoked?


CELL: groups of instructionsOp group Example

Memory Load and Store

lqa - load quadword, from relative addressstqx - store quardword, to address computed from sum of two registerschd - generate controls for halfword insertion using the shufb instruction

Constant Formationil - immediate word loadfsmi - format select mask for bytes immediate

Integer and Logical Operations

ah - add halfwordcg - carry generatebgx - borrow generatempyhha - multiply high high and addclz - count leading zeroescntb - count ones in bytesfsm - form select mask for wordsavgb - average bytessumb - sum bytes into halfwordsand - andeqv - equivalentselb - select bitsshufb - shuffle bytes








Straight-forward

Straight-forward

See together with shufb instruction

il rt,0x12

int [0]

int [1]

int [2]

int [3]

=0x12

=0x12

=0x12

=0x12

fsmi rt,0x9c1a

Byte 0

Byte 1

Byte 2

Byte 3

Byte 4

Byte 15

0x9c1a : 1001110000011010

0xff

0x00

0x00

0xff

0xff

0x00

...








ah rt,ra,rb

rt [0]

rt [1]

rt [2]

rt [3]

rt [4]

rt [5]

rt [6]

rt [7]

ra [0]

ra [1]

ra [2]

ra [3]

ra [4]

ra [5]

ra [6]

ra [7]

=

=

=

=

=

=

=

=

rb [0]

rb [1]

rb [2]

rb [3]

rb [4]

rb [5]

rb [6]

rb [7]

+

+

+

+

+

+

+

+

cg rt,ra,rb

...

= ( + > 216 ) ? 0x01 : 0x00rt [0] ra [0] rb [0]

= ( + > 216 ) ? 0x01 : 0x00rt [15] ra [15] rb [15]

bgx rt,ra,rb

...

= ( > ) ? 0x00 : 0x01rt [0] ra [0] rb [0]

= ( > ) ? 0x00 : 0x01rt [15] ra [15] rb [15]








mpyhha rt,ra,rb

rt{0:3} ra{0:1}= * +rb{0:1} rt{0:3}

rt{4:7} ra{4:5}= * +rb{4:5} rt{4:7}

rt{8:11} ra{8:9}= * +rb{8:9} rt{8:11}

rt{12:15} ra{12:13}= * +rb{12:13} rt{12:15}

For every byte separately









fsm rt,ra

ra

rt{0:3}

rt{4:7}

rt{8:11}

rt{12:15}

= ( == 0 ) ? 0x00000000 : 0xffffffff

= ( == 0 ) ? 0x00000000 : 0xffffffff

= ( == 0 ) ? 0x00000000 : 0xffffffff

= ( == 0 ) ? 0x00000000 : 0xffffffff









sumb rt,ra,rb

ra{0} ra{1} ra{2} ra{3} ra{4} ra{5} ra{6} ra{7} ra{8} ra{9} ra{10}ra{11}ra{12}ra{13}ra{14}ra{15}

rb{0} rb{1} rb{2} rb{3} rb{4} rb{5} rb{6} rb{7} rb{8} rb{9} rb{10}rb{11}rb{12}rb{13}rb{14}rb{15}

rt{0:1}

rt{2:3}

rt{4:5}

rt [14:15]

=

=

=

=

+

+

+

+

rb{0} rb{1} rb{2} rb{3}

rb{4} rb{5} rb{6} rb{7}

ra{0} ra{1} ra{2} ra{3}

ra{12} ra{13} ra{14} ra{15}

+

+

+

+

+

+

+

+

...

and rt,ra,rbbit-wise: t = a AND b

eqv rt,ra,rbbit-wise: t = NOT (a XOR b )

selb rt,ra,rb,rcbit-wise: t = c ? b : a








il rd,0

chd rc,4(rd)

shufb rt,ra,rb,rc



rd{0:16} = 0

0x10 0x11 0x12 0x13 0x14 0x15 0x16 0x17 0x08 0x09 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f

rt{0} rt{1} rt{2} rt{3} rt{4} rt{5} rt{6} rt{7} rt{8} rt{9} rt{10} rt{11} rt{12} rt{13} rt{14} rt{15}








shufb rt,ra,rb,rc



0x80 0x11 0xe0 0x14 0x14 0x11 0xc0 0x17 0x08 0x0a 0x1a 0x0c 0x1c 0x0e 0x1e 0x1f

0x00 rb{1} 0x80 rb{4} rb{4} rb{1} 0xff rb{7} ra{8} ra{10}rb{10}ra{12}rb{12}ra{14}rb{14}rb{15}



Shift and Rotateshl - shift left wordshlqbybi - shift left quadword by bytes from bit shift countrotmi - rotate and mask word immediate

Compare, Branch and Halt

heqi - halt if equal immediateceqb - compare equal byteclgt - compare logical greater than wordbrsl - branch relative and set linkbrz - branch if zero

Hint-for-Branch hbra - hint for branch (a-form)





heqi - halt if equal immediateceqb - compare equal byteclgt - compare logical greater than wordbrsl - branch relative and set linkbrz - branch if zero


shlqbybi rt,ra,rb

rb

H L

bitshift count = H L e.g.: 3

... L6 L5 L4 L3 L2 L1 L0H0 H1 H2 H3 H4 H5 H6

L3 L2 L1 L0 0 0 0...H3 H4 H5 H6 H7 H8 H9

ra=

-> rt





heqi - halt if equal immediateceqb - compare equal byteclgt - compare logical greater than wordbrsl - branch relative and set linkbrz - branch if zero word


cegb rt,ra,rbbyte-wise: t = ( a == b ) ? 0xff : 0x00

clgt rt,ra,rbword-wise: t = ( a > b ) ? 0xffffffff : 0x00000000

brsl rt,symbolrt = PC+1jump to PC+symbol

brsl rt,symbolif( rt{4:7} == 0 } jump to symbol

hbra brinst,brtargWarns the processor: There will be a jump to brtarg, coming brinst instructions from current PC


CELL: groups of instructions

Floating Point

fa - floating adddfs - double floating subtractfma - floating multiply and addfrest - floating reciprocal estimatecsfit - convert signed integer to floatingfrds - floating round double to singledftsv - double float test special valuefcmgt - floating compare magnitude greater than

Controlnop - no operation (execute)dsync - synchronize data

SPU Channelrdch - read channelrchcnt - read channel countwrch - write channel

SPU Interrupt Facility iret - return from interrupt


CELL: groups of instructions

Floating Point

fa - floating adddfs - double floating subtractfma - floating multiply and addfrest - floating reciprocal estimatecsfit - convert signed integer to floatingfrds - floating round double to singledftsv - double float test special valuefcmgt - floating compare magnitude greater than

Controlnop - no operation (execute)dsync - synchronize data

SPU Channelrdch - read channelrchcnt - read channel countwrch - write channel

SPU Interrupt Facility iret - return from interrupt

Channels: A chip CELL has 128 so-called channels A channel is an atomic pipe for communication between the SPE and its environment

The meaning of channels is defined by integrators and implemented in the MMIO (memory mapped I/O) unit In current-day CELLs channels are among other things used for implementing mailboxes and DMA operations

So the meaning of these commands varies with platform

Interrupts: an SPE can have one registered interrupt handler at address 0 iret returns from such an interrupts alternatively

interrupts can be disabled, and programs can use the bisled command (branch indirect and set link if external data) to check conditions that might otherwise be handled by interrupt


nVIDIA GPGPU

CUDA instruction set


nVIDIA G92

Streaming Multiprocessor (SM)

Store to

SP0 RF0

SP1 RF1

SP2 RF2

SP3 RF3

SP4RF4

SP5RF5

SP6RF6

SP7RF7

Constant L1 Cache

L1 Fill

Load from Memory

Load Texture

SFU

SFU

Instruction Fetch

Instruction L 1 Cache

Thread / Instruction Dispatch

L1 Fill

Work

Control

Results

Shared Memory

Store to Memory

Stream Multiprocessor (SM)− 8 Stream Processors (SPs)

− Sharing local ”shared memory”− Sharing L1 cache

instruction set: PTX− Parallel thread execution

PTX is one abstraction: meant to make GPGPU programming easier to handle

Parts of the card’s operation is not formulated in PTX

PTX programs are sandboxed, running in a virtual machine-like environment


nVIDIA G92


Store to

SP0 RF0

SP1 RF1

SP2 RF2

SP3 RF3

SP4RF4

SP5RF5

SP6RF6

SP7RF7

Constant L1 Cache

L1 Fill

Load from Memory

Load Texture

SFU

SFU

Instruction Fetch



L1 Fill

Work

Control

Results

Shared Memory

Store to Memory

SPs are not as independent as PTX seems to imply− From the manual: ”Each

multiprocessor has a Single Instruction, Multiple Data architecture (SIMD): At any given clock cycle, each processor of the multiprocessor executes the same instruction, but operates on different data. “

Implications− You program several different

threads− Threads in the same SM

execute the same instruction, some may wait

− If threads don’t execute the same instr., they are divergent


nVIDIA G92


Store to

SP0 RF0

SP1 RF1

SP2 RF2

SP3 RF3

SP4RF4

SP5RF5

SP6RF6

SP7RF7

Constant L1 Cache

L1 Fill

Load from Memory

Load Texture

SFU

SFU

Instruction Fetch



L1 Fill

Work

Control

Results

Shared Memory

Store to Memory

Parallelity− Threads are grouped into warps

• Threads of one warp run on the same SM at the same time

• Efficiency is lost if they diverge

− The warps form cooperative thread arrays, or CTAs

− One CTA runs on only one SM− CTAs have a 1D, 2D or 3D shape

• software-defined• determines threads’ 3D ID (%tid)

− An SM can handle several CTAs in time-sharing mode

− Time-sharing is preemptive

The instruction set hides this entirely


CUDA PTX doesn’t have

conditional branches

However− It has absolute branches− It has the ability to hop over

instructions

“Predicate”

Want to emulate

Steps

Branching

if(i<n)

j=j+1

.reg .pred p

setp.lt.s32 p, i, n

@p add.s32 j, j, 1

setp.lt.s32 p, i, n

@p bra label

...

label:

...


CUDA Branching is by default

divergant− Must tell assembler if this is

not the case:

The threads can declare, load and store vectors of intrinsics− .v2, .v3, .v4of− .b8, .b16, .b32, .b64, .s8, .s16, .s32

, .s64, .u8, .u16, .u32, .u64, .f32, .f64

vector double a as on the Cell would be:.v2 .f64

but PTX can even sayld.shared.v4.f64 rt,[a]

And, of course, this instruction is effectively working on 8 of these in parallel (2096 bits)

setp.lt.s32 p, i, n

@p bra.uni label

...

label:

...


CUDAcomparison

set, setp, selp, slct EQ, NE, LT, LE, GT, GEEQ, NE, LO, LS, HI, HS

arithmetic add, sub, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max

logic and shift

and, or, xor, not, cnot, shl, shr

move and convert

mov, ld, st, cvt .const, .global, .local, .param, .shared.v2, .v3, .v4.pred, .b8, b16, b32, b64, u8, u16, u32, u64, s8, s16, s32, s64, f32, f64

texture tex

control flow

@, bra, call, ret, exit Considered divergent unless .uni is specified

sync’ing atom, bar atom can have subcommands add, inc, dec, min, max, and, or, xor, cas, exch

float rcp, sqrt, rsqrt, sin, cos, lg2, ex2

misc trap, brkpt


CUDAcomparison



logic and shift


move and convert


texture tex

control flow




misc trap, brkpt

Given predicate p, general registers rt, ra, rb

selp.s32 rt, ra, rb, p

sets rt = p ? ra : rb

General registers rt, ra, rbmul.hi.u32 rt, ra, rb

tmp = ra*rb

rt = high 32 bits of tmp

Load an int into rt from shared memory addressed by ra plus offsetld.shared.u32 rt,[ra+16]

Load a double vector into rt from global memory absolutely addressedld.global.v4.f64 rt,[240]

A true function call, recursion forbidden, in this case non-divergantcall.uni label


CUDAcomparison



logic and shift


move and convert


texture tex

control flow




misc trap, brkpt

Atomically adds ra to signed integer at position in global memory pointed to by a.atom.global.add.s32 d,[a], ra

A barrier. Not initialized, requires all SPs to arrive here.bar.sync 0

Reciprocal value: rt = 1/rarcp.f64 rt, ra

Square root: rt = sqrt(ra)sqrt.f32 rt, ra

Base-2 logarithm: rt = log2(ra)lg2.f32 rt, ra


Discussion


Which IS did you like best?

What tasks would you have them do?

What do you think about IS abstractions à la PTX?

How hard is using the SIMD paradigm?

What don’t you know from knowing the ISA?

Looks into ISAs of 3 specialized processors

Instruction Sets Instruction Sets November 21, 2015 INF5062: Programming asymmetric multi-core...

Documents

Transcript of Instruction Sets Instruction Sets November 21, 2015 INF5062: Programming asymmetric multi-core...