The RISC-V Vector ISA · 2020. 8. 13. · The RISC-V Vector ISA Krste Asanovic, [email protected],...
Transcript of The RISC-V Vector ISA · 2020. 8. 13. · The RISC-V Vector ISA Krste Asanovic, [email protected],...
-
TheRISC-VVectorISA
KrsteAsanovic,[email protected],VectorWGChairRogerEspasa,[email protected],VectorWGCo-ChairVectorExtensionWorkingGroup
17thRISC-VWorkshop,Nov'17
-
WhyaVectorExtension?
2
VectorISAGoodness
• Reducedinstructionbandwidth• Reducedmemorybandwidth• Lowerenergy• ExposesDLP• Maskedexecution• Gather/Scatter• FromsmalltolargeVPU
RISC-VVectorExtension
• Small• Naturalmemoryordering• Masksfoldedintovregs(*)• Scalar,Vector&Matrix(*)• Typedregisters• Reconfigurable• Mixed-typeinstructions• CommonVector/SIMDprogrammingmodel
• Fixed-pointsupport• EasilyExtensible• BestvectorISAeverJ
Domains
• MachineLearning• Graphics• DSP• Crypto• Structuralanalysis• Climatemodeling• Weatherprediction• Drugdesign• Andmore…
(*)ChangedsincelastWorkshopPresentation7thRISC-VWorkshop,Nov'17
-
TheVectorISAinanutshell• 32vectorregisters(v0…v31)
• Eachregistercanholdeitherascalar,avector oramatrix (shape)• Eachvectorregisterhasanassociatedtype (polymorphicencoding)• Variable numberofregisters(dynamicallychangeable)
• Vectorinstructionsemantics• AllinstructionscontrolledbyVectorLength(VL)register• Allinstructionscanbeexecutedundermask• Intuitivememoryorderingmodel• Preciseexceptionssupported
• Vectorinstructionset:• AllinstructionspresentinbaselineISAarepresentinthevectorISA• Vectormemoryinstructionssupportinglinear,strided&gather/scatteraccesspatterns• OptionalFixed-Pointset• OptionalTranscendentalset
37thRISC-VWorkshop,Nov'17
-
NewArchitecturalStateMVL=8
32b
v0v1v2v3
e7 e6 e5 e4 e3 e2 e1 e0 type16b
vl (XLEN)vxrm (3b)vxsat (1b)
Note:Floatingpointflagsusetheexistingscalarflags4
vdcfg(512b)
7thRISC-VWorkshop,Nov'17
e7 e6 e5 e4 e3 e2 e1 e0e7 e6 e5 e4 e3 e2 e1 e0
type
type
e7 e6 e5 e4 e3 e2 e1 e0 type
e7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 typee7 e6 e5 e4 e3 e2 e1 e0 type
v28v29v30v31
-
CompleteVectorInstructionList
5
VOP VMEMvmadd vadd vmerge vsll vclass vround vld vamoswapvnmadd vaddi vmin vslli vpopc vclip vst vamoaddvmsub vand vmul vsra vsgnj vextract vlds vamoandvnmsub vandi vmulh vsrai vsgnjn vmv vsts vamoor
vdiv vsne vsrl vsgnjx vldx vamoxorvseq vor vsrli vsqrt vstx vamomaxvsge vori vsub vcvt vamominvslt vrem vxorvmax vselect vxori7thRISC-VWorkshop,Nov'17
-
Addingtwovectorregisters
67thRISC-VWorkshop,Nov'17
-
vadd v1, v2 à v0
• WhenVLiszero,destregisterisfullycleared• Operationspast‘vl’shallnotraiseexceptions• Destinationcanbesameassource
32b 32b 32b 32b 32b 32b 32b
h g f e d c b a
32b
v1v2
v0
p o n m l k j i
0 0 0 e+m d+l c+k b+j a+i
(MVL=8,VL=5,F32)
76543210
+ + + + +
for (i = 0; i < vl; i++ ){
v0[i] = v1[i] +F32 v2[i]}for (i = vl; i < MVL; i++ ){
v0[i] = 0}
77thRISC-VWorkshop,Nov'17
-
Howisthisexecuted?SIMD?Vector?Uptoyou!
VRF
+F32 +F32
1st clock: a+i,b+j2nd clock: c+k,d+l3rd clock: e+m,04th clock: uptoyou
2-laneimplementation
87thRISC-VWorkshop,Nov'17
-
Howisthisexecuted?SIMD?Vector?Uptoyou!
+F32 +F32
1st clock: a+i,b+j,c+k,d+l2nd clock: e+m,0,0,0
4-laneimplementation
+F32 +F32
VRF
97thRISC-VWorkshop,Nov'17
-
Howisthisexecuted?SIMD?Vector?Uptoyou!
8-laneimplementation(a.k.a.SIMD)
+F32
1st clock: a+i,b+j,c+k,d+l,e+m,0,0,0
+F32 +F32 +F32 +F32 +F32 +F32 +F32
VRF
NumberoflanesistransparenttoprogrammerSamecoderunsindependentof#oflanes
107thRISC-VWorkshop,Nov'17
-
Addingavectorandascalar
117thRISC-VWorkshop,Nov'17
-
ScalarvaluesintheVectorRegisterFile
• ThedatainsideaVREGcanhave3possibleshapes:• Asinglescalarvalue• Avector (i.e.,whatyou’dexpect)• Amatrix(optional,notinthebasespec)
• Thecurrentshapeisheldintheper-vregtypefield• ShapechangescauseaVRFreset(discussedlater)
• Avectorregisterwithshapescalar• Onlyholdsonevalue• Implementationchoice:whereexactlythisonevalueisstoredwithinthevectorisnotdefinedbythespec.Whetherthevalueisreplicatedtoeverylaneisalsoimplementationdependent.
127thRISC-VWorkshop,Nov'17
-
vadd v1, v2.s à v0
• Implementationsarefreetoreplicatethescalarvalueacrossallelementsinthevectorregister• AssemblynotationforindicatingscalaroperandsstillT.B.D
0 0 0 e+i d+i c+i b+i a+i
32b 32b 32b 32b 32b 32b 32b
h g f e d c b a
32b
v1v2s
v0
? ? ? ? ? ? ? i
(MVL=8,VL=5,F32)
76543210
+ + + + +
for (i = 0; i < vl; i++ ){
v0[i] = v1[i] +F32 v2[0]}for (i = vl; i < MVL; i++ ){
v0[i] = 0}
137thRISC-VWorkshop,Nov'17
-
Maskedexecution
147thRISC-VWorkshop,Nov'17
-
Maskedexecution
• Masksarestoredinregularvectorregisters• TheLSBofeachelementisusedasaboolean“0”or“1”value• Otherbitsignored
• Masksarecomputedwithcompareoperations(vseq,vsne,vslt,vsge)• veqv6,v7à v1• Comparisonresultsareinteger“0”or“1”(can’tbeassignedtofloattypes)• Encodedwithasmanybitsasthedestinationregisterelementsize
• Instructionsuse2bitsofencodingtoselectmaskedexecution• 00:Nomasking(==assumemaskingis0xFFFF…FFFF)• 01:unused(usedforotherencodings)• 10:Usev1’selementslsbasthemask• 11:Use~v1’selementslsbasthemask
157thRISC-VWorkshop,Nov'17
-
vadd v3, v4, v1.t à v5
• Remember:v1istheonlyregisterusedasmasksource• Masked-outoperationsshallnotraiseanyexceptions• AssemblynotationstillTBD
lsb(v1)
32b 32b 32b 32b 32b 32b 32b
h g f e d c b a
32b
v3v4
v5
p o n m l k j i
0 0 0 0 d+l c+k b+j 0
(MVL=8,VL=5,F32)
76543210
+ + + + +
1 0 1 0 1 1 1 0
for (i = 0; i < vl; i++ ){
v5[i] = lsb(v1[i]) ? v3[i] +F32 v4[i] : 0;}for (i = vl; i < MVL; i++ ){
v5[i] = 0}
167thRISC-VWorkshop,Nov'17
-
VectorLoad(unitstride)
177thRISC-VWorkshop,Nov'17
-
vld 80(x3)à v5
• Unalignedaddressesarelegal,likelyveryslow 18
abcdefghijk
v5 0 0 0 e d c b a76543210
@100@104@108@112@116@120@124@128@132@136@140
sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){
v5[i] = read_mem(tmp, sz);tmp = tmp + sz;
}for (i = vl; i < MVL; i++ ){
v0[i] = 0}
7thRISC-VWorkshop,Nov'17
-
StridedVectorLoad
197thRISC-VWorkshop,Nov'17
-
vlds 80(x3,x9) à v5abcdefghijk
v5 0 0 0 h g e c a76543210
@100@104@108@112@116@120@124@128@132@136@140
• Stride0islegal• Stridesthatresultinunalignedaccessesarelegal
• likelyveryslow
sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){
v5[i] = read_mem(tmp, sz);tmp = tmp + x9; // x9 = 8 = stride in bytes
}for (i = vl; i < MVL; i++ ){
v0[i] = 0}
207thRISC-VWorkshop,Nov'17
-
Gather(indexedvectorload)
217thRISC-VWorkshop,Nov'17
-
vldx 80(x3,v2) à v5
• Repeatedaddressesarelegal• Unalignedaddressesarelegal,likelyveryslow
abcd
efghi
v5
0 0 0 d d a i c76543210
v2 0 0 0 12 12 0 32 8
@100@104@108@112@116@120@124@128@132@136@140
sz = sizeof_type(v5); // 4tmp = x3 + 80 // 100for (i = 0; i < vl; i++ ){
addr = tmp + sext(v2[i]);v5[i] = read_mem(addr, sz);
}for (i = vl; i < MVL; i++ ){
v0[i] = 0}
227thRISC-VWorkshop,Nov'17
-
VectorStore(unitstride)
237thRISC-VWorkshop,Nov'17
-
vst v5 à 80(x3)
abcdefghijk
v5 0 0 0 e d c b a76543210
@100@104@108@112@116@120@124@128@132@136@140
sz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){
write_mem(tmp, sz, v5[i]);tmp = tmp + sz;
}
24• Unalignedaddressesarelegal,likelyveryslow7thRISC-VWorkshop,Nov'17
-
StridedVectorStore
257thRISC-VWorkshop,Nov'17
-
vsts v5 à 80(x3,x9)
• Stride0islegal• Stridesthatresultinunalignedaccessesarelegal
• likelyveryslow
abcdefghijk
v5 0 0 0 h g e c a76543210
@100@104@108@112@116@120@124@128@132@136@140
// x9 = stride in bytessz = sizeof_type(v5); // 4tmp = x3 + 80; // x3 = 20for (i = 0; i < vl; i++ ){
write_mem(tmp, sz, v5[i]);tmp = tmp + x9; // x9 = 8 = stride in bytes
}
267thRISC-VWorkshop,Nov'17
-
Scatter(indexedvectorstore)
277thRISC-VWorkshop,Nov'17
-
vstx v5 à 80(x3,v2)
• Repeatedaddressesarelegal• Provisionforbothorderedandunorderedscatter
• Unalignedaddressesarelegal• likelyveryslow
abcdefghijk
v5 0 0 0 d d a i c
v2 0 0 0 12 12 0 32 8
@100@104@108@112@116@120@124@128@132@136@140
sz = sizeof_type(v5); // 4tmp = x3 + 80; // 100for (i = 0; i < vl; i++ ){
addr = tmp + sext(v2[i]);write_mem(addr, sz, v5[i]);
}
287thRISC-VWorkshop,Nov'17
-
Ordering
• FromthepointofviewofagivenHART• Vectorloads&storesinstructionshappeninorder• Youdon’tneedanyfencestoseeyourownstores
• FromthepointofviewofotherHART’s• Otherhartsseethevectormemoryaccessesasifdonebyascalarloop• So,theycanbeseenout-of-orderbyotherharts
297thRISC-VWorkshop,Nov'17
-
TypedVectorRegisters
307thRISC-VWorkshop,Nov'17
-
TypedVectorRegisters
• Eachvectorregisterhasanassociatedtype• Yes,differentregisterscanhavedifferenttypes(i.e.,v2canhavetypeF16andv3havetypeF32)• Typescanbemixedinaninstructionundercertainrules
• Hardwarewillautomaticallypromotesometypestoothers(seenextslide)• Typescanbedynamicallychangedbythevcvtinstruction
• Ifthetypechangedoesnotrequiredmorebitsperelementthanincurrentconfiguration• Rationalefortypedregisters
• Registertypesenablea“polymorphic”encodingforallvectorinstructions• Saveslargespaceofconvertfrom“typeA”to“typeB”• Morescalableintothefuture:Supportscustomtypeswithoutadditionalencodings
• SupportedtypesdependonthebaselineISAyourimplementationsupports• RV32I à I8,U8,I16,U16,I32,U32• RV64I à I8,U8,I16,U16,I32,U32,I64,U64• RV128I à I8,U8,I16,U16,I32,U32,I64,U64,X128,X128U• F àF16,F32• FD à F16,F32,F64• FDQ à F16,F32,F64,F128• Provisionforcustomtypeextensions 317thRISC-VWorkshop,Nov'17
-
Type&dataconversions:vcvt
• Toconvertdataintoadifferentformat• Usevcvtbetweenregistersoftheappropriatetype• vcvt v1F32 à v0F16• vcvt v1u8 à v0F32• vcvt v1F32 à v0I32
• Additionalfeature:changingthedestregistertypewithvcvt• vcvt v1F32 à v0F32, I32• Ignoresthecurrentdest type,andsetsittothetyperequestedinimmediate• Legalifrequestedtypesizeisnotbiggerthancurrentconfiguredelementwidth
327thRISC-VWorkshop,Nov'17
-
MixingTypes:promotingsmallintolarge
• Whenanysourceissmallerthandest,thatsourceis“promoted”todestsize• Ifallowedbypromotiontable.Otherwise,instructionshalltrap
• Promotionexamples• vadd v1I8, v2I8 à v0I16• vadd v1I8, v2I64 à v0I64• vadd v1F16, v2F32 à v0F32• vmadd v1F16, v2F16, v3F32 à v3F32
• Tableontherightdefinesvalidpromotions• Zeroextend• Signextend• Re-biasexponentandpadmantissawith0’s
33
se=signextendze =zeroextendp=passthroughrb =re-biast=trap
7thRISC-VWorkshop,Nov'17
-
ReconfigurableVectorRegisterFile
347thRISC-VWorkshop,Nov'17
-
Reconfigurable,variable-lengthVectorRF• Thevectorunitisconfiguredwithacsrrw x1, vdcfg à x2
• x1containsthenewconfigurationindicating• Numberoflogicalregisters(from2to32)• Typeforeachvectorregister,usinganincrementalscheme
• Hardwareresetsallvectorstatetozero• HardwarecomputesMaximumVectorLength(MVL)
• basedonx1andavailablevectorregisterfilestorage• MVLreturnedinx2• Canbedoneinusermode• Expectedtobefast
• Thevectorunitisunconfiguredwritinga0tovdcfg• Verygoodtosavekernelsave&restore!• Usefulforlowpowerstate
• Implementationchoices• AlwaysreturnthesameMVL,regardlessofconfig• Splitstorageacrosslogicalregisters,maybelosingsomespace• Packlogicalregistersastightlyaspossible
35IMPORTANT:ALLvectorregistersALWAYShavethesameNUMBEROFELEMENTS(MVL)7thRISC-VWorkshop,Nov'17
-
V0V1V2V3……V28V29V30v31
32b
+F32
V0V1V2V3……V28V29V30v31
32b
+F32
V0V1V2V3……V28V29V30v31
32b
+F32
V0V1V2V3……V28V29V30v31
32b
+F32
Usersasksfor32F32registers• Hardwarehas32rx4ex4B=512B• Need• 4bytesperv0element• 4bytesperv1element• …• 4bytesperv31element
• Therefore• MVL=512B/(32*4)=4
• HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization
367thRISC-VWorkshop,Nov'17
-
V0V1V0V1……V0V1V0V1
32b
+F32
32b
+F32
32b
+F32
32b
+F32
Usersasksforonly2F32registers
• Hardwarehas32rx4ex4B=512B• Need• 4bytesperv0element• 4bytesperv1element
• Therefore• MVL=512B/(4+4)=64
• HowistheVRForganized?• Manypossibleways• ShowinganINTERLEAVEDorganization
V0V1V0V1……V0V1V0V1
V0V1V0V1……V0V1V0V1
V0V1V0V1……V0V1V0V1
377thRISC-VWorkshop,Nov'17
-
V0V1
32b
+F32
V0V1
32b
+F32
V0V1
32b
+F32
V0V1
32b
+F32
Usersasksforonly2F32registers(alsolegal!)• Hardwarehas32rx4ex4B=512B• Need
• 4bytesperv0element• 4bytesperv1element
• Therefore• MVL=512B/(4+4)=64
• Andyet,implementation…• …answerswithMVL=4• Absolutelylegal!
• HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization
387thRISC-VWorkshop,Nov'17
-
V0,V0
V0,V0
V0,V0
V0,V0
V1,V1
V1,V1
V1,V1
V1,V1
V2
V2
…
V2
V3
V3
…
V3
Unused
Unused
Unused
Unused
Usersasksfor2F16regs&2F32regs• Hardwarehas32rx4ex4B=512B• Need
• 2bytesperv0element• 2bytesperv1element• 4bytesperv2element• 4bytesperv3element• 4‘unusedbytes’tonearestpowerof2
• Therefore• MVL=512B/(12B+4B)=32
• HowistheVRForganized?• Manypossibleways• Showingonepossibleorganization
39
4
4
8
8
8
V0,V0
V0,V0
V0,V0
V0,V0
V1,V1
V1,V1
V1,V1
V1,V1
V2
V2
…
V2
V3
V3
…
V3
Unused
Unused
Unused
Unused
V0,V0
V0,V0
V0,V0
V0,V0
V1,V1
V1,V1
V1,V1
V1,V1
V2
V2
…
V2
V3
V3
…
V3
Unused
Unused
Unused
Unused
V0,V0
V0,V0
V0,V0
V0,V0
V1,V1
V1,V1
V1,V1
V1,V1
V2
V2
…
V2
V3
V3
…
V3
Unused
Unused
Unused
Unused
7thRISC-VWorkshop,Nov'17
-
MVListransparenttosoftware!
• Codecanbeportableacross• Differentnumberoflanes• DifferentvaluesofMVL• Ifusingsetvlinstruction
• SETVLrs1,rd• vl=rs1>MVL?MVL:rs1• Encodedascsrrw
407thRISC-VWorkshop,Nov'17
-
EncodingSummary
417thRISC-VWorkshop,Nov'17
-
Notcoveredtoday– askoffline
• Exceptions• Kernelsave&restore• Customtypes• CryptoWGhasagoodlistofextendedtypesthatfitwithin16bencoding• GFXhasadditionaltypes
• Matrixshapes(comingsoon)• Usingthesamevregs,don’tpanic!• Vadd“matrix”,“matrix”à “matrix”• Vmul“matrix”,“matrix”à “matrix”
427thRISC-VWorkshop,Nov'17
-
Status&Plans
• BestVectorISAever!J• Goalistohavespecreadytoberatifiedbynextworkshop• WeekofMay7th,2018inBarcelona
• Software• ExpectLLVMtosupportit• ExpectGCCauto-vectorizertosupportit
• Pleasejointhevectorworkinggrouptoparticipate• Meetingevery2nd Friday8amPST• Warning:Githubspecisout-of-date:WIPtoupdatetothispresentation
437thRISC-VWorkshop,Nov'17
-
BACKUPSLIDES
447thRISC-VWorkshop,Nov'17
-
Reductions
457thRISC-VWorkshop,Nov'17
-
vadd v1 à v0.s
tmp = 0;for (i = 0; i < vl; i++ ){
tmp = tmp + v1[i]}v0[0] = tmp;
• Implementationsarefreetoreplicatethefinal“sum”acrossallelementsinthedestvectorregister ? ? ? ? ? ? ? sum
32b 32b 32b 32b 32b 32b 32b
h g f e d c b a
32b
v1
v0s
(MVL=8,VL=5,F32)
76543210
+
+
+
+
467thRISC-VWorkshop,Nov'17
-
PromotionTable(largefont)SourceTypepromotion
I64 I32 I16 I8 U64 U32 U16 U8 F64 F32 F16
DestType
I64 p se se se t ze ze ze t t tI32 t p se se t t ze ze t t tI16 t t p se t t t ze t t tI8 t t t p t t t t t t tU64 t t t t p ze ze ze t t tU32 t t t t t p ze ze t t tU16 t t t t t t p ze t t tU8 t t t t t t t p t t tF64 t t t t t t t t p rb rbF32 t t t t t t t t t p rbF16 t t t t t t t t t t p
477thRISC-VWorkshop,Nov'17