Vector Processors Motivations - Ryerson University
Transcript of Vector Processors Motivations - Ryerson University
Vector Processors Motivations:
• Cannot increase performance with deeper pipeline
because:
-clock cycle time limitation (latch delay)
-increase dependences with deeper pipeline
• Cannot increase performance by multiple is-
suing because:
-limitation of Instruction Fetch and decode
rate (memory bottelneck)
-Not enough ILP
Concept of Vector Processing
Provide operations that work on vectors.
Vectors are alinear array of numbers.
One vector operation could add two 64 elements
Advantages of Vector Operations:
• Operations on each two elements do not de-
pend on previous results (No data hazards)
• Need single vector instruction to do the work
of multiple instructions. This reduces Instruc-
tion memory bandwidth.
• Could use Interleaved memory to reduce la-
tency cost of main memory and fetch data
elements.
Multiple elements could be accessed from multi-
bank memory in single access.
• Reduce Control hazards because an entire loop
is replaced by single vector operation.
The Basic Vector Architecture
OP VR3, VR2, VR1 ; 64 Operation
VR3(i) = VR2(i) OP VR1(i)
Scalar operation:
R3 = R2 OP R1
FUVR1
VR2
VR3
1
1
1
64
64
64
The Vector Architecture Components:
• 1-Pipelined function units for vector opera-
tions.
Clock cycle time= latency pipeline depth
Vector Function Units can use very deep pipeline
(no hazards), this increases clock frequency.
• 2-Multiple Vector Function Units:
Multiply, ADD Subtract, Divide, Integer, Log-
ical.
This allows multiple vector operations
• 3-It uses Scalar Unit (same as the basic pipeline)
• 4-Vector Register File
Each Vector register consists of fixed length
bank holding a vector. (DLXV= 64 elements)
The Vector Architecture Components:
• 5- Multiple read/write ports for the vector
register file (16 read ports and 8 write).
• 6-Vector Load-Store Unit:
Load or Store Vector from memory.
Fully pipelined, words can be moved fromto
memory at a rate of 1 word/clock cycle (need
Interleaved memory).
• 7-Scalar Registers:
To compute address for loadstore unit and the
32 GPR, 32 FP registers used in DLX.
Main Memory
Vector Load−Store
VectorRegisters
ScalarRegisters
8 multiple words
16 multiple words
MULTIPLE VECTORFUNCTION UNITS
FP ADD/SUB
FP MULTIPLY
FP Divide
Integer
Logical
Cray T-90 , 1996
Clock rate= 500 MHz
Vector registers =8
each register = 128 X 64 bits
8 FP add, 1 FP Multiply, 4 LoadStore, 1 Int
AddSub , 2 Logial, 1 Shift, 1Reciprocal
DLXV Instruction Set
ADD V1, V2, V3 ; V1(i) = V2(i) + V3(i)
ADDSV V1, F0, V2 ; V1(i) = F0 + V2(i)
SUBV V1, V2, V3 ; V1(i0 = V2(i) - V3(i)
MULTV V1, V2, V3 ; V1(i) = V2(i) xV3(i)
DIVV V1, V2, V3; V1(i) = V2(i) /V3(i)
LV V1, R1 ; V1(i) = M [R1 +i]
SV R1, V1 ; M[R1 +i] = V1(i)
LVWS V1, (R1,R2) ; V1(i) = M[R1 + ixR2]
SVWS (R1,R2), V1 ; M[R1 +ixR2] =V1(i)
SEQV V1, V2; if(V1(i)==V2(i))
VM(i) =1 where VM is mask Reg
MOVI2S VLR, R1; VLR=R1 strip mining
Example: Compare the performance of DLX
to DLXV for DAXPY routine assume vector
register= 64 elements
Y = A × X + Y
DLX Code:
LD F0,B R
ADDI R4, RX, #512 ; R4= Last address
lOOP LD F2, 0(RX); F2=X[i]
MultD F2, F0, F2; F2=A.X[i]
LD F4, 0(Ry) ; F4= Y[i]
ADDD F4, F4, F2 ; F4=A.X[i]+y[I]
SD 0(Ry), F4 ; Y[i]=A.X[i] + Y[i]
ADDI Rx, Rx, #8 ; i+1 for X
ADDI Ry, Ry, #8 ; i+1 for Y
SUB R20, R4, Rx ; R20=64 -I
BNZ R20,lOOP
number of instructions = 2 + 64x9=578
Total time = 578 cycles with no hazards
DLXV Code:
LD F0, A; 1 cycle
LV V1, Rx ; V1[i]= X[i] 64 cycles
MULTV V2, V1, F0 ; V2[i]=AX[i] 64 cycles
LV V3, Ry ; V3[i]=Y[i] 64 cycles
ADDV V4, V2, V3 ; V4[i]=AX[i] + Y[i] 64 cycles
SV Ry, V4 ; Y[i]=AX[i] + Y[i] 64 cycles
only 6 Instructions
Instruction memory bandwidth =578/6 = 1/100
No control hazards or overhead (BNZ, SUB, ADDI)
total time 321 cycles
Applications For Vector Operations
1-Multimedia Processing (Compress, graphics,..)
2- Standard Scientific computing (Matrix Multi-
plication, FFT, CONVL, SORT)
3-Database (data mining, image/video serving)
4-Operating SystemsNetworking (memcpy,..)
5-Speech, handwriting recognition
EXAMPLE: MMX
MMX Technology is a set of Instructions for mul-
timedia and communication applications.
• Single Instruction Multiple Data (SIMD) tech-
nique for vector operations
• 57 New Instructions
• Uses eight 64 bit MMX Registers
• four new data types:
– 1-Packed byte - 8 bytes packed into 64 bit
– 2-Packed word - four 16 bit words
– 3-Packed double words - two 32 bit words
– 4- Quadword - one 64 bit word
Graphics Pixel Data use 8 bits integers, 8 of
these pixels can be packed together and moved
to MMX register.
When MMX instruction executes, it operates on
all the 8 Pixels at once (SIMD) to perform arith-
metic or logic operations.
MMX 64 bit 63 56 7 0
Register ----------- -------
| pixel8| | pixel1 |
---------- ----------
MMX is integrated to Intel architecture
MMX is fully compatible with exisisting applica-
tions and operating system by aliasing its registers
and stste on FP registers and state.
MMX Instruction Set It operates on byte (B),
word (W), double word (DW) or quad word (QW).
• Basic arithmetic operations (add, sub, multi-
ply, shift, and multiply-add)
PADD[B,W,D] = add with wrap around
PMULHW = packed multiply high on words
• Comparison operations:
PCMPEQ[B,W,D] = packed compare for equal-
ity
• Conversion between data types:
PACKUSWB = pack words into bytes (un-
signed)
• Data transfer
MOV[D,Q] = move [double word, QW] to
MMX register or from MMX register
• shift
PSLL[W,D,Q] = packed shift left logical by
Immediate
Examples
PADD[W]:
-----------------------------
MMX1 a3 a2 a1 FFFF
-----------------------------
+ + + +
MMX2 -----------------------------
b3 b2 b1 8000
-----------------------------
= -----------------------------
a3+b3 a2+b2 a1+b1 7FFF warp around
-----------------------------
PMADDWD: 16bX16b ---> 32 b
-----------------------------
MMX1 a3 a2 a1 a0
-----------------------------
X X X X
MMX2 -----------------------------
b3 b2 b1 a0
-----------------------------=
= -----------------------------
a3Xb3+ a2Xb2 a1Xb1+ a0Xb0
-----------------------------
Examples
PCMPGT[W]:
----------------------
MMX1 23 45 16 39
---------------------
gt? gt? gt? gt?
MMX2 ---------------------
31 7 16 67
---------------------
= ---------------------
0000 FFFF 0000 0000
---------------------
Application Example: Chroma Keying
Conditional selection of pixels and overlay on a
baclground
TV Weatherman overlaid on the Image of waether
map
Assume a person picture to overlay it on a
picture of spring blossom
1-Assume that person picture has green back-
ground
2-Compare each pixel of person picture with pixel
of green colour using PCMPEQ (mask for per-
son’s face)
3-USE AND NOT instruction between MASK and
person’s picture, get person’s face only
4-Use AND instruction between MASK and spring
blossom, get the spring blossom only in place of
green background
Use OR instruction for results of 3 and 4, get
person’s picture with spring blossom in the back-
ground
1-
Person --------------------
picture x1 x2 x3 x4
--------------------
PCMPEQW
--------------------
green green green green
-------------------
-----------------
=MASK FFFF 0000 0000 FFFF
-----------------
PANDN
-----------------
Person x1 x2 x3 x4
picture -----------------
= -------------------
0000 x2 x3 0000
-------------------
3- -------------------
MASK FFFF 0000 0000 FFFF
-------------------
PAND
------------------
spring y1 y2 y3 y4
blossom ------------------
-----------------
y1 0000 0000 y4
-----------------
4- -----------------
0000 x2 x3 0000
-----------------
POR
------------------
y1 0000 0000 y4
------------------
=
-------------------
y1 x2 x3 y4
--------------------
IntelMMX™ TechnologyOverview
Order Number: 243081-002
March 1996
MMXTM Technology Overview E
2
Information in this document is provided in connection with Intel products. No license under any patent or copyright is grantedexpressly or implied by this publication. Intel assumes no liability whatsoever, including infringement of any patent or copyright,for sale and use of Intel products except as provided in Intel’s Terms and Conditions of Sale for such products.
Intel retains the right to make changes to these specifications at any time, without notice. Microcomputer Products may haveminor variations to their specifications known as errata.
*Other brands and names are the property of their respective owners.
Copyright © Intel Corporation 1996
Contact your local Intel sales office or your distributor to obtain the latest specifications before placing product orders.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature, may beobtained from:Intel CorporationP.O. Box 7641Mt. Prospect IL 60056-764or call 1-800-879-4683
MMXTM Technology Overview
3
CONTENTS
PAGE
INTRODUCTION .........................................................................................................................4
DATA TYPES................................................................................................................................6Data Types in 64-bit Registers .....................................................................................................6
COMPATIBILITY........................................................................................................................6
DETECTING THE PRESENCE OF MMX™ TECHNOLOGY ................................................7
INSTRUCTIONS...........................................................................................................................7MMX™ Instruction Set Summary................................................................................................7
Instruction Examples....................................................................................................................8
APPLICATION EXAMPLES.....................................................................................................11Conditional Select......................................................................................................................11
Chroma Keying..........................................................................................................................11
Matrix Multiply..........................................................................................................................13
24-Bit Color ...............................................................................................................................14
Image Dissolve Using Alpha Blending.......................................................................................15
SUMMARY .................................................................................................................................17
RELATED DOCUMENTATION ...............................................................................................17
MMXTM Technology Overview
5
INTRODUCTIONThe volume and complexity of data processed by today’s personal computer are increasingexponentially, placing incredible demands on the microprocessor. New communications, gamesand “edutainment” applications feature video, 3D graphics, animation, audio and virtual reality, allof which demand ever increasing levels of performance.
Intel’s MMXTM technology is designed to accelerate multimedia and communications applications.The technology includes new instructions and data types that allow applications to achieve a newlevel of performance. It exploits the parallelism inherent in many multimedia and communicationsalgorithms, yet maintains full compatibility with existing operating systems and applications.
MMX technology is the most significant enhancement to the Intel Architecture since the Intel386TM
processor, which extended the architecture to 32 bits. Processors enabled with MMX technologywill deliver enough performance to execute compute-intensive communications and multimediatasks with headroom left to run other tasks or applications. They allow software developers todesign richer, more exciting applications for the PC. The volume of MMX technology-enabledsystems will grow rapidly in 1997 as the technology is incorporated into multiple processorgenerations from Intel.
The definition of MMX technology resulted from a joint effort between Intel’s microprocessorarchitects and software developers. A wide range of software applications was analyzed, includinggraphics, MPEG video, music synthesis, speech compression, speech recognition, imageprocessing, games, video conferencing and more. These applications were broken down to identifythe most compute-intensive routines, which were then analyzed in details using advancedcomputer-aided engineering tools. The results of this extensive analysis showed many common,fundamental characteristics across these diverse software categories. The key attributes of theseapplications were:
• Small integer data types (for example: 8-bit graphics pixels, 16-bit audio samples)
• Small, highly repetitive loops
• Frequent multiplies and accumulates
• Compute-intensive algorithms
• Highly parallel operations
MMX technology is designed as a set of basic, general purpose integer instructions that can beeasily applied to the needs of the wide diversity of multimedia and communications applications.The highlights of the technology are:
• Single Instruction, Multiple Data (SIMD) technique
• 57 new instructions
• Eight 64-bit wide MMX registers
• Four new data types
The basis for MMX technology is a technique called Single Instruction, Multiple Data (SIMD).This allows many pieces of information to be processed with a single instruction, providingparallelism that greatly increases performance. This technology combined with the IA superscalararchitecture will provide substantial performance enhancement to the PC platform. MMXtechnology is integrated into Intel Architecture processors in a way that maintains fullcompatibility with existing operating systems, including MS DOS*, Windows* 3.1, Windows 95,OS/2* and Unix*. In addition, the full base of Intel architecture software will run on MMXtechnology-enabled systems.
MMX technology was defined to be simple. MMX technology is general enough to address theneeds of a large domain of PC applications built from current and future algorithms. MMXinstructions are not privileged; they can be used in applications, codecs, algorithms, and drivers.
MMXTM Technology Overview
6
DATA TYPESThe principal data type of the IA MMX instruction set is the packed, fixed-point integer, wheremultiple integer words are grouped into a single 64-bit quantity. These 64-bit quantities are movedinto the 64-bit MMX registers. The decimal point of the fixed-point values is implicit and is left forthe programmer to control for maximum flexibility. The supported data types are signed andunsigned fixed-point integers, bytes, words, doublewords and quadwords.
The four MMX technology data types are:
• Packed byte Eight bytes packed into one 64-bit quantity
• Packed word Four 16-bit words packed into one 64-bit quantity
• Packed doubleword Two 32-bit double words packed into one 64-bit quantity
• Quadword One 64-bit quantity
As an example, graphics pixel data are generally represented in 8-bit integers, or bytes. With MMXtechnology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMXregister. When an MMX instruction executes, it takes all eight of the pixel values at once from theMMX register, performs the arithmetic or logical operation on all eight elements in parallel, andwrites the result into an MMX register.
Data Types in 64-bit Registers
63 0
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Packed byte (eight 8-bit elements)
63 48 47 32 31 16 15 0
Packed word (four 16-bit elements)
Quadword (64-bit element)
63 32 31 0
Packed doubleword (two 32-bit elements)
COMPATIBILITYMMX technology retains its full compatibility with existing operating systems and applications byaliasing its registers and state upon the IA floating-point registers and state. Therefore, no newregisters or states are added to support MMX technology. This means that the operating systemuses the standard mechanisms for interacting with the floating point state to save and restore MMXcode. For example, during a task switch, the operating system would use an FSAV and FRSTR topreserve either floating point or MMX code. Aliasing the MMX state upon the floating-point statedoes not preclude applications from executing both MMX technology routines and floating pointroutines.
Floating-point instructions that save/restore the floating-point state also handle the MMX state (forexample, during context switching). The same techniques used by the floating-point architecture tointerface with the operating system are used by MMX technology. MMX technology does notintroduce any new exception or state information, so today’s operating systems can enableapplications using MMX instructions.
MMXTM Technology Overview
7
DETECTING THE PRESENCE OF MMX™ TECHNOLOGYDetecting the existence of MMX technology on an Intel microprocessor is done by executing theCPUID instruction and checking a set bit. This gives software developers the flexibility todetermine the specific code in their software to execute. During install or run time the software canquery the microprocessor to determine if MMX technology is supported and install or execute thecode that includes, or does not include, MMX instructions based on the result.
INSTRUCTIONSThe MMX instructions cover several functional areas including:
• Basic arithmetic operations such as add, subtract, multiply, arithmetic shift and multiply-add
• Comparison operations
• Conversion instructions to convert between the new data types - pack data together, andunpack from small to larger data types
• Logical operations such as AND, AND NOT,OR, and XOR
• Shift operations
• Data Transfer (MOV) instructions for MMX register-to-register transfers, or 64-bit and 32-bitload/store to memory
Arithmetic and logical instructions are designed to support the different packed integer data types.These instructions have a different op code for each data type supported. As a result, the newMMX technology instructions are implemented with 57 op codes.
MMX technology uses general-purpose, basic instructions that are fast and are easily assigned tothe parallel pipelines in Intel processors. By using this general-purpose approach, MMXtechnology provides performance that will scale well across current and future generations of Intelprocessors.
MMXTM Technology Overview
8
MMX™ Instruction Set SummaryThe instructions and corresponding mnemonics in the table below are grouped by functioncategories.
If an instruction supports multiple data types—byte (B), word (W), doubleword (DW), orquadword (QW), the datatypes are listed in brackets. Only one data type may be chosen for agiven instruction. For example, the base mnemonic PADD (packed add) has the followingvariations: PADDB, PADDW, and PADDD. The number of opcodes associated with each basemnemonic is listed.
Category Mnemonic
Number ofDifferentOpcodes Description
Arithmetic PADD[B,W,D]
PADDS[B,W]
PADDUS[B,W]
PSUB[B,W,D]
PSUBS[B,W]
PSUBUS[B,W]
PMULHW
PMULLW
PMADDWD
3
2
2
3
2
2
1
1
1
Add with wrap-around on [byte, word, doubleword]
Add signed with saturation on [byte, word]
Add unsigned with saturation on [byte, word]
Subtraction with wrap-around on [byte, word, doubleword]
Subtract signed with saturation on [byte, word]
Subtract unsigned with saturation on [byte, word]
Packed multiply high on words
Packed multiply low on words
Packed multiply on words and add resulting pairs
Comparison PCMPEQ[B,W,D]
PCMPGT[B,W,D]
3
3
Packed compare for equality [byte, word, doubleword]
Packed compare greater than [byte, word, doubleword]
Conversion PACKUSWB
PACKSS[WB,DW]
PUNPCKH
[BW,WD,DQ]
PUNPCKL
[BW,WD,DQ]
1
2
3
3
Pack words into bytes (unsigned with saturation)
Pack [words into bytes, doublewords into words]
(signed with saturation)
Unpack (interleave) high-order
[bytes, words, doublewords] from MMXTM register
Unpack (interleave) low-order
[bytes, words, doublewords] from MMX register
Logical PAND
PANDN
POR
PXOR
1
1
1
1
Bitwise AND
Bitwise AND NOT
Bitwise OR
Bitwise XOR
Shift PSLL[W,D,Q]
PSRL[W,D,Q]
PSRA[W,D]
6
6
4
Packed shift left logical [word, doubleword, quadword] by
amount specified in MMX register or by immediate value
Packed shift right logical [word, doubleword, quadword] by
amount specified in MMX register or by immediate value
Packed shift right arithmetic [word, doubleword] by
amount specified in MMX register or by immediate value
Data Transfer MOV[D,Q] 4 Move [doubleword, quadword] to MMX register or from
MMX register
FP & MMXState Mgmt
EMMS 1 Empty MMX state
MMXTM Technology Overview
9
Instruction ExamplesThe following section will describe briefly five examples of MMX instructions. For illustration, thedata type shown in this section will be the 16-bit word data type; most of these operations also existfor 8-bit or 32-bit packed data types.
The following example shows a packed add word with wrap around. It performs four additions ofthe eight, 16-bit elements, with each addition independent of the others and in parallel. In thiscase, the right-most result exceeds the maximum value representable in 16-bits—thus it wraps-around. This is the way regular IA arithmetic behaves. FFFFh + 8000h would be a 17 bit result.The 17th bit is lost because of wrap around, so the result is 7FFFh.
a3 a2 a1
b3 b2 b1+ + + +
a3+b3 a2+b2 a1+b1
FFFFh
8000h
7FFFh
PADD[W]: Wrap-around Add
The following example is for a packed add word with unsigned saturation. This example uses thesame data values from before. The right-most add generates a result that does not fit into 16 bits;consequently, in this case saturation occurs. Saturation means that if addition results in overflowor subtraction results in underflow, the result is clamped to the largest or the smallest valuerepresentable. For an unsigned, 16-bit word, the largest and the smallest representable values areFFFFh and 0x0000; for a signed word the largest and the smallest representable values are 7FFFhand 0x8000. This is important for pixel calculations where this would prevent a wrap-around addfrom causing a black pixel to suddenly turn white while, for example, doing a 3D graphicsGouraud shading loop.
a3 a2 a1
b3 b2 b1+ + + +
a3+b3 a2+b2 a1+b1
FFFFh
8000h
FFFFh
PADDUS[W]: Saturating Arithmetic
The specific instruction here is Packed Add Unsigned Saturation Word (PADDUSW). A completeset of ADD operations exists for signed and unsigned cases. The number FFFFh, treated asunsigned (65,535 decimal), is added to 0x8000 unsigned (32,768), and the result saturates to FFFFh- the largest representable unsigned 16-bit value.
There is no “saturation mode bit” as a new mode bit would require a change to the operatingsystem. Separate instructions are used to generate wrap-around and saturating results.
The next example shows the key instruction used for multiply-accumulate operations, which arefundamental to many signal processing algorithms like vector-dot-products, matrix multiplies, FIRand IIR Filters, FFTs, DCTs etc. This instruction is the packed multiply add (PMADD).
MMXTM Technology Overview
10
a3 a2 a1 a0
b3 b2 b1 b0** **
a3*b3+a2*b2 a1*b1+a0*b0PMADDWD: 16b x 16b -> 32b Multiply Add
The PMADD instruction starts from a 16-bit, packed data type and generates a 32-bit packed, datatype result. It multiplies all the corresponding elements generating four 32-bit results, and addsthe two products on the left together for one result and the two products on the right together forthe other result. To complete a multiply-accumulate operation, the results would then be added toanother register which is used as the accumulator.
The following example is a packed parallel compare. This example compares four pairs of 16-bitwords. It creates a result of true (FFFFh), or false (0000h). This result is a packed mask of ones foreach true condition, or zeros for each false condition. The following example shows an example ofa compare “greater than” on packed word data. There are no new condition code flags, nor are anyexisting IA condition code flags affected by this instruction.
23 45 16
31 7 16
gt ? gt ? gt ? gt ?
0000h FFFFh 0000h
34
67
0000h
PCMPGT[W]: Parallel Compares
The packed compare result can be used as a mask to select elements from different inputs using alogical operation, eliminating the need for a branch or a set of branch instructions. The ability to doa conditional move instead of using branch instructions is an important performance enhancementin advanced processors that have deep pipelines and employ branch prediction. A branch based onthe result of a compare operation on the incoming data is usually difficult to predict, as incomingdata in many cases can change randomly. Eliminating branches that are used to perform dataselection by using the conditional select capability, together with the parallelism of the MMXinstruction set, is an important performance enhancement feature of the MMX technology.
The following is an example of a pack instruction. It takes four 32-bit values and packs them intofour 16-bit values, performing saturation if one of the 32-bit source values does not fit into a 16-bitresult. There are also instructions that perform the opposite - unpack, for example, a packed bytedata type into a packed word data type.
b0’ a1’b1’
b1 b0
a1 a0
a0’PACKSS[DW]: Pack Instruction
MMXTM Technology Overview
11
The pack and unpack instructions exist to facilitate conversion between the new packed data types.These are especially important when an algorithm needs higher precision in its intermediatecalculations, as in image filtering. A filter on an image usually involves a set of multiplyoperations between filter coefficients and a set of adjacent image pixels, accumulating all thevalues together. These multiplies and accumulations need more precision than 8-bits, the originaldata type of the pixels. The solution is to unpack the image’s 8-bit pixels into 16-bit words,perform the calculations in 16-bit words without concern for overflow, then pack back to 8-bitpixels before storing the filtered pixels to memory.
APPLICATION EXAMPLESThe following section describes example uses of the MMX instruction set to implement basiccoding structures:
Conditional SelectMultimedia applications must process large sets of data. In some cases there is a need to select thedata based on a condition query performed on the incoming data. Intel has been able to improveperformance in its family of processors by implementing micro-architectural features for increasedperformance and deeper pipelines. Branch prediction is an important part of making the pipelinesrun efficiently, as a misprediction can cause the pipelines to flush and degrade performance. Thefollowing example shows an efficient way to reduce the need to use branch instructions, especiallythose that are data dependent, and thus very difficult to predict. The Chroma Keying exampledemonstrates how conditional selection using the MMX instruction set removes branch mis-predictions, in addition to performing multiple selection operations in parallel. Text overlay on agraphics/video background, and sprite overlays in games are some of the other operations thatwould benefit from this technique.
Chroma KeyingMost have seen the television weather man overlaid on the image of a weather map. In thisexample we use a green screen to overlay an image of a woman on a picture of spring blossom.We’ll illustrate this example by processing four 16-bit pixels in parallel. The instructions alsoallow the processing of eight 8-bit pixels in parallel for a substantial performance speed-uppotential.
+ =First we’ll take four pixels from the picture with the woman on a green background. The top rowof the data below represents pixels that alternate between green, not green, green, and not green.The compare instruction builds a mask for that data. That mask is a sequence of words that are allones or all zeros representing the Boolean values of true and false. We now know what is theunwanted background and what we want to keep. This is shown below using a shadow picture.
MMXTM Technology Overview
12
X1=green X2!=green X3=green X4!=green ... ......
green green green greenpcmpeqw
0xFFFF 0x0000 0xFFFF 0x0000bitmask4 pixel/cycle
This mask is now used on the same four pixels from the picture with the woman and the equivalentfour pixels from the Spring blossom. The “AND NOT” and “AND” instructions use the mask toidentify which pixels to keep from the Spring blossom and the woman. They also turn theunwanted pixels to zeros. The “OR” instruction builds the final picture. Four pixels were mappedusing only four MMX instructions without any branches.
In working through this example, the PANDN instruction inverts all the bits in mask beforeapplying the AND operation.
0xFFFF 0x0000 oxFFFF 0x0000
X1 X2 X3 X4pandn
0xFFFF 0x0000 oxFFFF 0x0000
Y1 Y2 Y3 Y4pand
0x0000 X2 ox0000 X4
Y1 0x0000 Y3 0x0000porY1 X2 Y3 X4
Without MMX technology, each pixel is processed separately and requires a conditional branch.Using MMX instructions, eight 8-bit pixels can be processed in parallel and no conditionalbranches are involved.
Vector Dot ProductThe vector dot product is one of the most basic algorithms used in signal-processing of natural datasuch as images, audio, video and sound. The following example shows how the PMADDinstruction helps speed up algorithms using vector dot products. The PMADD instruction willhandle four multiplies and two additions at a time. Coupled with a PADD instruction, as describedbefore, eight multiply-accumulate operations are performed. These eight element vectors fit nicelyinto two PMADD instructions and two PADD instructions.
Assuming that the precision supported by the PMADD instruction is sufficient, this dot-productexample on eight-element vectors can be completed using eight MMX instructions: TwoPMADDs, two more PADDs, two shifts (if needed to fix the precision after the multiply operation),and two memory moves to load one of the vectors (the other vector is loaded by the PMADDinstruction which can have one of its operands come from memory).
MMXTM Technology Overview
13
a0 a1 a2 a3 a4 a5 a6 a7
c0 c1 c2 c3 c4 c5 c6 c7* * * * * * * *
a0*c0+a1*c1 a2*c2+a3*c3 a4*c4+a5*c5 a6*c6+a7*c7
Accumulator
Note: Input data and coefficients are 16-bit precision.If not, first unpack to 16 bit.
Pmaddwd
+Paddd +
Shift to right precision if needed Shift to right precision if needed
x = ∑ a(i) * c(i)
Comparing instruction counts with and without MMX technology for this operation yields thefollowing:
Number of Instructions withoutMMXTM Technology
Number ofMMX Instructions
Load 16 4
Multiply 8 2
Shift 8 2
Add 7 1
Miscellaneous - 3
Store 1 1
Total 40 13
With MMX technology, one third of the number of instructions is needed.
Most MMX instructions can be executed in one clock cycle, so the performance improvement willbe more dramatic than the simple ratio of instruction counts.
Matrix MultiplyExciting new 3D games are coming to market every day. Typically, computations that manipulate3D objects are based on 4-by-4 matrices that are multiplied with four element vectors many times.The vector has the X,Y, Z and perspective corrective information for each pixel. The 4-by-4matrix is used to rotate, scale, translate and update the perspective corrective information for eachpixel. This 4-by-4 matrix is applied to many vectors.
MMXTM Technology Overview
14
a0 a1 a2 a3
b0 b1 b2 b3
c0 c1 c2 c3
d0 d1 d2 d3
x
y
z
1
x’
y’
z’
w’
=
Rotate & Scale Translate
Perspective
x’ = a0x + a1y + a2z + a3
Applications which already use 16-bit integer or fixed-point data are able to make extensive use ofthe PMADD instruction. There would be one PMADD instruction per row in the matrix, for a totalof four. Comparing instruction counts with and without MMX technology for this operation yieldsthe following:
Number of Instructions withoutMMXTM Technology
Number ofMMX instructions
Load 32 6
Multiply 16 4
Add 12 2
Miscellaneous 8 12
Store 4 4
Total 72 28
With MMX technology, less than one half of the number of instructions without MMX technologyis needed.
24-Bit ColorThe MMX instruction set offers graphical applications the opportunity to move from 8-bit or 16-bitcolor lookup table to 24-bit, or “true” color, a feature which will greatly enhance the realism of agame’s graphics. In many cases, this can be done in the same amount of time that is currentlyrequired for 8-bit graphics. For 24-bit and 32-bit colors, red, green and blue are each representedby 8-bit values. There are eight additional bits in 32-bit color for alpha value.
MMXTM Technology Overview
15
Image compositing and alpha blending are operations that can be performed on 24-bit colorimages.
R
G
B
alpha Value from 0 to 255
Image Representation
Image Dissolve Using Alpha BlendingThis example shows how the MMX instruction set will speed up image compositing. In thisexample, a flower will dissolve into a swan. The screen starts with a picture of the flower. As theflower gradually fades away, the swan gradually appears.
The math for the dissolve is a straight-forward function. Alpha determines the intensity of theflower. At full intensity, the flower’s 8-bit alpha value is FFH, or 255. By plugging 255 in thedissolve equation, each flower pixel is 100 percent and each swan pixel is 0 percent. The equationbelow calculates each pixel:
Result_pixel = Flower_pixel * (alpha/255) + Swan_pixel * [1 - (alpha/255)]
Illustrated below are the flower and swan when alpha = 230:
* 230/255 * 1 - 230/255+ =
When the alpha value is 230, the resulting picture is 90 percent flower and 10 percent swan. Onclose examination, some of the swan image appears in the picture to the right of the equal sign.
This example assumes that the 24-bit color data is organized so that four pixels at a time areprocessed from one color plane, that is, the image is separated into individual color planes: one forred, one for green, one for blue. The first four red values from the flower and the swan will beprocessed first. After finishing the red plane, the processing moves to the green and blue planes.
MMXTM Technology Overview
16
The unpack instruction takes the first four bytes of the red data that are represented in 8-bit valuesand unpacks each pixel into 16-bit elements, putting them into a 64-bit MMX register. The alphavalue, which is computed once per frame, is the other operand. The PMUL multiplies the twovectors in parallel. Similarly, an unpack and PMUL create the intermediate result for the swan.Now the two intermediate results are added together using a PADD and the final result is sent tomemory using a PACK that converts the intermediate 16-bit values back to 8-bit pixel values thatcan be stored.
RG
Balpha value = %
Flower R
GB
Swan
r0r1r2r3
r0r1r2r3
90%90%90%90%X
Same operation on image Bwith (1 - alpha (A))
90 * r090 *r190 *r290 * r3
new r3 new r2 new r1 new r0
r0r1r2r3Pack
Unpack
Computations are done percolor plane.4 pixels are computed inparallel.
90% 10%alpha value = %
10 * r010 *r110 *r210 * r3Add
If these images use 640X480 resolution, and the dissolve technique uses all 255 steps of the alphavalue, then 117 million PUNPCKs and PMULs, and 58 million PADDs and PACKs are used.Comparing instruction counts with and without MMX technology for this operation yields thefollowing:
OperationCalculation
without MMXTM Technology
Number of Instructionswithout MMX
TechnologyNumber of MMX
Instructions
Load (640*480)*255*3*2 470 million 117 million
Unpack - - 117 million
Multiply (640*480)*255*3*2 470 million 117 million
Add (640*480)*255*3 235 million 58 million
Pack - - 58 million
Store (640*480)*255*3 235 million 58 million
Total 1.4 billion 525 million
Almost 1 billion fewer instructions are used in this example.
The dissolve technique, sometimes called combine, is one of several commonly used imagecompositing techniques used in multimedia applications and can be sped up substantially withMMX technology.
MMXTM Technology Overview
17
Combine Dissolve: Fade in, fade out effectA * alpha(A) + B * (1 - alpha(A))
A over B Transparent Image placed on backgroundA + (B * (a - alpha(A)))
A in B Image A only where B has OpacityA * alpha (B)
A out B Image A only where B has transparencyA * (1 - alpha(B))
A top B (A in B) over B(A * alpha(B)) + (B * (1 - alpha(A)))
A XOR B (B * (1 - alpha(A))) + (A * (1 - alpha(B)))
Alpha blending is a technique used by game developers that is similar to image compositing. Alphablending allows race cars to drive realistically through fog or smoke, allows a more realistic viewof fish in water, or a rabbit in a translucent tube. In these examples, the alpha values wouldn’tnecessarily be the same for the whole frame, but the basic concept remains the same.
SUMMARYMMX technology brings more power to multimedia and communication applications. MMXtechnology adds new data types and instructions that can process data in parallel. MMX technologyis fully compatible with existing operating systems and application software.
MMX technology brings a step improvement to the PC platform and enables new applications andusage of PCs. It helps establish a new paradigm in the industry with the PC as an improvedcommunications and multimedia device. Systems enabled with MMX technology will ramp in highvolume in 1997 as Intel incorporates the technology in multiple processor generations.
RELATED DOCUMENTATIONRefer to the following documentation for more information on MMX technology.
• Intel Architecture MMXTM Technology Developers’ Manual (Order Number 243013)
• Intel Architecture MMXTM Technology Programmer’s Reference Manual (Order Number243007)
Refer to Intel’s corporate website for the latest information on related documentation:
http://www.intel.com/
1