SIMD-nid

FINED GRAINED SIMD ARCHITECTURE

Nidhi Naraang, D1911, A01

Reg no :10900190

Introduction to SIMD Architectures

SIMD (Single-Instruction Stream Multiple-Data Stream) architectures are essential in the parallel world of computers. Their ability to manipulate large vectors and matrices in minimal time has created a phenomenal demand in such areas as weather data and cancer radiation research. The power behind this type of architecture can be seen when the number of processor elements is equivalent to the size of your vector. In this situation, component wise addition and multiplication of vector elements can be done simultaneously. Even when the size of the vector is larger than the number of processors elements available, the speedup, compared to a sequential algorithm, is immense. There are two types of SIMD architectures we will be discussing. The first is the True SIMD followed by the Pipelined SIMD. Each has its own advantages and disadvantages but their common attribute is superior ability to manipulate vectors.

What is SIMD?

SIMD stands for Single Instruction Multiple Data. It is a way of packing N (usually a power of 2) like operations (e.g. 8 adds) into a single instruction. The data for the instruction operands is packed into registers capable of holding

the extra data. The advantage of this format is that for the cost of doing a single instruction, N instructions worth of work are performed.

Examples of common areas where SIMD can result in very large improvements in speed are 3-D graphics (Electric Image, games), image processing (Quartz, Photoshop filters), video processing (MPEG, MPEG2, MPEG4), and theater-quality audio (Dolby AC-3, DTS, mp3), and high performance scientific calculations. SIMD units are present on all G4, G5 or Pentium 3/4/M class processors.

Why do we need SIMD?

SIMD offers a great flexibilty and oppotunities for a better performance in things like audio,video and various communication tasks which are crucial for the applications.It provides a cornerstone for robust and powerful multimedia capabilties that extend the scalar instruction set significantly.

Why should a developer care about SIMD?

SIMD can provide a substantial boost in performance and capability for an application that makes significant use of 3D graphics, image processing, audio compression or other calculation-intense functions. Other features of a program

may be accelerated by recoding to take advantage of the parallelism and additional operations of SIMD. Apple is adding SIMD capabilities to Core Graphics, QuickDraw and QuickTime. An application that calls them today will see improvements from SIMD without any changes. SIMD also offers the potential to create new applications that take advantage of its features and power. To take advantage of SIMD, an application must be reprogrammed or at least recompiled; however you do not need to rewrite the entire application. SIMD typically works best for that 10% of the application that consumes 80% of your CPU time -- these functions typically have heavy computational and data loads, two areas where SIMD excels.

True SIMD: Distributed Memory

The True SIMD architecture contains a single control unit (CU) with multiple Processor elements (PE) acting as arithmetic units (AU). In this situation, the Arithmetic units are slaves to the control unit. The AU's cannot fetch or interpret any instructions. They are merely a unit which has capabilities of addition, subtraction, multiplication, and division. Each AU has access only to its own memory. In this sense, if an AU needs the information contained in a different AU, it must put in a request to the CU and the CU must manage the transferring of information. The advantage of this type of architecture is in the ease of adding more memory and AU's to the computer. The disadvantage can be found in the time wasted by the CU managing all memory exchanges.

True SIMD: Shared Memory

Another True SIMD architecture, is designed with a configurable association between the PE's and the memory modules (M). In this architecture, the local memories that were attached to each AU as above are replaced by memory modules. These M's are shared by all the PE's through an alignment network or switching unit. This allows for the individual PE's to share their memory without accessing the control unit. This type of architecture is certainly superior to the above, but a disadvantage is inherited in the difficulty of adding memory.

Pipelined SIMD

Pipelined SIMD architecture is composed of a pipeline of arithmetic units with shared memory. The pipeline takes different streams of instructions and performs all the operations of an arithmetic unit. The pipeline is a first in first out type of procedure. The sizes of the pipelines are relative. To take advantage of the pipeline, the data to be evaluated must be stored in different memory modules so the pipeline can be fed with this information as fast as possible.

Advantages

An application that may take advantage of SIMD is one where the same value is being added (or subtracted) to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three values for the brightness of the red (R), green (G) and blue (B) portions of the color. To change the brightness, the R, G and B values are read from memory, a value is added (or subtracted) from

them, and the resulting values are written back out to memory.

With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get the next pixel", a SIMD processor will have a single instruction that effectively says "get lots of pixels" ("lots" is a number that varies from design to design). For a variety of reasons, this can take much less time than "getting" each pixel individually, as with traditional CPU design.

Another advantage is that SIMD systems typically include only those instructions that can be applied to all of the data in one operation. In other words, if the SIMD system works by loading up eight data points at once, the add operation being applied to the data will happen to all eight values at the same time. Although the same is true for any super-scalar processor design, the level of parallelism in a SIMD system is typically much higher.

Disadvantages

Not all algorithms can be vectorized. For example, a flow-control-heavy task like code parsing wouldn't benefit from SIMD.

It also has large register files which increases power consumption and chip area.

Currently, implementing an algorithm with SIMD instructions usually requires human labor; most compilers don't generate SIMD instructions

from a typical C program, for instance. Vectorization in compilers is an active area of computer science research. (Compare vector processing.)

Programming with particular SIMD instruction sets can involve numerous low-level challenges.

o SSE (Streaming SIMD Extension) has restrictions on data alignment; programmers familiar with the x86 architecture may not expect this.

o Gathering data into SIMD registers and scattering it to the correct destination locations is tricky and can be inefficient.

o Specific instructions like rotations or three-operand addition aren't in some SIMD instruction sets.

o Instruction sets are architecture-specific: old processors and non-x86 processors lack SSE entirely, for instance, so programmers must provide non-vectorized implementations (or different vectorized implementations) for them.

o The early MMX instruction set shared a register file with the floating-point stack, which caused inefficiencies when

http://en.wikipedia.org/wiki/C_(programming_language)

mixing floating-point and MMX code. However, SSE2 corrects this.

Hardware

Small-scale (64 or 128 bits) SIMD has become popular on general-purpose CPUs in the early 1990s and continuing through 1997 and later with Motion Video Instructions (MVI) for Alpha. SIMD instructions can be found, to one degree or another, on most CPUs, including the IBM's AltiVec and SPE for PowerPC, HP's PA-RISC Multimedia Acceleration eXtensions (MAX), Intel's MMX and iwMMXt, SSE, SSE2, SSE3 and SSSE3, AMD's 3DNow!, ARC's ARC Video subsystem, SPARC's VIS and VIS2, Sun's MAJC, ARM's NEON technology, MIPS' MDMX (MaDMaX) and MIPS-3D. The IBM, Sony, Toshiba co-developed Cell Processor's SPU's instruction set is heavily SIMD based. NXP founded by Philips developed several SIMD processors named Xetal. The Xetal has 320 16bit processor elements especially designed for vision tasks.

Modern graphics processing units (GPUs) are often wide SIMD implementations, capable of branches, loads, and stores on 128 or 256 bits at a time.

Future processors promise greater SIMD capability: Intel's AVX instructions will process 256 bits of data at once, and Intel's Larrabee graphic microarchitecture promises two 512-bit SIMD registers on each of its cores (VPU - Wide Vector Processing Units) [although as of early 2010, the Larrabee project was canceled at Intel].

Software

SIMD instructions are widely used to process 3D graphics, although modern graphics cards with embedded SIMD have largely taken over this task from the CPU. Some systems also include permute functions that re-pack elements inside vectors, making them particularly useful for data processing and compression. They are also used in cryptography. The trend of general-purpose computing on GPUs (GPGPU) may lead to wider use of SIMD in the future.

Adoption of SIMD systems in personal computer software was at first slow, due to a number of problems. One was that many of the early SIMD instruction sets tended to slow overall performance of the system due to the re-use of existing floating point registers. Other systems, like MMX and 3DNow!, offered support for data types that were not interesting to a wide audience and had expensive context switching instructions to switch between using the FPU and MMX registers. Compilers also often lacked support requiring programmers to resort to assembly language coding.

SIMD on x86 had a slow start. The introduction of 3DNow! by AMD and SSE by Intel confused matters somewhat, but today the system seems to have settled down (after AMD adopted SSE) and newer compilers should result in more SIMD-enabled software. Intel and AMD now both provide optimized math libraries that use SIMD instructions, and open source alternatives like libSIMD and SIMDx86 have started to appear.

Apple Computer had somewhat more success, even though they entered the SIMD market later than the rest. AltiVec offered a rich system and can be programmed using increasingly sophisticated compilers from Motorola, IBM and GNU, therefore assembly language programming is rarely needed. Additionally, many of the systems that would benefit from SIMD were supplied by Apple itself, for example iTunes and QuickTime. However, in 2006, Apple computers moved to Intel x86 processors. Apple's APIs and development tools (XCode) were rewritten to use SSE2 and SSE3 instead of AltiVec. Apple was the dominant purchaser of PowerPC chips from IBM and Freescale Semiconductor and even though they abandoned the platform, further development of AltiVec is continued in several Power Architecture designs from Freescale, IBM.

SIMD within a register, or SWAR, is a range of techniques and tricks used for performing SIMD in general-purpose registers on hardware that doesn't provide any direct support for SIMD instructions. This can be used to exploit parallelism in certain algorithms even on hardware that does not support SIMD directly.

SIMD Extensions for Multimedia

Delivering High Performance Audio & Video Processing to Embedded Applications

Current generations of smartphones and internet appliances must deliver high levels of media and graphics performance to be competitive. SIMD

extensions in the ARMv6 and ARMv7 architectures deliver improved performance.

The ARM SIMD media extensions were introduced with the ARMv6 architecture, beginning with ARM1136 and continuing through ARM1176, ARM1 MPCore, Corte-A5, Cortex-A8 and Cortex-A9. These SIMD extensions increase the processing capability of ARM processor-based SoC without materially increasing the power consumption . The SIMD extensions are optimized for a broad range of software applications including video and audio codecs, where the extensions increase performance by up to 75% or more.

HOW SIMD SUPPORT SOFTWARE

Now a days many DSP and general purpose processors are actually equipped with SIMD means single -instruction ,multiple data hardware and instructions.SIMD makes the processor to execute a single instruction for instance addition, on multiple independent sets of data in parallelwhich produce independent results.

Now a days SIMD support has become really common as it improves performance on many types of DSP oriented apllications(tending to operate same instructions again and again.But what will happen if we implement DSP oriented appliaction on an embedded processor without SIMD support?Actually it depends upon the specification of the application ,data and the processor that it might make sense to emulate SIMD in software.For example Unlike many other optimization techniques soft SIMD is only useful for a realtively small range of

applications and processors and impose limits on input operands.but in case of digital signal processing applications where we try to squeeze every last cycle out of a processor and try to put in significant extra work in order to meet performance targets ,it is useful addition to our optimization kit.

SIMD PROCESSOR

A processor with SIMD support one operation is executed on two or more operand set to produce multiple outputs.This thing is usually achieved by having a processor treat registor as containing multiple data words.Lets consider an example a 32 bit registor can be consider as it contains two 16 bit data words as shown below.Here processor automatically handles sign and carries propagation for each SIMD calculations.Most of the time processor having explicit support for this SIMD operations also provides instructions which allow programmer to pack and unpack the registers with SIMD operands.

(The above diagram is a representation of SIMD processing.It performs the same operation simultaneously on multiple sets of opearnds under the control of a single instruction.)

SIMd proves to be really effective at accelerating algorithams and applications which are highly parallel and repietitive, for instance bolck FIR filtering or a vector addition .But its not effective for processing that is perdominantly serail as in case of control code or whne perands are scattrered in memory and can not be easily packed or unpacked into registors.

SIMD IN 3D GEOMETRY PROCESSING

Today multi media applications for

instance speech,audio/video,image and graphics applications have become really important workloads running on common and general processors.3D geometry processing is inherently parallel task.Because of thsi many CPU vendors add SIMD taht is single instruction multiple data instruction extensions in order to accelerate 3D geometry processing.Moderen graphic cards with embedded SIMD have largely taken over this task from CPU.Some systems also include functions taht re pack elemetns inside the vectors, which makes them particularly useful data processing and compression.

SIMD EXTENSIONS FOR MULTIMEDIA

delivering high performance in audio and video processing to embedded applications

present generation of phones and internet applications must deliver very high level of media and graphics performance to be competitive. SIMD extensions for ARMv6 ,also for ARMv7 architectures has improved performance alot.These SIMD extensions increase the processing ability of the ARM processor which is based on SOC without materially inreasing the power consumption.Such SIMD extensions are highly optimzed for broad range of software applications which includes video ,audio codecs where extension increase performance

by mostly 75% or even more.

ARMv6 SIMD Features:

75% performance increase for audio and video processing

Simultaneous computation of 2x16-bit or 4x8-bit operands

Fractional arithmetic

User definable saturation modes (arbitrary word-width)

Dual 16x16 multiply-add/subtract 32x23 fractional MAC

Simultaneous 8/16-bit select operations

Performance up to 3.2 GOPS at 800MHz

Performance is achieved with a "near zero" increase in power consumption on a typical implementation

Applications:

Media streaming Internet appliance

MPEG4 and H264 encode/decode

Voice and handwriting recognition

FFT processing

Complex arithmetic

Viterbi processing

SIMD extensions simplify development of application software by offering a

single tool-chain and processing device, when compared to architectures with separate programmable DSPs or accelerators. The single tool-chain environment speeds time-to-market as software plays an increasingly important role in product development. The SIMD extensions are completely transparent to the operating system (OS), allowing existing OS ports to be used. New applications running on the OS can be written to explicitly use the SIMD extensions, providing an additional power/performance advantage.

References:

1. www.hotchips.org/archives/ hc10/2_Mon/HC10.S5/HC10.5.3.pdf

2. http://www.arm.com/products/ processors/technologies/dsp-simd.php

http://www.arm.com/products/processors/technologies/dsp-simd.php



SIMD-nid

Documents

Transcript of SIMD-nid