Download - Accerelate Framwork

Transcript
Page 1: Accerelate Framwork

These are confidential sessions—please refrain from streaming, blogging, or taking pictures

Fast and energy efficient computation

Session 713

Accelerate Framework

Geoff BelterEngineer, Vector and Numerics Group

Page 2: Accerelate Framwork

What is it?Accelerate Framework

• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware

Page 3: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 4: Accerelate Framwork

Session goalsAccelerate Framework

•How Accelerate helps you•Areas of your code likely to benefit from Accelerate•How you use Accelerate

Page 5: Accerelate Framwork

Why is it fast?Accelerate Framework

Page 6: Accerelate Framwork

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

Page 7: Accerelate Framwork

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling

Page 8: Accerelate Framwork

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling

•Multi-threaded using GCD

Page 9: Accerelate Framwork

Tips for Successful Use of Accelerate

Page 10: Accerelate Framwork

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

Page 11: Accerelate Framwork

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size

Page 12: Accerelate Framwork

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size•Do setup once/destroy at the end

Page 13: Accerelate Framwork

Using Accelerate FrameworkXcode

Page 14: Accerelate Framwork

Using Accelerate FrameworkXcode

Page 15: Accerelate Framwork

Using Accelerate FrameworkXcode

Page 16: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 17: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 18: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 19: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 20: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 21: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 22: Accerelate Framwork

Command L=lineUsing Accelerate Framework

Page 23: Accerelate Framwork

cc -framework Accelerate main.c

Command L=lineUsing Accelerate Framework

Page 24: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 25: Accerelate Framwork

Vectorized image processing libraryvImage

Page 26: Accerelate Framwork

vImageWhat’s available?

Page 27: Accerelate Framwork

vImageWhat’s available?

Page 28: Accerelate Framwork

Additions and Improvements

Page 29: Accerelate Framwork

Additions and Improvements

• Improved conversion support

Page 30: Accerelate Framwork

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities

Page 31: Accerelate Framwork

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images

Page 32: Accerelate Framwork

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images• Streamlined Core Graphics interoperability

Page 33: Accerelate Framwork

Core Graphics Interoperability

Page 34: Accerelate Framwork

Core Graphics Interoperability

•How do I use vImage with my CGImageRef?

Page 35: Accerelate Framwork

Core Graphics Interoperability

•How do I use vImage with my CGImageRef?•New utility functions• vImageBuffer_InitWithCGImage• vImageCreateCGImageFromBuffer

Page 36: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 37: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 38: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 39: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 40: Accerelate Framwork

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 41: Accerelate Framwork

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 42: Accerelate Framwork

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 43: Accerelate Framwork

Core Graphics Interoperability

• Convert between vImage_CGImageFormat types• Tips for use

■ Create converter once■ Use many times

vImageConvert_AnyToAny

Page 44: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 45: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 46: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 47: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 48: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 49: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 50: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 51: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 52: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 53: Accerelate Framwork

Conversions in ContextScaling on a premultiplied PlanarF image

0 25 50 75 100

Unpremultiply

Scale

Premultiply

Percentage of time taken

Page 54: Accerelate Framwork

Conversions in ContextScaling on a premultiplied PlanarF image

0 25 50 75 100

Unpremultiply

Scale

Premultiply 0.64%

98.26%

1.10%

Percentage of time taken

Page 55: Accerelate Framwork
Page 56: Accerelate Framwork

vImage vs. OpenCV

Page 57: Accerelate Framwork

The competitionOpenCV

•Open source computer vision library• Image processing module

Page 58: Accerelate Framwork

Points of Comparison

• Execution time• Energy consumed

Page 59: Accerelate Framwork

vImage Speedup over OpenCViPhone 5

0 5 10 15 20 25

Histogram

Max

Box Convolve

Affine Warp

Speedup over OpenCV

Above one is better

(lanczos)

Page 60: Accerelate Framwork

vImage Speedup over OpenCViPhone 5

0 5 10 15 20 25

Histogram

Max

Box Convolve

Affine Warp 6.41

7.88

23.19

1.62

Speedup over OpenCV

Above one is better

(lanczos)

Page 61: Accerelate Framwork

Energy Consumption and Battery Life

• Fast code tends to■ Decrease energy consumption■ Increase battery life

Page 62: Accerelate Framwork

Typical Energy Consumption Profile

Pow

er

Idle power

Instantaneous power

Energy

t1t0

p2

p1

p0

Time

Page 63: Accerelate Framwork

Typical Energy Consumption Profile

Pow

er

Idle power

Instantaneous power

Time

Unoptimized

Opt

imiz

ed

Page 64: Accerelate Framwork

vImage Energy Savings over OpenCViPhone 5

0 2 4 6 8

Histogram

Max

Box Convolve

Affine Warp

Times less energy than OpenCV

(lanczos)

Above one is better

Page 65: Accerelate Framwork

vImage Energy Savings over OpenCViPhone 5

0 2 4 6 8

Histogram

Max

Box Convolve

Affine Warp 6.96

4.05

6.71

0.75

Times less energy than OpenCV

(lanczos)

Above one is better

Page 66: Accerelate Framwork
Page 67: Accerelate Framwork

“Using vImage from the Accelerate framework to dynamically pre-render my sprites. It’s the only way to make it fast. ;-)”

–Twitter User

Page 68: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 69: Accerelate Framwork

Vectorized digital signal processing libraryvDSP

Page 70: Accerelate Framwork

• Basic operations on arrays■ Add, subtract, multiply, conversion, accumulation, etc.

•Discrete Fourier Transform• Convolution and correlation

vDSPWhat’s available?

Page 71: Accerelate Framwork

New and Improved in vDSP

Page 72: Accerelate Framwork

New and Improved in vDSP

•Multi-channel IIR filter

Page 73: Accerelate Framwork

New and Improved in vDSP

•Multi-channel IIR filter• Improved power of 2 support

■ Discrete Fourier Transform (DFT)■ Discrete Cosine Transform (DCT)

Page 74: Accerelate Framwork

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

Page 75: Accerelate Framwork

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

256

FFT

384

DFT

Before

Page 76: Accerelate Framwork

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

256

FFT

384

DFT

256 384

DFT

Before OS X 10.9, iOS 7

Page 77: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 78: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 79: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 80: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 81: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 82: Accerelate Framwork
Page 83: Accerelate Framwork

vDSP vs. FFTW

Page 84: Accerelate Framwork

Fastest Fourier Transform in the westFFTW

•One and multi-dimensional transforms• Real and complex data• Parallel

Page 85: Accerelate Framwork

vDSP Speedup over FFTWDFT on iPhone 5

Page 86: Accerelate Framwork

vDSP Speedup over FFTW

240 256 320 384 480 512 640 768 9600

1

2

3

Spee

dup

Number of Points

DFT on iPhone 5

Above one is better

Page 87: Accerelate Framwork

vDSP Speedup over FFTW

240 256 320 384 480 512 640 768 9600

1

2

3

Spee

dup

Number of Points

DFT on iPhone 5

Above one is better

Page 88: Accerelate Framwork

•Used in FaceTime•DFT one of many DSP routines• Percentage of time spent in DFT

DFT In UseAAC Enhanced Low Delay

Page 89: Accerelate Framwork

DFT In UseAAC Enhanced Low Delay

Page 90: Accerelate Framwork

DFT In UseAAC Enhanced Low Delay

47%54%

DFTEverything Else

FFTW

Percent time spent in DFT

Page 91: Accerelate Framwork

DFT In UseAAC Enhanced Low Delay

47%54%

DFTEverything Else

FFTW

70%

30%

vDSP

Percent time spent in DFT

Page 92: Accerelate Framwork

Data Types in vDSP

• Single and double precision• Real and complex• Support for strided data access

Page 93: Accerelate Framwork

“Wanna do FFT on iOS?Use the Accelerate.framework.

Highly recommended. #ios”

–Twitter User

Page 94: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 95: Accerelate Framwork

Fast Math

Luke ChangEngineer, Vector and Numerics Group

Page 96: Accerelate Framwork

Math for Every Data Length

• Libm for scalar data

float

Page 97: Accerelate Framwork

Math for Every Data Length

• Libm for scalar data• vMathLib for SIMD vectors

float vFloat

Page 98: Accerelate Framwork

Math for Every Data Length

• Libm for scalar data• vMathLib for SIMD vectors• vForce for array data

float vFloat

...float []

Page 99: Accerelate Framwork

Standard C math libraryLibm

Page 100: Accerelate Framwork

Libm

• Standard math library in C• Collection of transcendental functions•Operates on scalar data

■ exp[f ]■ log[f ]■ sin[f ]■ cos[f ]■ pow[f ]■ Etc…

Page 101: Accerelate Framwork

What’s New in Libm?

• Extensions to C11, prefixed with “__”•Added in both iOS 7 and OS X 10.9• __exp10[f ]• __sinpi[f ], __cospi[f ], __tanpi[f ]• __sincos[f ], __sincospi[f ]

Page 102: Accerelate Framwork

__exp10[f]Power of 10

• Commonly used for decibel calculations• Faster than pow(10.0, x)•More accurate than exp(log(10) * x)

■ exp(log(10) * 5.0) =100000.0000000002

Page 103: Accerelate Framwork

__sinpi[f ], __cospi[f ], __tanpi[f ]Trigonometry in Terms of PI

• cospi(x) means cos( *x)• Faster because argument reduction is simpler•More accurate when working with degrees

■ cos(M_PI*0.5) returns 6.123233995736766e-17■ __cospi(0.5) returns exactly 0.0

Page 104: Accerelate Framwork

__sincos[f ], __sincospi[f ]Sine-Cosine Pairs

• Compute sine and cosine simultaneously• Faster because argument reduction is only done once• Clang will call __sincos[f ] when possible

Page 105: Accerelate Framwork

C11 Features

• Some complex values can’t be specified as literals■ (0.0 + INFINITY * I)

• C11 adds CMPLX macro for this purpose■ CMPLX(0.0, INFINITY)

• CMPLXF and CMPLXL are also available

Page 106: Accerelate Framwork

SIMD vector math libraryvMathLib

Page 107: Accerelate Framwork

vMathLib

• Collection of transcendental functions for SIMD vectors•Operates on SIMD vectors

■ vexp[f ]■ vlog[f ]■ vsin[f ]■ vcos[f ]■ vpow[f ]■ Etc…

Page 108: Accerelate Framwork

Writing your own vector algorithmWhen to Use vMathLib?

•Need transcendental functions in your vector code

Page 109: Accerelate Framwork

Taking sine of a vectorvMathLib Example

•Using Libm

#include <math.h>

vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...float *px = (float *)&vx, *py = (float *)&vy;for( i = 0; i < sizeof(vx)/sizeof(px[0]); ++i ) {

py[i] = sinf(px[i]);}...

Page 110: Accerelate Framwork

Taking sine of a vectorvMathLib Example

•Using vMathLib

#include <Accelerate/Accelerate.h>

vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...vy = vsinf(vx);...

Page 111: Accerelate Framwork

Vectorized math libraryvForce

Page 112: Accerelate Framwork

vForce

• Collection of transcendental functions for arrays•Operates on array data

■ vvexp[f ]■ vvlog[f ]■ vvsin[f ]■ vvcos[f ]■ vvpow[f ]■ Etc…

Page 113: Accerelate Framwork

vForce Example

• Filling a buffer with sine wave using a for loop

#include <math.h>

float buffer[length];float indices[length];

...

for (int i = 0; i < length; i++){ buffer[i] = sinf(indices[i]);}

Page 114: Accerelate Framwork

vForce Example

• Filling a buffer with sine wave using vForce

#include <Accelerate/Accelerate.h>

float buffer[length];float indices[length];

...

vvsinf(buffer, indices, &length);

Page 115: Accerelate Framwork

Better PerformanceMeasured on iPhone 5

Page 116: Accerelate Framwork

vForce

For loop

0 1 2 3 4 5 6

Sines Computed per µs

0 10 20 30 40 50 60

Better PerformanceMeasured on iPhone 5

Bigger is better

56

24

Page 117: Accerelate Framwork

vForce

For loop

0 1 2 3 4 5 6

nJ consumed per sine

0 4 8 12 16 20 24

Less EnergyMeasured on iPhone 5

Smaller is better

Page 118: Accerelate Framwork

vForce

For loop

0 1 2 3 4 5 6

nJ consumed per sine

0 4 8 12 16 20 24

Less EnergyMeasured on iPhone 5

Smaller is better

14.16

23.37

Page 119: Accerelate Framwork

Measured on iPhone 5vForce Performance

0 40 80 120 160 200

truncf

logf

expf

powf

sinf

sincosf

Results computed per µs vForceFor loop

Page 120: Accerelate Framwork

Measured on iPhone 5vForce Performance

0 40 80 120 160 200

truncf

logf

expf

powf

sinf

sincosf 10.6

24

9.2

29

22

161

28.8

56

26.6

120

116

0

Results computed per µs vForceFor loop

826

Page 121: Accerelate Framwork

vForce in Detail

• Supports both float and double•Handles edge cases correctly• Requires minimal data alignment• Supports in-place operation• Improves performance even with small arrays

■ Consider using vForce when more than 16 elements

Page 122: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 123: Accerelate Framwork

Linear Algebra PACKage andBasic Linear Algebra Subprograms

LAPACK and BLAS

Geoff BelterEngineer, Vector and Numerics Group

Page 124: Accerelate Framwork

LAPACK Operations

•High-level linear algebra• Solve linear systems•Matrix factorizations• Eigenvalues and eigenvectors

Page 125: Accerelate Framwork

LINPACK

•How fast can you solve a system of linear equations?

•1000 x 1000 matrices

Page 126: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 127: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A” 40

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 128: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 129: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Mflops

Bigger is better

Page 130: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A” 788

LINPACK benchmark performance in Mflops

Mflops

Bigger is better

Page 131: Accerelate Framwork

0 500 1000 1500

LAPACK

Accelerate

“Brand A”

Bigger is better

788

LINPACK benchmark performance in Mflops

Mflops

Page 132: Accerelate Framwork

0 500 1000 1500

LAPACK

Accelerate

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 133: Accerelate Framwork

0 500 1000 1500

LAPACK

Accelerate on iPhone 4S

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 134: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 135: Accerelate Framwork

0 500 1000 1500 2000

LAPACK

“Brand A”

Bigger is better

Accelerate on iPhone 5

788

LINPACK benchmark performance in Mflops

Mflops

Page 136: Accerelate Framwork

0

Accelerate on iPhone 4S

LAPACK

“Brand A”

500

Bigger is better

1000 1500 2000 2500 3000 3500 4000

788

LINPACK benchmark performance in Mflops

Mflops

Accelerate on iPhone 5

Page 137: Accerelate Framwork

0

Accelerate on iPhone 4S

LAPACK

“Brand A”

3446

500

Bigger is better

1000 1500 2000 2500 3000 3500 4000

788

LINPACK benchmark performance in Mflops

Mflops

Accelerate on iPhone 5

Page 138: Accerelate Framwork
Page 139: Accerelate Framwork

VS.iPadwith Retina Display G5

Power Mac

Page 140: Accerelate Framwork

iPad with Retina display vs. Power Mac G5LAPACK

• Triumphant return•After 10 years•All fans blazing

Page 141: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

LINPACK benchmark performance in Mflops

Mflops

Page 142: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 143: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 144: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

3686

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 145: Accerelate Framwork

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 146: Accerelate Framwork

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 147: Accerelate Framwork

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 148: Accerelate Framwork

BLAS Operations

• Low-level linear algebra• Vector

■ Dot product, scalar product, vector sum

•Matrix-vector■ Matrix-vector product, outer product

•Matrix-matrix■ Matrix multiply

Page 149: Accerelate Framwork

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 150: Accerelate Framwork

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 151: Accerelate Framwork

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 152: Accerelate Framwork

Data Types

• Single and double precision• Real and complex•Multiple data layouts

■ Dense, banded, triangular, etc.■ Transpose, conjugate transpose■ Row and column major

Page 153: Accerelate Framwork

“Playing with the Accelerate.framework today. Having a BLASt.”

–Twitter User

Page 154: Accerelate Framwork

Summary

Page 155: Accerelate Framwork

Lots of functionalityAccelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 156: Accerelate Framwork

Accelerate Framework

• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware

Features and benefits

Page 157: Accerelate Framwork

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size•Do setup once/destroy at the end

Accelerate FrameworkTo be successful

Page 158: Accerelate Framwork
Page 159: Accerelate Framwork

If You Need a Feature

Page 160: Accerelate Framwork

If You Need a FeatureRequest It

Page 161: Accerelate Framwork

“Discrete Cosine Transform was my feature request that made it into the

Accelerate Framework. I feel so special!”

–Twitter User

Page 162: Accerelate Framwork

“Thanks Apple for making the Accelerate Framework.”

–Twitter User

Page 163: Accerelate Framwork

Paul DanboldCore OS Technologies [email protected]

George WarnerDTS Sr. Support [email protected]

DocumentationvImage Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vImage/Introduction/Introduction.html

vDSP Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html

Apple Developer Forumshttp://devforums.apple.com

More Information

Page 164: Accerelate Framwork

Labs

Accelerate Lab Core OS Lab AThursday 4:30PM

Page 165: Accerelate Framwork