Improve Performance with Intel® Cilk™ Plus · •Hyperobjects - powerful parallel data...

Improve Performance with Intel® Cilk™ Plus

Brandon Hewitt

Senior Technical Consulting Engineer

Intel Corporation

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Agenda

The obstacles in taking advantage of CPU innovation in software

The Answer – Intel® Cilk™ Plus

Cilk Plus Array Notations

Cilk Plus Elemental Functions

Cilk Plus keywords

Serialization and Serial Semantics

Composability

Cilk Plus Hyperobjects

References and Contact Information

Q&A

2

6/26/2012


CPU Innovation Increases…

3

6/26/2012

Multi Core Many Core

128 Bits

256 Bits

512 Bits


But it’s hard for C/C++ applications to take advantage

• Native Threading APIs require significant expertise to get correct and fast

• And even harder to scale, especially on systems running other threaded processes

• Taking advantage of vector instructions has similar challenges

• Inline assembly and intrinsics require low-level expertise and do not scale forward

• Compiler vectorization is great when it works but can be hindered by language semantics

4

6/26/2012


Intel® Cilk™ Plus Provides a Solution

5

6/26/2012

Intel Cilk Plus Key Benefits

• Simple syntax which is very easy to learn and use

• Array notation guarantees fast vector code that is easy to retarget to wider vectors

• Fork/join tasking system is simple to understand and mimics serial behavior

• Low overhead tasks offer scalability to high core counts

• Reducers give better performance than mutex locks and maintain serial semantics

• Mixes well with Intel® Threading Building Blocks, a more complete threading library solution

Intel Cilk Plus

What is it?

• Compiler supported solution offering a tasking system via 3 simple keywords

• Array notation to specify vector code

• Hyperobjects - powerful parallel data structures to efficiently prevent races

• Pragmas to define vectorization of loops

• Attributes to specify functions that can be applied to all elements of arrays

• Supports C and C++


Wait, “compiler supported?”

Yes, it requires language and runtime support from a compiler.

6

6/26/2012

Language and ABI Specifications are supplied at http://software.intel.com/en-us/articles/intel-cilk-plus-specification/

http://software.intel.com/en-us/articles/intel-cilk-plus-specification/











Intel® Cilk™ Plus is not

A complete threading solution

Solution: Intel® Threading Building Blocks

A cluster parallel solution across nodes

Solution: Intel® Cluster Studio

A tool to make designing parallel applications easier

Solution: Intel® Parallel Advisor

7

6/26/2012

Intel® Cilk™ Plus Vector Constructs

8

6/26/2012


First, a moment on vectorization

9

6/26/2012

• Scalar mode – one instruction produces

one result

• SIMD processing – With Intel® Streaming SIMD Extensions or

Advanced Vector Extensions

– one instruction can produce multiple

results

+ a[i]

b[i]

c[i]

+

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

for (i=0;i<=MAX;i++)

c[i]=a[i]+b[i];

a

b

c

+

Must be enabled in compiler by specifying target architecture -m or /arch -x or /Qx -ax or /Qax Default for Intel® C++ Composer XE 2011 is –msse2 or /arch:SSE2


Intel® Cilk™ Plus Array Notations

Array Notations provide a syntax to specify sections of arrays on which to perform operations

syntax :: [<lower bound> : <length> : <stride>]

Simple examples

• a[0:N] = b[0:N] * c[0:N];

• a[:] = b[:] * c[:] // if a, b, c are declared with size N

The Intel® C++ Compiler’s automatic vectorization can use this information to apply single operations to multiple elements of the array using Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX)

More advanced example

• x[0:10:10] = sin(y[20:10:2]);

10

6/26/2012


Intel® Cilk™ Plus Array Notations Example

void foo(double * a, double * b, double * c, double * d, double * e, int n) { for(int i = 0; i < n; i++) a[i] *= (b[i] - d[i]) * (c[i] + e[i]); } void goo(double * a, double * b, double * c, double * d, double * e, int n) { a[0:n] *= (b[0:n] - d[0:n]) * (c[0:n] + e[0:n]); } icl -Qvec-report3 -c test-array-notations.cpp Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.4.325 Build 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. test-array-notations.cpp test-array-notations.cpp(2): (col. 2) remark: loop was not vectorized: existence of vector dependence. test-array-notations.cpp(3): (col. 3) remark: vector dependence: assumed FLOW dependence between a line 3 and e line 3. <snip> test-array-notations.cpp(7): (col. 6) remark: LOOP WAS VECTORIZED.

11

6/26/2012


And more

Reduction operations using __sec_reduce_*

Can even write custom reductions

Perform index-dependent operations using __sec_implicit_index(rank)

Gather/scatter notation

12

6/26/2012


Intel® Cilk™ Plus Elemental Functions

The compiler can’t assume that user-defined functions are safe for vectorization

Now you can make your function an elemental function which indicates to the compiler that such a function can be applied to multiple elements of an array in parallel safely.

Specify __declspec(vector) on both function declarations and definitions as this will affect name-mangling.

Clauses provided to help direct the compiler on the code generated – see the documentation for details

13

6/26/2012


Elemental Function Sample

double user_function(double x); __declspec(vector) double elemental_function(double x); double a[100]; double b[100]; void foo() { a[:] = user_function(b[:]); a[:] = elemental_function(b[:]); } icl /Qvec-report3 /c test-elemental-functions.cpp Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.4.325 Build 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. test-elemental-functions.cpp test-elemental-functions.cpp(9) : (col. 4) remark: LOOP WAS VECTORIZED. test-elemental-functions.cpp(8) : (col. 4) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

14

6/26/2012

Intel® Cilk™ Plus Tasking

15

6/26/2012


Intel® Cilk™ Plus Keywords

Cilk Plus adds three keywords to C and C++:

_Cilk_spawn

_Cilk_sync

_Cilk_for

If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for.

Cilk Plus runtime controls thread creation and scheduling. A thread pool is created prior to use of Cilk Plus keywords.

The number of threads matches the number of cores (including hyperthreads) by default, but can be controlled by the user.

16

6/26/2012


cilk_spawn and cilk_sync

cilk_spawn gives the runtime permission to run a child function

asynchronously.

• No 2nd thread is created or required!

• If there are no available workers, then the child will execute as a serial function call.

• The scheduler may steal the parent and run it in parallel with the child function.

• The parent is not guaranteed to run in parallel with the child.

• No 2nd thread is created or required!

cilk_sync waits for all children to complete before execution

proceeds from that point.

• There are implicit cilk_sync points – will discuss later

17

6/26/2012


A simple example

18

6/26/2012

Recursive computation of a Fibonacci number:

int fib(int n)

{

int x, y;

if (n < 2) return n;

x = cilk_spawn fib(n-1);

y = fib(n-2);

cilk_sync;

return x+y;

}

Asynchronous call must complete before using x.

Execution can continue while fib(n-1) is running.


cilk_for

Looks like a normal for loop.

cilk_for (int x = 0; x < 1000000; ++x) { … }

Any or all iterations may execute in parallel with one another.

All iterations complete before program continues.

Constraints:

• Limited to a single control variable.

• Must be able to jump to the start of any iteration at random.

• Iterations should be independent of one another.

19

6/26/2012


Under the hood of cilk_for

void run_loop(first, last) { if ((last - first) < grainsize) { for (int i=first; i<last ++i) {LOOP_BODY;} } else { int mid = (last-first)/2; cilk_spawn run_loop(first, mid); run_loop(mid, last); } }

20

6/26/2012


cilk_for examples

cilk_for (int x; x < 1000000; x += 2) { … }

cilk_for (vector<int>::iterator x = y.begin(); x != y.end(); ++x) { … }

cilk_for (list<int>::iterator x = y.begin(); x != y.end(); ++x) { … }

• Loop count cannot be computed in constant time for a list. (y.end() – y.begin() is not defined.)

• Do not have random access to the elements of the list. (y.begin() + n is not defined.)

21

6/26/2012


Serialization

Every Cilk Plus program has an equivalent serial program called the serialization

The serialization is obtained by removing cilk_spawn and cilk_sync keywords and replacing cilk_for with for

• The compiler will produce the serialization for you if you compile with /Qcilk-serialize (Windows*) or –cilk-

serialize (Linux*/OS X*)

Running with only one worker is equivalent to running the serialization.

22

6/26/2012


Serial Semantics

A deterministic Intel® Cilk™ Plus program will have the same semantics as its serialization.

• Easier regression testing

• Easier to debug:

– Run with one core

– Run serialized

• Composable

• Advantages to hyperobjects

• Strong analysis tools (Cilk Plus SDK)

– cilkscreen race detector

– cilkview parallelism analyzer

23

6/26/2012


Implicit Syncs

24

6/26/2012

void f() { cilk_spawn g(); cilk_for (int x = 0; x < lots; ++x) { ... } try { cilk_spawn h(); } catch (...) { ... } }

At end of a spawning function

At end of a cilk_for body (does not sync g())

At end of a try block containing a spawn

Before entering a try block containing a sync


Composability

Implicit syncs are important for making each function call a “black box.”

Caller does not know or care if the called function spawns.

• Serial abstraction is maintained.

• Caller does not need to worry about data races with the called function.

cilk_sync synchronizes only children that were

spawned within the same function as the sync: no action at a distance.

25

6/26/2012


A summing example

26

6/26/2012

int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... int result = 0; for (std::size_t i = 0; i < n; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result << std::endl; return 0; }


Summing with Intel® Cilk™ Plus

27

6/26/2012

int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... int result = 0; cilk_for (std::size_t i = 0; i < n; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result << std::endl; return 0; }

Race!


Locking solution

28

6/26/2012

int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... mutex L; int result = 0; cilk_for (std::size_t i = 0; i < n; ++i) { int temp = compute(myArray[i]); L.lock(); result += temp; L.unlock(); } std::cout << "The result is: " << result << std::endl; return 0; }

Problems Lock overhead & lock contention.


Solution with Intel® Cilk™ Plus Reducer

29

6/26/2012

int compute(const X& v); int main() { const std::size_t ARRAY_SIZE = 1000000; extern X myArray[ARRAY_SIZE]; // ... cilk::reducer_opadd<int> result; cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result.get_value() << std::endl; return 0; }

Declare result to be a summing

reducer over int. Updates are resolved automatically without races or contention.

At the end, the underlying int value can be extracted.


Hyperobject Library

Intel® Cilk™ Plus’s hyperobject library contains many commonly used reducers and more:

• reducer_list_append

• reducer_list_prepend

• reducer_max

• reducer_max_index

• reducer_min

• reducer_min_index

• reducer_opadd

• reducer_ostream

• reducer_basic_string

• holder

• …

You can also write your own using cilk::monoid_base and cilk::reducer.

30

6/26/2012


Hyperobject Benefits and Limitations

Benefits

• Hyperobjects do not suffer from lock contention

• Hyperobjects retain serial semantics

– For example, a list or ostream reducer will retain the same order you would get from serial execution

– Even a correct locking solution cannot guarantee this

Limitations

• Operations on a hyperobject must be associative to behave deterministically

– Refer to the operators supported by a particular reducer class for safe operations to use

– Floating point types may get different results run-to-run

• If using custom data types for hyperobjects, refer to the header for the specific reducer for requirements

• Supported in C as well as C++, but interface is less elegant in C

31

6/26/2012


Use Intel® Cilk™ Plus

Available in:

Intel® C++ Composer XE 2011

• Included in Intel® Parallel Studio XE 2011 SP1

• 30-day evaluation at http://software.intel.com/en-us/articles/intel-software-evaluation-center/

Open Source Patch to gcc* 4.7

• Get from http://cilk.com

32

6/26/2012

http://software.intel.com/en-us/articles/intel-software-evaluation-center/









http://cilk.com/


Of further interest

For all things Intel® Cilk™ Plus • http://cilk.com

Intel® Threading Building Blocks • http://threadingbuildingblocks.org

Get support • http://premier.intel.com

User Forums • http://software.intel.com/en-us/forums/intel-cilk-plus

Past technical presentations • http://software.intel.com/en-us/articles/intel-software-

development-products-technical-presentations/ • Includes deeper dives into Cilk Plus, vectorization, Intel®

Cluster Studio, and more! 33

6/26/2012

http://cilk.com/

http://threadingbuildingblocks.org/

http://premier.intel.com/

http://software.intel.com/en-us/articles/intel-cilk-plus







http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/















Thank You

34


INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement) Intel Xeon, Core, and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2012, Intel Corporation. All rights reserved.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


35

6/26/2012 35

http://www.intel.com/design/literature.htm

http://www.intel.com/products/processor_number

http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement)













Improve Performance with Intel® Cilk™ Plus · •Hyperobjects - powerful parallel data...

Documents

Transcript of Improve Performance with Intel® Cilk™ Plus · •Hyperobjects - powerful parallel data...