Improve Performance with Intel® Cilk™ Plus · •Hyperobjects - powerful parallel data...
Transcript of Improve Performance with Intel® Cilk™ Plus · •Hyperobjects - powerful parallel data...
Improve Performance with Intel® Cilk™ Plus
Brandon Hewitt
Senior Technical Consulting Engineer
Intel Corporation
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
The obstacles in taking advantage of CPU innovation in software
The Answer – Intel® Cilk™ Plus
Cilk Plus Array Notations
Cilk Plus Elemental Functions
Cilk Plus keywords
Serialization and Serial Semantics
Composability
Cilk Plus Hyperobjects
References and Contact Information
Q&A
2
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CPU Innovation Increases…
3
6/26/2012
Multi Core Many Core
128 Bits
256 Bits
512 Bits
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
But it’s hard for C/C++ applications to take advantage
• Native Threading APIs require significant expertise to get correct and fast
• And even harder to scale, especially on systems running other threaded processes
• Taking advantage of vector instructions has similar challenges
• Inline assembly and intrinsics require low-level expertise and do not scale forward
• Compiler vectorization is great when it works but can be hindered by language semantics
4
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus Provides a Solution
5
6/26/2012
Intel Cilk Plus Key Benefits
• Simple syntax which is very easy to learn and use
• Array notation guarantees fast vector code that is easy to retarget to wider vectors
• Fork/join tasking system is simple to understand and mimics serial behavior
• Low overhead tasks offer scalability to high core counts
• Reducers give better performance than mutex locks and maintain serial semantics
• Mixes well with Intel® Threading Building Blocks, a more complete threading library solution
Intel Cilk Plus
What is it?
• Compiler supported solution offering a tasking system via 3 simple keywords
• Array notation to specify vector code
• Hyperobjects - powerful parallel data structures to efficiently prevent races
• Pragmas to define vectorization of loops
• Attributes to specify functions that can be applied to all elements of arrays
• Supports C and C++
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Wait, “compiler supported?”
Yes, it requires language and runtime support from a compiler.
6
6/26/2012
Language and ABI Specifications are supplied at http://software.intel.com/en-us/articles/intel-cilk-plus-specification/
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus is not
A complete threading solution
Solution: Intel® Threading Building Blocks
A cluster parallel solution across nodes
Solution: Intel® Cluster Studio
A tool to make designing parallel applications easier
Solution: Intel® Parallel Advisor
7
6/26/2012
Intel® Cilk™ Plus Vector Constructs
8
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
First, a moment on vectorization
9
6/26/2012
• Scalar mode – one instruction produces
one result
• SIMD processing – With Intel® Streaming SIMD Extensions or
Advanced Vector Extensions
– one instruction can produce multiple
results
+ a[i]
b[i]
c[i]
+
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
a
b
c
+
Must be enabled in compiler by specifying target architecture -m or /arch -x or /Qx -ax or /Qax Default for Intel® C++ Composer XE 2011 is –msse2 or /arch:SSE2
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus Array Notations
Array Notations provide a syntax to specify sections of arrays on which to perform operations
syntax :: [<lower bound> : <length> : <stride>]
Simple examples
• a[0:N] = b[0:N] * c[0:N];
• a[:] = b[:] * c[:] // if a, b, c are declared with size N
The Intel® C++ Compiler’s automatic vectorization can use this information to apply single operations to multiple elements of the array using Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX)
More advanced example
• x[0:10:10] = sin(y[20:10:2]);
10
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus Array Notations Example
void foo(double * a, double * b, double * c, double * d, double * e, int n) { for(int i = 0; i < n; i++) a[i] *= (b[i] - d[i]) * (c[i] + e[i]); } void goo(double * a, double * b, double * c, double * d, double * e, int n) { a[0:n] *= (b[0:n] - d[0:n]) * (c[0:n] + e[0:n]); } icl -Qvec-report3 -c test-array-notations.cpp Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.4.325 Build 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. test-array-notations.cpp test-array-notations.cpp(2): (col. 2) remark: loop was not vectorized: existence of vector dependence. test-array-notations.cpp(3): (col. 3) remark: vector dependence: assumed FLOW dependence between a line 3 and e line 3. <snip> test-array-notations.cpp(7): (col. 6) remark: LOOP WAS VECTORIZED.
11
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
And more
Reduction operations using __sec_reduce_*
Can even write custom reductions
Perform index-dependent operations using __sec_implicit_index(rank)
Gather/scatter notation
12
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus Elemental Functions
The compiler can’t assume that user-defined functions are safe for vectorization
Now you can make your function an elemental function which indicates to the compiler that such a function can be applied to multiple elements of an array in parallel safely.
Specify __declspec(vector) on both function declarations and definitions as this will affect name-mangling.
Clauses provided to help direct the compiler on the code generated – see the documentation for details
13
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Elemental Function Sample
double user_function(double x); __declspec(vector) double elemental_function(double x); double a[100]; double b[100]; void foo() { a[:] = user_function(b[:]); a[:] = elemental_function(b[:]); } icl /Qvec-report3 /c test-elemental-functions.cpp Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.4.325 Build 20120410 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. test-elemental-functions.cpp test-elemental-functions.cpp(9) : (col. 4) remark: LOOP WAS VECTORIZED. test-elemental-functions.cpp(8) : (col. 4) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
14
6/26/2012
Intel® Cilk™ Plus Tasking
15
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus Keywords
Cilk Plus adds three keywords to C and C++:
_Cilk_spawn
_Cilk_sync
_Cilk_for
If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for.
Cilk Plus runtime controls thread creation and scheduling. A thread pool is created prior to use of Cilk Plus keywords.
The number of threads matches the number of cores (including hyperthreads) by default, but can be controlled by the user.
16
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
cilk_spawn and cilk_sync
cilk_spawn gives the runtime permission to run a child function
asynchronously.
• No 2nd thread is created or required!
• If there are no available workers, then the child will execute as a serial function call.
• The scheduler may steal the parent and run it in parallel with the child function.
• The parent is not guaranteed to run in parallel with the child.
• No 2nd thread is created or required!
cilk_sync waits for all children to complete before execution
proceeds from that point.
• There are implicit cilk_sync points – will discuss later
17
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
A simple example
18
6/26/2012
Recursive computation of a Fibonacci number:
int fib(int n)
{
int x, y;
if (n < 2) return n;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
}
Asynchronous call must complete before using x.
Execution can continue while fib(n-1) is running.
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
cilk_for
Looks like a normal for loop.
cilk_for (int x = 0; x < 1000000; ++x) { … }
Any or all iterations may execute in parallel with one another.
All iterations complete before program continues.
Constraints:
• Limited to a single control variable.
• Must be able to jump to the start of any iteration at random.
• Iterations should be independent of one another.
19
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Under the hood of cilk_for
void run_loop(first, last) { if ((last - first) < grainsize) { for (int i=first; i<last ++i) {LOOP_BODY;} } else { int mid = (last-first)/2; cilk_spawn run_loop(first, mid); run_loop(mid, last); } }
20
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
cilk_for examples
cilk_for (int x; x < 1000000; x += 2) { … }
cilk_for (vector<int>::iterator x = y.begin(); x != y.end(); ++x) { … }
cilk_for (list<int>::iterator x = y.begin(); x != y.end(); ++x) { … }
• Loop count cannot be computed in constant time for a list. (y.end() – y.begin() is not defined.)
• Do not have random access to the elements of the list. (y.begin() + n is not defined.)
21
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Serialization
Every Cilk Plus program has an equivalent serial program called the serialization
The serialization is obtained by removing cilk_spawn and cilk_sync keywords and replacing cilk_for with for
• The compiler will produce the serialization for you if you compile with /Qcilk-serialize (Windows*) or –cilk-
serialize (Linux*/OS X*)
Running with only one worker is equivalent to running the serialization.
22
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Serial Semantics
A deterministic Intel® Cilk™ Plus program will have the same semantics as its serialization.
• Easier regression testing
• Easier to debug:
– Run with one core
– Run serialized
• Composable
• Advantages to hyperobjects
• Strong analysis tools (Cilk Plus SDK)
– cilkscreen race detector
– cilkview parallelism analyzer
23
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Implicit Syncs
24
6/26/2012
void f() { cilk_spawn g(); cilk_for (int x = 0; x < lots; ++x) { ... } try { cilk_spawn h(); } catch (...) { ... } }
At end of a spawning function
At end of a cilk_for body (does not sync g())
At end of a try block containing a spawn
Before entering a try block containing a sync
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Composability
Implicit syncs are important for making each function call a “black box.”
Caller does not know or care if the called function spawns.
• Serial abstraction is maintained.
• Caller does not need to worry about data races with the called function.
cilk_sync synchronizes only children that were
spawned within the same function as the sync: no action at a distance.
25
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
A summing example
26
6/26/2012
int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... int result = 0; for (std::size_t i = 0; i < n; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result << std::endl; return 0; }
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Summing with Intel® Cilk™ Plus
27
6/26/2012
int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... int result = 0; cilk_for (std::size_t i = 0; i < n; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result << std::endl; return 0; }
Race!
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Locking solution
28
6/26/2012
int compute(const X& v); int main() { const std::size_t n = 1000000; extern X myArray[n]; // ... mutex L; int result = 0; cilk_for (std::size_t i = 0; i < n; ++i) { int temp = compute(myArray[i]); L.lock(); result += temp; L.unlock(); } std::cout << "The result is: " << result << std::endl; return 0; }
Problems Lock overhead & lock contention.
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Solution with Intel® Cilk™ Plus Reducer
29
6/26/2012
int compute(const X& v); int main() { const std::size_t ARRAY_SIZE = 1000000; extern X myArray[ARRAY_SIZE]; // ... cilk::reducer_opadd<int> result; cilk_for (std::size_t i = 0; i < ARRAY_SIZE; ++i) { result += compute(myArray[i]); } std::cout << "The result is: " << result.get_value() << std::endl; return 0; }
Declare result to be a summing
reducer over int. Updates are resolved automatically without races or contention.
At the end, the underlying int value can be extracted.
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Hyperobject Library
Intel® Cilk™ Plus’s hyperobject library contains many commonly used reducers and more:
• reducer_list_append
• reducer_list_prepend
• reducer_max
• reducer_max_index
• reducer_min
• reducer_min_index
• reducer_opadd
• reducer_ostream
• reducer_basic_string
• holder
• …
You can also write your own using cilk::monoid_base and cilk::reducer.
30
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Hyperobject Benefits and Limitations
Benefits
• Hyperobjects do not suffer from lock contention
• Hyperobjects retain serial semantics
– For example, a list or ostream reducer will retain the same order you would get from serial execution
– Even a correct locking solution cannot guarantee this
Limitations
• Operations on a hyperobject must be associative to behave deterministically
– Refer to the operators supported by a particular reducer class for safe operations to use
– Floating point types may get different results run-to-run
• If using custom data types for hyperobjects, refer to the header for the specific reducer for requirements
• Supported in C as well as C++, but interface is less elegant in C
31
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Use Intel® Cilk™ Plus
Available in:
Intel® C++ Composer XE 2011
• Included in Intel® Parallel Studio XE 2011 SP1
• 30-day evaluation at http://software.intel.com/en-us/articles/intel-software-evaluation-center/
Open Source Patch to gcc* 4.7
• Get from http://cilk.com
32
6/26/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Of further interest
For all things Intel® Cilk™ Plus • http://cilk.com
Intel® Threading Building Blocks • http://threadingbuildingblocks.org
Get support • http://premier.intel.com
User Forums • http://software.intel.com/en-us/forums/intel-cilk-plus
Past technical presentations • http://software.intel.com/en-us/articles/intel-software-
development-products-technical-presentations/ • Includes deeper dives into Cilk Plus, vectorization, Intel®
Cluster Studio, and more! 33
6/26/2012
Thank You
34
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement) Intel Xeon, Core, and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright © 2012, Intel Corporation. All rights reserved.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
35
6/26/2012 35