How to write powerful parallel Applications - Polyhedron · How to write powerful parallel...
-
Upload
phamnguyet -
Category
Documents
-
view
217 -
download
0
Embed Size (px)
Transcript of How to write powerful parallel Applications - Polyhedron · How to write powerful parallel...

How to write powerful parallel Applications
Parallel programming Techniques and Program Testing in Cluster Environments16:45-17:15
Performance Tuning Threaded Software using Intel VTune Performance Analyzer
and Thread Profiler16:00-16:45
Pinpoint Program Inefficiencies and Threading Bugs - Data Races and Deadlocks15:30-16:00
Break15:15-15:30
Expressing Parallelism: Introducing Threading Through Libraries14:30-15:15
Expressing Parallelism: Using Intel® C++ and Fortran Compilers, Professional
Editions 10.1, for Performance, Multi-threading13:30-14:30
Lunch12:30-13:30
Introduction to Parallel Programming Methods11:30-12:30
How to Optimize Applications and Identify Areas for Parallelization10:30-11:30
Break10:15-10:30
Introduction to Software Design Cycle - From Serial to Parallel Applications09.45-10:15
Introduction to the Intel Micro architecture and Software Implications09.00-09:45
Welcome and Coffee08:30-09.00

Intel® Core™Microarchitecture
Edmund PreissEMEA Software Solutions Group

CoreCore™ ArchitectureArchitecture
�Moores Law and Processor Evolution
�Introduction on Core architecture
–New features added in 2007:
–Intro to 45nm Technology -> Shrink
�New Core™ Advanced Features
�Selected Software Implications

Source: WSTS/Dataquest/IntelSource: WSTS/Dataquest/Intel
Implications of Moore’s Law
101010101010101033
101010101010101044
101010101010101055
101010101010101066
101010101010101077
101010101010101088
101010101010101099
10101010101010101010
As the As the As the As the number of number of number of number of
transistors transistors transistors transistors
goes UPgoes UPgoes UPgoes UP
Cost per Cost per Cost per Cost per
transistor transistor transistor transistor goes DOWNgoes DOWNgoes DOWNgoes DOWN
1010101010101010--------77
1010101010101010--------66
1010101010101010--------55
1010101010101010--------44
1010101010101010--------33
1010101010101010--------22
1010101010101010--------11
101010101010101000
1010101010101010
ScalingScalingScalingScalingScalingScalingScalingScaling
+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size+ Wafer Size
+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume+ Volume
= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs= Lower Costs
Source: Fortune Magazine

New Microarchitecture History
* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing
Examples:
PentiumPentium®® ProPro
PentiumPentium®® II/IIIII/IIIPentiumPentium®®
PentiumPentium®® 44
PentiumPentium®® DD
XeonXeon®®
PentiumPentium®® MM
Core DuoCore Duo®®
Intel NetBurst®P5 P6 Banias
EPIC* (Itanium®) x86 IXA* (xScale)
Examples:
Examples:
IntelIntel®® CoreCore™™
ConroeWoodcrest
Merom

2 YEARS
2 YEARS
2 YEARS
2 YEARS
Intel Processor Family Design Cycles
Increase performance per given clock cycle
Increase processor frequencies
Extend energy efficiency
Deliver lead product for 45nm High k + metal gate process technology
Deliver optimized processors across each product segment and power envelope
2 YEARS
2 YEARS
2 YEARS
2 YEARS
45nm
32nm
New MicroarchitectureNew MicroarchitectureNew MicroarchitectureNew MicroarchitectureNehalem
Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/Derivative
New New New New MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture
65nm
2 YEARS
2 YEARS
2 YEARS
2 YEARS Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/Derivative
Presler · Yonah · Dempsey
New MicroarchitectureNew MicroarchitectureNew MicroarchitectureNew MicroarchitectureIntel® Core™ Microarchitecture
Shrink/DerivativeShrink/DerivativeShrink/DerivativeShrink/DerivativePenryn Family

Details of the Intel Core Architecture

IntelIntel®® WideWide
Dynamic ExecutionDynamic Execution
IntelIntel®® AdvancedAdvanced
Digital Media BoostDigital Media Boost
IntelIntel®® IntelligentIntelligent
Power CapabilityPower Capability
IntelIntel®® SmartSmart
Memory AccessMemory Access
IntelIntel®® AdvancedAdvanced
Smart CacheSmart Cache
L2 CacheL2 Cache
Core 1 Core 2
BusBus
Smarter
Faster
Wider Deeper
Intel Core Innovations

CoreTM vs. NetBurstTM µ-arch: Overview
31SIMD Inst. Issued per Clock
80W135WPower
Up to 2
(Add + Mul or Div)
1FP Inst. Issued per clock
3 (Add/Mul/Div)3 (Add/Mul/Div)FP Units
3 x 128-bits2 x 64-bitsSIMD Units
3 (1x core freq)2 (2x core freq)Integer Units
41Instr. Decoders
1 x 4MB (shared)2 x 2MBL2 Cache Org.
(32K I/32K Data)(12K uop I/16K Data)
L1 Cache Org.
12Threads per core
1431Pipeline Stages
Intel CoreTMIntel NetBurstTMProcessor component
means per core

45nm Technology
�Penryn – code name for an enhanced Intel® CoreTM
microarchitecture at 45 nm– Industry’s first 45 nm High-K processor technology
– ~2x transistor density
– >20% gain in transistor switching speed
– ~30% decrease in transistor switching power
– Dual core, quad core
– Shared L2 cache
– Intel 64 architecture
– 128-bit SSE
2 Threads, 1 Package
(similar to Intel® Core™ 2 Duo processor)
6M L2 Cache
32K I-Cache
32K D-Cache
Bus
Core Core
”Penryn”/”Wolfdale”/“Wolfdale DP”
Dual Core Package
32K I-Cache
32K D-Cache

Core™ Microarchitecture
Decode
2MB/4MB Shared
L2 Cache
Up to 10.4 GB/s
FSB
uCodeROM
Instruction Queue
Instruction Fetch and Pre Decode
Rename/Alloc
Retirement Unit(Reorder Buffer)
ALU Branch
MMX/SSE
FPMove
ALU FAdd
MMX/SSE
FPMove
ALU FMul
MMX/SSE
FPMoveLOAD STORE
L1 D-Cache and D-TLB
Schedulers
Decode
Instruction Queue
Instruction Fetch and Pre Decode
Rename/Alloc
Retirement Unit(Reorder Buffer)
ALU Branch
MMX/SSE
FPMove
ALU FAdd
MMX/SSE
FPMove
ALU FMul
MMX/SSE
FPMove
L1 D-Cache and D-TLB
LOADSTORE
Schedulers
uCodeROM

Intel® Core™Microarchitecture
Primary interfaces• Front end
• Execution• Memory
Primary interfaces• Front end
• Execution• Memory
Instruction Fetch Instruction Fetch
And And PreDecodePreDecode
Instruction QueueInstruction Queue
DecodeDecode
Rename/Rename/AllocAlloc
2M/4M2M/4M
Shared L2Shared L2
CacheCache
Up to Up to
10.6 GB/s10.6 GB/s
FSBFSB
uCodeuCode
ROMROM
SchedulersSchedulers
ALUALU
BranchBranch
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FAddFAdd
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FMulFMul
MMX/SSEMMX/SSE
FPmoveFPmove
LoadLoad StoreStore
L1 DL1 D--Cache and DCache and D--TLBTLB
Retirement UnitRetirement Unit
5
4
4
6
MemoryMemory
OrderOrder
BufferBuffer

Intel® Core™Microarchitecture
Front EndFront End
Instruction Fetch Instruction Fetch
And And PreDecodePreDecode
Instruction QueueInstruction Queue
DecodeDecode
Rename/Rename/AllocAlloc
2M/4M2M/4M
Shared L2Shared L2
CacheCache
Up to Up to
10.6 GB/s10.6 GB/s
FSBFSB
uCodeuCode
ROMROM
SchedulersSchedulers
ALUALU
BranchBranch
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FAddFAdd
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FMulFMul
MMX/SSEMMX/SSE
FPmoveFPmove
LoadLoad StoreStore
L1 DL1 D--Cache and DCache and D--TLBTLB
Retirement UnitRetirement Unit
5
4
4
6
MemoryMemoryOrderOrder
BufferBuffer
� Up to 6 instructions per cycle can be sent to the IQ
� Typical programs average slightly less than 4 bytes per instruction
� 4 decoders:1 “large” and 3 “small”. – All decoders handle “simple” 1-uop instructions.
– Larger handles instructions up to 4 uops
� Detects short loops and locks them in the instruction queue (IQ)– Reduced front end power consumption - total saving of up to 14%
� Up to 6 instructions per cycle can be sent to the IQ
� Typical programs average slightly less than 4 bytes per instruction
� 4 decoders:1 “large” and 3 “small”. – All decoders handle “simple” 1-uop instructions.
– Larger handles instructions up to 4 uops
� Detects short loops and locks them in the instruction queue (IQ)– Reduced front end power consumption - total saving of up to 14%

WithoutMacro-Fusion
Instruction Queue
Read five instructions from Instruction Queue
Each instruction gets decoded separately
store [mem3], ebx
load eax, [mem1]
cmp eax, [mem2]
jne targ
inc esp
inc esp
store [mem3], ebx
dec1 dec2 dec3
jne targ
load eax, [mem1]
cmp eax, [mem2]
dec0
Cycle 1
Cycle 2

With Intel’s New Macro-Fusion
Read five Instructions from Instruction Queue
Send fusable pair to single decoder
All in one cycle
store [mem3], ebx
load eax, [mem1]
cmpjne eax, [mem2], targ
inc esp
dec1
Instruction Queue
inc esp
dec2 dec3
load eax, [mem1]
cmp eax, [mem2]
jne targ
store [mem3], ebx
dec0
ct3

Slide 15
ct3 66% improvement due to macro fusion and +1 decoderVisually make NGMA bigger/betterctaggard, 03/03/2006

Instruction Fetch Instruction Fetch
And And PreDecodePreDecode
Instruction QueueInstruction Queue
DecodeDecode
Rename/Rename/AllocAlloc
2M/4M2M/4M
Shared L2Shared L2
CacheCache
Up to Up to
10.6 GB/s10.6 GB/s
FSBFSB
uCodeuCode
ROMROM
SchedulersSchedulers
ALUALU
BranchBranch
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FAddFAdd
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FMulFMul
MMX/SSEMMX/SSE
FPmoveFPmove
LoadLoad StoreStore
L1 DL1 D--Cache and DCache and D--TLBTLB
Retirement UnitRetirement Unit
5
4
4
6
MemoryMemory
OrderOrder
BufferBuffer
Intel® Core™Microarchitecture
ExecutionOut-of-Order
ExecutionOut-of-Order
� 4 uops renamed / retired per clock
� Uops written to RS and ROB– RS waits for sources to arrive allowing OOO execution
– ROB waits for results to show up for retirement
� 6 dispatch ports from RS– 3 execution ports (integer / fp / simd)
– load
– store (address)
– store (data)
� 128-bit SSE implementation– Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
– Port 1 has packed add (3 cycles all precisions)
� FP data has one additional cycle bypass latency– Do not mix SSE FP and SSE integer ops on same register
� 4 uops renamed / retired per clock
� Uops written to RS and ROB– RS waits for sources to arrive allowing OOO execution
– ROB waits for results to show up for retirement
� 6 dispatch ports from RS– 3 execution ports (integer / fp / simd)
– load
– store (address)
– store (data)
� 128-bit SSE implementation– Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
– Port 1 has packed add (3 cycles all precisions)
� FP data has one additional cycle bypass latency– Do not mix SSE FP and SSE integer ops on same register

Intel® Core™Microarchitecture
OthersOthers
DECODE
X4
Y4
X4opY4
SOURCE
X1opY1
DECODE
In Each Core In Each Core
X3
Y3
X3opY3
X2
Y2
X2opY2
X1
Y1
X1opY1
DEST
SSE/2/3 OP
X2opY2
X3opY3X4opY4
CLOCK
CYCLE 1CYCLE 1
CLOCKCLOCKCYCLE 22
00127127
CLOCKCLOCKCYCLE 11
SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)
ADVANTAGEADVANTAGE•• Increased PerformanceIncreased Performance•• 128 bit Single Cycle In Each Core128 bit Single Cycle In Each Core•• Improved Energy EfficiencyImproved Energy Efficiency
EXECUTEEXECUTE
SingleCycle
Execution
Energy
Perf
*Graphics not representative of actual die photo or relative size
IntelIntel®® Advanced Digital Media Advanced Digital Media BoostBoost

Instruction Fetch Instruction Fetch
And And PreDecodePreDecode
Instruction QueueInstruction Queue
DecodeDecode
Rename/Rename/AllocAlloc
2M/4M2M/4M
Shared L2Shared L2
CacheCache
Up to Up to
10.6 GB/s10.6 GB/s
FSBFSB
uCodeuCode
ROMROM
SchedulersSchedulers
ALUALU
BranchBranch
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FAddFAdd
MMX/SSEMMX/SSE
FPmoveFPmove
ALUALU
FMulFMul
MMX/SSEMMX/SSE
FPmoveFPmove
LoadLoad StoreStore
L1 DL1 D--Cache and DCache and D--TLBTLB
Retirement UnitRetirement Unit
5
4
4
6
MemoryMemory
OrderOrder
BufferBuffer
Intel® Core™Microarchitecture
Memorysub-system
Memorysub-system
� Loads & Stores – 128-bit load and 128-bit store per cycle
� Data Prefetching
� Memory Disambiguation
� Shared Cache
� L1D cache prefetching– Data Cache Unit Prefetcher (aka streaming prefetcher– Recognizes ascending access patterns in recently loaded data
– Prefetches the next line into the processors cache
– Instruction Based Stride Prefetcher– Prefetches based upon a load having a regular stride
– Can prefetch forward or backward 2 Kbytes (1/2 default page size)
� L2 cache prefetching: Data Prefetch Logic (DPL)– Prefetches data to the 2nd level cache before the DCU requests the data
– Maintains 2 tables for tracking loads– Upstream – 16 entries
– Downstream – 4 entries
� Loads & Stores – 128-bit load and 128-bit store per cycle
� Data Prefetching
� Memory Disambiguation
� Shared Cache
� L1D cache prefetching– Data Cache Unit Prefetcher (aka streaming prefetcher– Recognizes ascending access patterns in recently loaded data
– Prefetches the next line into the processors cache
– Instruction Based Stride Prefetcher– Prefetches based upon a load having a regular stride
– Can prefetch forward or backward 2 Kbytes (1/2 default page size)
� L2 cache prefetching: Data Prefetch Logic (DPL)– Prefetches data to the 2nd level cache before the DCU requests the data
– Maintains 2 tables for tracking loads– Upstream – 16 entries
– Downstream – 4 entries

oldest
youngest L1
Data
Cache
Load1
Load2
Load3
Load4
SharedL2
DataCache
Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

oldest
youngest
Memory is too far away
L1
Data
Cache
Load1
Load2
Load3
Load4
SharedL2
DataCache
Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

oldest
youngest
Caches are closerwhen they have the data
L1
Data
Cache
Load1
Load2
Load3
Load4
SharedL2
DataCache
Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

oldest
youngest
Prefetchers detectapplications datareference patterns
L1
Data
Cache
Load1
Load2
Load3
Load4
SharedL2
DataCache
Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

oldest
youngest
And bring the data closer to data consumer
L1
Data
Cache
Load1
Load2
Load3
Load4
SharedL2
DataCache
Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

oldest
youngest L1
Data
Cache
Load1
Load2
Load3
Load4
SharedL2
DataCache
Solving the Problem of Solving the Problem of Where Where
Intel Smart Memory Access: Intel Smart Memory Access: PrefetchersPrefetchers

Some Implications of Core 2 Architecture for Developers
who want to thread their apps

Advanced Smart Cache Benefits
–Two threads which “communicate”frequently should be scheduled to same two cores sharing L2 cache
–Use the thread/processor affinity feature in your applications
FSB
Quad Core
ProcessorL2 CacheL2 Cache
Core 3 Core 4Core 1 Core 2

Memory Related
Avoid False Sharing
�What is false sharing?
�Multiple threads repeatedly write to the same cache line shared by processors– Usually different data
– Cache lines get invalidated– Forces additional reads from memory
– Severe performance impact in tight loops, in general– Threads read/write to the same cache line very rapidly

Some Words on Pipelines (1)� Modern CPUs may be understood by considering their basic designparadigm, the so-called pipeline. The pipeline is designed to break up theprocessing of a single instruction in independenent parts that idealy areexecuted in an identical time window.
� Since identical processing time in each stage can‘t be guaranteed, mostpipeline stages control a buffer or queue that supplies instructions if theprevious stage is still busy or in which instruction can be stored if the nextstage is still busy.
� The independent parts of the processing are called pipeline stages.
� Underflow or Overflow of a queue will cause the respective stage to run idleand will cause a pipeline stall.
AllocateFetch Decode Execute Retire
Buffer Buffer Buffer BufferFull Full Full Full
Busy Busy Busy Busy
Empty Empty Empty
Idle Idle Idle
Stall

Some Words on Pipelines (2)
� In order to achieve the best performance
� Pipeline stalls must be avoided
� Since Core 2 performance makes use of speculativeexecution, a wrongly taken branch might lead to a pipeline flush to keep the instructions consistent.
� Pipeline flushes must be avoided
� Understanding the Core 2 pipeline and being ableto detect pipeline problems will highly improve theperformance of your software
� Knowledge of the Pipeline and its registers increasethe understanding and efficient usage of VtunePerformance analyser
– E.g. look for Cache Misses, Branch Mispredictions

Branch Target Buffer
Microcode Sequencer
Register AllocationTable (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode(4 issue)
Fetch / Decode
Uop Flow – Refer to Vtune Event Counters
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache
Port
Port
Port
Port
Bus Unit
Reservation Stations (RS)
32 entry
Scheduler / Dispatch Ports
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
UOPS_RETIRED measures at Retirement
UOPS_RETIRED measures at Retirement
RS_UOPS_DISPATCHED measures at Execution
RS_UOPS_DISPATCHED measures at Execution
RESOURCE_STALLS measures here transfer from Decode
RESOURCE_STALLS measures here transfer from Decode
Detailed description in Processor Manualshttp://www.intel.com/products/processor/manuals/

Backup