STAR CCM+ and STAR CD Performance and...
Transcript of STAR CCM+ and STAR CD Performance and...
STAR‐CCM+ and STAR‐CD Performance Benchmark and Profiling
March 2009
2
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: AMD, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The participating members would like to thank CD-adapco for their support and guidelines
• For more info please refer to
– www.mellanox.com, www.dell.com/hpc, www.amd.com
3
STAR-CCM+ and STAR-CD
• STAR-CCM+
– An engineering process-oriented CFD tool
– Client-server architecture, object-oriented programming
– Delivers the entire CFD process in a single integrated software
environment
• STAR-CD
– An integrated platform for multi-physics simulations
– A long established platform for industrial CFD simulation
– Bridging the gap between CFD and structural-mechanics
• Developed by CD-adapco
4
Objectives
• The presented research was done to provide best practices
– STAR-CCM+ and STAR-CD performance benchmarking
– Interconnect performance comparisons
– Ways to increase STAR-CCM+ and STAR-CD productivity
– Understanding STAR-CD communication patterns
– MPI libraries comparisons
5
Test Cluster Configuration
• Dell™ PowerEdge™ SC 1435 24-node cluster
• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs
• Mellanox® InfiniBand DDR Switch
• Memory: 16GB memory, DDR2 800MHz per node
• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack
• MPI: Platform MPI 5.6.5, HP-MPI 2.3
• Application: STAR-CCM+ Version 3.06, STAR-CD Version 4.08
• Benchmark Workload
– STAR-CD: A-Class (Turbulent Flow around A-Class Car)
– STAR-CCM+: Auto Aerodynamics test
6
Mellanox InfiniBand Solutions
• Industry Standard– Hardware, software, cabling, management
– Design for clustering and storage interconnect
• Performance– 40Gb/s node-to-node
– 120Gb/s switch-to-switch
– 1us application latency
– Most aggressive roadmap in the industry
• Reliable with congestion management• Efficient
– RDMA and Transport Offload
– Kernel bypass
– CPU focuses on application processing
• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage
InfiniBand Delivers the Lowest Latency
The InfiniBand Performance Gap is Increasing
Fibre Channel
Ethernet
60Gb/s
20Gb/s
120Gb/s
40Gb/s
240Gb/s (12X)
80Gb/s (4X)
7
• Performance– Quad-Core
• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache
– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor
– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core
– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz
• Scalability– 48-bit Physical Addressing
• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor
7 November5, 2007
PCI-E® Bridge
PCI-E® Bridge
I/O HubI/O Hub
USBUSB
PCIPCI
PCI-E® Bridge
PCI-E® Bridge
8 GB/S
8 GB/S
Dual ChannelReg DDR2
8 GB/S
8 GB/S
8 GB/S
Quad-Core AMD Opteron™ Processor
8
Dell PowerEdge Servers helping Simplify IT
• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers
– Servers optimized for High Performance Computing environments
– Building Block Foundations for best price/performance and performance/watt
• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity
– Dell's comprehensive HPC services help manage the lifecycle requirements.
– Integrated, Tested and Validated Architectures
• Workload Modeling– Optimized System Size, Configuration and Workloads
– Test-bed Benchmarks
– ISV Applications Characterization
– Best Practices & Usage Analysis
9
STAR-CCM+ Benchmark Results - Interconnect
• Input Dataset– Auto Aerodynamics test
• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 136% faster run time
HP MPILower is better
STAR-CCM+ Benchmark Results(Auto Aerodynamics Test )
02468
1012141618
1 2 4 8 16 24Number of Nodes
Ela
psed
Tim
e/Ite
ratio
n
10GigE InfiniBand DDR
10
STAR-CCM+ Productivity Results• Two cases are presented
– Single job over the entire systems– Four jobs, each on two cores per server
• Productivity increases by allowing multiple jobs to run simultaneously– Up to 30% increase in the system productivity
Higher is better HP MPI over InfiniBand
STAR-CCM+ Productivity(Auto Aerodynamics Test)
0
20000
40000
60000
80000
100000
120000
4 8 12 16 20 24
Number of Nodes
Itera
tions
/Day
1 Job 4 Jobs
11
STAR-CCM+ Productivity Results - Interconnect
• Test case– Four jobs, each on two cores per server
• InfiniBand DDR provides higher productivity compared to 10GigE– Up to 25% more iterations per day
• InfiniBand maintains consistent scalability as cluster size increases
Higher is better HP MPI
STAR-CCM+ Productivity(Auto Aerodynamics Test)
0
20000
40000
60000
80000
100000
120000
4 8 12 16 24Number of Nodes
Itera
tions
/Day
10GigE InfiniBand DDR
12
STAR-CD Benchmark Results - Interconnect
• Test case– Single job over the entire systems– Input Dataset (A-Class)
• InfiniBand DDR enhances performance and scalability– Up to 34% and 10% more jobs/day compared to GigE and 10GigE respectively– Performance advantage of InfiniBand increases as cluster size scales
Platform MPIHigher is better
STAR-CD Performance Advantage of InfiniBand and 10GigE over GigE
(AClass)
0%
5%
10%
15%
20%
25%
30%
35%
4 8 16 20 24Number of Nodes
Perc
enta
ge
10GigE InfiniBand DDR
13
Maximize STAR-CD Performance per Core
STAR-CD Benchmark Results (AClass)
0
4000
8000
12000
16000
20000
4-cores/proc 2-cores/proc 1-core/proc
Number of Nodes
Tota
l Ela
psed
Tim
e (s
)
16 Cores 32 Cores 48 CoresLower is better
• Test case– Single job over the entire systems– Using one, two or four cores in each quad-core AMD processor
• Remaining cores kept idle• Using partial cores per simulation improves single job run time• It is recommended to have multiple simulations simultaneously
– For maximum productivity (slide 10)
Platform MPI
14
STAR-CCM+ Profiling – MPI Functions
• MPI_Testall, MPI_Bcast, and MPI_Recv are the mostly used MPI functions
STAR-CCM+ MPI Profiling
1100
100001000000
10000000010000000000
MPI_Testall
MPI_Bcast
MPI_Reduce
MPI_Scatter
MPI_Recv
Message Size
Num
ber
of M
essa
ges
4 Nodes 8 Nodes 16 Nodes 22 Nodes
15
STAR-CCM+ Profiling – Data Transferred
STAR-CCM+ MPI Profiling
01020304050
[0..64B]
[65..256B]
[257B..1KB][1..4KB]
[4..16KB]
[16..64KB]
[64..256KB]
[256KB..1M][1..4M]
[4M..infinity]
Message Size
Num
ber o
f Mes
sage
s (M
illio
ns)
4 Nodes 8 Nodes 16 Nodes 22 Nodes
• Most MPI messages are within 4KB in size• Number of messages increases with cluster size
16
STAR-CCM+ Profiling – Timing
• MPI_Testall, MPI_Bcast, and MPI_Recv have relatively large overhead
• Overhead increases with cluster size
STAR-CCM+ MPI Profiling
01000200030004000500060007000
MPI_Testall
MPI_Bcast
MPI_Reduce
MPI_Scatter
MPI_Recv
Message Size
Tota
l Ove
rhea
d (s
)
4 Nodes 8 Nodes 16 Nodes 22 Nodes
17
STAR-CCM+ Profiling Summary
• STAR-CCM+ was profiled to determine networking dependency • Most used message sizes
– <4KB messages – Number of messages increases with cluster size
• Interconnects effect to STAR-CD performance– Both interconnect latency and throughput influences STAR-CCM+
performance due to its messaging pattern
18
STAR-CD MPI Profiling
02468
1012
MPI_Allgather
MPI_Allreduce
MPI_Barrier
MPI_Bcast
MPI_Sendrecv
MPI Functions
Num
ber
of M
essa
ges
(Mill
ions
)
4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes
STAR-CD Profiling – MPI Functions
• MPI_SendRecv and MPI_Allreduce are the mostly used MPI functions
19
STAR-CD MPI Profiling(MPI_Sendrecv)
01234567
[0-128B>
[128B-1K>[1K-8K>
[8K-256K>
[256K-1M>
Message Size
Num
ber
of M
essa
ges
(Mill
ions
)
4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes
STAR-CD Profiling – Data Transferred
• Most point-to-point MPI messages are within 1KB to 8KB in size• Number of messages increases with cluster size
20
STAR-CD MPI Profiling(MPI_Allreduce)
0
0.5
1
1.5
2
2.5
[0-128B>
[128B-1K>[1K-8K>
[8K-256K>
[256K-1M>
Message Size
Num
ber o
f Mes
sage
s (M
illio
ns)
4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes
STAR-CD Profiling – Data Transferred
• Most MPI collective messages are smaller than 128Bytes• Number of messages increases with cluster size
21
STAR-CD Profiling – Timing
STAR-CD MPI Profiling
0
5000
10000
1500020000
25000
30000
35000
MPI_Allgather
MPI_Allreduce
MPI_Barrier
MPI_Bcast
MPI_Sendrecv
MPI Functions
Tota
l Ove
rhea
d (s
)
4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes
• MPI_Allreduce and MPI_Sendrecv have large overhead• Overhead increases with cluster size
22
MPI Performance Comparison – MPI_Sendrecv
Higher is better
• HP MPI demonstrates better performance for large message
InfiniBand DDR
MPI_Sendrecv(192 Processes)
0
200
400
600
800
1000
1200
1400
128 256 512 1024 2048 4096 8192Message Size (Bytes)
Ban
dwid
th(M
B/s
ec)
Platform MPI HP-MPI
MPI_Sendrecv(32 Processes)
0
200
400
600
800
1000
1200
1400
128 256 512 1024 2048 4096 8192Message Size (Bytes)
Ban
dwid
th(M
B/s
ec)
Platform MPI HP-MPI
23
MPI Performance Comparison – MPI_Allreduce
Lower is better
• HP MPI demonstrates lower MPI_Allreduce runtime for small messages
InfiniBand DDR
MPI_Allreduce(32 Processes)
0
2
4
6
8
10
12
14
1 2 3 4 5 6Message Size (Bytes)
Tota
l Tim
e (u
sec)
Platform MPI HP-MPI
MPI_Allreduce(192 Processes)
0
5
10
15
20
25
30
1 2 3 4 5 6Message Size (Bytes)
Tota
l Tim
e (u
sec)
Platform MPI HP-MPI
24
STAR-CD Benchmark Results - MPI
• Test case– Single job over the entire systems– Input Dataset (A-Class)
• HP-MPI has slightly better performance with CPU affinity enabled
Lower is better
STAR-CD Benchmark Results (A-Class)
01000
20003000
40005000
60007000
80009000
10000
4 8 16 20 24Number of Nodes
Tota
l Ela
psed
Tim
e (s
)
Platform MPI HP-MPI
25
STAR-CD Profiling Summary
• STAR-CD was profiled to determine networking dependency • Majority of data transferred between compute nodes
– Medium size messages– Data transferred increases with cluster size
• Most used message sizes– <128B messages – MPI_Allreduce– 1K-8K – MPI_Sendrecv
• Total number of messages increases with cluster size• Interconnects effect to STAR-CD performance
– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Sendrecv) influences application performance
26
STAR-CD Performance with Power Management
InfiniBand DDRLower is better
• Test Scenario– 24 servers, 4-Cores/Proc
• Nearly identical performance with power management enabled or disabled
STAR-CD Benchmark Results (AClass)
1000
1200
1400
1600
1800
Power ManagementEnabled
Power ManagementDisabled
Tota
l Ela
psed
Tim
e (s
)
27
STAR-CD Benchmark – Power Consumption
• Power management reduces 2% of total system power consumption
InfiniBand DDR
STAR-CD Benchmark Results (AClass)
5000
5500
6000
6500
7000
7500
Power ManagementEnabled
Power ManagementDisabled
Pow
er C
onsu
mpt
ion
(Wat
t)
Lower is better
28
Power Cost Savings with Power Management
• Power management saves 248$/year for the 24-node cluster• As cluster size increases, bigger saving are expected
InfiniBand DDR24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
STAR-CD Benchmark Results Power Cost Comparison
10000
11000
12000
13000
Power ManagementEnabled
Power ManagementDisabled
$/Ye
ar
29
Power Cost Savings with Different Interconnect
• InfiniBand saves ~$1200 and ~$4000 power to finish the same number of STAR-CD jobs compared to 10GigE and GigE– Yearly based for 24-node cluster
• As cluster size increases, more power can be saved
Power Cost Savings (InfiniBand vs 10GigE and GigE)
0
1000
2000
3000
4000
5000
10GigE GigE
Pow
er C
ost S
avin
gs ($
)
24 Node Cluster $/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
Power Management enabled
30
Conclusions• STAR-CD and STAR-CCM+ are widely used CFD simulation software• Performance and productivity relies on
– Scalable HPC systems and interconnect solutions– Low latency and high throughout interconnect technology– Reasonably process distribution can dramatically improves performance per core
• Interconnect comparison shows– InfiniBand delivers superior performance in every cluster size– Low latency InfiniBand enables unparalleled scalability
• Power management provide 2% saving in power consumption– Per 24-node system with InfiniBand– $248 power savings per year for 24-node cluster– Power saving increases with cluster size
• InfiniBand saves power cost– Based on the number of jobs can be finished by InfiniBand per year– InfiniBand enables $1200 and $4000 power savings compared to 10GigE and GigE
3131
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein