STAR CCM+ and STAR CD Performance and...

STAR‐CCM+ and STAR‐CD Performance Benchmark and Profiling

March 2009

2

Note

• The following research was performed under the HPC Advisory Council activities

– Participating vendors: AMD, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The participating members would like to thank CD-adapco for their support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com

http://www.mellanox.com/

http://www.dell.com/hpc

http://www.amd.com/

3

STAR-CCM+ and STAR-CD

• STAR-CCM+

– An engineering process-oriented CFD tool

– Client-server architecture, object-oriented programming

– Delivers the entire CFD process in a single integrated software

environment

• STAR-CD

– An integrated platform for multi-physics simulations

– A long established platform for industrial CFD simulation

– Bridging the gap between CFD and structural-mechanics

• Developed by CD-adapco

4

Objectives

• The presented research was done to provide best practices

– STAR-CCM+ and STAR-CD performance benchmarking

– Interconnect performance comparisons

– Ways to increase STAR-CCM+ and STAR-CD productivity

– Understanding STAR-CD communication patterns

– MPI libraries comparisons

5

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack

• MPI: Platform MPI 5.6.5, HP-MPI 2.3

• Application: STAR-CCM+ Version 3.06, STAR-CD Version 4.08

• Benchmark Workload

– STAR-CD: A-Class (Turbulent Flow around A-Class Car)

– STAR-CCM+: Auto Aerodynamics test

6

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management

– Design for clustering and storage interconnect

• Performance– 40Gb/s node-to-node

– 120Gb/s switch-to-switch

– 1us application latency

– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload

– Kernel bypass

– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

7

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ Technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E® Bridge

PCI-E® Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E® Bridge

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

8

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

9

STAR-CCM+ Benchmark Results - Interconnect

• Input Dataset– Auto Aerodynamics test

• InfiniBand DDR delivers higher performance and scalability– For any cluster size– Up to 136% faster run time

HP MPILower is better

STAR-CCM+ Benchmark Results(Auto Aerodynamics Test )

02468

1012141618

1 2 4 8 16 24Number of Nodes

Ela

psed

Tim

e/Ite

ratio

n

10GigE InfiniBand DDR

10

STAR-CCM+ Productivity Results• Two cases are presented

– Single job over the entire systems– Four jobs, each on two cores per server

• Productivity increases by allowing multiple jobs to run simultaneously– Up to 30% increase in the system productivity

Higher is better HP MPI over InfiniBand

STAR-CCM+ Productivity(Auto Aerodynamics Test)

0

20000

40000

60000

80000

100000

120000

4 8 12 16 20 24

Number of Nodes

Itera

tions

/Day

1 Job 4 Jobs

11

STAR-CCM+ Productivity Results - Interconnect

• Test case– Four jobs, each on two cores per server

• InfiniBand DDR provides higher productivity compared to 10GigE– Up to 25% more iterations per day

• InfiniBand maintains consistent scalability as cluster size increases

Higher is better HP MPI

STAR-CCM+ Productivity(Auto Aerodynamics Test)

0

20000

40000

60000

80000

100000

120000

4 8 12 16 24Number of Nodes

Itera

tions

/Day


12

STAR-CD Benchmark Results - Interconnect

• Test case– Single job over the entire systems– Input Dataset (A-Class)

• InfiniBand DDR enhances performance and scalability– Up to 34% and 10% more jobs/day compared to GigE and 10GigE respectively– Performance advantage of InfiniBand increases as cluster size scales

Platform MPIHigher is better

STAR-CD Performance Advantage of InfiniBand and 10GigE over GigE

(AClass)

0%

5%

10%

15%

20%

25%

30%

35%


Perc

enta

ge


13

Maximize STAR-CD Performance per Core

STAR-CD Benchmark Results (AClass)

0

4000

8000

12000

16000

20000

4-cores/proc 2-cores/proc 1-core/proc

Number of Nodes

Tota

l Ela

psed

Tim

e (s

)

16 Cores 32 Cores 48 CoresLower is better

• Test case– Single job over the entire systems– Using one, two or four cores in each quad-core AMD processor

• Remaining cores kept idle• Using partial cores per simulation improves single job run time• It is recommended to have multiple simulations simultaneously

– For maximum productivity (slide 10)

Platform MPI

14

STAR-CCM+ Profiling – MPI Functions

• MPI_Testall, MPI_Bcast, and MPI_Recv are the mostly used MPI functions

STAR-CCM+ MPI Profiling

1100

100001000000

10000000010000000000

MPI_Testall

MPI_Bcast

MPI_Reduce

MPI_Scatter

MPI_Recv

Message Size

Num

ber

of M

essa

ges

4 Nodes 8 Nodes 16 Nodes 22 Nodes

15

STAR-CCM+ Profiling – Data Transferred


01020304050

[0..64B]

[65..256B]

[257B..1KB][1..4KB]

[4..16KB]

[16..64KB]

[64..256KB]

[256KB..1M][1..4M]

[4M..infinity]

Message Size

Num

ber o

f Mes

sage

s (M

illio

ns)


• Most MPI messages are within 4KB in size• Number of messages increases with cluster size

16

STAR-CCM+ Profiling – Timing

• MPI_Testall, MPI_Bcast, and MPI_Recv have relatively large overhead

• Overhead increases with cluster size


01000200030004000500060007000

MPI_Testall

MPI_Bcast

MPI_Reduce

MPI_Scatter

MPI_Recv

Message Size

Tota

l Ove

rhea

d (s

)


17

STAR-CCM+ Profiling Summary

• STAR-CCM+ was profiled to determine networking dependency • Most used message sizes

– <4KB messages – Number of messages increases with cluster size

• Interconnects effect to STAR-CD performance– Both interconnect latency and throughput influences STAR-CCM+

performance due to its messaging pattern

18

STAR-CD MPI Profiling

02468

1012

MPI_Allgather

MPI_Allreduce

MPI_Barrier

MPI_Bcast

MPI_Sendrecv

MPI Functions

Num

ber

of M

essa

ges

(Mill

ions

)

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 24 Nodes

STAR-CD Profiling – MPI Functions

• MPI_SendRecv and MPI_Allreduce are the mostly used MPI functions

19

STAR-CD MPI Profiling(MPI_Sendrecv)

01234567

[0-128B>

[128B-1K>[1K-8K>

[8K-256K>

[256K-1M>

Message Size

Num

ber

of M

essa

ges

(Mill

ions

)


STAR-CD Profiling – Data Transferred

• Most point-to-point MPI messages are within 1KB to 8KB in size• Number of messages increases with cluster size

20

STAR-CD MPI Profiling(MPI_Allreduce)

0

0.5

1

1.5

2

2.5

[0-128B>

[128B-1K>[1K-8K>

[8K-256K>

[256K-1M>

Message Size

Num

ber o

f Mes

sage

s (M

illio

ns)


STAR-CD Profiling – Data Transferred

• Most MPI collective messages are smaller than 128Bytes• Number of messages increases with cluster size

21

STAR-CD Profiling – Timing

STAR-CD MPI Profiling

0

5000

10000

1500020000

25000

30000

35000

MPI_Allgather

MPI_Allreduce

MPI_Barrier

MPI_Bcast

MPI_Sendrecv

MPI Functions

Tota

l Ove

rhea

d (s

)


• MPI_Allreduce and MPI_Sendrecv have large overhead• Overhead increases with cluster size

22

MPI Performance Comparison – MPI_Sendrecv

Higher is better

• HP MPI demonstrates better performance for large message

InfiniBand DDR

MPI_Sendrecv(192 Processes)

0

200

400

600

800

1000

1200

1400

128 256 512 1024 2048 4096 8192Message Size (Bytes)

Ban

dwid

th(M

B/s

ec)

Platform MPI HP-MPI

MPI_Sendrecv(32 Processes)

0

200

400

600

800

1000

1200

1400

128 256 512 1024 2048 4096 8192Message Size (Bytes)

Ban

dwid

th(M

B/s

ec)

Platform MPI HP-MPI

23

MPI Performance Comparison – MPI_Allreduce

Lower is better

• HP MPI demonstrates lower MPI_Allreduce runtime for small messages

InfiniBand DDR

MPI_Allreduce(32 Processes)

0

2

4

6

8

10

12

14

1 2 3 4 5 6Message Size (Bytes)

Tota

l Tim

e (u

sec)

Platform MPI HP-MPI

MPI_Allreduce(192 Processes)

0

5

10

15

20

25

30

1 2 3 4 5 6Message Size (Bytes)

Tota

l Tim

e (u

sec)

Platform MPI HP-MPI

24

STAR-CD Benchmark Results - MPI

• Test case– Single job over the entire systems– Input Dataset (A-Class)

• HP-MPI has slightly better performance with CPU affinity enabled

Lower is better

STAR-CD Benchmark Results (A-Class)

01000

20003000

40005000

60007000

80009000

10000


Tota

l Ela

psed

Tim

e (s

)

Platform MPI HP-MPI

25

STAR-CD Profiling Summary

• STAR-CD was profiled to determine networking dependency • Majority of data transferred between compute nodes

– Medium size messages– Data transferred increases with cluster size

• Most used message sizes– <128B messages – MPI_Allreduce– 1K-8K – MPI_Sendrecv

• Total number of messages increases with cluster size• Interconnects effect to STAR-CD performance

– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Sendrecv) influences application performance

26

STAR-CD Performance with Power Management

InfiniBand DDRLower is better

• Test Scenario– 24 servers, 4-Cores/Proc

• Nearly identical performance with power management enabled or disabled


1000

1200

1400

1600

1800

Power ManagementEnabled

Power ManagementDisabled

Tota

l Ela

psed

Tim

e (s

)

27

STAR-CD Benchmark – Power Consumption

• Power management reduces 2% of total system power consumption

InfiniBand DDR


5000

5500

6000

6500

7000

7500



Pow

er C

onsu

mpt

ion

(Wat

t)

Lower is better

28

Power Cost Savings with Power Management

• Power management saves 248$/year for the 24-node cluster• As cluster size increases, bigger saving are expected

InfiniBand DDR24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

STAR-CD Benchmark Results Power Cost Comparison

10000

11000

12000

13000



$/Ye

ar

29

Power Cost Savings with Different Interconnect

• InfiniBand saves ~$1200 and ~$4000 power to finish the same number of STAR-CD jobs compared to 10GigE and GigE– Yearly based for 24-node cluster

• As cluster size increases, more power can be saved

Power Cost Savings (InfiniBand vs 10GigE and GigE)

0

1000

2000

3000

4000

5000

10GigE GigE

Pow

er C

ost S

avin

gs ($

)

24 Node Cluster $/KWh = KWh * $0.20For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

Power Management enabled

30

Conclusions• STAR-CD and STAR-CCM+ are widely used CFD simulation software• Performance and productivity relies on

– Scalable HPC systems and interconnect solutions– Low latency and high throughout interconnect technology– Reasonably process distribution can dramatically improves performance per core

• Interconnect comparison shows– InfiniBand delivers superior performance in every cluster size– Low latency InfiniBand enables unparalleled scalability

• Power management provide 2% saving in power consumption– Per 24-node system with InfiniBand– $248 power savings per year for 24-node cluster– Power saving increases with cluster size

• InfiniBand saves power cost– Based on the number of jobs can be finished by InfiniBand per year– InfiniBand enables $1200 and $4000 power savings compared to 10GigE and GigE

3131

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

STAR CCM+ and STAR CD Performance and...

Documents

Transcript of STAR CCM+ and STAR CD Performance and...