Fluent Analysis

5/10/2018 Fluent Analysis - slidepdf.com

http://slidepdf.com/reader/full/fluent-analysis 1/28

ANSYS FLUENT Performance Benchmark and Profiling

May 2009



2

Note

• The following research was performed under the HPC

Advisory Council activities

– Participating vendors: AMD, ANSYS, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• The participating members would like to thank ANSYS fortheir support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com,

www.ansys.com

http://www.mellanox.com/

http://www.dell.com/hpc

http://www.amd.com/

http://www.ansys.com/

http://www.ansys.com/

http://www.amd.com/

http://www.dell.com/hpc

http://www.mellanox.com/



3

ANSYS FLUENT

• Computational Fluid Dynamics (CFD) is a computational technology

– Enables the study of the dynamics of things that flow

• By generating numerical solutions to a system of partial differential equations

which describe fluid flow

– Enable better understanding of qualitative and quantitative physical

phenomena in the flow which is used to improve engineering design

• CFD brings together a number of different disciplines

– Fluid dynamics, mathematical theory of partial differential systems,

computational geometry, numerical analysis, Computer science

• ANSYS FLUENT is a leading CFD application from ANSYS

– Widely used in almost every industry sector and manufactured product



4

Objectives

• The presented research was done to provide best practices

– ANSYS FLUENT performance benchmarking

– Interconnect performance comparisons

– Performance enhancement of the latest FLUENT release

– Ways to increase FLUENT productivity

– Understanding FLUENT communication patterns



5

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 800MHz per node

• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack

• MPI: HP-MPI 2.3

• Application: FLUENT 6.3.37, FLUENT 12.0

• Benchmark Workload

– New FLUENT Benchmark Suite



6

Mellanox InfiniBand Solutions

• Industry Standard

– Hardware, software, cabling, management

– Design for clustering and storage interconnect

• Performance

– 40Gb/s node-to-node

– 120Gb/s switch-to-switch

– 1us application latency

– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload

– Kernel bypass

– CPU focuses on application processing

• Scalable for Petascale computing & beyond

• End-to-end quality of service

• Virtualization acceleration

• I/O consolidation Including storage InfiniBand Delivers the Lowest Latency

The InfiniBand PerformanceGap is Increasing

FibreChannel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s

(12X)

80Gb/s(4X)



7

• Performance

– Quad-Core

• Enhanced CPU IPC

• 4x 512K L2 cache

• 6MB L3 Cache – Direct Connect Architecture

• HyperTransport™ Technology

• Up to 24 GB/s peak per processor

– Floating Point

• 128-bit FPU per core

• 4 FLOPS/clk peak per core

– Integrated Memory Controller

• Up to 12.8 GB/s

• DDR2-800 MHz or DDR2-667 MHz• Scalability

– 48-bit Physical Addressing

• Compatibility

– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

7 November5, 2007

PCI-E®Bridge

PCI-E®Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E®Bridge

PCI-E®Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor



Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines

– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions

– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling

– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis



FLUENT Benchmark Results

• Input Dataset – EDDY_417K

• Reacting Flow with Eddy Dissipation Model

• FLUENT 12 provides better performance and scalability

– Utilizing InfiniBand DDR to delivers highest performance and scalability

InfiniBand DDR Higher is better

FLUENT Benchmark Result

(Eddy_417K)

0

1000

2000

3000

4000

5000

1 2 4 8 12 16 20 24

Number of Nodes

R a t i n g

FLUENT 6.3.37 FLUENT 12




(Aircraft_2M)

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 12 16 20 24

Number of Nodes

R a t i n g



• Input Dataset – Aircraft_2M

• External Flow Over an Aircraft Wing

• FLUENT 12 provides performance and scalability increase

– Up to 107% higher performance versus previous 6.3.37 version

Higher is better

107%

InfiniBand DDR




(Truck_14M)

0

200

400

600

800

1000

1 2 4 8 12 16 20 24

Number of Nodes

R

a t i n g



• Input Dataset – Truck_14M

• External Flow Over a Truck Body

• FLUENT 12 delivers higher performance and scalability

– For any cluster size – Up to 80% higher performance versus previous 6.3.37 version

Higher is better

80%

InfiniBand DDR




(Truck_poly_14M)

0

100

200300

400

500

600

700

800900

1 2 4 8 12 16 20 24

Number of Nodes

R a t i n g



• Input Dataset – Truck_Poly_14M

• External Flow Over a Truck Body with a Polyhedral Mesh

• FLUENT 12 delivers higher performance and scalability

– For any cluster size – Up to 67% higher performance versus previous 6.3.37 version

Higher is better

67%

InfiniBand DDR



FLUENT 12 Benchmark Results - Interconnect

• Input Dataset – EDDY_417K (417 thousand elements)

• Reacting Flow with Eddy Dissipation Model

• InfiniBand DDR delivers higher performance and scalability

– For any cluster size

– Up to 192% higher performance versus Ethernet (GigE)

Higher is better

FLUENT 12.0 Benchmark Result

(Eddy_417K)

0

1000

2000

3000

4000

5000

1 2 4 8 12 16 20 24

Number of Nodes

R

a t i n g

InfiniBand DDR Ethernet

192%




(Aircraft_2M)

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 12 16 20 24

Number of Nodes

R

a t i n g



• Input Dataset – Aircraft_2M (2 million elements)

• External Flow Over an Aircraft Wing


– For any cluster size – Up to 99% higher performance versus Ethernet (GigE)

Higher is better

99%




(Truck_14M)

0

200

400

600

800

1000

1 2 4 8 12 16 20 24

Number of Nodes

R a

t i n g



Higher is better

36%

• Input Dataset – Truck_14M (14 millions elements)

• External Flow Over a Truck Body


– Up to 36% higher performance versus Ethernet (GigE)

• For bigger cases (# of elements) CPU is the bottleneck for larger node count configuration – More server nodes (or cores) are required for increased paternalism interconnect dependency



Enhancing FLUENT Productivity

• Test cases – Single job over the entire systems

– 2 jobs, each runs on four cores per server

• Running multiple jobs simultaneously improves FLUENT productivity

– Up to 90% more jobs per day for Eddy_417K

– Up to 30% more jobs per day for Aircraft_2M – Up to 3% more jobs per day for Truck_14M

• As bigger the # of elements, higher node count is required for increased productivity

– The CPU is the bottleneck for larger number of servers

Higher is better InfiniBand DDR

FLUENT 12.0 Productivity Result(2 jobs in parallel vs 1 job)

0%

20%

40%

60%

80%

100%

4 8 12 16 20 24

Number of Nodes

P r o d u c t i v i t y G a i n

Eddy_417K Aircraft_2M Truck_14M



Power Cost Savings with Different Interconnect

• InfiniBand saves up to $8000 power to finish the same number ofFLUENT jobs compared to GigE

– Yearly based for 24-node cluster

• As cluster size increases, more power can be saved

$/KWh = KWh * $0.20 For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

Power Cost Savings(InfiniBand vs GigE)

0

2000

4000

6000

8000

10000

Eddy_417K Aircraft_2M Truck_14M P o w e r S a v i n g s / y e a r ( $ )



Power Cost Savings with FLUENT upgrade

Power Cost Savings(FLUENT 12 vs FLUENT 6.3.37)

0

2000

4000

6000

8000

Eddy_417K Aircraft_2M Truck_14M P o w e r S a v i n g s / y e a r ( $ )

• FLUENT 12 saves up to ~$6000 power to finish the same numberof FLUENT jobs compared to FLUENT 6.3.37

– Yearly based for 24-node cluster

• As cluster size increases, more power can be saved

$/KWh = KWh * $0.20 For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf



FLUENT Productivity Results Summary

• FLUENT 12 has tremendous performance improvement over version 6.3.37

– Optimizations made for higher performance and scalability

– Optimizations included for AMD Opteron™ processor technology

• AMD contributed to the development and the QA stages of Fluent 12

• Performance results on AMD technology established baseline performance

• InfiniBand enables higher performance and scalability than Ethernet

– Performance advantage extends as cluster size increases

• Efficient job placement can increase FLUENT productivity significantly

• Interconnect comparison shows

– InfiniBand delivers superior performance in every cluster size

– Low latency InfiniBand enables unparalleled scalability

• InfiniBand enables up to $8000/year power savings compared to GigE

• FLUENT 12 reduces yearly power consumption by up to $6000 compared to FLUENT 6.3.37



FLUENT 6.3.37 Profiling – MPI Functions

• Mostly used MPI functions – MPI_Send, MPI_Recv, MPI_Reduce, and MPI_Bcast

FLUENT 6.3.37 Benchmark Profiling Result(Truck_poly_14M)

0

5000

10000

15000

20000

25000

30000

M P I_ W

a i t a l l

M P I_ S e n d

M P I_ R

e d u c e M P

I_ R e c v

M P I_ B

c a s t

M P I_ B

a r r i e r

M P I_ I

s e n d M P

I_ I r e c v

MPI Function

N u m b e r

o f C a l l s

( T h o u s

a n d s )

4 Nodes 8 Nodes 16 Nodes



FLUENT 6.3.37 Profiling – Timing

• MPI_Recv shows the highest communication overhead

FLUENT 6.3.37 Benchmark Profiling Result

(Truck_poly_14M)

0

40000

80000

120000

160000

200000

M P I_ W a i

t a l l

M P I_ S e

n d

M P I_ R

e d u c e

M P I_ R e

c v

M P I_ B c

a s t

M P I_ B a r r

i e r

M P I_ I s e

n d

M P I_ I r e

c v

MPI Function

T o t a l O v e r h e a d ( s )




FLUENT 6.3.37 Profiling – Message Transferred

• Most data related MPI messages are within 256B-1KB in size

• Typical MPI synchronization messages are lower than 64B in size

• Number of messages increases with cluster size

FLUENT 6.3.37 Benchmark Profiling Result

(Truck_poly_14M)

0

8000

16000

2400032000

40000

48000

[ 0 . . 6 4 B ]

[ 6 4 . . 2 5 6 B ]

[ 2 5 6 B

. . 1 K B ]

[ 1 . . 4 K

B ]

[ 4 . . 1 6 K B ]

[ 1 6 . . 6 4 K

B ]

[ 6 4 . . 2 5 6

K B ]

[ 2 5 6 K

B . . 1 M ]

[ 1 . . 4 M ]

[ 4 M . . 2 G B ]

Message Size

N u m b e r o f M

e s s a g e s

( T h o u s a n d s )




FLUENT 12 Profiling – MPI Functions

• Mostly used MPI functions – MPI_Iprobe, MPI_Allreduce, MPI_Isend, and MPI_Irecv

FLUENT 12 Benchmark Profiling Result(Truck_poly_14M)

0

4000

8000

12000

16000

20000

M P I_ W a i t

a l l

M P I_ S e

n d

M P I_ R

e d u c e

M P I_ R e

c v

M P I_ B c

a s t

M P I_ B a r r

i e r

M P I_ I p r o

b e

M P I_ A

l l r e d u c e

M P I_ I s e

n d

M P I_ I r e

c v

MPI Function

N u m b e r

o f C a l l s

( T h o u s

a n d s )




FLUENT 12 Profiling – Timing

• MPI_Recv and MPI_Allreduce show highest communicationoverhead

FLUENT 12 Benchmark Profiling Result

(Truck_poly_14M)

0

4000

8000

12000

16000

20000

M P I_ W a

i t a l l

M P I_ S

e n d

M P I_ R e d u

c e

M P I_ R

e c v

M P I_ B c

a s t

M P I_ B a r

r i e r

M P I_ I p r o

b e

M P I_ A

l l r e d u

c e

M P I_ I s e

n d

M P I_ I r

e c v

MPI Function

T o t a l O v e

r h e a d ( s )




25

FLUENT 12 Profiling – Message Transferred

• Most data related MPI messages are within 256B-1KB in size

• Typical MPI synchronization messages are lower than 64B in size

• Number of messages increases with cluster size

FLUENT 12 Benchmark Profiling Result

(Truck_poly_14M)

0

40008000

12000

16000

2000024000

28000

[ 0 . . 6 4 B ]

[ 6 4 . . 2 5 6 B ]

[ 2 5 6 B

. . 1 K B ]

[ 1 . . 4 K

B ]

[ 4 . . 1 6 K B ]

[ 1 6 . . 6 4 K B ]

[ 6 4 . . 2 5 6

K B ]

[ 2 5 6 K

B . . 1 M ]

[ 1 . . 4 M ]

[ 4 M . . 2 G B ]

Message Size

N u m b e r o f M

e s s a g e s





26

FLUENT Profiling - Message Transferred

• FLUENT 12 reduces total number of messages

• Further optimization can be made to take bigger advantage of high-speed and low latency interconnects

FLUENT Benchmark Profiling Result

(Truck_poly_14M)

0

8000

16000

24000

32000

40000

48000

[ 0 . . 6 4 B ]

[ 6 4 . . 2 5 6

B ]

[ 2 5 6 B

. . 1 K B

]

[ 1 . . 4 K

B ]

[ 4 . . 1 6 K B

]

[ 1 6 . . 6 4 K

B ]

[ 6 4 . . 2 5 6

K B ]

[ 2 5 6 K

B . . 1 M

]

[ 1 . . 4 M

]

Message Size

N

u m b e r o f M e s s a g e s





27

FLUENT Profiling Summary

• FLUENT 12 and FLUENT 6.3.37 were profiled to identify their communication patterns

• Frequent used message sizes

– 256-1KB messages for data related communications

– <64B for synchronizations

– Number of messages increases with cluster size

• MPI Functions

– FLUENT 12 introduced MPI collective functions

• MPI_Allreduce help improves the communication efficiency

• Interconnects effect to FLUENT performance

– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Recv) highly influence

FLUENT performance

– Further optimization can be made to take bigger advantage of high-speed networks



Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy andcompleteness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

Fluent Analysis

Documents

Transcript of Fluent Analysis