analysis of forced draft cooling tower performance using ansys fluent ...
Fluent Analysis
Transcript of Fluent Analysis
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 1/28
ANSYS FLUENT Performance Benchmark and Profiling
May 2009
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 2/28
2
Note
• The following research was performed under the HPC
Advisory Council activities
– Participating vendors: AMD, ANSYS, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The participating members would like to thank ANSYS fortheir support and guidelines
• For more info please refer to
– www.mellanox.com, www.dell.com/hpc, www.amd.com,
www.ansys.com
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 3/28
3
ANSYS FLUENT
• Computational Fluid Dynamics (CFD) is a computational technology
– Enables the study of the dynamics of things that flow
• By generating numerical solutions to a system of partial differential equations
which describe fluid flow
– Enable better understanding of qualitative and quantitative physical
phenomena in the flow which is used to improve engineering design
• CFD brings together a number of different disciplines
– Fluid dynamics, mathematical theory of partial differential systems,
computational geometry, numerical analysis, Computer science
• ANSYS FLUENT is a leading CFD application from ANSYS
– Widely used in almost every industry sector and manufactured product
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 4/28
4
Objectives
• The presented research was done to provide best practices
– ANSYS FLUENT performance benchmarking
– Interconnect performance comparisons
– Performance enhancement of the latest FLUENT release
– Ways to increase FLUENT productivity
– Understanding FLUENT communication patterns
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 5/28
5
Test Cluster Configuration
• Dell™ PowerEdge™ SC 1435 24-node cluster
• Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
• Mellanox® InfiniBand ConnectX® 20Gb/s (DDR) HCAs
• Mellanox® InfiniBand DDR Switch
• Memory: 16GB memory, DDR2 800MHz per node
• OS: RHEL5U2, OFED 1.4 InfiniBand SW stack
• MPI: HP-MPI 2.3
• Application: FLUENT 6.3.37, FLUENT 12.0
• Benchmark Workload
– New FLUENT Benchmark Suite
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 6/28
6
Mellanox InfiniBand Solutions
• Industry Standard
– Hardware, software, cabling, management
– Design for clustering and storage interconnect
• Performance
– 40Gb/s node-to-node
– 120Gb/s switch-to-switch
– 1us application latency
– Most aggressive roadmap in the industry
• Reliable with congestion management• Efficient
– RDMA and Transport Offload
– Kernel bypass
– CPU focuses on application processing
• Scalable for Petascale computing & beyond
• End-to-end quality of service
• Virtualization acceleration
• I/O consolidation Including storage InfiniBand Delivers the Lowest Latency
The InfiniBand PerformanceGap is Increasing
FibreChannel
Ethernet
60Gb/s
20Gb/s
120Gb/s
40Gb/s
240Gb/s
(12X)
80Gb/s(4X)
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 7/28
7
• Performance
– Quad-Core
• Enhanced CPU IPC
• 4x 512K L2 cache
• 6MB L3 Cache – Direct Connect Architecture
• HyperTransport™ Technology
• Up to 24 GB/s peak per processor
– Floating Point
• 128-bit FPU per core
• 4 FLOPS/clk peak per core
– Integrated Memory Controller
• Up to 12.8 GB/s
• DDR2-800 MHz or DDR2-667 MHz• Scalability
– 48-bit Physical Addressing
• Compatibility
– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor
7 November5, 2007
PCI-E®Bridge
PCI-E®Bridge
I/O HubI/O Hub
USBUSB
PCIPCI
PCI-E®Bridge
PCI-E®Bridge
8 GB/S
8 GB/S
Dual ChannelReg DDR2
8 GB/S
8 GB/S
8 GB/S
Quad-Core AMD Opteron™ Processor
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 8/288
Dell PowerEdge Servers helping Simplify IT
• System Structure and Sizing Guidelines
– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers
– Servers optimized for High Performance Computing environments
– Building Block Foundations for best price/performance and performance/watt
• Dell HPC Solutions
– Scalable Architectures for High Performance and Productivity
– Dell's comprehensive HPC services help manage the lifecycle requirements.
– Integrated, Tested and Validated Architectures
• Workload Modeling
– Optimized System Size, Configuration and Workloads
– Test-bed Benchmarks
– ISV Applications Characterization
– Best Practices & Usage Analysis
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 9/289
FLUENT Benchmark Results
• Input Dataset – EDDY_417K
• Reacting Flow with Eddy Dissipation Model
• FLUENT 12 provides better performance and scalability
– Utilizing InfiniBand DDR to delivers highest performance and scalability
InfiniBand DDR Higher is better
FLUENT Benchmark Result
(Eddy_417K)
0
1000
2000
3000
4000
5000
1 2 4 8 12 16 20 24
Number of Nodes
R a t i n g
FLUENT 6.3.37 FLUENT 12
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 10/2810
FLUENT Benchmark Result
(Aircraft_2M)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 12 16 20 24
Number of Nodes
R a t i n g
FLUENT 6.3.37 FLUENT 12
FLUENT Benchmark Results
• Input Dataset – Aircraft_2M
• External Flow Over an Aircraft Wing
• FLUENT 12 provides performance and scalability increase
– Up to 107% higher performance versus previous 6.3.37 version
Higher is better
107%
InfiniBand DDR
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 11/2811
FLUENT Benchmark Result
(Truck_14M)
0
200
400
600
800
1000
1 2 4 8 12 16 20 24
Number of Nodes
R
a t i n g
FLUENT 6.3.37 FLUENT 12
FLUENT Benchmark Results
• Input Dataset – Truck_14M
• External Flow Over a Truck Body
• FLUENT 12 delivers higher performance and scalability
– For any cluster size – Up to 80% higher performance versus previous 6.3.37 version
Higher is better
80%
InfiniBand DDR
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 12/2812
FLUENT Benchmark Result
(Truck_poly_14M)
0
100
200300
400
500
600
700
800900
1 2 4 8 12 16 20 24
Number of Nodes
R a t i n g
FLUENT 6.3.37 FLUENT 12
FLUENT Benchmark Results
• Input Dataset – Truck_Poly_14M
• External Flow Over a Truck Body with a Polyhedral Mesh
• FLUENT 12 delivers higher performance and scalability
– For any cluster size – Up to 67% higher performance versus previous 6.3.37 version
Higher is better
67%
InfiniBand DDR
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 13/2813
FLUENT 12 Benchmark Results - Interconnect
• Input Dataset – EDDY_417K (417 thousand elements)
• Reacting Flow with Eddy Dissipation Model
• InfiniBand DDR delivers higher performance and scalability
– For any cluster size
– Up to 192% higher performance versus Ethernet (GigE)
Higher is better
FLUENT 12.0 Benchmark Result
(Eddy_417K)
0
1000
2000
3000
4000
5000
1 2 4 8 12 16 20 24
Number of Nodes
R
a t i n g
InfiniBand DDR Ethernet
192%
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 14/2814
FLUENT 12.0 Benchmark Result
(Aircraft_2M)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 12 16 20 24
Number of Nodes
R
a t i n g
InfiniBand DDR Ethernet
FLUENT 12 Benchmark Results - Interconnect
• Input Dataset – Aircraft_2M (2 million elements)
• External Flow Over an Aircraft Wing
• InfiniBand DDR delivers higher performance and scalability
– For any cluster size – Up to 99% higher performance versus Ethernet (GigE)
Higher is better
99%
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 15/2815
FLUENT 12.0 Benchmark Result
(Truck_14M)
0
200
400
600
800
1000
1 2 4 8 12 16 20 24
Number of Nodes
R a
t i n g
InfiniBand DDR Ethernet
FLUENT 12 Benchmark Results - Interconnect
Higher is better
36%
• Input Dataset – Truck_14M (14 millions elements)
• External Flow Over a Truck Body
• InfiniBand DDR delivers higher performance and scalability
– Up to 36% higher performance versus Ethernet (GigE)
• For bigger cases (# of elements) CPU is the bottleneck for larger node count configuration – More server nodes (or cores) are required for increased paternalism interconnect dependency
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 16/2816
Enhancing FLUENT Productivity
• Test cases – Single job over the entire systems
– 2 jobs, each runs on four cores per server
• Running multiple jobs simultaneously improves FLUENT productivity
– Up to 90% more jobs per day for Eddy_417K
– Up to 30% more jobs per day for Aircraft_2M – Up to 3% more jobs per day for Truck_14M
• As bigger the # of elements, higher node count is required for increased productivity
– The CPU is the bottleneck for larger number of servers
Higher is better InfiniBand DDR
FLUENT 12.0 Productivity Result(2 jobs in parallel vs 1 job)
0%
20%
40%
60%
80%
100%
4 8 12 16 20 24
Number of Nodes
P r o d u c t i v i t y G a i n
Eddy_417K Aircraft_2M Truck_14M
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 17/2817
Power Cost Savings with Different Interconnect
• InfiniBand saves up to $8000 power to finish the same number ofFLUENT jobs compared to GigE
– Yearly based for 24-node cluster
• As cluster size increases, more power can be saved
$/KWh = KWh * $0.20 For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
Power Cost Savings(InfiniBand vs GigE)
0
2000
4000
6000
8000
10000
Eddy_417K Aircraft_2M Truck_14M P o w e r S a v i n g s / y e a r ( $ )
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 18/2818
Power Cost Savings with FLUENT upgrade
Power Cost Savings(FLUENT 12 vs FLUENT 6.3.37)
0
2000
4000
6000
8000
Eddy_417K Aircraft_2M Truck_14M P o w e r S a v i n g s / y e a r ( $ )
• FLUENT 12 saves up to ~$6000 power to finish the same numberof FLUENT jobs compared to FLUENT 6.3.37
– Yearly based for 24-node cluster
• As cluster size increases, more power can be saved
$/KWh = KWh * $0.20 For more information - http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 19/2819
FLUENT Productivity Results Summary
• FLUENT 12 has tremendous performance improvement over version 6.3.37
– Optimizations made for higher performance and scalability
– Optimizations included for AMD Opteron™ processor technology
• AMD contributed to the development and the QA stages of Fluent 12
• Performance results on AMD technology established baseline performance
• InfiniBand enables higher performance and scalability than Ethernet
– Performance advantage extends as cluster size increases
• Efficient job placement can increase FLUENT productivity significantly
• Interconnect comparison shows
– InfiniBand delivers superior performance in every cluster size
– Low latency InfiniBand enables unparalleled scalability
• InfiniBand enables up to $8000/year power savings compared to GigE
• FLUENT 12 reduces yearly power consumption by up to $6000 compared to FLUENT 6.3.37
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 20/2820
FLUENT 6.3.37 Profiling – MPI Functions
• Mostly used MPI functions – MPI_Send, MPI_Recv, MPI_Reduce, and MPI_Bcast
FLUENT 6.3.37 Benchmark Profiling Result(Truck_poly_14M)
0
5000
10000
15000
20000
25000
30000
M P I_ W
a i t a l l
M P I_ S e n d
M P I_ R
e d u c e M P
I_ R e c v
M P I_ B
c a s t
M P I_ B
a r r i e r
M P I_ I
s e n d M P
I_ I r e c v
MPI Function
N u m b e r
o f C a l l s
( T h o u s
a n d s )
4 Nodes 8 Nodes 16 Nodes
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 21/2821
FLUENT 6.3.37 Profiling – Timing
• MPI_Recv shows the highest communication overhead
FLUENT 6.3.37 Benchmark Profiling Result
(Truck_poly_14M)
0
40000
80000
120000
160000
200000
M P I_ W a i
t a l l
M P I_ S e
n d
M P I_ R
e d u c e
M P I_ R e
c v
M P I_ B c
a s t
M P I_ B a r r
i e r
M P I_ I s e
n d
M P I_ I r e
c v
MPI Function
T o t a l O v e r h e a d ( s )
4 Nodes 8 Nodes 16 Nodes
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 22/2822
FLUENT 6.3.37 Profiling – Message Transferred
• Most data related MPI messages are within 256B-1KB in size
• Typical MPI synchronization messages are lower than 64B in size
• Number of messages increases with cluster size
FLUENT 6.3.37 Benchmark Profiling Result
(Truck_poly_14M)
0
8000
16000
2400032000
40000
48000
[ 0 . . 6 4 B ]
[ 6 4 . . 2 5 6 B ]
[ 2 5 6 B
. . 1 K B ]
[ 1 . . 4 K
B ]
[ 4 . . 1 6 K B ]
[ 1 6 . . 6 4 K
B ]
[ 6 4 . . 2 5 6
K B ]
[ 2 5 6 K
B . . 1 M ]
[ 1 . . 4 M ]
[ 4 M . . 2 G B ]
Message Size
N u m b e r o f M
e s s a g e s
( T h o u s a n d s )
4 Nodes 8 Nodes 16 Nodes
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 23/2823
FLUENT 12 Profiling – MPI Functions
• Mostly used MPI functions – MPI_Iprobe, MPI_Allreduce, MPI_Isend, and MPI_Irecv
FLUENT 12 Benchmark Profiling Result(Truck_poly_14M)
0
4000
8000
12000
16000
20000
M P I_ W a i t
a l l
M P I_ S e
n d
M P I_ R
e d u c e
M P I_ R e
c v
M P I_ B c
a s t
M P I_ B a r r
i e r
M P I_ I p r o
b e
M P I_ A
l l r e d u c e
M P I_ I s e
n d
M P I_ I r e
c v
MPI Function
N u m b e r
o f C a l l s
( T h o u s
a n d s )
4 Nodes 8 Nodes 16 Nodes
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 24/2824
FLUENT 12 Profiling – Timing
• MPI_Recv and MPI_Allreduce show highest communicationoverhead
FLUENT 12 Benchmark Profiling Result
(Truck_poly_14M)
0
4000
8000
12000
16000
20000
M P I_ W a
i t a l l
M P I_ S
e n d
M P I_ R e d u
c e
M P I_ R
e c v
M P I_ B c
a s t
M P I_ B a r
r i e r
M P I_ I p r o
b e
M P I_ A
l l r e d u
c e
M P I_ I s e
n d
M P I_ I r
e c v
MPI Function
T o t a l O v e
r h e a d ( s )
4 Nodes 8 Nodes 16 Nodes
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 25/28
25
FLUENT 12 Profiling – Message Transferred
• Most data related MPI messages are within 256B-1KB in size
• Typical MPI synchronization messages are lower than 64B in size
• Number of messages increases with cluster size
FLUENT 12 Benchmark Profiling Result
(Truck_poly_14M)
0
40008000
12000
16000
2000024000
28000
[ 0 . . 6 4 B ]
[ 6 4 . . 2 5 6 B ]
[ 2 5 6 B
. . 1 K B ]
[ 1 . . 4 K
B ]
[ 4 . . 1 6 K B ]
[ 1 6 . . 6 4 K B ]
[ 6 4 . . 2 5 6
K B ]
[ 2 5 6 K
B . . 1 M ]
[ 1 . . 4 M ]
[ 4 M . . 2 G B ]
Message Size
N u m b e r o f M
e s s a g e s
( T h o u s a n d s )
4 Nodes 8 Nodes 16 Nodes
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 26/28
26
FLUENT Profiling - Message Transferred
• FLUENT 12 reduces total number of messages
• Further optimization can be made to take bigger advantage of high-speed and low latency interconnects
FLUENT Benchmark Profiling Result
(Truck_poly_14M)
0
8000
16000
24000
32000
40000
48000
[ 0 . . 6 4 B ]
[ 6 4 . . 2 5 6
B ]
[ 2 5 6 B
. . 1 K B
]
[ 1 . . 4 K
B ]
[ 4 . . 1 6 K B
]
[ 1 6 . . 6 4 K
B ]
[ 6 4 . . 2 5 6
K B ]
[ 2 5 6 K
B . . 1 M
]
[ 1 . . 4 M
]
Message Size
N
u m b e r o f M e s s a g e s
( T h o u s a n d s )
FLUENT 6.3.37 FLUENT 12
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 27/28
27
FLUENT Profiling Summary
• FLUENT 12 and FLUENT 6.3.37 were profiled to identify their communication patterns
• Frequent used message sizes
– 256-1KB messages for data related communications
– <64B for synchronizations
– Number of messages increases with cluster size
• MPI Functions
– FLUENT 12 introduced MPI collective functions
• MPI_Allreduce help improves the communication efficiency
• Interconnects effect to FLUENT performance
– Both interconnect latency (MPI_Allreduce) and throughput (MPI_Recv) highly influence
FLUENT performance
– Further optimization can be made to take bigger advantage of high-speed networks
5/10/2018 Fluent Analysis - slidepdf.com
http://slidepdf.com/reader/full/fluent-analysis 28/28
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy andcompleteness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein