Technical white paper Scalabilit y of ANSYS 16 applications and Hardware selection. · 2015. 11....

Technical white paper

Scalabil ity of ANSYS 16 applicat ions and Hardware select ion. On mult i-core and f loat ing point accelerator processor systems

Table of Contents Abstract ........................................................................................................................................................................................... 2 Test configuration details ............................................................................................................................................................ 2

Message Passing Interface ..................................................................................................................................................... 3 ANSYS CFD Test results ............................................................................................................................................................... 3

ANSYS Fluent ............................................................................................................................................................................. 4 Standard benchmarks on a single node .......................................................................................................................... 4 Multi-process performance relative to single-process performance ....................................................................... 4 Standard benchmarks going from one node to thirty two nodes .............................................................................. 5 Processors with higher clockspeeds and less cores ..................................................................................................... 7 Fluent and GPUs .................................................................................................................................................................... 7

ANSYS CFX .................................................................................................................................................................................. 8 Small and medium standard benchmarks on a single node ....................................................................................... 8 Multi-process performance relative to single-process performance ....................................................................... 9 Standard benchmarks going from one node to sixteen nodes ................................................................................ 10

Summary of ANSYS CFD test results .................................................................................................................................. 11 Solution Reference Architectures for ANSYS CFD ............................................................................................................ 11

ANSYS Mechanical test results ................................................................................................................................................. 14 With GPU support ................................................................................................................................................................ 14 Xeon PHI ................................................................................................................................................................................ 15 Without GPU support ......................................................................................................................................................... 15 Faster processors less cores. What does that buy you? ............................................................................................ 17 Summary of ANSYS Mechanical test results ................................................................................................................. 17

Solution Reference Architectures for ANSYS Mechanical............................................................................................... 18 Conclusion ..................................................................................................................................................................................... 19

Abst ract For many years, the advances in computer design have followed Moore's Law, which states that the number of transistors on a single chip increases at a fixed rate. During these years, increasing Central Processing Unit (CPU) computing power involved adding more transistors to a single CPU. This created faster and smaller chips, but with more complicated architectures.

Recently, this trend has changed. Along with using floating-point accelerator units, the current trend is to increase computing power by adding CPUs to a single chip (creating a multi-core chip). Although multi-core chips increase computing power, they do not necessarily result in immediate performance improvements.

The performance and scalability of multi-core chips depends on the application that you are running. The multi-core chips can improve performance, but in doing so, increase demands on other subsystems. The memory, I/O, and networking subsystems must be able to handle these demands. In addition to using a floating point processor and fast memory, an application’s utilization of these subsystems can maximize performance. Performance and scalability depends on the application design

This paper serves three main purposes. One it looks at ANSYS Inc. 16.0 applications and how they run on Intel®-based HP ProLiant Gen9 servers. Second it gives users insight on how to run these applications as to get maximum performance. And third it can help IT managers get the correct and optimal hardware for their users running ANSYS applications.

Test conf igurat ion details In this paper, we discuss performance test results for the following ANSYS applications:

• ANSYS Fluent release 16.0 • ANSYS CFX release 16.0 • ANSYS Mechanical(MAPDL) release 16.0 The configuration for benchmark testing:

• For the all non GPU testing, we used a cluster of HP ProLiant BL460 Gen9 Servers and a cluster of XL230a Gen Servers connected by FDR InfiniBand. Table 1 shows the configuration for each server.

• For the GPU testing, we used the HP ProLiant XL250s Gen9 Servers that were configured as shown in Table 1.

Table 1. Server configuration details.

Subsystem HP ProLiant BL460c Gen8 server conf igurat ion

HP ProLiant SL250s Gen8 server conf igurat ion

Operating system Red Hat Enterprise Linux Release 6.5

Message Passing Interface (MPI) Platform MPI version 9.1 of Intel MPI version 5.0 (depending on application)

Processors Two, Intel® Xeon® Processor E5-2680 v3, E5-2697v3 , and E5-2698v3 processors

Two, Intel Xeon Processor E5-2690v3processors

Memory 128GB of 2133MHz DIMMs 128GB of 2133MHz DIMMs

Hard drives Two, 600GB 10K SAS drives Four, 300GB 15K SAS drives

Graphical Processing Units (GPUs) Not applicable Two NVIDIA Kepler K80 GPUs or Intel 7120P PHIs

Figure 1 depicts the HP ProLiant BL460 Gen9 Server that was used for testing.

Figure 1. HP ProLiant Blade enclosure (C7000) with a cluster of BL460 Gen9 blades.

For servers configured with the Intel e5-26xxv3 “Haswell” processors, it is important to have memory DIMMs connected to all four channels. Failure to do so will degrade performance. Although each processor can use another processor’s memory, doing so increases memory latency and degrades performance.

The ProLiant XL250s server can be configured with Graphical Processing Units (GPUs) through 2 dedicated internal PCIe Gen3 connections. In the case of High Performance Computing (HPC), these GPUs can be used as floating point accelerators to measurably speed up floating point applications, provided those applications are programmed to use them. Figure two shows the Apollo 6000 which can be populated with these XL250 Gen9 servers.

Figure 2. Apollo 6000 with either XL230 or XL250 Gen9 servers

Message Passing Interface ANSYS applications include an independently developed MPI.

When reviewing the test results, notice that some tests do not use all the cores on each particular type of processor. When not using all the cores, there are several ways to distribute the processes over both processor sockets. We used the default method—round robin.

The round robin method alternates the placement of processes on the processors. For example, when running eight processes on a node with two, eight-core sockets, MPI alternates the placement of processes between each processor, resulting in four processes on each processor. This is the best way to run ANSYS applications. Although MPI can be configured to use other placement methods, we do not recommend doing so.

ANSYS CFD Test results The following benchmark scenarios were used when testing ANSYS Fluent and ANSYS CFX. Description of the benchmarks can be found at the following URL http://www.ansys.com/benchmarks

• Small, medium, and large standard benchmarks on a single node for Fluent. And small and medium standard benchmarks for CFX on a single node.

• Multi-process performance relative to single-process performance • Standard benchmarks going from one node to thirty two nodes for Fluent, and one to sixteen nodes for CFX.

http://www.ansys.com/benchmarks

• Running Fluent with GPU enablement on a single node and multinodes.

• Comparing the scalability of machines with 12, 14, and 16 core processors.

ANSYS Fluent Our testing shows that Fluent version 16.0 scales well both within a single node and as nodes are added to a cluster. Also GPU acceleration can boost performance where applicable.

Standard benchmarks on a single node Figure 3 shows the geometric mean when running the Fluent small, medium, and large standard benchmarks on a single node. The results are in solver ratings. Solver ratings are a measure of the amount of work that can be done in a single day, so a larger solver rating indicates better performance.

The solver rate is calculated as follows:

Total number of seconds in a day (86400) divided by the time of the solve step in the job

For example, if the solver time in a particular job run were 100 seconds, the solver rating for that job would be 86400/100, which is a solver rate of 864.

As Figure 3 illustrates, there is good scaling from one to 32 processes on a single node when running these benchmarks. However, notice that the performance benefit lessens as you get closer to 32 processes.

Figure 3. Geometric mean of standard benchmarks

Mult i-process performance relat ive to single-process performance Figure 4 illustrates multi-process performance relative to single-process performance. Notice that there is a steady increase as you increase the number of processes up to 30 - 32 processes. Not all applications exhibit this behavior. Some applications might scale well up until 10 to 12 processes on a node. It would not be beneficial to run those applications with more than 10 or 12 processes on one node. Fluent makes productive use at all 32 cores, although as stated earlier the benefit of running with more cores in a node are starting to tail off. Also related to performance is processor and memory clock speed. With applications such as Fluent the faster the processor and memory clock speed the better the performance. With the machine used in the results, the processor clock was 2.3 GHz and the memory speed was 2133 MHz. Processors with a faster clock and less cores may be more beneficial. More about this later.

0

100

200

300

400

500

600

1p 2p 4p 6p 8p 10p 12p 14p 16p 18p 20p 22p 24p 26p 28p 30p 32p

solver ratings

processes Geometric mean

Figure 4. Speedup of Fluent in a single ProLiant XL230 Gen9 node

Standard benchmarks going f rom one node to thirty two nodes Figure 5 shows the speedup for the geometric mean of the Fluent small, medium, and large standard benchmarks going from one node to thirty two nodes. As you can see, there is a speedup of up to times, which is very good when compared to other applications and running with this many processes, which is 1024, and the size of benchmarks included in the geometric mean. Some of which are not very large.. What is critical here to the scalability from the hardware side is the network. With parallel process applications, there are substantial amounts of communication going on between each process in a node and also with processes spread out over multinodes in the network. Because of this a network with low latency and high bandwidth is required for performance. With these tests a network consisting of FDR InfiniBand interconnects and switches was used. At this time, there is no network that has lower latency and higher bandwidth than FDR InfiniBand!

0

2

4

6

8

10

12

14

16

18

1p 2p 4p 6p 8p 10p 12p 14p 16p 18p 20p 22p 24p 26p 28p 30p 32p

Speedup

processes Speedup

Figure 5. Speedup of Fluent going from one to thirty two nodes

The next chart, Figure 6, is the speed up of the Fluent very large benchmarks(Combustor 71M, and F1 racer 140M), showing from four to thirty two nodes. What is interesting about it is the scalability of a large number of nodes, say 8 to 16 or 16 to 32 is more pronounced than in Figure 5. This is because larger benchmarks have more work that needs to be done on them and this work can be more efficiently parallelized at a higher level than the other benchmarks. If you might notice this is perfect scaling!

Figure 6. Speedup of Fluent very large benchmarks

0

2

4

6

8

10

12

14

16

18

1 node 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes

Speedup

Speedup

0

5

10

15

20

25

30

35

4 nodes 8 nodes 16 nodes 32 nodes

Speedup

Processors with higher clockspeeds and less cores Earlier it was mentioned that perhaps running with processors that had higher clock speeds, but a few less cores may be the type of systems for Fluent. Figure 7 shows a chart that is an example of this. Here we have a node by node comparision of clusters with three variations of the Intel E5-26xxv3 processor. The E5-2698v3 is the 16 core chip that was used in the benchmarking results for the previous charts. Its clock speed is 2.3GHz. We also have shown the E5-2697v3 a 14 core 2.6GHz processor, and the E5-2680v3 a 12 core 2.5GHz chip. As you can see there isn’t a lot of difference when comparing the nodes of these variations. The nodes with the E5-2680v3 performs theworst, but keep in mind these nodes only have 24 cores per node versus 28 for the E5-2697v3, and 32 for the E5-2698v3 nodes.

Figure 7. Node by node comparison of systems with different type of E5-26xxv3 processors

Fluent and GPUs In Fluent version 16 continues with the GPU capability that was in version 15. Below shows the results from a 1.2 million cell pipe benchmark run on two XL250s Gen9 with Intel E5-2690v3 (2.6GHz) processors, 128 gigabytes of memory, and 2 Nvidia K80s. Each machine had two 12 core processors, Each K80 has two GPUs in it, so with the below we are looking at 2 and 4 functional GPUs or 2 and 4 cores out of our ANSYS HPC PACK licenses. One K80 helps up to between 8 and 14 cores per node depending if you are running single or dual node. Two K80 will show benefit up to 24 processes either on one node or two nodes. Although in the two node case at 24 processes, running with no GPUs is performing almost as well as with 2 K80s.

Figure 8. Fluent with and without GPU enablement

0

1000

2000

3000

4000

5000

6000

7000

1 node 2 nodes 4 nodes 8 nodes 16 nodes

solver

ratings

2698v3

2697v3

2680v3

0

1000

2000

3000

4000

5000

6000

1p1n

2p1n

4p1n

6p1n

8p1n

10p1

n12

p1n

14p1

n16

p1n

18p1

n20

p1n

22p1

n24

p1n

8p2n

12p2

n16

p2n

20p2

n24

p2n

Solver

Ratings

processes per node

1.2M pipes benchmark

No GPUs

1 K80/node

2 K80s/node

Below are some results from a larger 9.6 million cell benchmark running on two XL250 nodes of the configuration shown with the previous chart. This benchmark was too large to run on a single K80 within one node, but with 2 K80 we see a good performance boost from them. On two nodes we do see a performance boost at four and 8 processes per node with one K80 per node, and with two we see the best performance from two nodes when running with eight processes per node. The 48 process no GPU run has been thrown in for comparison. As we can see running with 16 processes and four K80s outperforms this. Also one thing to note is that the GPU run will only use 2 HPC PACK licenses whereas the no GPU run will take three.

Figure 9. Fluent with and without GPU enablement on large benchmark

.

ANSYS CFX Our testing shows that CFX is highly scalable.

Small and medium standard benchmarks on a single node Figure 10 illustrates the small and medium standard benchmarks on a single node.

0100200300400500600700800

Solver

Ratings

processes per node

9.6M pipes benchmark

No GPUs

1 K80/node

2 K80s/node

Figure 10. Geometric mean of the small and medium benchmarks in a single node.

Mult i-process performance relat ive to single-process performance With CFX, the geometric mean of the small and medium benchmarks (Pump, Lemans, and Airfoil 10M) exhibits a speedup of almost 18 times going from one to 24 processes on a BL460 Gen9 with two 12 core 2.5GHz Intel E5-2680v3 processors as shown in the relative performance chart in Figure 10. Note that the amount of memory on these machines is128 GBs. How much memory you require depends on the types and size of cases you run. It is typical that for ANSYS CFD applications from four to eight gigabytes of memory is available per core. Since there are 24 cores each of these machines, a minimum of 96GBs and a maximum of 192GBs would be recommended. 128GBs fall right in the middle of this. The reason for this recommendation is you don’t want to have too little memory to run a large enough job to take advantage of the processing power of the processors and with large jobs and too few cores, you aren’t taking advantage of the scalability of the application and the machines to give you a better time to solution. Also notice that since this machine has much fewer cores than the 32 core machine shown in the earlier Fluent results. The scaling here continues all the way to the maximum number of cores on the system. Fluent would behave in a similar way on this machine. Conversely the scaling of CFX on the 32 core machines would be analogous to Fluent’s.

Figure 11. Speedup of CFX in a single ProLiant BL460c Gen9 node

0

100

200

300

400

500

600

700

800

900

1000

1p1n 2p1n 4p1n 6p1n 8p1n 10p1n12p1n14p1n16p1n18p1n20p1n22p1n24p1n

Solver Ratings

processes

Standard benchmarks going f rom one node to sixteen nodes When running the small, medium, and large CFX benchmarks (Pump, Lemans, Airfoil 10M, and Airfoil 50M) on more than one compute node, one can see a good speed up, too. The performance increase run over ten times for sixteen nodes, which shows the benefit of going parallel over multiple nodes with CFX.

Figure 12. Speedup of CFX from one to sixteen nodes

With Figure 13 we see the scalability of the large CFX benchmarks (perf_Airfoiil 50M and 100M) and like Fluent, large benchmarks scale better on a node by node basis than the smaller ones. This shows almost perfect scaling!

02468

101214161820

Speedup

Processes

0

2

4

6

8

10

12

1 node 2 nodes 4 nodes 8 nodes 16 nodes

Speedup

Speed…

Figure 13. Speedup of the large CFX benchmarks.

Summary of ANSYS CFD test results Our testing shows that Fluent and CFX are highly scalable. What makes Fluent and CFX so scalable compared to other applications? First, credit must be given to the developers at ANSYS who have worked over the years to make Fluent and CFX high performing scalable applications. Other reasons involve the characteristics of the application. Like many CFD applications, Fluent and CFX do not perform a lot of file system I/O, so they are not dependent on the speed of the file systems, which can slow down a high performance computational application. In addition, Fluent and CFX do not over-tax the bandwidth of the memory system on the node when used with a proper memory to core configuration.

For Fluent and CFX running on single servers, another consideration is latency and bandwidth of the node’s memory. In a single node, the data transmitted between a core running a Fluent or CFX process and memory can affect performance. These memory parameters also affect the communication time among Fluent or CFX processes. Since FLUENT and CFX’s parallelism is a multi-process form based on message passing protocols, from time to time various processes in a parallel Fluent or CFX job will have to communicate with one another. In a single node, this communication is passed from a process, which is running on a core, through memory to another process, which is running on another core. So, the memory performance affects this communication time, so using 2133MHz versus the previous maximum of1866 MHz benefits here.

For Fluent and CFX running across multiple servers, in a multi-node job, node-to-node communication times become important. In clusters of ProLiant BL460 Gen9s. XL230a Gen9s, and XL250 Gen9s servers (used in our testing), a high speed FDR InfiniBand inter-node interconnect was used to facilitate communication between nodes. This interconnect has low latency and high bandwidth characteristics which make it ideal for this task.

Larger jobs will scale better with higher numbers of nodes with these applications. The more work you give them the more parallelism you can get out of them, provided you have the hardware and licenses!

Solut ion Reference Architectures for ANSYS CFD From our test results, it can be concluded that the best hardware configuration for a given application will depend on a number of factors. Nonetheless, some general recommendations and guidance can be made for suitable compute clusters running ANSYS CFD, as shown in the three SRAs in Figures 14, 15, and 16.

Figure 14. ANSYS CFD (Fluent and CFX): starter cluster kit

0

2

4

6

8

10

12

14

16

18

2 nodes 4 nodes 8 nodes 16 nodes

Speedup

Server Options: 1 XL1x0r head node2-4 ProLiant Xeon nodes, each using 2 processors, in an Apollo 2x00 chassis

24 - 28 cores per compute node, E5-2697v3 12 core 2.6GHz processors recommended. 3 to 6 SAS drives (Raid 0)

Options: 2 NVIDIA K40s (Fluent)

Total Memory for the Cluster:Compute nodes: 4 to 8 GBs/coreHead node 32GBs or more depending on role

Cluster Interconnect: Integrated Gigabit, 10 Gigabit Ethernet, or QDR InfiniBand (recommended for jobs using more than four nodes)

Operating Environment: 64-bit Linux, Microsoft (HPC Pack) Server 2012

Workloads: Suited for ANSYS CFD models up to ~230M cells (FLUENT), and depending on mesh ~60M to ~230M nodes(CFX) .

HP ProLiant XL170r or XL190r Gen9 Nodes in 2U Apollo 2000 chassis

The SRA shown in Figure 14 is our ‘Starter Cluster Kit” for CFD. This showcases the Apollo 2000 chassis which can house up to four XL170r or XL190r ProLiant servers. There are some common things to keep in mind when looking at this SRA and the ones that will follow. They are the number of nodes, the number of cores per compute note, amount of memory per core, and the ideally suited workloads shown in the SRA. In the case of this SRA the workloads listed are for the maximum configuration shown in this SRA, which is four nodes, each with 24 to 28 cores per compute node, and 8 GBs per core. If the workloads you run are say half the size. Then perhaps you want two nodes, or four nodes with 4 GBs per core, or some mixtures of lower core count processors and memory. If your jobs are larger, read on!

Figure 15. ANSYS CFD (Fluent and CFX): midsize cluster

Server Options:1 DL380 head node4-32 ProLiant XL230a or XL250a Xeon nodes (Apollo 6000), each using 2 processors in a a6000 chassis (up to 10 nodes per chassis)

24-32 cores per compute node, E5-2698v3 2.3GHz 16 core processors recommended. 2 to 4 1TB 15K SAS drives per compute nodeUp to 2 NVIDIA K80s per XL250a node (Fluent)

Options: Configure the DL380 node with up to 24 internal SAS driveswith extra memory/storage for very large jobs.

Total Memory for the Cluster:Compute nodes: 4 to 8 GBs/coreHead node 32GBs or more depending on role.

Cluster Interconnect: FDR InfiniBand 2:1


Workloads: Suited for ~4 simultaneous ANSYS CFD models up to ~500M cells (FLUENT), and depending on mesh ~100 to ~500 nodes(CFX) . Or, run ~20 to ~30 simultaneous ANSYS CFD models on the scale of ~50M cells(FLUENT), ~10 to ~50 M nodes(CFX), again depending on mesh.

Apollo 6000

In Figure 15 we are looking at our “Midsize Cluster for CFD”. The same considerations about the number of compute nodes, amount of cores per compute node, amount of memory per core, and how that relates to the ideal workloads is the same as with the previous SRA. However with this one we mention Apollo 6000 chassis with XL250s. One thing to note here, is that for interconnects FDR InfiniBand is mentioned. With four nodes and up you want the highspeed InfiniBand interconnect!

Figure 16. ANSYS CFD (Fluent/CFX): Large Scale-Out Cluster

42U

Server Options: 1 DL380 Head node32-64 ProLiant BL460c nodes, each using 2 processors

24 – 32 cores per compute node. E5-2697v3 2.6GHz 14 core processors recommended. Two 1.2TB 15K SAS drives per compute node

Options:WS460c Gen9 workstation blade with NVIDIA Quadro K6000 graphics

card for pre/post processing using remote visualizationConfigure the head node with extra memory/storage for very large jobs

Total Memory for the Cluster:Compute nodes: 4 to 8 GBs/coreHead node 32GBs or more depending on role

Cluster Interconnect: FDR InfiniBand 2:1


Workloads: Suited for ~4 simultaneous ANSYS CFD models greater than 500M cells(FLUENT) , and greater than100 or 500M nodes (CFX) depending on mesh. Or running greater than 30 simultaneous ANSYS CFD models on the scale of ~50M cells (FLUENT), ~10 to ~50M nodes (CFX) depending on mesh

With Figure 16 we have our large scale out cluster. Here we have BL460c blade servers. One specific thing to point out about this SRA is you can replace one of your server blades with a WS460c workstation blade that can have a K6000 graphics card, or a K2 for combined computation and remote graphics!

ANSYS Mechanical test results We tested ANSYS Mechanical with GPU support and without GPUs with following benchmark scenarios which can be found at the following URL http://www.ansys.com/benchmarks

With GPU support We ran tests with ANSYS Mechanical version 16.0 in distributed (DMP) mode with the following mainstream solvers, which support GPU acceleration:

• Direct Sparse • Preconditioned Conjugate Gradient (N/A with the Intel Xeon PHI) • Jacobian Conjugate Gradient (N/A with the Intel Xeon PHI)

NOTE ANSYS Mechanical DMP mode gained Xeon PHI support in version 16 for the Direct sparse solver with symmetrical matrices.

There are different ways to use GPUs on job running on one node or on a cluster of nodes. There are several ways to utilize GPUs:

• One ANSYS Mechanical job running on one node with one or more GPUs: If only one job is running on one node, that job can use one or more GPUs. In DMP mode, a job is segmented into chunks. These chunks run as processes on different cores. Each process computes in parallel with other processes. When utilizing GPUs, each process runs until it can offload part of the computation to the GPU. At this point, the process sends work to the GPUs, waits for the results, and continues.

• Multiple ANSYS Mechanical jobs on one node with multiple GPUs: To use additional GPUs on a node, you can run more than one ANSYS Mechanical job on the node; each job uses a different GPU.

• One ANSYS Mechanical job on multiple nodes, each with a GPU: A single ANSYS Mechanical job can use more than one GPU by running the job on multiple nodes, one GPU on each node it is running on. If you run a 24-process parallel job over four nodes, you could use four GPUs with the job, as long as you have at least one GPU on each node.

Note the following restrictions when using ANSYS Mechanical with GPUs:

• These GPU-enabled solvers support many ANSYS Mechanical datasets, but there may be times when the GPU may not be used for certain types of analysis.

• If the data for the job is too big for the onboard GPU memory with the PCG solver, ANSYS Mechanical will not use the GPU; instead, it will run in normal compute mode.

Figure 17 shows that, on average, you get a substantial benefit from the NVIDIA K80s, with each K80 having 2 GPUs on it. Note that as the number of processes goes up the benefit of the GPUs is less then when compared to using no GPUs. Also these are systems with 24 cores per node, but only a maximum16 processes per node is shown. This is because when using GPUs there is not much of a performance benefit when running with more that 32 processes over two nodes. Also observe that results for 24 processes and 28 processes over two nodes are included too.

The performance with 24 processes and two K80s per node exceeds the performance with 32 processes with no GPUs or one K80 per node. It really isn’t that far off from 32 processes with two K80s per node. From a licensing standpoint, using 24 cores and 8 GPU units (4 K80) requires only two HPC PACK licenses, whereas with 32 processes and even one K80 you would need three HPC PACKs to run. If all you have is one HPC PACK license you can see that running with six processes and one K80 would outperform not using GPUs at all. In these ways the GPUs can help you maximize your performance per cost of license.

http://www.ansys.com/benchmarks

Figure 17. Comparison of geometric means of standard benchmark results with and without K80 GPU acceleration

Xeon PHI New in ANSYS Mechanical version 16 is support for the Xeon PHI in the multiprocess parallel product. Figure 18 shows us the performance of running with one and two 7120 PHIs with the results being the geometric mean of a select group of the ANSYS standard benchmarks. The PHI only works on jobs in which the sparse solver is used and where the solution matrix is symmetrical. Also one PHI will only work up to 8 processes and two will go up to 16 processes. The chart shows us that where applicable the PHI does give us an advantage, scaling well up to six to eight processes for one and up to 14 when using two PHIs, which outperforms running with 24 processes without PHI accelerations. As with the GPUs, where the Xeon PHI can help you make optimal use of a single HPC PACK license when running with six processes and one or two PHIs as compared as to running with eight processes without acceleration.

Figure 18. Comparison of geometric means of standard benchmark results with and without Xeon PHI acceleration

Without GPU support Figures 19 shows ANSYS Mechanical runs within a node. Specifically we can surmise from Figure 19 that Mechanical speeds up over 12 times. Also we see from the two charts that the application has an increase in performance as you increase the number of parallel processes, although as we get up above 16 to 18 processes the relative improvement lessens. Still this shows that ANSYS Mechanical scales.

Figure 19. Geometric mean of all the standard benchmarks ran in a node.

0

100

200

300

400

500

600

700

800

900

1p1n 2p1n 4p1n 6p1n 8p1n 16p1n 24p2n 28p2n 32p2n

Solver Ratings

Processes per node

Without GPUs

One K80/node

Two K80s/node

0

100

200

300

400

500

600

1p 2p 4p 6p 8p 10p 12p 14p 16p 24p

Solver

Ratings

Processes

No PHI

One PHI

Two PHIs

When running on two nodes ANSYS Mechanical will scale to all the cores on both nodes, although again it will lose efficiency as you increase the number of processes up to 48. However with four nodes and beyond, look at figure 20. Results shown here are running with various process counts from one to eight nodes. It is a bit of an eye chart, but the point of this chart is one to show that ANSYS Mechanical will scale up to 8 nodes, and that at eight nodes optimal performance is when running on 16 processes per node even though each node had 24 cores and this cluster of eight will enable up to 192 processes total. Now at lower node counts than eight the application will scale somewhat to all the processors in a node as with the single node results previously shown, but as you get above 16 to 18 processes per node the efficiency in performance improvement drops off as it did on the single node. Again at eight nodes the efficiency in improving performance drops to 0 above 16 processes per node as we can see.

Figure 20. . Geometric mean of all the standard benchmarks ran from one to eight nodes with varying processor counts

To clean this up a bit look at Figure 21. This show multi-node scaling from one to eight nodes using a maximum of16 processes per node. One other thing you might notice here is that 128 processes used three HPC PACKs, so if you have three HPC PACKs and eight nodes this is the way you would run your job, provided it was big enough to use all eight nodes. Along with this reasoning if you had two HPC PACKs you would run 32 processes over two nodes, even if your server has two 16 core processors which can run 32 processes on one node.

0

50

100

150

200

250

300

350

400

450

1p 2p 4p 6p 8p 10p 12p 14p 16p 18p 20p 22p 24p

solver

ratings

processes

0

200

400

600

800

1000

1200

1400

1p1n

2p1n

4p1n

6p1n

8p1n

8p2n

10p1

n12

p1n

12p2

n14

p1n

16p1

n16

p2n

16p4

n18

p1n

20p1

n20

p2n

22p1

n24

p1n

24p2

n24

p4n

28p2

n32

p2n

32p4

n32

p8n

36p2

n40

p2n

40p4

n44

p2n

48p2

n48

p4n

48p8

n56

p4n

64p4

n64

p8n

72p4

n80

p4n

80p8

n88

p4n

96p4

n96

p8n

112p

8n12

8p8n

144p

8n16

0p8n

176p

8n19

2p8n

Solver Ratings

Processes per node

Figure 21. Geometric mean of all the standard benchmarks ran from one to eight nodes with a maximum of 16 processes per node.

Faster processors less cores. What does that buy you? The answer to this question is it depends. Look at Figure 22. Here is a comparison of the performance of various processor types within a node with results from running ANSYS Mechanical benchmarks. The highest performing processor on a core by core or process basis is the Intel E5-2667v3, which has eight cores and is clocked at 3.2GHz. The processor we have been looking at with respect to the previous charts is the Intel E5-2680v3, which has 12 cores and is clocked at 2.5GHz. The other two shown are the E5-2690v3, 12 cores clocked at 2.6GHz and the E5-2695v3, 14 cores clocked at 2.3GHz. Compared to the other processor models, the fast 3.2GHz processor saves you four to six cores. If you only have two HPC PACKs and a small number of nodes, and never intend to expand, perhaps this is the processor you would want. Also to note here you will see similar behavior with the CFD applications. However if you have a medium to a large cluster of machines, you will want processors with more cores which will give you better overall performance on a node by node basis.

Figure 22. Comparison of various processor types with ANSYS Mechanical.

Summary of ANSYS Mechanical test results The conclusion about how to run ANSYS Mechanical in DMP mode on a cluster of ProLiant Gen9 servers, without GPU acceleration: to run a single job on one or two nodes, you would all the cores in the node, however if you were limited by licenses or you were running on four nodes or more you would limit the number of processes to 16 per node. When using

0

200

400

600

800

1000

1200

1p1n 2p1n 4p1n 8p1n 16p1n 32p2n 64p4n 128p8n

Solver

Ratings

Processes per node

GPU acceleration you want to run with a maximum of 16 processes and two K80s per node. However if you are limited by HPC PACK licenses you might want to spread this out to 14 processes and one K80, which has two GPUs in it or 14 processes and use two K80s which has four GPUs. Where applicable the Xeon 7120P PHI can help make optimal use of a HPC PACK license as well. One of the most effective ways to take advantage of the GPUs is to run multiple jobs on a node each using up to six processes and one K80. Also with the XL190r server model, the K80 is not available, but the K40 is. Roughly speaking each GPU on a K80 is equivalent to a K40. So you would adjust your processes versus GPU combination accordingly. With respect to I/O, since this application makes heavy use of the file system, you want the application to use local storage attached to each individual node. You can run it from a shared file system and it will work fine, but having the file I/O going over network file system (NFS) or even a fast-shared file system will perform worse than using local storage.

Solut ion Reference Architectures for ANSYS Mechanical From our test results, it can be concluded that the best hardware configuration for a given application will depend on a number of factors. Nonetheless, some general recommendations and guidance can be made for suitable compute clusters running ANSYS Mechanical, as shown in the two SRAs in Figures 22 and 23.

Figure 22. Starter Cluster Kit for Mechanical

Total Memory for the Cluster:Compute nodes: 4 to 8 GB/coreHead node 32 GBs or more depending on role

Cluster Interconnect: Integrated Gigabit, 10 Gigabit Ethernet, or QDR InfiniBand (recommended for jobs using more than 2 nodes) .


Workloads: Suited for Mechanical up to ~70M or ~480M DOFs depending on solver used

Server Options:1 XL1x0r head node2-4 ProLiant Xeon nodes, each using 2 processors, in an Apollo 2x00 chassis

Up to 24 cores per compute node, E5-2690v3 12 core 2.6GHz processors recommended. 3 to 6 480GB SSD drives (Raid 0)

Options: 2 NVIDIA K40s (XL190)

HP ProLiant XL170r or XL190r Gen9 Nodes in 2U Apollo 2000 chassis

The SRA shown in Figure 22 is our ‘Starter Cluster Kit” for ANSYS Mechanical. It is similar to our CFD starter kit, except we show the option of using 480GB SSDs. ANSYS Mechanical’s performance can be highly sensitive to file system performance, and parallel RAID0 Solid State Drives (SSDs) can dramatically outperform standard hard drives.

Figure 23. Fat node Cluster for Mechanical.

42U

DL380p

Apollo 6000

Server Options: 1 DL380 head node 4-8 ProLiant DL380 Xeon server nodes, each using 2

processors (24 cores). A NVIDIA K80 and 2 to 24 internal 600GB SAS 15K drives or 800GB SAS SSDs striped RAID 0 per compute node plus a 6x2TB SAS RAID0 disk array on head node

Or 4 - 8 XL250a Xeon server nodes (Apollo 6000), each using 2 processors (24 cores), up to 2 NVIDIA K80s or Intel 7120P PHIs and 2 internal SAS 15K drives or 800GB SAS SSDs per compute node (suitable for nonlinear jobs > = 2M DOF)

E5-2680v3 2.5GHz or E5-2690v3 2.6GHz 12 core processors recommended.

Total Memory for Cluster: 8GB/core on Head node 4 to 8 GB/core on each remaining

compute nodes

Cluster Interconnect: FDR InfiniBand

Operating Environment: 64-bit Linux OR Microsoft HPC Server 2008

Workloads: 384 - 1536GB RAM configurations will handle up to ~8 simultaneous running ANSYS “megamodels” of ~45-180M or ~450M DOFs and up depending on solver used.

Out last SRA is the “Fat node Cluster” for ANSYS Mechanical. The reason it is called the ‘Fat node cluster’ is that it has the DL380 Gen9 server where you can put up to 24 disks in it, using SSDs for file system performance if you want. Also, we do show the option of XL250s for GPU support. However, you can now get the DL380 server with one K80 .Also ANSYS Mechanical can be even more sensitive to the type of interconnect than the CFD codes are. So here we have FDR InfiniBand recommended for configurations of four nodes and larger. You would get some benefit at two nodes as well.

Conclusion HP designed the hardware configurations used in the analysis for this paper for HPC. The servers are configured using Intel high performing E5-2600v3 processors, fast memory DIMMs, and high performance disk drives. Other HP two-processor server models with similar processors, memory, networks, and disk subsystems will perform similarly.

This summary of ANSYS applications on HP ProLiant servers using Intel Xeon E5-26XXv3 12 and 16 core processors shows that now, as in the past, as the number of cores on the processors increases, application performance improves. The performance of memory and network components has improved to maximize the performance these processors; however, there are still considerations to be taken when running ANSYS CFD applications in parallel.

Fluent and CFX are both highly scalable, both within a node and over many nodes in an HPC cluster. With both applications the recommendation when running multi-node parallel is to fill up the node with processes, Up to the maximum number of cores in the case of the machines tested.

Of course, with these as with all applications, you need to match the level of parallelism to the requirements of the dataset. If you try to parallelize a job that does not have enough work, there will be too much inter-process communication compared to the amount of compute work in an application process, and the advantages from spreading out the compute work over a number of cores will be negated.

In Version 16 there is GPU support for the Fluent application. Where applicable there is a benefit at least with small parallel process counts.

With the new version of ANSYS Mechanical, we see that it is highly scalable. So it is recommended to fill up the node with processes when running on a node or two, as shown in this paper with the 24 core systems. However running on a larger number of nodes running with the nodes fully populated may not buy you anything. Also faster processors with smaller core counts will get better performance on a core by core basis, but on a node by node basis having more cores up to a point

is beneficial. When running with GPUs a maximum of 16 processes per node using two K80s or four GPUs per node would be ideal. GPUs would also be very effective when running more than one job per node, running two to three jobs of six or seven processes with a GPU for each job. Also if you are limited in the number of licenses you have, using GPUs is a great way to optimize your performance. For that matter, where applicable the Xeon 7120P PHI would be as well.

Technical white paper Scalabilit y of ANSYS 16 applications and Hardware selection. · 2015. 11....

Documents

Transcript of Technical white paper Scalabilit y of ANSYS 16 applications and Hardware selection. · 2015. 11....