LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4...

29
LSDYNA Performance Benchmarks and Profiling January 2009

Transcript of LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4...

Page 1: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

LS‐DYNA Performance Benchmarks  and Profiling

January 2009

Page 2: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

2

Note

• The following research was performed under the HPC Advisory Council activities

– AMD, Dell, Mellanox

– HPC Advisory Council Cluster Center

• The participating members would like to thank LSTC for their support and guidelines

• The participating members would like to thank Sharan Kalwani, HPC Automotive specialist, for his support and guidelines

• For more info please refer to

– www.mellanox.com, www.dell.com/hpc, www.amd.com

Page 3: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

3

LS-DYNA

• LS-DYNA– A general purpose structural and fluid analysis simulation software

package capable of simulating complex real world problems – Developed by the Livermore Software Technology Corporation (LSTC)

• LS-DYNA used by– Automobile– Aerospace– Construction– Military– Manufacturing– Bioengineering

Page 4: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

4

LS-DYNA

• LS-DYNA SMP (Shared Memory Processing)– Optimize the power of multiple CPUs within single machine

• LS-DYNA MPP (Massively Parallel Processing)– The MPP version of LS-DYNA allows to run LS-DYNA solver over

High-performance computing cluster – Uses message passing (MPI) to obtain parallelism

• Many companies are switching from SMP to MPP– For cost-effective scaling and performance

Page 5: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

5

Objectives

• The presented research was done to provide best practices

– LS-DYNA performance benchmarking

– Interconnect performance comparisons

– Ways to increase LS-DYNA productivity

– Understanding LS-DYNA communication pattern

– MPI libraries comparisons

– Power-aware consideration

Page 6: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

6

Test Cluster Configuration

• Dell™ PowerEdge™ SC 1435 24-node cluster

• Quad-Core AMD Opteron™ Model 2358 processors (“Barcelona”)

• Mellanox® InfiniBand ConnectX® DDR HCAs

• Mellanox® InfiniBand DDR Switch

• Memory: 16GB memory, DDR2 667MHz per node

• OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

• MPI: HP MPI 2.2.7, Platform MPI 5.6.5

• Application: LS-DYNA MPP971

• Benchmark Workload

– Three Vehicle Collision Test simulation

– Neon-Refined Revised Crash Test simulation

Page 7: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

7

Mellanox InfiniBand Solutions

• Industry Standard– Hardware, software, cabling, management– Design for clustering and storage interconnect

• Performance– 40Gb/s node-to-node– 120Gb/s switch-to-switch– 1us application latency– Most aggressive roadmap in the industry

• Reliable with congestion management• Efficient

– RDMA and Transport Offload– Kernel bypass– CPU focuses on application processing

• Scalable for Petascale computing & beyond• End-to-end quality of service• Virtualization acceleration• I/O consolidation Including storage

InfiniBand Delivers the Lowest Latency

The InfiniBand Performance Gap is Increasing

Fibre Channel

Ethernet

60Gb/s

20Gb/s

120Gb/s

40Gb/s

240Gb/s (12X)

80Gb/s (4X)

Page 8: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

88 November5, 2007

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 2MB L3 Cache

– Direct Connect Architecture• HyperTransport™ technology• Up to 24 GB/s

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Memory• 1GB Page Support• DDR-2 667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as Second-Generation AMD Opteron™ processor

PCI-E® Bridge

PCI-E® Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E® Bridge

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

Page 9: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

9

Dell PowerEdge Servers helping Simplify IT

• System Structure and Sizing Guidelines– 24-node cluster build with Dell PowerEdge™ SC 1435 Servers

– Servers optimized for High Performance Computing environments

– Building Block Foundations for best price/performance and performance/watt

• Dell HPC Solutions– Scalable Architectures for High Performance and Productivity

– Dell's comprehensive HPC services help manage the lifecycle requirements.

– Integrated, Tested and Validated Architectures

• Workload Modeling– Optimized System Size, Configuration and Workloads

– Test-bed Benchmarks

– ISV Applications Characterization

– Best Practices & Usage Analysis

Page 10: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

10

LS-DYNA Performance Results - Interconnect

• InfiniBand high speed interconnect enables highest scalability – Performance gain with cluster size

• Performance over GigE is not scaling – Slowdown occurs as number of processors increases beyond 16 nodes

LS-DYNA - 3 Vehicle Collision

010002000300040005000600070008000

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)10

(80 C

ores)

12 (9

6 Cores

)

14 (1

12 C

ores)

16 (1

28 C

ores)

18 (1

44 C

ores)

20 (1

60 C

ores)

22 (1

76 C

ores)

24 (1

92 C

ores)

Number of Nodes

Elap

sed

time

(Sec

onds

)

InfiniBand GigE

LS-DYNA - Neon Refined Revised

0100200300400500600700

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)10

(80 C

ores)

12 (9

6 Cores

)

14 (1

12 C

ores)

16 (1

28 C

ores)

18 (1

44 C

ores)

20 (1

60 C

ores)

22 (1

76 C

ores)

24 (1

92 C

ores)

Number of Nodes

Elap

sed

time

(Sec

onds

)

InfiniBand GigE

Lower is better

Page 11: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

11

LS-DYNA Performance Results - Interconnect

• InfiniBand outperforms GigE by up to 132%– As node number increases, bigger advantage is expected

LS-DYNA - 3 Vehicle Collision(InfiniBand vs GigE)

0%20%40%60%80%

100%120%

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)

10 (8

0 Cores

)

12 (9

6 Cores

)

14 (1

12 C

ores)

16 (1

28 C

ores)

18 (1

44 C

ores)

20 (1

60 C

ores)

22 (1

76 C

ores)

24 (1

92 C

ores)

Number of Nodes

Perfo

rman

ce A

dvan

tage

LS-DYNA - Neon Refined Revised(InfiniBand vs GigE)

0%20%40%60%80%

100%120%140%

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)

10 (8

0 Cores

)

12 (9

6 Cores

)

14 (1

12 C

ores)

16 (1

28 C

ores)

18 (1

44 C

ores)

20 (1

60 C

ores)

22 (1

76 C

ores)

24 (1

92 C

ores)

Number of NodesPe

rform

ance

Adv

anta

ge

Page 12: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

12

LS-DYNA Performance Results – CPU Affinity

• CPU affinity accelerates performance up to 10%• Saves up to 177 seconds per simulation

LS-DYNA - 3 Vehicle Collision(CPU Affinity vs Non-Affinity)

01000200030004000500060007000

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)10

(80 C

ores

)12

(96 C

ores

)

14 (1

12 C

ores

)

16 (1

28 Cor

es)

18 (1

44 C

ores

)

20 (1

60 Cor

es)

22 (1

76 C

ores

)

24 (1

92 Cor

es)

Number of Nodes

Ela

psed

tim

e (S

econ

ds)

CPU Affinity Without Affinity

Lower is better

Page 13: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

13

LS-DYNA Performance Results - Productivity

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously– Providing required productivity for virtual vehicle design

• Three cases are presented– Single job over the entire systems (with CPU affinity)

– Two jobs, each on a single CPU per server (job placement , CPU affinity)

– Four jobs, each on two CPU cores per CPU per server (job placement , CPU affinity)

• Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case• Increased number of parallel processes (jobs) increases the load on the interconnect

– High speed and low latency interconnect solution is required for gaining high productivityLS-DYNA - Neon Refined Revised

0100200300400500600700800900

10004 (

32 C

ores)

8 (64

Cores

)12

(96 C

ores

)16

(128

Cor

es)

20 (1

60 C

ores

)24

(192

Cor

es)

Number of Nodes

Jobs

per

Day

1 Job 2 Parallel Jobs 4 Parallel Jobs

LS-DYNA - 3 Vehicle Collision

0102030405060708090

4 (32

Cores

)8 (

64 C

ores)

12 (9

6 Cor

es)

16 (1

28 C

ores

)20

(160

Cor

es)

24 (1

92 C

ores

)

Number of Nodes

Jobs

per

Day

1 Job 2 Parallel Jobs 4 Parallel Jobs

Higher is better

Page 14: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

14

LS-DYNA MPI Profiliing

10000000

100000000

1E+09

1E+10

1E+11

1E+12

[0..64B

][64..

256B

][256

B..1KB]

[1..4KB]

[4..16K

B][16..

64KB]

[64..25

6KB]

[256KB..1

M][1..4

M][4M..in

finity

]

Message Size

Tota

l Siz

e (M

B)

4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

LS-DYNA Profiling – Data Transferred

(3 Vehicle Collision)

• Majority of data transfer is done via 256B-4KB message size

Page 15: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

15

LS-DYNA Profiling – Message Distribution

LS-DYNA MPI Profiliing

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

[0..64B

][64..

256B

][256

B..1KB]

[1..4KB]

[4..16K

B][16..

64KB]

[64..25

6KB]

[256KB..1

M][1..4

M][4M..in

finity

]

Message Size

Num

ber o

f Mes

sage

s

4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

(3 Vehicle Collision)

• Majority of the messages are in the range of 2B-4KB– 2B-256B for synchronization, 256B-4KB for data communications

Page 16: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

16

LS-DYNA MPI Profiliing

0%

10%

20%

30%

40%

50%

60%

70%

[0..64] [65..256] [257..1024] [1025..4096] [4097..16384]

Message Size

% o

f tot

al m

essa

ges

4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

LS-DYNA Profiling – Message Distribution

(3 Vehicle Collision)

• As number of nodes scales, percentage of small messages increases• percentage of 256-1KB messages is relatively consistent with cluster size

– Actual number increases with cluster size,

Page 17: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

17

LS-DYNA Profiling – MPI Collectives

• Two key MPI collective functions in LS-DYNA– MPI_AllReduce– MPI_Bcast

• Account for the majority of MPI communication overhead

MPI Collectives

0%10%20%30%40%50%60%70%

4 (32 Cores)

8 (64 Cores)

12 (96 Cores)

16 (128 Cores)

20 (160 Cores)

24 (192 Cores)

Number of Nodes

% o

f Tot

al O

verh

ead

(ms)

MPI_AllReduce MPI_Bcast

Page 18: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

18

MPI Collective Benchmarking

MPI_Bcast

0

5

10

15

20

25

30

0 1 2 4 8 16 32 64 128 256 512

Message Size

Late

ncy(

usec

)

HP-MPI Platform MPI

MPI_AllReduce

0

20

40

60

80

100

120

0 4 8 16 32 64 128 256 512

Message Size

Late

ncy(

usec

)

HP-MPI Platform MPI

• MPI collective performance comparison – Two frequently called collection operations in LS-DYNA were benchmarked

• MPI_Allreduce• MPI_Bcast

– Platform MPI shows better latency for AllReduce operation

Page 19: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

19

LS-DYNA with Different MPI Libraries

• LS-DYNA performance Comparison– Each MPI library shows different benefits for latency and collectives– As such, HP-MPI and Platform MPI shows comparable performance

LS-DYNA - 3 Vehicle Collision

1000200030004000500060007000

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)10

(80 C

ores)

12 (9

6 Cores

)

14 (1

12 C

ores)

16 (1

28 C

ores)

18 (1

44 C

ores)

20 (1

60 C

ores)

22 (1

76 C

ores)

24 (1

92 C

ores)

Number of Nodes

Elap

sed

time

(Sec

onds

)

Platform MPI HP-MPI

LS-DYNA - Neon Refined Revised

100150200250300350400450500550

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)

10 (8

0 Cores

)

12 (9

6 Cores

)

14 (1

12 C

ores)

16 (1

28 C

ores)

18 (1

44 C

ores)

20 (1

60 C

ores)

22 (1

76 C

ores)

24 (1

92 C

ores)

Number of Nodes

Elap

sed

time

(Sec

onds

)

Platform MPI HP-MPI

Lower is better

Page 20: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

20

LS-DYNA Profiling Summary - Interconnect

• LS-DYNA was profiled to determine networking dependency • Majority of data transferred between compute nodes

– Done with 256B-4KB message size, data transferred increases with cluster size

• Most used message sizes– <64B messages – mainly synchronizations– 64B-4KB – mainly compute related

• Message size distribution– Percentage of smaller messages (<64B) increases with cluster size

• Mainly due to the needed synchronization

– Percentage of mid-size messages (64B-4KB) is kept the same with cluster size• Compute transactions increases with cluster size

– Percentage of very large messages decreases with cluster size• Mainly used for problem data distribution at the simulation initialization phase

• LS-DYNA interconnect sensitivity points – Interconnect latency and throughput for 64B-4KB message range– Collectives operations performance, mainly MPI_Allreduce

Page 21: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

21

Test Cluster Configuration – System Upgrade

• The following results were achieved after system upgrade (changes are in green)

– Dell PowerEdge SC 1435 24-node cluster

– Quad-Core AMD Opteron™ Model 2382 processors (“Shanghai”) (vs “Barcelona” in previous

configuration)

– Mellanox® InfiniBand ConnectX® DDR HCAs

– Mellanox® InfiniBand DDR Switch

– Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration)

– OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

– MPI: HP MPI 2.2.7, Platform MPI 5.6.5

– Application: LS-DYNA MPP971

– Benchmark Workload

• Three-Car Crash Test simulation

• Neon-Refined Revised Crash Test simulation

Page 22: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

22

• Performance– Quad-Core

• Enhanced CPU IPC• 4x 512K L2 cache• 6MB L3 Cache

– Direct Connect Architecture• HyperTransport™ technology • Up to 24 GB/s peak per processor

– Floating Point• 128-bit FPU per core• 4 FLOPS/clk peak per core

– Integrated Memory Controller• Up to 12.8 GB/s• DDR2-800 MHz or DDR2-667 MHz

• Scalability– 48-bit Physical Addressing

• Compatibility– Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron™ processor

22 November5, 2007

PCI-E® Bridge

PCI-E® Bridge

I/O HubI/O Hub

USBUSB

PCIPCI

PCI-E® Bridge

PCI-E® Bridge

8 GB/S

8 GB/S

Dual ChannelReg DDR2

8 GB/S

8 GB/S

8 GB/S

Quad-Core AMD Opteron™ Processor

Page 23: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

23

Performance Improvement

• Upgraded AMD CPU and DDR-2 Memory • LS-DYNA run time decreased by more than 20%

– Leveraging InfiniBand 20Gb/s for higher scalability

LS-DYNA - 3 Vehicle Collision

01000

20003000

40005000

60007000

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)10

(80 C

ores)

12 (9

6 Cores

)14

(112

Core

s)16

(128

Core

s)18

(144

Core

s)20

(160

Core

s)22

(176

Core

s)24

(192

Core

s)

Number of Nodes

Elap

sed

time

(Sec

onds

)

Barcelona Shanghai

LS-DYNA - Neon Refined Revised

0

100

200

300

400

500

600

4 (32

Cores

)6 (

48 C

ores)

8 (64

Cores

)10

(80 C

ores)

12 (9

6 Cores

)14

(112

Core

s)16

(128

Core

s)18

(144

Core

s)20

(160

Core

s)22

(176

Core

s)24

(192

Core

s)

Number of Nodes

Elap

sed

time

(Sec

onds

)

Barcelona Shanghai

Lower is better

Page 24: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

24

Maximize LS-DYNA Productivity

• Scalable latency of InfiniBand and latest Shanghai processor deliver scalable LS-DYNA performance

LS-DYNA - 3 Vehicle Collision

020

406080

100120

4 (32 Cores)

8 (64 Cores)

12 (96 Cores)

16 (128 Cores)

20 (160 Cores)

24 (192 Cores)

Number of Nodes

Jobs

per

Day

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

LS-DYNA - Neon Refined Revised

0200400600800

100012001400

4 (32 Cores)

8 (64 Cores)

12 (96 Cores)

16 (128 Cores)

20 (160 Cores)

24 (192 Cores)

Number of NodesJo

bs p

er D

ay

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

Higher is better

Page 25: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

25

LS-DYNA with Shanghai Processors

• “Shanghai” processors provides higher performance compared to “Barcelona’

LS-DYNA - 3 Vehicle Collision(Shanghai vs Barcelona)

0%5%

10%15%20%25%30%

4 (32 Cores)

8 (64 Cores)

12 (96 Cores)

16 (128 Cores)

20 (160 Cores)

24 (192 Cores)

Number of Nodes

% o

f mor

e jo

bs p

er d

ay

1 Job 2 Parallel Jobs 4 Parallel Jobs

Page 26: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

26

LS-DYNA Performance Results - Interconnect

• InfiniBand 20Gb/s vs 10GigE vs GigE• InfiniBand 20Gb/s (DDR) outperforms 10GigE and GigE in all test cases

– Reducing run time by up to 60% versus 10GigE and 61% vs GigE• Performance loss shown beyond 16 nodes with 10GigE and GigE• InfiniBand 20Gb/s maintain scalability with cluster size

LS-DYNA - Neon Refined Revised(HP-MPI)

0

100

200

300

400

500

600

4 (32 Cores)

8 (64 Cores)

12 (96 Cores)

16 (128 Cores)

20 (160 Cores)

24 (192 Cores)

Number of Nodes

Elap

sed

time

(Sec

onds

)

GigE 10GigE InfiniBand

LS-DYNA - 3 Vehicle Collision(HP-MPI)

0

1000

2000

3000

4000

5000

6000

4 (32 Cores)

8 (64 Cores)

12 (96 Cores)

16 (128 Cores)

20 (160 Cores)

24 (192 Cores)

Number of Nodes

Elap

sed

time

(Sec

onds

)

GigE 10GigE InfiniBand

Lower is better

Page 27: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

27

Power Consumption(InfiniBand vs 10GigE vs GigE)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

3 Vehicle Collision Neon Refined Revised

Wh

per J

ob

GigE 10GigE InfiniBand

Power Consumption Comparison

• InfiniBand also enables power efficient simulations– Reducing power/job by up to 62%!

62%

50%

24-node comparison

Page 28: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

28

Conclusions

• LS-DYNA is widely used to simulate many real-world problems– Automotive crash-testing and finite-element simulations – Developed by Livermore Software Technology Corporation (LSTC)

• LS-DYNA performance and productivity relies on– Scalable HPC systems and interconnect solutions– Low latency and high throughput interconnect technology– NUMA aware application for fast access to local memory– Reasonable job distribution can dramatically improve productivity

• Increasing number of jobs per day while maintaining fast run time

• Interconnect comparison shows– InfiniBand delivers superior performance and productivity in every cluster size– Scalability requires low latency and “zero” scalable latency– Lowest power consumption was achieved with InfiniBand

• Saving in system power, cooling and real-estate

Page 29: LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4 LS-DYNA • LS-DYNA SMP (Shared Memory Processing) – Optimize the power of multiple

2929

Thank YouHPC Advisory [email protected]

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein