LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4...

of 29/29
LSDYNA Performance Benchmarks and Profiling January 2009
  • date post

    23-Feb-2018
  • Category

    Documents

  • view

    268
  • download

    4

Embed Size (px)

Transcript of LS DYNA Performance Benchmarks and Profilinghpcadvisorycouncil.com/pdf/LS-DYNA _analysis.pdf · 4...

  • LSDYNAPerformanceBenchmarks andProfiling

    January 2009

  • 2

    Note

    The following research was performed under the HPC Advisory Council activities

    AMD, Dell, Mellanox

    HPC Advisory Council Cluster Center

    The participating members would like to thank LSTC for their support and guidelines

    The participating members would like to thank Sharan Kalwani, HPC Automotive specialist, for his support and guidelines

    For more info please refer to

    www.mellanox.com, www.dell.com/hpc, www.amd.com

    http://www.mellanox.com/http://www.dell.com/hpchttp://www.amd.com/

  • 3

    LS-DYNA

    LS-DYNA A general purpose structural and fluid analysis simulation software

    package capable of simulating complex real world problems Developed by the Livermore Software Technology Corporation (LSTC)

    LS-DYNA used by Automobile Aerospace Construction Military Manufacturing Bioengineering

  • 4

    LS-DYNA

    LS-DYNA SMP (Shared Memory Processing) Optimize the power of multiple CPUs within single machine

    LS-DYNA MPP (Massively Parallel Processing) The MPP version of LS-DYNA allows to run LS-DYNA solver over

    High-performance computing cluster Uses message passing (MPI) to obtain parallelism

    Many companies are switching from SMP to MPP For cost-effective scaling and performance

  • 5

    Objectives

    The presented research was done to provide best practices

    LS-DYNA performance benchmarking

    Interconnect performance comparisons

    Ways to increase LS-DYNA productivity

    Understanding LS-DYNA communication pattern

    MPI libraries comparisons

    Power-aware consideration

  • 6

    Test Cluster Configuration

    Dell PowerEdge SC 1435 24-node cluster

    Quad-Core AMD Opteron Model 2358 processors (Barcelona)

    Mellanox InfiniBand ConnectX DDR HCAs

    Mellanox InfiniBand DDR Switch

    Memory: 16GB memory, DDR2 667MHz per node

    OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

    MPI: HP MPI 2.2.7, Platform MPI 5.6.5

    Application: LS-DYNA MPP971

    Benchmark Workload

    Three Vehicle Collision Test simulation

    Neon-Refined Revised Crash Test simulation

  • 7

    Mellanox InfiniBand Solutions

    Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect

    Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry

    Reliable with congestion management Efficient

    RDMA and Transport Offload Kernel bypass CPU focuses on application processing

    Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage

    InfiniBand Delivers the Lowest Latency

    The InfiniBand Performance Gap is Increasing

    Fibre Channel

    Ethernet

    60Gb/s

    20Gb/s

    120Gb/s

    40Gb/s

    240Gb/s (12X)

    80Gb/s (4X)

  • 88 November5, 2007

    Performance Quad-Core

    Enhanced CPU IPC 4x 512K L2 cache 2MB L3 Cache

    Direct Connect Architecture HyperTransport technology Up to 24 GB/s

    Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core

    Memory 1GB Page Support DDR-2 667 MHz

    Scalability 48-bit Physical Addressing

    Compatibility Same power/thermal envelopes as Second-Generation AMD Opteron processor

    PCI-E Bridge

    PCI-E Bridge

    I/O HubI/O Hub

    USBUSB

    PCIPCI

    PCI-E Bridge

    PCI-E Bridge

    8 GB/S

    8 GB/S

    Dual ChannelReg DDR2

    8 GB/S

    8 GB/S

    8 GB/S

    Quad-Core AMD Opteron Processor

  • 9

    Dell PowerEdge Servers helping Simplify IT

    System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers

    Servers optimized for High Performance Computing environments

    Building Block Foundations for best price/performance and performance/watt

    Dell HPC Solutions Scalable Architectures for High Performance and Productivity

    Dell's comprehensive HPC services help manage the lifecycle requirements.

    Integrated, Tested and Validated Architectures

    Workload Modeling Optimized System Size, Configuration and Workloads

    Test-bed Benchmarks

    ISV Applications Characterization

    Best Practices & Usage Analysis

  • 10

    LS-DYNA Performance Results - Interconnect

    InfiniBand high speed interconnect enables highest scalability Performance gain with cluster size

    Performance over GigE is not scaling Slowdown occurs as number of processors increases beyond 16 nodes

    LS-DYNA - 3 Vehicle Collision

    010002000300040005000600070008000

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    InfiniBand GigE

    LS-DYNA - Neon Refined Revised

    0100200300400500600700

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    InfiniBand GigE

    Lower is better

  • 11

    LS-DYNA Performance Results - Interconnect

    InfiniBand outperforms GigE by up to 132% As node number increases, bigger advantage is expected

    LS-DYNA - 3 Vehicle Collision(InfiniBand vs GigE)

    0%20%40%60%80%

    100%120%

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )Number of Nodes

    Perfo

    rman

    ce A

    dvan

    tage

    LS-DYNA - Neon Refined Revised(InfiniBand vs GigE)

    0%20%40%60%80%

    100%120%140%

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )

    Number of NodesPe

    rform

    ance

    Adv

    anta

    ge

  • 12

    LS-DYNA Performance Results CPU Affinity

    CPU affinity accelerates performance up to 10% Saves up to 177 seconds per simulation

    LS-DYNA - 3 Vehicle Collision(CPU Affinity vs Non-Affinity)

    01000200030004000500060007000

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )

    Number of Nodes

    Ela

    psed

    tim

    e (S

    econ

    ds)

    CPU Affinity Without Affinity

    Lower is better

  • 13

    LS-DYNA Performance Results - Productivity

    InfiniBand increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for virtual vehicle design

    Three cases are presented Single job over the entire systems (with CPU affinity)

    Two jobs, each on a single CPU per server (job placement , CPU affinity)

    Four jobs, each on two CPU cores per CPU per server (job placement , CPU affinity)

    Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case Increased number of parallel processes (jobs) increases the load on the interconnect

    High speed and low latency interconnect solution is required for gaining high productivityLS-DYNA - Neon Refined Revised

    0100200300400500600700800900

    10004 (

    32 C

    ores

    )8 (

    64 C

    ores

    )12

    (96 C

    ores

    )16

    (128

    Cor

    es)

    20 (1

    60 C

    ores

    )24

    (192

    Cor

    es)

    Number of Nodes

    Jobs

    per

    Day

    1 Job 2 Parallel Jobs 4 Parallel Jobs

    LS-DYNA - 3 Vehicle Collision

    0102030405060708090

    4 (32

    Cor

    es)

    8 (64

    Cor

    es)

    12 (9

    6 Cor

    es)

    16 (1

    28 C

    ores

    )20

    (160

    Cor

    es)

    24 (1

    92 C

    ores

    )

    Number of Nodes

    Jobs

    per

    Day

    1 Job 2 Parallel Jobs 4 Parallel Jobs

    Higher is better

  • 14

    LS-DYNA MPI Profiliing

    10000000

    100000000

    1E+09

    1E+10

    1E+11

    1E+12

    [0..64

    B][64

    ..256

    B][25

    6B..1

    KB]

    [1..4K

    B][4.

    .16KB

    ][16

    ..64K

    B][64

    ..256

    KB]

    [256K

    B..1M

    ][1.

    .4M]

    [4M..in

    finity

    ]

    Message Size

    Tota

    l Siz

    e (M

    B)

    4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

    LS-DYNA Profiling Data Transferred

    (3 Vehicle Collision)

    Majority of data transfer is done via 256B-4KB message size

  • 15

    LS-DYNA Profiling Message Distribution

    LS-DYNA MPI Profiliing

    1

    10

    100

    1000

    10000

    100000

    1000000

    10000000

    100000000

    1000000000

    [0..64

    B][64

    ..256

    B][25

    6B..1

    KB]

    [1..4K

    B][4.

    .16KB

    ][16

    ..64K

    B][64

    ..256

    KB]

    [256K

    B..1M

    ][1.

    .4M]

    [4M..in

    finity

    ]

    Message Size

    Num

    ber o

    f Mes

    sage

    s

    4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

    (3 Vehicle Collision)

    Majority of the messages are in the range of 2B-4KB 2B-256B for synchronization, 256B-4KB for data communications

  • 16

    LS-DYNA MPI Profiliing

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    [0..64] [65..256] [257..1024] [1025..4096] [4097..16384]

    Message Size

    % o

    f tot

    al m

    essa

    ges

    4nodes 8nodes 12nodes 16nodes 20nodes 24nodes

    LS-DYNA Profiling Message Distribution

    (3 Vehicle Collision)

    As number of nodes scales, percentage of small messages increases percentage of 256-1KB messages is relatively consistent with cluster size

    Actual number increases with cluster size,

  • 17

    LS-DYNA Profiling MPI Collectives

    Two key MPI collective functions in LS-DYNA MPI_AllReduce MPI_Bcast

    Account for the majority of MPI communication overhead

    MPI Collectives

    0%10%20%30%40%50%60%70%

    4 (32 C

    ores)

    8 (64 C

    ores)

    12 (96

    Cores)

    16 (128

    Cores

    )

    20 (160

    Cores

    )

    24 (192

    Cores

    )

    Number of Nodes

    % o

    f Tot

    al O

    verh

    ead

    (ms)

    MPI_AllReduce MPI_Bcast

  • 18

    MPI Collective Benchmarking

    MPI_Bcast

    0

    5

    10

    15

    20

    25

    30

    0 1 2 4 8 16 32 64 128 256 512

    Message Size

    Late

    ncy(

    usec

    )

    HP-MPI Platform MPI

    MPI_AllReduce

    0

    20

    40

    60

    80

    100

    120

    0 4 8 16 32 64 128 256 512

    Message Size

    Late

    ncy(

    usec

    )

    HP-MPI Platform MPI

    MPI collective performance comparison Two frequently called collection operations in LS-DYNA were benchmarked

    MPI_Allreduce MPI_Bcast

    Platform MPI shows better latency for AllReduce operation

  • 19

    LS-DYNA with Different MPI Libraries

    LS-DYNA performance Comparison Each MPI library shows different benefits for latency and collectives As such, HP-MPI and Platform MPI shows comparable performance

    LS-DYNA - 3 Vehicle Collision

    1000200030004000500060007000

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    Platform MPI HP-MPI

    LS-DYNA - Neon Refined Revised

    100150200250300350400450500550

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )

    16 (1

    28 C

    ores

    )

    18 (1

    44 C

    ores

    )

    20 (1

    60 C

    ores

    )

    22 (1

    76 C

    ores

    )

    24 (1

    92 C

    ores

    )

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    Platform MPI HP-MPI

    Lower is better

  • 20

    LS-DYNA Profiling Summary - Interconnect

    LS-DYNA was profiled to determine networking dependency Majority of data transferred between compute nodes

    Done with 256B-4KB message size, data transferred increases with cluster size

    Most used message sizes

  • 21

    Test Cluster Configuration System Upgrade

    The following results were achieved after system upgrade (changes are in green)

    Dell PowerEdge SC 1435 24-node cluster

    Quad-Core AMD Opteron Model 2382 processors (Shanghai) (vs Barcelona in previous

    configuration)

    Mellanox InfiniBand ConnectX DDR HCAs

    Mellanox InfiniBand DDR Switch

    Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration)

    OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

    MPI: HP MPI 2.2.7, Platform MPI 5.6.5

    Application: LS-DYNA MPP971

    Benchmark Workload

    Three-Car Crash Test simulation

    Neon-Refined Revised Crash Test simulation

  • 22

    Performance Quad-Core

    Enhanced CPU IPC 4x 512K L2 cache 6MB L3 Cache

    Direct Connect Architecture HyperTransport technology Up to 24 GB/s peak per processor

    Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core

    Integrated Memory Controller Up to 12.8 GB/s DDR2-800 MHz or DDR2-667 MHz

    Scalability 48-bit Physical Addressing

    Compatibility Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron processor

    22 November5, 2007

    PCI-E Bridge

    PCI-E Bridge

    I/O HubI/O Hub

    USBUSB

    PCIPCI

    PCI-E Bridge

    PCI-E Bridge

    8 GB/S

    8 GB/S

    Dual ChannelReg DDR2

    8 GB/S

    8 GB/S

    8 GB/S

    Quad-Core AMD Opteron Processor

  • 23

    Performance Improvement

    Upgraded AMD CPU and DDR-2 Memory LS-DYNA run time decreased by more than 20%

    Leveraging InfiniBand 20Gb/s for higher scalability

    LS-DYNA - 3 Vehicle Collision

    01000

    20003000

    40005000

    60007000

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )16

    (128

    Cor

    es)

    18 (1

    44 C

    ores

    )20

    (160

    Cor

    es)

    22 (1

    76 C

    ores

    )24

    (192

    Cor

    es)

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    Barcelona Shanghai

    LS-DYNA - Neon Refined Revised

    0

    100

    200

    300

    400

    500

    600

    4 (32

    Cor

    es)

    6 (48

    Cor

    es)

    8 (64

    Cor

    es)

    10 (8

    0 Cor

    es)

    12 (9

    6 Cor

    es)

    14 (1

    12 C

    ores

    )16

    (128

    Cor

    es)

    18 (1

    44 C

    ores

    )20

    (160

    Cor

    es)

    22 (1

    76 C

    ores

    )24

    (192

    Cor

    es)

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    Barcelona Shanghai

    Lower is better

  • 24

    Maximize LS-DYNA Productivity

    Scalable latency of InfiniBand and latest Shanghai processor deliver scalable LS-DYNA performance

    LS-DYNA - 3 Vehicle Collision

    020

    406080

    100120

    4 (32 C

    ores)

    8 (64 C

    ores)

    12 (96 C

    ores)

    16 (128

    Cores

    )

    20 (160

    Cores

    )

    24 (192

    Cores

    )

    Number of Nodes

    Jobs

    per

    Day

    1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

    LS-DYNA - Neon Refined Revised

    0200400600800

    100012001400

    4 (32 C

    ores)

    8 (64 C

    ores)

    12 (96 C

    ores)

    16 (128

    Cores

    )

    20 (160

    Cores

    )

    24 (192

    Cores

    )

    Number of NodesJo

    bs p

    er D

    ay

    1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

    Higher is better

  • 25

    LS-DYNA with Shanghai Processors

    Shanghai processors provides higher performance compared to Barcelona

    LS-DYNA - 3 Vehicle Collision(Shanghai vs Barcelona)

    0%5%

    10%15%20%25%30%

    4 (32 C

    ores)

    8 (64 C

    ores)

    12 (96

    Cores)

    16 (128

    Cores

    )

    20 (160

    Cores

    )

    24 (192

    Cores

    )

    Number of Nodes

    % o

    f mor

    e jo

    bs p

    er d

    ay

    1 Job 2 Parallel Jobs 4 Parallel Jobs

  • 26

    LS-DYNA Performance Results - Interconnect

    InfiniBand 20Gb/s vs 10GigE vs GigE InfiniBand 20Gb/s (DDR) outperforms 10GigE and GigE in all test cases

    Reducing run time by up to 60% versus 10GigE and 61% vs GigE Performance loss shown beyond 16 nodes with 10GigE and GigE InfiniBand 20Gb/s maintain scalability with cluster size

    LS-DYNA - Neon Refined Revised(HP-MPI)

    0

    100

    200

    300

    400

    500

    600

    4 (32 C

    ores)

    8 (64 C

    ores)

    12 (96 C

    ores)

    16 (128

    Cores

    )

    20 (160

    Cores

    )

    24 (192

    Cores

    )

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    GigE 10GigE InfiniBand

    LS-DYNA - 3 Vehicle Collision(HP-MPI)

    0

    1000

    2000

    3000

    4000

    5000

    6000

    4 (32 C

    ores)

    8 (64 C

    ores)

    12 (96 C

    ores)

    16 (128

    Cores

    )

    20 (160

    Cores

    )

    24 (192

    Cores

    )

    Number of Nodes

    Elap

    sed

    time

    (Sec

    onds

    )

    GigE 10GigE InfiniBand

    Lower is better

  • 27

    Power Consumption(InfiniBand vs 10GigE vs GigE)

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    3 Vehicle Collision Neon Refined Revised

    Wh

    per J

    ob

    GigE 10GigE InfiniBand

    Power Consumption Comparison

    InfiniBand also enables power efficient simulations Reducing power/job by up to 62%!

    62%

    50%

    24-node comparison

  • 28

    Conclusions

    LS-DYNA is widely used to simulate many real-world problems Automotive crash-testing and finite-element simulations Developed by Livermore Software Technology Corporation (LSTC)

    LS-DYNA performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughput interconnect technology NUMA aware application for fast access to local memory Reasonable job distribution can dramatically improve productivity

    Increasing number of jobs per day while maintaining fast run time

    Interconnect comparison shows InfiniBand delivers superior performance and productivity in every cluster size Scalability requires low latency and zero scalable latency Lowest power consumption was achieved with InfiniBand

    Saving in system power, cooling and real-estate

  • 2929

    Thank YouHPC Advisory [email protected]

    All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

    LS-DYNA Performance Benchmarks and ProfilingNoteLS-DYNALS-DYNAObjectivesTest Cluster ConfigurationMellanox InfiniBand SolutionsQuad-Core AMD Opteron ProcessorDell PowerEdge Servers helping Simplify ITLS-DYNA Performance Results - Interconnect LS-DYNA Performance Results - Interconnect LS-DYNA Performance Results CPU AffinityLS-DYNA Performance Results - Productivity LS-DYNA Profiling Data TransferredLS-DYNA Profiling Message DistributionLS-DYNA Profiling Message DistributionLS-DYNA Profiling MPI CollectivesMPI Collective BenchmarkingLS-DYNA with Different MPI LibrariesLS-DYNA Profiling Summary - InterconnectTest Cluster Configuration System UpgradeQuad-Core AMD Opteron ProcessorPerformance ImprovementMaximize LS-DYNA ProductivityLS-DYNA with Shanghai ProcessorsLS-DYNA Performance Results - Interconnect Power Consumption ComparisonConclusionsSlide Number 29