GPU Enhancements for Noise, Vibration and Harshness (NVH) … · 2013. 3. 21. · This session will...

24
MSC Software Confidential GPU Enhancements for Noise, Vibration and Harshness (NVH) Analysis Dr. Ted Wertheimer

Transcript of GPU Enhancements for Noise, Vibration and Harshness (NVH) … · 2013. 3. 21. · This session will...

  • MSC Software Confidential

    GPU Enhancements for Noise, Vibration

    and Harshness (NVH) Analysis

    Dr. Ted Wertheimer

  • MSC Software Confidential MSC Software Confidential

    20 Million DOF - 3.9 M elements

    2 3/20/2013

  • MSC Software Confidential MSC Software Confidential

    • This model extracted many modes:

    • up to 1500 Hz structure -> ~26500 modes

    • up to 1500 Hz fluid -> ~3200 modes

    • Large frequency range: 0 to 1024 Hz in 2048 frequency steps

    20 Million DOF

    3 3/20/2013

    # Nodes DMP SMP Elapsed Time

    4 16 * 4 4:58:09

  • MSC Software Confidential MSC Software Confidential

    94 Million DOF

    4 3/20/2013

  • MSC Software Confidential MSC Software Confidential

    • Automated Component Modal Synthesis

    (ACMS)

    • MSC Nastran model is automatically divided

    into N domains

    • Executes in parallel using Distributed Memory

    Parallel (DMP)

    – Shared Memory Parallel (SMP) provides additional

    speedup

    ACMS

  • MSC Software Confidential MSC Software Confidential

    1 2 3 4 6 7 8 9 10 11 12 13 14 15 16

    0

    25

    21 23 22 24

    26

    20 19 18 17

    30

    28 27

    Master

    Slave 2

    Slave 1

    Slave 3

    29

    Example with DMP=4

    ACMS Domain Decomposition

    5

  • MSC Software Confidential MSC Software Confidential

    • Multi-CPU, multi-core parallel scalability

    • 2X performance increase from 2010

    MSC Nastran ACMS – Automotive Models

    0

    200

    400

    600

    800

    serial 12 CPUs serial 12 CPUs serial 12 CPUs serial 12 CPUs

    Case 1 Case 2 Case 3 Case 4

    ACMS)

    2010

    2011.1

    2011.22012

  • MSC Software Confidential MSC Software Confidential

    • Up to 3X faster for exterior acoustics

    – Exterior acoustics

    – Brake squeal

    – Friction

    – Rotordynamics

    Nonsymmetric Solver Performance

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000

    fr resp total job

    Case 3

    Exterior acoustics

    2011.1

    2011.22012

  • MSC Software Confidential MSC Software Confidential

    Improved Performance for Acoustics

    • Efficient Participation Factor

    3 Times Faster

    MSC Nastran 2012 MSC Nastran 2010

  • MSC Software Confidential MSC Software Confidential

    • Nastran direct equation solver is GPU accelerated – Sparse direct factorization (MSCLDL, MSCLU)

    • Real, Complex, Symmetric, Un-symmetric

    – Handles very large fronts with minimal use of pinned host memory • Lowest granularity GPU implementation of a sparse

    direct solver; solves unlimited sparse matrix sizes

    – Impacts several solution sequences: • High impact (SOL101, SOL108), Mid (SOL103), Low

    (SOL111, SOL400)

    MSC Nastran 2013

    10

  • MSC Software Confidential MSC Software Confidential

    • Support of multi-GPU and for Linux and Windows – With DMP> 1, multiple fronts are factorized

    concurrently on multiple GPUs; 1 GPU per matrix domain

    – NVIDIA GPUs: Tesla K20/K20X, Tesla M2090, Tesla

    C2075, Quadro 6000 – CUDA 5.0

    MSC Nastran 2013

    11

  • MSC Software Confidential MSC Software Confidential

    Direct sparse solver workflow

    in MSC Nastran (MSCLDL, MSCLU)

    3/20/2013

    In a proper order, do the

    following at each node.

    Assembly

    Pivoting

    Block factorization:

    from Global Stiffness &

    contribution blocks

    11

    9 10

    8

    6 7

    5

    3 4

    1 2

    Most time-consuming matrix update operations on GPU

    Off-diagonal

    update

    Diagonal

    decomposition Schur Complement

    Trailing matrix update

  • MSC Software Confidential

    Block LU Decomposition

    Direct solves are (typically) performed using Block LU

    decomposition

    Spend most of their time computing the Schur Complement

    Compute bound / low hanging fruit

    A11 A12

    A21 A22

    0

    L21 I

    I 0

    0 A22 –

    L21U12 0

    = * *

    U12

    I

    L11 U11

    DGEMM

    DTRSM DPOTRF DPOTRF

    DTRSM

    L11 U11 = A11 L11 U12 = A12 L21 U11 = A21

  • MSC Software Confidential

    PCIe limit on Schur complement calculation.

    (DGEMM)

    • PCIe limts GPU performance

    • Host is faster for small fronts

    • Requires nRank >700 for full perf on K20

    • M2090 and K20 are same until nRank

    >300

  • MSC Software Confidential MSC Software Confidential

    0

    1.5

    3

    4.5

    6

    SOL101, 2.4M rows, 42K front SOL103, 2.6M rows, 18K front

    serial 4c 4c+1g

    MSC Nastran 2013

    SMP + GPU acceleration of SOL101 and SOL103

    Higher is

    Better

    Server node: Sandy Bridge E5-2670 (2.6GHz), Tesla K20X GPU, 128 GB memory

    1X 1X

    2.7X

    1.9X

    6X

    2.8X

    Lanczos solver (SOL 103) Sparse matrix factorization

    Iterate on a block of vectors

    (solve)

    Orthogonalization of vectors

  • MSC Software Confidential MSC Software Confidential

    0

    200

    400

    600

    800

    1000

    serial 1c + 1g 4c (smp) 4c + 1g 8c(dmp=2)

    8c + 2g(dmp=2)

    NVH with MSC Nastran 2013

    Coupled Structural-Acoustics simulation with SOL108

    1X

    Lower is Better

    Europe Auto OEM 710K nodes, 3.83M elements

    100 frequency increments

    (FREQ1)

    Direct Sparse solver

    4.8X

    2.7X

    5.2X 5.5X

    11.1X

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

    Ela

    psed

    Tim

    e in M

    inu

    tes

  • MSC Software Confidential

    MSC Nastran 2013:

    Solution Price-Performance Gain

  • MSC Software Confidential MSC Software Confidential

    0

    20

    40

    60

    80

    serial smp 4c smp 4c+1g(x1 node)

    dmp 4c+1g(x2 nodes)

    dmp 4c+1g(x3 nodes)

    Elap

    sed

    Tim

    e in

    Ho

    urs

    NVH with MSC Nastran 2013 Trimmed Car Body Frequency Response with SOL108

    Server node: Sandy Bridge 2.6GHz, 2x 8 core, Tesla 2x K20X GPU, 128GB memory

    1X

    2.5X

    Lower is Better

    USA Auto OEM 1.2M nodes, 7.47M DOF

    Shells (CQUAD4): 1.04M

    Solids (CTETRA): 0.1M

    100 frequency increments

    (FREQ1)

    4.4X

    6.8X 9X

  • MSC Software Confidential MSC Software Confidential

    • Japan Auto OEM – Nodes 1.4M, Elements 0.78M

    • Mainly TETRA10

    – Modes: 104 (2500 Hz )

    – Front size: 23,718

    NVH with MSC Nastran 2013

    Engine Model Modal Frequency with SOL111

    2848

    1000

    614

    586

    2807

    901

    2303

    2168

    0

    2000

    4000

    6000

    8000

    10000

    1CPU(9052sec.)

    1CPU+1GPU(5116sec.)

    CPU Time

    Tim

    e(s

    ec.)

    FBS+Matrix-vectorMultply

    Shift+Decomposition

    Sparse Decomposition

    only

    335 239

    2856

    1027

    6180

    4120

    291

    223

    0

    2000

    4000

    6000

    8000

    10000

    12000

    1CPU(9702sec.)

    1CPU+1GPU(5647sec.)

    Elaps Time

    Tim

    e(s

    ec.)

    Pre_Eigenvalue

    Eigenvalue

    Resvec

    Post_Eigenvalue

    1.7x speedup

  • MSC Software Confidential MSC Software Confidential

    • Marc multi-frontal sparse solver is GPU accelerated – Marc Solver type 8

    • Support of multi-GPU and for Linux and Windows – Recommend 1 GPU per DDM

    Marc 2012

    3/20/2013

  • MSC Software Confidential MSC Software Confidential

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    Serial 1c + 1gpu

    nps=2 nps=2, 2gpus

    nps=4, 2gpus

    Marc 2012 - Automotive Engine model (1M DOF)

    Marc 2012 – GPU Acceleration

    Customer model

    6.5X Speedup with 2 GPUs over Serial run

    DOF: 1M

    Elements: 170K

  • MSC Software Confidential MSC Software Confidential

    Marc 2012 – GPU Acceleration of US Auto OEM

    model

    22 3/20/2013

    Speed Up – End to End

    2.5 Million Elements

    10 Million DOF

    Nonlinear Bolt Tightening

    48 Iterations

    0

    0.5

    1

    1.5

    2

    2.5

    3

    Serial (1c) 4c 1c+1 GPU

  • MSC Software Confidential

    Conclusions

    • GPUs provide for significant performance acceleration for direct

    solver intensive large jobs, ie. max front > 10000 for real data and

    > 5000 for complex data models.

    • Multiple GPU performance is available with DMP>1 including for

    NVH SOL108 (embarrassingly parallel).

    • NVIDIA and MSC continue to work together to tune BLAS and

    LAPACK kernels for MSCLDL and MSCLU.

    • As Models become larger the value of GPGPU becomes Greater

    23

  • MSC Software Confidential MSC Software Confidential

    Thank You

    24 3/20/2013