Assessing the Performance of Computational...

Assessing the Performance of

Computational Engineering Codes

Omkar Deshmukh

Simulation Based Engineering Laboratory

Department of Electrical and Computer Engineering

5/13/2015 University of Wisconsin–Madison 1

Acknowledgments

• Advisor

• Associate Professor Dan Negrut

• Committee member

• Associate Professor Krishnan Suresh

• Assistant Professor Eftychios Sifakis

• Lab members

• Dr. Radu Serban, Hammad Mazhar, Andrew Seidl, Ang Li, Naveen

Subramaniam, Vennila Megavannan


Overview

• Motivation and Background

• Systems Under Test

• Libraries and Benchmarks

• Benchmarking Results

• Performance Database (PerfDB)

• Live Demo

• Conclusions and Future Work


Motivation

• Why benchmark?

• How to benchmark?

• How to analyze results?

• Project contributions:

• Benchmarking state-of-the-art hardware platforms

• Creating infrastructure for performance benchmarking


Hardware – The CPUs

• AMD Opteron 6274

• 64 cores, 4 sockets, 128GB DDR3 RAM.

• Intel Core i7-5960X

• Haswell-E, 16 virtual cores, 32GB DDR4 RAM

• Intel Xeon E5-2690 v2

• Ivy Bridge-EP, 2 sockets 40 virtual cores, 64GB DDR3 RAM

• Intel Xeon Phi Coprocessor 5110P

• MIC, 60 cores / 240 threads, 512-bit VPU, 8 GB GDDR5 RAM


Hardware – The GPUs

• NVidia Tesla K40c

• Kepler, 12GB GDDR5 RAM, 2880 scalar processors

• NVidia Tesla K20Xm

• Kepler, 6GB GDDR5 RAM, 2688 scalar processors

• NVidia GeForce GTX 770

• Kepler, 4GB GDDR5 RAM, 1536 scalar processor

• AMD A10-7850K

• Kaveri APU, 16GB DDR3 RAM, 4 + 8 HSA cores, 512 GPU SPs


The Benchmarks

• Reduction

• Output = 𝑥𝑖𝑛𝑖=0

• Streaming access, O(N)

• SAXPY

• 𝑦𝑖 ← α 𝑥𝑖 + 𝑦𝑖

• Streaming access, 2 Reads + 1 Write per element

• Prefix Scan

• 𝑥𝑛 = 𝑥𝑖𝑛𝑖=0

• Streaming access, O(N log(N))

• Sorting

• Performance depends upon implementation

• Random access


Numerical Computing Libraries

• Thrust

• STL-like, commercially developed by Nvidia

• Supports OpenMP, CUDA

• VexCL

• Vector expression template library for GPGPU programming

• Support OpenCL, CUDA

• Intel Math Kernel Library (MKL)

• BLAS and LAPACK interfaces

• Blaze

• Dense and sparse arithmetic

• Supports OpenMP, C++11 and Boost threads


Results – Reduction

Intel Xeon Phi

• H/W with best performance

• Scales up

• Thrust Outperforms VexCL

Intel Xeon E5-2690v2

• Compute → Memory bound

transition


Results – Reduction

NVidia Tesla K20Xm

• Thrust scales up

• VexCL saturated

AMD A10 7850K

• GPU only implementation works

similar to CPU+GPU


Results – SAXPY

Intel Xeon Phi

• Performance of libraries

• Flat profiles

AMD Opteron 6274

• Performance at 10M and 25M

• Transition to I/O intensive workload


Results – SAXPY

NVidia Tesla K20Xm

• Thrust outperforms

• Dimension matter – Division SMs

AMD Opteron 6274 + Blaze

• Different backends → Different

performance


Results – Prefix Scan

VexCL + OpenCL

• Best case scenario for Xeon Phi

only

• Flat performance profiles

Thrust + OpenMP

• Outperforms VexCL

• Noticeable worse on Xeon Phi


Results – Prefix Scan

VexCL + OpenCL

• OpenCL and CUDA backend

closely matched

Thrust + CUDA

• Scales up

• Higher performance than VexCL


Results – Sort

VexCL + OpenCL

• Drop in sort rate for Xeon Phi

Thrust + OpenMP

• 4 to 5 times faster than VexCL


Software Setup for PerfDB

• The need for database

• Information archival and retrieval

• Deluge of data. Bound to increase fast

• Easy to collaborate

• Use Github to keep track of:

• Source code + makefiles

• Results and reports

• SQLite3 – Embedded database


Database Schema


Interacting with PerfDB

Semi-automated process →

• Manual pre-runs setup – Uses

config.json

• Automated benchmark reporting

{

"db_url": "sqlite:///perfdb",

"host_id": "3",

"accl_id": "6",

"system_id": "30",

"source_id": "1",

"perf_id": "1"

} Config.json


name = 'test name' input = 'vector or matrix name' datatype = 'float/double' dim_x = #int dim_y = #int NNZ = #int value_type = 'GFLOPS or keys/sec' value = #float

Benchmark Output

Interacting with PerfDB

• Web based interface

• Get existing data

• Insert new configurations

• Query results

• Command line interface

• Access to SQLite3 shell

• Python utilities for similar functionality

• Usage of script “insert.py” common to both workflows


PerfDB Demo


Conclusions

• Benchmarking:

• Performance dependent on application requirements

• Understand the context of vendor-advertised performance metrics

• Numerical Computing Libraries:

• Thrust – Consistent and fast

• VexCL – GPU performance lower than Thrust

• MKL – Not always the best option

• Software Setup

• Pro and cons of embedded SQLite3 database


Future Work

• Current version – Functional and ready to use

• In short term:

• Use CMake for portable cross-platform builds

• Move to database server, e.g. PostgreSQL

• Long term goals:

• Incorporate software profiling

• Extend web-based interface

• Widen the user and/or contributor base


Thank you!


Comparison - Reduction


Comparison - Scan


Comparison - Sort


Assessing the Performance of Computational...

Documents

Transcript of Assessing the Performance of Computational...