Assessing the Performance of Computational...
Transcript of Assessing the Performance of Computational...
Assessing the Performance of
Computational Engineering Codes
Omkar Deshmukh
Simulation Based Engineering Laboratory
Department of Electrical and Computer Engineering
5/13/2015 University of Wisconsin–Madison 1
Acknowledgments
• Advisor
• Associate Professor Dan Negrut
• Committee member
• Associate Professor Krishnan Suresh
• Assistant Professor Eftychios Sifakis
• Lab members
• Dr. Radu Serban, Hammad Mazhar, Andrew Seidl, Ang Li, Naveen
Subramaniam, Vennila Megavannan
5/13/2015 University of Wisconsin–Madison 2
Overview
• Motivation and Background
• Systems Under Test
• Libraries and Benchmarks
• Benchmarking Results
• Performance Database (PerfDB)
• Live Demo
• Conclusions and Future Work
5/13/2015 University of Wisconsin–Madison 3
Motivation
• Why benchmark?
• How to benchmark?
• How to analyze results?
• Project contributions:
• Benchmarking state-of-the-art hardware platforms
• Creating infrastructure for performance benchmarking
5/13/2015 University of Wisconsin–Madison 4
Hardware – The CPUs
• AMD Opteron 6274
• 64 cores, 4 sockets, 128GB DDR3 RAM.
• Intel Core i7-5960X
• Haswell-E, 16 virtual cores, 32GB DDR4 RAM
• Intel Xeon E5-2690 v2
• Ivy Bridge-EP, 2 sockets 40 virtual cores, 64GB DDR3 RAM
• Intel Xeon Phi Coprocessor 5110P
• MIC, 60 cores / 240 threads, 512-bit VPU, 8 GB GDDR5 RAM
5/13/2015 University of Wisconsin–Madison 5
Hardware – The GPUs
• NVidia Tesla K40c
• Kepler, 12GB GDDR5 RAM, 2880 scalar processors
• NVidia Tesla K20Xm
• Kepler, 6GB GDDR5 RAM, 2688 scalar processors
• NVidia GeForce GTX 770
• Kepler, 4GB GDDR5 RAM, 1536 scalar processor
• AMD A10-7850K
• Kaveri APU, 16GB DDR3 RAM, 4 + 8 HSA cores, 512 GPU SPs
5/13/2015 University of Wisconsin–Madison 6
The Benchmarks
• Reduction
• Output = 𝑥𝑖𝑛𝑖=0
• Streaming access, O(N)
• SAXPY
• 𝑦𝑖 ← α 𝑥𝑖 + 𝑦𝑖
• Streaming access, 2 Reads + 1 Write per element
• Prefix Scan
• 𝑥𝑛 = 𝑥𝑖𝑛𝑖=0
• Streaming access, O(N log(N))
• Sorting
• Performance depends upon implementation
• Random access
5/13/2015 University of Wisconsin–Madison 7
Numerical Computing Libraries
• Thrust
• STL-like, commercially developed by Nvidia
• Supports OpenMP, CUDA
• VexCL
• Vector expression template library for GPGPU programming
• Support OpenCL, CUDA
• Intel Math Kernel Library (MKL)
• BLAS and LAPACK interfaces
• Blaze
• Dense and sparse arithmetic
• Supports OpenMP, C++11 and Boost threads
5/13/2015 University of Wisconsin–Madison 8
Results – Reduction
Intel Xeon Phi
• H/W with best performance
• Scales up
• Thrust Outperforms VexCL
Intel Xeon E5-2690v2
• Compute → Memory bound
transition
5/13/2015 University of Wisconsin–Madison 9
Results – Reduction
NVidia Tesla K20Xm
• Thrust scales up
• VexCL saturated
AMD A10 7850K
• GPU only implementation works
similar to CPU+GPU
5/13/2015 University of Wisconsin–Madison 10
Results – SAXPY
Intel Xeon Phi
• Performance of libraries
• Flat profiles
AMD Opteron 6274
• Performance at 10M and 25M
• Transition to I/O intensive workload
5/13/2015 University of Wisconsin–Madison 11
Results – SAXPY
NVidia Tesla K20Xm
• Thrust outperforms
• Dimension matter – Division SMs
AMD Opteron 6274 + Blaze
• Different backends → Different
performance
5/13/2015 University of Wisconsin–Madison 12
Results – Prefix Scan
VexCL + OpenCL
• Best case scenario for Xeon Phi
only
• Flat performance profiles
Thrust + OpenMP
• Outperforms VexCL
• Noticeable worse on Xeon Phi
5/13/2015 University of Wisconsin–Madison 13
Results – Prefix Scan
VexCL + OpenCL
• OpenCL and CUDA backend
closely matched
Thrust + CUDA
• Scales up
• Higher performance than VexCL
5/13/2015 University of Wisconsin–Madison 14
Results – Sort
VexCL + OpenCL
• Drop in sort rate for Xeon Phi
Thrust + OpenMP
• 4 to 5 times faster than VexCL
5/13/2015 University of Wisconsin–Madison 15
Software Setup for PerfDB
• The need for database
• Information archival and retrieval
• Deluge of data. Bound to increase fast
• Easy to collaborate
• Use Github to keep track of:
• Source code + makefiles
• Results and reports
• SQLite3 – Embedded database
5/13/2015 University of Wisconsin–Madison 16
Interacting with PerfDB
Semi-automated process →
• Manual pre-runs setup – Uses
config.json
• Automated benchmark reporting
{
"db_url": "sqlite:///perfdb",
"host_id": "3",
"accl_id": "6",
"system_id": "30",
"source_id": "1",
"perf_id": "1"
} Config.json
5/13/2015 University of Wisconsin–Madison 18
name = 'test name' input = 'vector or matrix name' datatype = 'float/double' dim_x = #int dim_y = #int NNZ = #int value_type = 'GFLOPS or keys/sec' value = #float
Benchmark Output
Interacting with PerfDB
• Web based interface
• Get existing data
• Insert new configurations
• Query results
• Command line interface
• Access to SQLite3 shell
• Python utilities for similar functionality
• Usage of script “insert.py” common to both workflows
5/13/2015 University of Wisconsin–Madison 19
Conclusions
• Benchmarking:
• Performance dependent on application requirements
• Understand the context of vendor-advertised performance metrics
• Numerical Computing Libraries:
• Thrust – Consistent and fast
• VexCL – GPU performance lower than Thrust
• MKL – Not always the best option
• Software Setup
• Pro and cons of embedded SQLite3 database
5/13/2015 University of Wisconsin–Madison 21
Future Work
• Current version – Functional and ready to use
• In short term:
• Use CMake for portable cross-platform builds
• Move to database server, e.g. PostgreSQL
• Long term goals:
• Incorporate software profiling
• Extend web-based interface
• Widen the user and/or contributor base
5/13/2015 University of Wisconsin–Madison 22