Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation...
Transcript of Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation...
![Page 1: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/1.jpg)
Hedgehog: A Performance-Oriented General-Purpose Library for Multi-GPU Systems
Alexandre Bardakoff – Timothy BlattnerBruno Bachelet – Walid Keyrouz – Loic Yon
![Page 2: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/2.jpg)
Disclaimer
No approval or endorsement of any commercial product by NIST is intended or implied. Certain commercial software, products, and systems are identified in this report to facilitate better understanding. Such identification does not imply recommendations or endorsement by NIST, nor does it imply that the software and products identified are necessarily the best available for the purpose.
Hedgehog—A. Bardakoff & T. Blattner2
![Page 3: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/3.jpg)
Acknowledgment
Hedgehog—A. Bardakoff & T. Blattner3
} NIST} Mary Brady} Walid Keyrouz
} LIMOS} Bruno Bachelet} LoïcYon
![Page 4: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/4.jpg)
Motivation – Hardware} Servers
} AMD EPYC 7702P w/64 cores, Intel Xeon Platinum 8253 Processor w/16 cores
} Desktops} AMD Ryzen Threadripper 3990X w/64 cores, AMD Ryzen 9 PRO 3900 w/12 cores} Intel Core i9-10980XE Extreme Edition w/18 cores (3x hyperthreading)
} Laptops} AMD Ryzen 7 4800H w/8 cores, Intel Core i9-9980HK w/8 cores
} Mobile CPU: Kryo 585 w/8 cores
} GPUs:} GeForce RTX 2080: 9362 (SP), 292.6 (DP), 18720 (HP) GFLOPS} Tesla T4 GPU accelerator: 8100 (single precision) GFLOPS
Hedgehog—A. Bardakoff & T. Blattner4
![Page 5: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/5.jpg)
Motivation – Understandable Scalable Programs
Hedgehog—A. Bardakoff & T. Blattner5
} Abstract model of execution
} Explicit representation of an algorithm} Exists during execution} Used to instrument and reason about performance
} Experimentation for performance using high-level abstractions} Without loss of potential performance
![Page 6: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/6.jpg)
Requirements
Hedgehog—A. Bardakoff & T. Blattner6
} Manage a node with many cores and one or multiple GPUs
} Explicit representation of an algorithm (that exists during execution)
} High-level abstractions (without loss of potential performance)
![Page 7: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/7.jpg)
Outline
} Basic concepts
} Hedgehog
} Experimentations
Hedgehog—A. Bardakoff & T. Blattner7
![Page 8: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/8.jpg)
Basic Concepts
Data flow graphData pipelining
HTGS & library
Hedgehog—A. Bardakoff & T. Blattner8
![Page 9: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/9.jpg)
Asynchronous Data Flow Graph} Program model
} Directed graph representation} 1 entry and 1 exit point (source and sink)
} Components} Nodes: computations or state management} Edges: directed information flow
+AB CA B C
Hedgehog—A. Bardakoff & T. Blattner9
Addition algorithmData Flow representation
![Page 10: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/10.jpg)
Data Pipelining
Data1
Data1
Data1
Data2
Data2
Data2
Data3
Data3
Data3
Read Compute Write
Overlapping computation
Time
Hedgehog—A. Bardakoff & T. Blattner10
Data Pipelining representation
Stage start as soon as data becomes available• Asynchronous behavior
![Page 11: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/11.jpg)
Hybrid Task Graph Scheduler - HTGS
} Coarse-Grained Parallelism } Pipelined Multi-Threaded } Multi-CPU and Multi-GPU
} C++ 11 headers-only library} Visual Debugging Feature } Rich API
Blattner T., Keyrouz W., The Hybrid Task Graph Scheduler API, (2017)GitHub repository, https:// github.com/usnistgov/HTGS
Blattner, T. et al., J Sign Process Syst (2017) 89: 457https://doi.org/10.1007/s11265-017-1262-6
Hedgehog—A. Bardakoff & T. Blattner12
![Page 12: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/12.jpg)
Hedgehog
OverviewAPIUsageExample
Hedgehog—A. Bardakoff & T. Blattner13
![Page 13: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/13.jpg)
Overview
} Coarse grain parallelism} Dataflow graph representation} Data pipelining to obtain performance & keep hardware busy} Separation of concerns:
} Tasks; State; Memory Management
} C++ 17, headers-only library} General purpose} Open source and available
} Metaprogramming for type safety
Hedgehog—A. Bardakoff & T. Blattner14
![Page 14: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/14.jpg)
Methodology
AlgorithmData flow
Formulation Implementation
Hedgehog
Refinement
Hedgehog—A. Bardakoff & T. Blattner15
Methodology used in Hedgehog
![Page 15: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/15.jpg)
API - Nodes
} Multiple Inputs - Single Output} Shutdown virtual method to break cycles
} Tasks} Step of an algorithm / Computation kernels
} Special task for (NVIDIA) GPU computations
} Multithreaded
} State manager—single-threaded } Local computation’s state management} State shared between different managers in the graph
Hedgehog—A. Bardakoff & T. Blattner16
![Page 16: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/16.jpg)
API - Memory Manager} Throttles memory usage} Links to a task or state} Pool of available pieces of data
} Static} Create n objects calling a specific constructor} Ensure constructor signature by using SFINAE construct
} Dynamic} Create n objects calling default constructor
} Mechanism to recycle memory / objects
Hedgehog—A. Bardakoff & T. Blattner17
![Page 17: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/17.jpg)
API - Graph
} Graph} Algorithm representation} Group nodes (tasks, state manager, memory manager)} Can be part of another graph
} Share or compose algorithms} Bind a graph to a GPU} Only object used by an end-user
} Execution Pipeline} Duplicate graph
} Data decomposition rules} Associate each graphs to GPUs
Hedgehog—A. Bardakoff & T. Blattner18
![Page 18: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/18.jpg)
Explicit representation
} Create a graphical representation} Very low overhead (task level)
} Information gathered} Graph: execution & creation times} Nodes: wait & execution times
} Node colors} Based on execution & wait times
} Multiple options (all threads)
Matrix Multiplication Graph
Execution time:10.054ms
Creation time:1.674ms
Source
RowTraversal x 1
Wait Time: 0us
Execution Time: 40us
Matrix_A
ColumnTraversal x 1
Wait Time: 0us
Execution Time: 7us
Matrix_B
RowTraversal x 1
Wait Time: 0us
Execution Time: 8us
Matrix_C
Input State Manager
Wait Time: 0us
Execution Time: 120us
MatrixBlock_A MatrixBlock_B
Partial Computation State Manager
Wait Time: 160us
Execution Time: 349us
MatrixBlock_C
Product Task x 3
Wait Time:
Min: 419us
Avg: 481us +- 83us
Max: 594us
Execution Time:
Min: 296us
Avg: 315us +- 24us
Max: 329us
MatrixBlock_P
Addition Task x 3
Wait Time:
Min: 591us
Avg: 690us +- 79us
Max: 768us
Execution Time:
Min: 165us
Avg: 185us +- 21us
Max: 199us
MatrixBlock_C
Output State Manager
Wait Time: 449us
Execution Time: 20us
MatrixBlock_C
<MatrixBlock_A, MatrixBlock_B>
<MatrixBlock_C, MatrixBlock_P>
MatrixBlock_C
Hedgehog—A. Bardakoff & T. Blattner19
Dot representation
![Page 19: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/19.jpg)
Library Example (Fast Loader)
View Waiter
View Loader
View Counter
AbstractTile
Loader
AbstractTile
Loader
. . .
TileCache
File
File level 1
Algorithm
. . .
View Waiter
View Loader
AbstractTile
Loader
AbstractTile
Loader
. . .
TileCache
File level n
Fast Loader Graph
Memory manager
Memory manager
Hedgehog—A. Bardakoff & T. Blattner20
Fast Loader architecture
![Page 20: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/20.jpg)
Safety @ Compile Time (Metaprogramming)} Checks coherency rules with traits and constexpr:
} A graph's input task has at least one of this input type corresponding to one of the graph's input type
} Two linked tasks have at least one common type: task output's type correspond to at least to one of the other input types' task
} Checks restriction rule with traits:} To connect a memory manager to a node, the managed type is the node's output type
} Generates code with SFINAE construct:} Generate constructor for managed types
} Can be easily modified to take advantage of C++20
Hedgehog—A. Bardakoff & T. Blattner21
![Page 21: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/21.jpg)
System latency
Hedgehog—A. Bardakoff & T. Blattner22
![Page 22: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/22.jpg)
Usability (Summer 2019)} Rising sophomore student
} No knowledge about} C++} Parallel programing
} In < 3 months:} Learned enough C++ to use the library} Created base graphs to represent algorithm} Prototyped several numerical linear algebra operations} Got (good) results…
Hedgehog—A. Bardakoff & T. Blattner23
![Page 23: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/23.jpg)
Example
Hedgehog—A. Bardakoff & T. Blattner24
} Goal} API overview} Increment all elements in an array
} Algorithm} Split array into chunks} Increment chunks in parallel
ArraySplit the
arrayRange
Increment in range
Nothing
Algorithm representation
![Page 24: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/24.jpg)
Example: Some data
Hedgehog—A. Bardakoff & T. Blattner25
#include <hedgehog/hedgehog.h>
const size_t SIZE = 10000000000; // 10^9 --- ginormous size
using MYARRAY = std::array<int, SIZE>;
struct ItBeginEnd {MYARRAY::iterator
begin_,end_;
ItBeginEnd(MYARRAY::iterator const &begin, MYARRAY::iterator const &end): begin_(begin), end_(end) {}
};
![Page 25: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/25.jpg)
Example: Tasks / Split vector
Hedgehog—A. Bardakoff & T. Blattner26
class SplitVector : public hh::AbstractTask<ItBeginEnd, MYARRAY> {private:size_t batchSize_ = 0;
public:explicit SplitVector(size_t batchSize) : AbstractTask("Split Vector Task"), batchSize_ (batchSize)
{}
void execute(std::shared_ptr<MYARRAY> v) override {for (size_t pos = 0; pos < SIZE; pos += batchSize_) {
this->addResult(std::make_shared<ItBeginEnd>(v->begin() + pos, v->begin() + std::min(SIZE, pos +
batchSize_)));}
}};
![Page 26: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/26.jpg)
Example: Tasks / Batch Increment
Hedgehog—A. Bardakoff & T. Blattner27
class BatchIncrement : public hh::AbstractTask<void, ItBeginEnd> {private:size_t increment_ = 0;
public:explicit BatchIncrement(int increment, size_t numberThreads): AbstractTask("Batch Increment Task", numberThreads), increment_(increment) {}
std::shared_ptr<AbstractTask < void, ItBeginEnd>> copy() override{return std::make_shared<BatchIncrement>(increment_, this->numberThreads());
}
void execute(std::shared_ptr<ItBeginEnd> ptr) override {std::for_each(ptr->begin_, ptr->end_, [this] (int& x) { x += increment_; });
}};
![Page 27: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/27.jpg)
Example: main
Hedgehog—A. Bardakoff & T. Blattner28
int main() {auto myArray = std::make_shared<MYARRAY>();// Instantiate graph partsauto graph = std::make_shared<hh::Graph<void, MYARRAY>>("Increment Array Graph");auto splitVectorTask = std::make_shared<SplitVector>(1000); // batchSize:1000auto batchIncrementTask = std::make_shared<BatchIncrement>(100, 10); // +100, 10 threads// Construct Graph: link tasks and set graph's input / output, and run itgraph->input(splitVectorTask);graph->addEdge(splitVectorTask, batchIncrementTask);graph->output(batchIncrementTask);graph->executeGraph();// Send data to the graph, and wait for terminationgraph->pushData(myArray);graph->finishPushingData();graph->waitForTermination();// Create dot representation after computation completesgraph->createDotFile("Test.dot", hh::ColorScheme::EXECUTION, hh::StructureOptions::ALL);
}
![Page 28: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/28.jpg)
Example: Graph Representation
Hedgehog—A. Bardakoff & T. Blattner29
Algorithm dot representation
![Page 29: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/29.jpg)
Experiments
Linear Algebra RoutinesMatrix Multiplications experiments
Hedgehog—A. Bardakoff & T. Blattner30
![Page 30: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/30.jpg)
Linear Algebra Routines - Exploiting matrix decomposition
Hedgehog—A. Bardakoff & T. Blattner31
} Matrix decomposition inside operation} Most linear algebra implementations take advantage of this internally
} Matrix decomposition outside operation} Allows for streaming mode of computation
} Output blocks can be used immediately} Time for using computed data should immensely decrease
} Not available with other numerical linear algebra libraries
OperationInput matrix as blocks Computed blocks
Streaming of matrix blocks in and out of an operation
![Page 31: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/31.jpg)
Hedgehog Matrix Block Library (HMBLib)
Hedgehog—A. Bardakoff & T. Blattner32
} Hedgehog - API that aids to obtain performance} Designed for single system with many CPU cores & multiple GPUs
} Linear algebra subroutines (graphs)} Tasks} State-Managers
} States
} Reuse kernels from existing libraries
StateManager
BlockC
Example Graph (A + B = C)
BlockA BlockABlockB
BlockB
Task
![Page 32: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/32.jpg)
Linear Algebra - General Matrix Multiplication
Hedgehog—A. Bardakoff & T. Blattner33
} Compatible with BLAS (gemm)
} Multiply Blocks with same inner dimension} Uses OpenBLAS (gemm)
} Add blocks together} Add sum to corresponding block of matrix C
} Output final blockMatrix Multiplication Representation
B
A C
A x B + C = C
![Page 33: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/33.jpg)
Linear Algebra - LU decomposition, partial pivoting
Hedgehog—A. Bardakoff & T. Blattner34
} Factor a matrix as the product of two triangular matrices} Used to solve: Ax = B} Compatible with LAPACK’s getrf
} Recursive algorithm
} Row swapping enabled} Allows for more generalized matrices} Uses LAPACK’s laswp
Gauss Solve Update
recurse
Swap
PA = LU
LU decomposition algorithm
![Page 34: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/34.jpg)
Linear Algebra - Performance Study with HMBLib
Hedgehog—A. Bardakoff & T. Blattner35
} HMBLib v. OpenBLAS (gemm) & LAPACK (getrf)} 32,768 x 32,768 sized double precision matrices
} Over 1 billion objects} ~16 GBs each
} Computer specifications for study:} 1 node, 2x 14 physical cores (56 logical)
} 2 x Xeon E5-2680 @ 2.40 GHz¨ AVX2 (256-bit SIMD vector instruction)
} 512 GB Memory
![Page 35: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/35.jpg)
Linear Algebra - Matrix Multiplication Performance Study
Hedgehog—A. Bardakoff & T. Blattner36
} HMBLib v OpenBLAS (gemm) overall computation comparison} ~660 GFlops v. ~445 GFlops} 1.50x performance improvement
![Page 36: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/36.jpg)
Linear Algebra - Releasing Final Blocks (GEMM)
Hedgehog—A. Bardakoff & T. Blattner37
} First block time - time to release first block data
} Average block time - time to release average block data
} HMBLib vs OpenBLAS (gemm) time for first output comparison} 57x less for first output of computed
data
![Page 37: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/37.jpg)
Linear Algebra - LU w/PP Performance Study
Hedgehog—A. Bardakoff & T. Blattner38
} HMBLib v LAPACK (getrf) overall computation comparison} ~238 v. ~224 GFlops} 1.06x performance improvement
![Page 38: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/38.jpg)
Linear Algebra - LU w/PP Performance Study
Hedgehog—A. Bardakoff & T. Blattner39
} HMBLib v LAPACK (getrf) time for first output comparison} 42x less time for first computed data
![Page 39: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/39.jpg)
Hedgehog CUDA Acceleration Experiment
Hedgehog—A. Bardakoff & T. Blattner40
} Objective:} Adapt Hedgehog OpenBLAS GEMM to use cuBLAS
} Goals:} Analyze performance to observe overhead related to Hedgehog
} Compared with cublasXT and cublasMG as baselines} Use CUDA optimization techniques to keep the GPU(s) busy
} Hardware:} SuperMicro SYS-2029GP-TR Server
} 2x 16 core Intel Xeon Silver 4216 CPUs @ 2.1 GHz} 792 GB DDR4} 4x Tesla V100-PCIe w/ 32 GB HBM2
![Page 40: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/40.jpg)
Hedgehog(HH)-GEMM CUDA Optimizations
Hedgehog—A. Bardakoff & T. Blattner41
} CUDA technologies used} Unified memory} Asynchronous pre-fetch} Concurrent kernel execution} Synchronization through events
} HH-GEMM CUDA } Operates with user-specified block-size} Each block is contiguous and allocated
outside of graph} No support for 2D cudaMemPrefetchAsync
![Page 41: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/41.jpg)
HH-GEMM CUDA Graph
Hedgehog—A. Bardakoff & T. Blattner42
BlockPrefetch-In
A or BBlock MatMul State Pair GEMM Block
Addition State
Pair Addition
Block
5 (1 stream per thread)
2 (1 stream per thread) 1 1 NThreads
FunctionalityHH Get MemA | B
Prefetch MemA | B CPUàGPU
Create Event1
Pair MemA and MemB
(based on MatMul)
HH Get MemPartial(P)
Prefetch MemP CPUàGPU
Synchronize Event1
cublasSgemm(MemP ,MemA ,MemB)
Synchronize Stream
Recycle MemA & B
Prefetch MemP GPUàCPU
Create Event2
Pair MemP with CSynchronize Event2
C = MemP + C
Recycle MemP
Sub-graph (1 per GPU)
![Page 42: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/42.jpg)
HH-GEMM CUDA Results 16 GB Size Matrices
Hedgehog—A. Bardakoff & T. Blattner43
![Page 43: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/43.jpg)
Streaming Linear Algebra with HH-GEMM CUDA
Hedgehog—A. Bardakoff & T. Blattner44
} Streaming linear algebra} Required minor modifications to code to switch between inner/outer traversals
} Change loop order for pushing block data into graph} Alter memory pool size to have sufficient memory for both A and B
} Performance can be detrimental if there is insufficient GPU memory (unified memory paging)
![Page 44: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/44.jpg)
H-GEMM Results 64 GB Size Matrices
Hedgehog—A. Bardakoff & T. Blattner45
![Page 45: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/45.jpg)
Users
Hedgehog—A. Bardakoff & T. Blattner46
![Page 46: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/46.jpg)
NISTFull Slide Microscopy Analysis
Hedgehog—A. Bardakoff & T. Blattner47
} Processing Hardware:} 2x - Xeon Gold 5120 “Skylake” 14-core CPUs } 2x - NVIDIA GTX Titan V graphics cards
} 100,000 x 50,000 pixel images} Traditional computer vision} Inference using TensorRT
} Object Detection (Yolo V3)} Classification (Resnet50)
} End-to-end 60-90 seconds} Scales to number of GPUs
![Page 47: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/47.jpg)
Comprehensive Nuclear-Test-Ban Treaty Preparatory Commission
Hedgehog—A. Bardakoff & T. Blattner48
} Processing Hardware:} DGX-1 server (8xV100s)
} Monitors the nuclear test ban treaty} 300+ stations with 1000+ sensors globally
} 2.268 billion cross correlations per second} 8 GPUs} Scales with number of GPUs
![Page 48: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/48.jpg)
Conclusion
Hedgehog—A. Bardakoff & T. Blattner49
Which library allows us to manage a node with a lot of threads and one or multiple GPU, with an explicit representation of an algorithm (that exists during execution), and
a high-level abstractions (without loss of potential performance) ?
} Hedgehog} Based on an explicit Data Flow Graph using Data Pipelining} With a costless feedback that allows refinement
} HMBLib} Concept of streaming data shows promise
} Relevant for GPU and CPU computation} Potential Applications: Large image processing, Galaxy and space mapping
} Available} Hedgehog: https://github.com/usnistgov/hedgehog} Tutorials: https://pages.nist.gov/hedgehog-Tutorials, https://github.com/usnistgov/hedgehog-Tutorials
![Page 49: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/49.jpg)
Future
Hedgehog—A. Bardakoff & T. Blattner50
} Experiments with extending Hedgehog to operate beyond a single node
} General purpose libraries based around Hedgehog} Streaming full-slide microscopy analysis
} Compile-time static graph analyses} Check race conditions} Deadlock
} Principled dataflow-based “code generation”} Automated rule generation
![Page 50: Hedgehog: A Performance-Oriented General-Purpose Library for … · 2020. 3. 20. · Motivation –Hardware}Servers}AMD EPYC 7702P w/64 cores,Intel Xeon Platinum 8253 Processor w/16](https://reader035.fdocuments.in/reader035/viewer/2022071422/611c4447764bfa60d23e033c/html5/thumbnails/50.jpg)
Thank you
Any questions ?
Hedgehog—A. Bardakoff & T. Blattner51