An AI architectural journey
Transcript of An AI architectural journey
![Page 1: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/1.jpg)
An Architectural Journey to Scalable Supercomputing for AI
The Inside Story.
Jake Carroll, Associate Director – Research Computing (Institutes)
The University of Queensland
![Page 2: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/2.jpg)
Today…
Is about science.
Is about technology.
Is about how worlds mix and collide.
Is about people.
![Page 3: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/3.jpg)
QBI
CAI
IMB
AIBN
100’s of terabytes per day of data generated.Eclectic mixture of Life Sciences data, engineering, physics,
nanotech, imaging, genomics….
Meet the family.
![Page 4: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/4.jpg)
Scientific infrastructure of immense scale
![Page 5: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/5.jpg)
Non-deconvolved image Deconvolved image
![Page 6: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/6.jpg)
UQ’s supercomputing strategy - Right supercomputer for the right task. “Best fit”
Tinaroo - 7000 cores of Intel Broadwell. Tight MPI, massively parallel, Infiniband FDR connected“Traditional” HPC.
FlashLite - 1632 cores of Intel Haswell. High memory footprint, virtual SMP (ScaleMP), high throughput. SSD /tmp in each node.
Awoonga - 1032 cores of Intel Broadwell. Loosely coupled, embarrassingly parallel, high latency tolerant workloads. Ethernet connected HPC.
![Page 7: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/7.jpg)
To cope with 100’s of terabytes per day of imaging, genomics and sensor data, UQ turned to GPU accelerated supercomputing to solve its significant and complex scientific problems.
Accelerator based supercomputing strategy.
![Page 8: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/8.jpg)
Wiener – announced Nov 2017 @ SC2017
First nVidia Volta in Asia Pacific
![Page 9: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/9.jpg)
Christiaan Huygens (1629 -1695)
Norbert Wiener (1894 -1964)
![Page 10: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/10.jpg)
![Page 11: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/11.jpg)
Better than human vision cancer pathology and recognition accuracy.
30,000 pathology slides categorisedin an CNN in 13 minutes.
TensorFlow function on nVidia Volta.
![Page 12: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/12.jpg)
TensorFlow workload for skin cancer recognition inference
![Page 13: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/13.jpg)
Machine vision of optic flow for UAV workloads. Novel use of machine vision techniques, partnership with US DoD.
![Page 14: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/14.jpg)
Artificial intelligence techniques in multi-view real time facial recognition for sentiment analysis, security, surveillance and threat analysis.
![Page 15: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/15.jpg)
Challenges we faced, architecturally…
[and how we overcame them]
![Page 16: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/16.jpg)
Expectations.Long gone is the era of “building a capability” and expecting it to satisfy users for 5 years…3
years…or even 1 year. This mentality is not going to win you that Nature front page in a computationally intensive “arms race”.
Continuous aggressive version baselines, driver upgrades, firmware and optimisation. Monthly…weekly…almost daily, in the quest for every last bit of performance. Recompile daily to squeeze out every bit of life you can.
Some might say: “Won’t somebody think of the administrative overhead? This would cost so much to achieve?! So many hours –so much risk!”
If that’s the mindset – you’ve already lost the game. You’ve already come second. You’ve already ended up a less impactful research organisation than the crew down the road who pushed that little bit harder…
![Page 17: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/17.jpg)
Ye old(e) storage bottleneck effect…
Traditional parallel filesystem, or object (GPFS, Ceph, Swift etc)
IO Subsystem GPUs
BeeGFS on nVME RDMA EDR
connected flash. Delivering
180GB/sec and 25m IOPS of
sustained performance.
![Page 18: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/18.jpg)
MPI and inter-nodal comms
Totally awesome….but what if you need double…or triple…or quadruple this many GPU’s – and they all need to communicate at scale?
![Page 19: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/19.jpg)
Easy, right? Just buy a few of them. Stack them together somehow…
Not *quite* that simple….
![Page 20: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/20.jpg)
In supercomputing we need to think more about high scalability scenarios, pressure on the
interconnect and how to distribute workloads with least-cost/lowest latency…
![Page 21: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/21.jpg)
ScalableHierarchicalAggregation andReductionProtocol
![Page 22: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/22.jpg)
Putting MPI primitives in hardware, in the network.
![Page 23: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/23.jpg)
IO-gluttony. Games with 100Gb RDMA pipes…
100GbBoth RDMA storage IO and RDMA GPUDirect/MPI IO between GPU’s
~15.26GB/sec
A perfect IO starvation storm in a tea cup
![Page 24: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/24.jpg)
So, let us fix this…
100Gb * 2Dedicated RDMA for BeeGFS IODedicated RDMA for MPI/GPU IO
~15.26GB/sec
![Page 25: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/25.jpg)
But, one more big (complicated) problem…
![Page 26: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/26.jpg)
≠DL/ML/AI frameworks and software stacks
“HPC” techniques, hardware, technologies, methods, communications and principles
![Page 27: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/27.jpg)
Different strokes, for different folks…
“This is how we’vealways done things!Don’t mess with myworld! I’ll do what Iwant!”
“Let me at it! I’m keento try anything if I canmake an impact andget something goodhappening!”
“I….will take over theworld with this! Imust have all of it.More cores. MoreGPU’s. All of it! Now!”
“I’ve no idea where tostart. Help mecomputer-scienceperson. I just want todo my research!”
![Page 28: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/28.jpg)
Build the town hall. Let people come. Let them speak…
![Page 29: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/29.jpg)
Start working with users to co-create value.
![Page 30: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/30.jpg)
![Page 31: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/31.jpg)
And then odd, but wonderful things started happening…
“Jake,
The new GPU supercomputer is fantastic. Thanks for giving us early access and the chance to get some early apps up to iron things out/get it working quickly. I’ve been thinking about it a bit, and I think it would be useful for the rest of my community and my school if I actually built a dedicated “how to” guide for Relion + Volta. What do you think? The performance advances we’ve seen between Volta and Pascal, combined with all the MPI tweaks we worked on together make it worth the effort I think to explain to users how different it is – and how they can take advantage of it. I can get a rough draft for the Research Computing Centre wiki to you next week, if you’re keen?”
![Page 32: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/32.jpg)
And before we knew it, we had a very full user guide…
![Page 33: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/33.jpg)
The take away.
Balanced architecture is hard, actually. We made some poor assumptions to begin with. ML/DL/AI is still quite immature
when mashed with HPC.
The most important people in the building.
Architecture is just a stepping stone for us to innovation.
Never stop optimisation. Call it continuous integration in
supercomputing…or a deep obsession with going as fast as you can, for the
outcome.
![Page 34: An AI architectural journey](https://reader033.fdocuments.in/reader033/viewer/2022051906/62848493fd9a44125071d80e/html5/thumbnails/34.jpg)
Thank you.