Principles of ScalableHPC System Design
March 6, 2012
Sue KellySandia National Laboratories
http://www.sandia.gov/~smkelly
Abstract: Sandia National Laboratories has a long history of successfully applying high performance computing (HPC) technology to solve scientific problems. We drew upon our experiences with numerous architectural and design features when planning our most recent computer systems. This talk will present the key issues that were considered. Important principles are performance balance between the hardware components and scalability of the system software. The talk will conclude with lessons learned from the system deployments.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Companyfor the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Outline
• A definition of HPC for scientific applications • Design Principles
– Partition Model
– Network Topology
– Balance of Hardware Components
– Scalable System Software
• Lessons Learned
• (n.) A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors. (http://www.webopedia.com/TERM/H/High_Performance_Computing.html)
• Will not talk about embarrassingly parallel applications
• The idea/premise of scientific parallel processing is not new (http://www.sandia.gov/ASC/news/stories.html#nineteen-twenty-two)
What is High Performance Computing?
The Partition Model:Match the hardware & software to its function
Users/home
Parallel I/OCompute PartitionService
Net I/O
• Applies to both hardware and software• Physically and logically divide the system into functional units
• Compute hardware different configuration than service & I/O• Only run the necessary software to perform the function
Usage Model:Partitions cooperate to appear as one system
Linux Login (Service)
Node
ComputeResource
I/O
Mesh/Torus topologies are scalable
12,960Compute
Node Mesh
X=27
Y=20
Z=24
TorusInterconnect
in Z
310 Service
& I/O
Nodes
310
Ser
vice
& I
/O N
odes
Minimize communication interference
• Jobs occupy disjoint regions simultaneously• Example – red, green, and blue jobs:
Z=24
X=27
Y=20
12,960Compute
Nodes
Hardware Performance Characteristics that Lead to a Balanced System
• Network bandwidth
must balance with• Processor speed and operations per second
must balance with• Memory bandwidth and capacity
must balance with• File system I/O bytes per second
In Addition to Balanced Hardware,System Software must be Scalable
Scalable System SoftwareConcept #1
Do things in a hierarchical fashion
Jobs Launch is Hierarchical
ComputeNode
AllocatorJob Launch
Login Node
Linux
UserApplication
User
Login &
Start App
Job Scheduler Node
Batch momScheduler
Batch Server
......
...
…
ComputeNode
Allocator
Job Queues
Database Node
CPU Inventory Database
Fan out application
System monitoring is hierarchical
R S M S
E t h e r n e t
T r e eL 0L 0L 0
L 1L 1L 1
L 0L 0L 0
L 1L 1L 1
RR
RRC a b in e t
B o a r d
S M WS M WS M WS M WS M WS M W
S S
H /M
S S
H /M
HT
E t h e r n e t
H S NHT
S S
H /M
S S
H /M
HT
E t h e r n e t
H S NHT
Scalable System SoftwareConcept #2
Minimize Compute Node
Operating System Overhead
Operating System Interruptions Impede Progress of the Application
Interruptions of User Applications
0
50000
100000
150000
200000
250000
300000
350000
0 1 2 3 4 5 6
Wall time in seconds
Inte
rru
pti
on
s i
n n
s
LinuxCatamount
System monitoring is out of band and non-invasive
R S M S
E t h e r n e t
T r e eL 0L 0L 0
L 1L 1L 1
L 0L 0L 0
L 1L 1L 1
RR
RRC a b in e t
B o a r d
S M WS M WS M WS M WS M WS M W
S S
H /M
S S
H /M
HT
E t h e r n e t
H S NHT
S S
H /M
S S
H /M
HT
E t h e r n e t
H S NHT
Scalable System SoftwareConcept #3
Minimize Compute Node Interdependencies
Calculating Weather Minute by Minute
Calc 1
0 min
Calc 2
1 min
Calc 3
2 min
Calc 4
3 min 4 min
Calculation with Breaks
• Calculation with Asynchronous Breaks
Calc 1
0 min
Wait
1 min
Calc 2
2 min
Calc 3
3 min
Wait
4 min 5 min
Calc 4
6 min
Run Time Impact of LinuxSystems Services (aka Daemons)
• Say breaks take 50 S and occur once per second– On one CPU, wasted time is 50 s every second
• Negligible .005% impact
– On 100 CPUs, wasted time is 5 ms every second• Negligible .5% impact
– On 10,000 CPUs, wasted time is 500 ms• Significant 50% impact
Scalable System SoftwareConcept #4
Avoid linear scaling of buffer requirements
Connection-oriented protocolshave to reserve buffers for the worst case
• If each node reserves a 100KB buffer for its peers, that is 1GB of memory per node for 10,000 processors.
• Need to communicate using collective algorithms
Scalable System SoftwareConcept #5
Parallelize wherever possible
Use parallel techniques for I/OCompute Nodes
I/O Nodes
High Speed Network
Parallel File System Servers (190 + MDS)
10.0 GigE Servers (50)
Login Servers (10)
RAIDs10 Gbit Ethernet 1 Gbit Ethernet
• 140 MB/s per FC X 2 X 190 = 53 GB/s
• 500 MB/s X 50 = 25 GB/s
• 1.0 GigE X 10
CC CC CC CC CC CC CC CC CC CC CC CC CC
II II II II II
II
NN
LL
NN NN NN NN LL LL LL LL
Summary of Principles• Partition the hardware and software • Hardware
– For scalability and upgradability, use a mesh network topology– Determine the right balance of processor speed, memory
bandwidth, network bandwidth, and I/O bandwidth for your applications
• System Software– Do things in a hierarchical fashion– Minimize compute node OS overhead– Minimize compute node interdependencies– Avoid linear scaling of buffer requirements– Parallelize wherever possible
Lessons Learned
• Seek first to emulate– Learn from the past– Simulate the future
• Need technology philosophers• Tilt Meters• Historians• Even Tiger Woods has a coach
• The big bang only worked once– Deploy test platforms early and often
• Build de-scalable, scalable systems– Don’t forget that you have to get it running first!– Leave the support structures (even non-scalable development tools) in
working condition, you’ll need to debug some day• Only dead systems never change
– Nobody ever built just one system even when successfully deploying just one system
– Nothing is ever done just once• Build scaffolding that meets the structure
– Is build and test infrastructure in place FIRST?– Will it effectively support both the team and the project?
Top Related