Advances in Massively Parallel Electromagnetic Simulation ...
Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively...
-
Upload
barry-davis -
Category
Documents
-
view
218 -
download
1
Transcript of Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively...
Enabling Rapid Development of Parallel
Tree-Search Applications
Harnessing the Power of Massively Parallel Platforms for Astrophysical
Data AnalysisJeffrey P. GardnerJeffrey P. GardnerAndrew ConnollyAndrew ConnollyCameron McBrideCameron McBride
Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterUniversity of PittsburghUniversity of Pittsburgh
Carnegie Mellon UniversityCarnegie Mellon University
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon workstation
Step 3: Extract meaningfulscientific knowledge
(happy scientist)Using 300 processors:(circa 1995)
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon server (in serial)
Step 3: Extract meaningfulscientific knowledge
(happy scientist)Using 1000 processors:(circa 2000)
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon ???
(unhappy scientist)Using 4000+ processors:(circa 2006)
X
Exploring the Universe can be (Computationally)
Expensive
The size of simulations is no longer limited by computational power
It is limited by the parallelizability of data analysis tools
This situation, will only get worse in the future.
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon ???
Using ~1,000,000 cores?:(circa 2012)
X
By 2012, we will have machines that will have many hundreds of thousands of cores!
The Challenge of Data Analysis in a Multiprocessor Universe
Parallel programs are difficult to write! Steep learning curve to learn parallel programming
Parallel programs are expensive to write! Lengthy development time
Parallel world is dominated by simulations: Code is often reused for many years by many
people Therefore, you can afford to invest lots of time
writing the code. Example: GASOLINE (a cosmology N-body
code) Required 10 FTE-years of development
The Challenge of Data Analysis in a Multiprocessor Universe
Data Analysis does not work this way: Rapidly changing scientific
inqueries Less code reuse
Simulation groups do not even write their analysis code in parallel!
Data Mining paradigm mandates rapid software development!
How to turn observational data into scientific knowledge
Step 1: Collect data
Step 2: Analyze dataon workstation
Step 3: Extract meaningfulscientific knowledge
(happy astronomer)Observe at Telescope(circa 1990)
Use Sky Survey Data(circa 2005)
How to turn observational data into scientific knowledge
Step 1: Collect data
Step 2: Analyze dataon ???
Sloan Digital Sky Survey(500,000 galaxies)
X
(unhappy astronomer)
3-point correlation function:~200,000 node-hours
of computation
Use Sky Survey Data(circa 2012)
How to turn observational data into scientific knowledge
Large Synoptic Survey Telescope(2,000,000 galaxies)
3-point correlation function:~several petaflop weeks
of computation
Tightly-Coupled Parallelism(what this talk is about)
Data and computational domains overlap
Computational elements must communicate with one another
Examples: Group finding N-Point correlation functions New object classification Density estimation
The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe
Build a library that is: Sophisticated enough to take care
of all of the nasty parallel bits for you.
Flexible enough to be used for your own particular astrophysics data analysis application.
Scalable: scales well to thousands of processors.
The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe
Astrophysics uses dynamic, irregular data structures: Astronomy deals with point-like data in an N-dimensional
parameter space Most efficient methods on these kind of data use space-
partitioning trees. The most common data structure is a kd-tree.
Build a targeted library for distributed-memory kd-trees that is scalable to thousands of processing elements
Challenges for scalable parallel application development:
Things that make parallel programs difficult to write Work orchestration Data management
Things that inhibit scalability: Granularity (synchronization,
consistency) Load balancing Data locality
Structured dataMemory consistency
Overview of existing paradigms: DSM
There are many existing distributed shared-memory (DSM) tools.
Compilers: UPC Co-Array Fortran Titanium ZPL Linda
Libraries Global Arrays TreadMarks IVY JIAJIA Strings Mirage Munin Quarks CVM
Overview of existing paradigms: DSM
The Good: These are quite simple to use.
The Good: Can manage data locality pretty well.
The Bad: Existing DSM approaches tend not to scale very well because of fine granularity.
The Ugly: Almost none support structured data (like trees).
Overview of existing paradigms: DSM
There are some DSM approaches that do lend themselves to structured data: e.g. Linda (tuple-space)
The Good: Almost universally flexible The Bad: These tend not to scale even
worse than simple unstructured DSM approaches. Granularity is too fine
Challenges for scalable parallel application development:
Things that make parallel programs difficult to write Work orchestration Data management
Things that inhibit scalability: Granularity Load balancing Data locality
DSM
Overview of existing paradigms: RMI
rmi_broadcast(…, (*myFunction));
RMI layer
RMI layer
myFunction()
RMI Layer
myFunction()
RMI Layer
myFunction()
RMI Layer
myFunction()
Proc. 0 Proc. 1 Proc. 2 Proc. 3
MasterThread
Computational Agenda
myFunction() is coarsely grained
“Remote Method Invocation”
RMI Performance Features
Coarse Granulary Thread virtualization
Queue many instances of myFunction() on each physical thread.
RMI Infrastucture can migrate these instances to achieve load balacing.
Overview of existing paradigms: RMI
RMI can be language based: Java CHARM++
Or library based: RPC ARMI
Challenges for scalable parallel application development:
Things that make parallel programs difficult to write Work orchestration Data management
Things that inhibit scalability: Granularity Load balancing Data locality
RMI
N tropy: A Library for Rapid Development of kd-tree Applications
No existing paradigm gives us everything we need.
Can we combine existing paradigms beneath a simple, yet flexible API?
N tropy: A Library for Rapid Development of kd-tree Applications
Use RMI for orchestration Use DSM for data management Implementation of both is targeted
towards astrophysics
A Simple N tropy Example:N-body Gravity Calculation
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
100 million light years
Proc 0 Proc 1 Proc 2
Proc 5Proc 4Proc 3
Proc 6 Proc 7 Proc 8
A Simple N tropy Example:N-body Gravity Calculation
ntropy_Dynamic(…, (*myGravityFunc));
N tropy master RMI layer
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
Proc. 0 Proc. 1 Proc. 2 Proc. 3
MasterThread
Computational Agenda
Particles on which to calculate gravitational force
P1 …P2 Pn…
A Simple N tropy Example:N-body Gravity Calculation
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
100 million light years
To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset
To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset
Proc 0 Proc 1 Proc 2
Proc 5Proc 4Proc 3
Proc 6 Proc 7 Proc 8
A Simple N tropy Example:N-body Gravity Calculation
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
Proc. 0 Proc. 1 Proc. 2 Proc. 3
N tropy DSM layer N tropy DSM layer N tropy DSM layer N tropy DSM layer
0
43
1
2
76
5
1110
8
9
1413
12
DataWork
N tropy Performance Features
DSM allows performance features to be provided “under the hood”: Interprocessor data caching for both reads
and writes < 1 in 100,000 off-PE requests actually result in
communication. Updates through DSM interface must be
commutative Relaxed memory model allows multiple writers
with no overhead Consistency enforced through global
synchronization
N tropy Performance Features
RMI allows further performance features Thread virtualization
Divide workload into many more pieces than physical threads
Dynamic load balacing is achieved by migrating work elements as computation progresses.
N tropy Performance10 million particlesSpatial 3-Point3->4 Mpc
No interprocessor data cache,No load balancing
Interprocessor data cache,No load balancing
Interprocessor data cache,Load balancing
Why does the data cache make such a huge difference?
myGravityFunc()
Proc. 0
0
43
1
2
76
5
1110
8
9
1413
12
N tropy “Meaningful” Benchmarks
The purpose of this library is to minimize development time!
Development time for:1. Parallel N-point correlation function
calculator 2 years -> 3 months
2. Parallel Friends-of-Friends group finder
8 months -> 3 weeks
Conclusions
Most approaches for parallel application development rely on providing a single paradigm in the most general possible manner Many scientific problems tend not to
map well onto single paradigms Providing an ultra-general single
paradigm inhibits scalability
Conclusions
Scientists often borrow from several paradigms and implement them in a restricted and targeted manner.
Almost all current HPC programs are written in MPI (“paradigm-less”): MPI is a “lowest common denominator”
upon which any paradigm can be imposed.
Conclusions N tropy provides:
Remote Method Invocation (RMI) Distributed Shared-Memory (DSM)
Implementation of these paradigms is “lean and mean”
Targeted specifically for problem domain This approach successfully enables
astrophysics data analysis Substantially reduces application development time Scales to thousands of processors
More Information: Go to Wikipedia and seach “Ntropy”