Introduction to Research 2007

Introduction to Research 2007Introduction to Research 2007

Ashok Srinivasan

Florida State University

www.cs.fsu.edu/~asriniva

Ashok Srinivasan


www.cs.fsu.edu/~asriniva

Recent collaborators

V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu


S. Kapoor

IBM Austin

S. Namilae

Oak Ridge National Lab

Recent collaborators

V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu


S. Kapoor

IBM Austin

S. Namilae

Oak Ridge National Lab

M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, and R. Sharma

Sri Sathya Sai University, India

N. Chandra

University of Nebraska at Lincoln

M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, and R. Sharma

Sri Sathya Sai University, India

N. Chandra

University of Nebraska at Lincoln

Research support

Funding

DoD, FSU, NSF

Research support

Funding

DoD, FSU, NSF

Computer time

IBM, NCSA, NERSC, ORNL

Computer time

IBM, NCSA, NERSC, ORNL

OutlineOutline

Research Areas

Computational Nanotechnology

Computational Biology

High Performance Computing on Multicore Processors

Potential Research Topics

Graduate Courses

Research AreasResearch Areas

High Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical Software

Current topics: Computational Nanotechnology, Computational Biology, HPC on Multicore Processors

New Topics: Dynamic Data Driven Applications

Old Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image Compression

Importance of Parallel ComputingImportance of Parallel Computing

Makes feasible products based on more fundamental understanding of science Example: Nanotechnology, Medicine

Increasing relevance to industry In 1993, fewer than 30% of top 500 supercomputers were

commercial Now, over 50% are commercial

Finance and insurance Medicine Aerospace and Automobiles Telecom Oil exploration Shoes! (Nike) Potato chips! Toys!

Architectural TrendsArchitectural Trends

Massive parallelism 10K processor systems will be commonplace Large end already has over 100K processors

Single chip multiprocessing All processors will be multicore Heterogeneous multicore processors

Cell used in the PS3 80-core processor from Intel Processors with hundreds of cores are already commercially

available

Distributed environments, such as the Grid But it is hard to get good performance on these

systems

Computational NanotechnologyComputational Nanotechnology

Example application Carbon Nanotube

Can span 23,000 miles without failing due to own weight

100 times stronger than steel Lighter than feather Conducts heat better than

diamond Computations are used to understand

materials at the atomic scale, so that better materials can be designed

Easier than experimentation at the nano-meter scale

CNT Tensile TestCNT Tensile Test

Pull the CNT at constant speed Determine material properties from force-displacement response

Computational difficulties Time steps size ~ 10 –15 seconds

Desired time range is much larger A million time steps are required to reach 10-9 s ~ 500 hours of computing for ~ 40K atoms using GROMACS MD uses unrealistically large pulling speed

1 to 10 m/s instead of 10-7 to10-5 m/s Results at unrealistic speeds are unrealistic!

Difficulty with ParallelizationDifficulty with Parallelization

Results on scalable code Does not scale efficiently

beyond 10 ms/iteration

If we want to simulate to a ms Time step 1 fs 1012

iterations 1010s ≈ 300 years

If we scaled to 10 s per iteration 4 months computing time

NAMD, 327K atom ATPase PME, Blue Gene, IPDPS 2006

NAMD, 92K atom ApoA1 PME, Blue Gene, IPDPS 2006

IBM Blue Matter, 43K Rhodopsin, Blue Gene, Tech Report 2005

Desmond, 92K atom ApoA1, SC 2006

Data Driven Time ParallelizationData Driven Time Parallelization

Each processor simulates a

different time interval Initial state is obtained by

prediction, using prior data

(except for processor 0)

Verify if prediction for end state is

close to that computed by MD

Prediction is based on

dynamically determining a

relationship between the current

simulation and those in a

database of prior results

If time interval is sufficiently large, then communication overhead is small

ResultsResults

Speedup result Red line: Ideal speedup Blue: v = 0.1m/s Green: A different predictor Experimental parameters

v = 1m/s, using v = 10m/s CNT with 1000 atoms Xeon/ Myrinet cluster

Validation Compare stress strain

response Blue: Exact results Red: Time parallel results Green: Direct prediction

Computational BiologyComputational Biology

Data driven time parallelization in the AFM simulation of proteins An order of magnitude improvement in performance by combining

conventional and data driven time parallelization with the protein Titin

A PowerPC core, with 8 co-processors (SPE) with 256 K local

store each

Shared 512 MB - 2 GB main memory - SPEs can DMA

Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops

in double precision for SPEs

204.8 GB/s EIB bandwidth, 25.6 GB/s for memory

Two Cell processors can be combined to form a Cell blade with

global shared memory

High Performance Computing on High Performance Computing on Multicore ProcessorsMulticore Processors

DMA put timesDMA put times

Memory to Memory Copy using:

• SPE local store

• memcpy by PPE

Memory to Memory Copy using:

• SPE local store

• memcpy by PPE

Cell ArchitectureCell Architecture

Cell MPI ResultsCell MPI Results

PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension

DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i

Comparison of MPI_Barrier on different hardware

P Cell (PE) s

Xeon/Myrinet s

NEC SX-8 s SGI Altix BX2 s

8 0.4 10 13 3

16 1.0 14 5 5

MPI_Barrier timing

Broadcast bandwidth

Potential Research TopicsPotential Research Topics

Computational Biology Data Driven Time Parallelization Markov State Modeling Other topics

Dynamic Data Driven Applications Combining simulations and experiments in superplastic forming

High Performance Computing on Multicore Processors Algorithms and libraries on the Cell processor

Example: Sorting, linear algebra, etc Good software cache/code overlaying implementations

Other possible new directions Applications in history, linguistics, medicine, etc

Graduate CoursesGraduate Courses

Parallel Computing, Spring 2008 MPI and OpenMP programming on traditional parallel machines Threaded programming on multicore processors Parallel algorithms

Advanced Algorithms, Fall 2008 Approximation algorithms for NP hard problems Randomized algorithms Cache aware algorithms

Introduction to Research 2007

Documents

Transcript of Introduction to Research 2007