Data Intensive Computing at Sandia
description
Transcript of Data Intensive Computing at Sandia
Data Intensive Computing at Sandia
September 15, 2010
Andy WilsonSenior Member of Technical StaffData Analysis and Visualization
Sandia National Laboratories
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of
Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
The Question
What is Data-Intensive Computing?
My Answer
What is Data-Intensive Computing?
Parallel computing where you design your algorithms and your software around efficient
access and traversal of a data set; where hardware requirements are dictated by data size
as much as by desired run times
Usually distilling compact results from massive data
Outline
• What is Data-Intensive Computing?
• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures
• Into the Future
Spaghetti Plot (2)
Traditional Visualization Workflow
Solver
DiskStorage
Visualization
Full Mesh
Traditional In-Situ Visualization
Solver
DiskStorage
Visualization
Images
Solver
DiskStorage
Visualization
Full Mesh
Coprocessing
Solver
DiskStorage
Visualization
Images
Solver
DiskStorage
Visualization
Full Mesh
Solver
DiskStorage
Features &Statistics
Salient Data
Visualization
Collision Movie
Outline
• What is Data-Intensive Computing?
• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures
• Into the Future
Slide 3/20
Community Detection in Networks
• Find many small groups of vertices and/or edges– O(n) communities– overlaps may be allowed
• Hundreds of papers in physics and computer science
Lancichinetti, Fortunato, Radicchi 2008
Slide 2/20
Analysis of Massive Graphs
• Finding communities: a kernel of social network analysis
• “Dunber’s number” from sociology: there is a size limit (~150) on stable social group size (from neolithic farming village to academic sub-discipline)
Twitter social network (|V|≈200M)
[Akshay Java, 2007]
Slide 19/20
Collapsed Dendrograms and Statistical Confidence: wCNM
The wCNM partitioning is much deeper,resolving smaller communities
The statistically significant variation is visuallyclose, but does not reproduce ground truth as well
Image credit: Titan
The (much better) wCNM solution also has a statistically significant variation.
LSA and LDA from 5 miles up
Slide 15 of 18 Image credit: Dave Robinson
(LDA)
LSA/LDA: Increasing Data Size, Single ProcessorStraight Line = Linear Scaling, Lower = Faster
Slide 16 of XX
100 1000 10000 1000000.1
1
10
100
1000
10000
100000Higher Lines = More Topics
Number of Documents
CPU
Tim
e (s
ec.)
Slide 16 of 18
LSA/LDA: Weak Scaling(Bigger Problem, Same Time)Flat Lines = Perfect Scaling
Slide 17 of XX
1 10 100 10001
10
100
1000
10000
100000Higher Lines = More Documents
Number of Processors
CPU
Tim
e (s
ec.)
Slide 17 of 18
Outline
• What is Data-Intensive Computing?
• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures
• Into the Future
NGC System Diagram
Architectures Algorithms Web Services Applications(Clients)
Titan, browserTrilinosAlgebraic MethodsClustering, Ranking,High Dimensional Mapping
MTGLGraph MethodsSubgraph searches,Connection sg’s,Shortest Path, etc.
SpecializedDistributed Data Operations
TitanAnalysis Pipelines,Capability Integration,Data Access,Lightweight analysis
TitanAnalysis Pipelines,Capability Integration,Data Access,Lightweight analysis
“This project seeks to bring these two strengths – a solid reputation for excellence in computing, and our niche expertise in specific classes of intelligence analysis – to bear on a thorny problem: developing advanced informatics capabilities that are both usable and useful to analysts who are drowning in data.” NGC project proposal
Highly optimized Iterative, flexible
Data
SQL ServiceEnables Remote Access to Data Warehouse Appliances (DWA)
SQL Service*– Provides “bridge” between parallel
apps and external DWA– Runs on Red Storm network nodes– Titan applications communicate with
service through Portals– External resources (Netezza)
communicate through standard interfaces (e.g. ODBC over TCP/IP)
The SQL service enables an HPC application to access a remote DWA
Service Nodes(GUI and Database Services)
High-Speed Network (Portals)
Compute Nodes(Titan Analysis Code)
Tech Area 1Anywhere CSRI
Netezza
LexisNexis
OtherODBC DWA
Analyst HPC System (Red Storm) DWA
TCP/IP SQL
* Results of SQL access from parallel statistics code presented at CUG’2009.
Additional Modifications for Multilingual– Tokenization support on Netezza (goal is to count unique words)– Developed a custom UTF-8 words splitter for SPU (snippet processing unit)– Allows parallel tokenization and counting at storage device
Slide 20 of 14
Outline
• What is Data-Intensive Computing?
• Data-Intensive Computing at Sandia– Physics– Informatics– Architectures
• Into the Future
Into the Future
• I don’t care about flops anymore. I care about mops.
• I want to send more complex requests to the storage system.
• There is no one perfect architecture.