Download - Introduction to STINGER

Transcript
Page 1: Introduction to STINGER

STINGERDynamic Graph Analysis

Page 2: Introduction to STINGER

Contributors• David Bader• David Ediger• Rob McColl• Jason Riedy• Kamesh Madduri• Jason Poovey

Page 3: Introduction to STINGER

Outline• Motivation

• Dynamic Graph Basics

• What is STINGER?

• What can STINGER do?

• Why STINGER?

Page 4: Introduction to STINGER

Big Data problems need Graph Analysis

• Finding outbreaks, population epidemiologyHealth Care

• Advertising, searching, grouping, influenceSocial Networks

• Decisions at scale, regulating algorithmsIntelligence

• Understanding interactions, drug designSystems Biology

• Disruptions, conversionPower Grid

• Discrete events, cracking meshesSimulation

Page 5: Introduction to STINGER

Graphs are pervasive• Graphs: things and relationships

• Different kinds of things, different kinds of relationships, but graphs provide a framework for analyzing the relationships.

• New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.

AstrophysicsProblem: Outlier detectionChallenges: Massive data sets, temporal variationGraph Problems: matching, clustering

BioinformaticsProblem: Identifying target proteinsChallenges: Data heterogeneity, qualityGraph Problems: Centrality, clustering

Social InformaticsProblem: Emergent behavior, information spreadChallenges: New analysis, data uncertainty, scaleGraph Problems: clustering, flows, shortest paths

Page 6: Introduction to STINGER

Data rates and volumes are immense• Facebook:

• ~1 billion users• average 130 friends• 30 billion pieces of content shared / month

• Twitter: • 500 million active users• 340 million tweets / day

• Internet – 100s of exabytes / year• 300 million new websites per year• 48 hours of video to You Tube per minute• 30,000 YouTube videos played per second

Page 7: Introduction to STINGER

Our focus is streaming graphs• As relationships change

• Edges (relationships) are inserted, updated, and removed• New vertices (things) join and leave the network

• What are the effects?• On information flow• On community structure• On the integrity of data and structure

• Which actors and relationships are…• The key players and influencers in the change?• The anomalies and threats?

x yz

Page 8: Introduction to STINGER

What is STINGER?Spatio-Temporal Interaction Networks and Graphs Extensible RepresentationD. A. Bader, J. Berry, A. Amos-Binks, D. Chavarr´ıa-Miranda, C. Hastings, K. Madduri, S. C. Poulos

• A scalable, high performance in-memory dynamic graph data structure• Stores semantic and temporal information.• Designed to be flexible and extendable.• Be useful for the entire “large graph” community.• Permit good performance: No single structure is optimal for all.• Assume globally addressable memory access.• Support multiple, parallel readers and a single parallel writer.

• A software suite for dynamic graph analysis• Targets large shared-memory x86 and the Cray XMT• Written in C with OpenMP and XMT pragma support for parallelism

Page 9: Introduction to STINGER

As a data structure• Fast insertions, deletions, and updates:

A data structure that grows and changes at the speed of the data.

• Edge and vertex types and weights:Represent complex relationships and multiple simultaneous networks.

• Filtering traversal mechanisms:Traverse serially or in parallel on specific edge types, time ranges, vertex sets, etc.

• Experimental workflow server:Multiple data streams and analytics with one persistent data structure.

• Experimental Java and Python bindings:Use efficiency-oriented languages without sacrificing performance-oriented results.

Page 10: Introduction to STINGER

As an analysis package• Streaming edge insertions and deletions:

Performs new edge insertions, updates, and deletions in batches or individually.

• Streaming clustering coefficients: Tracks the local and global clustering coefficients of a graph under both edge insertions and deletions.

• Streaming connected components: Accurately tracks the connected components of a graph with insertions and deletions.

• Streaming community detection: Track and update the community structures within the graph as they change.

• Parallel agglomerative clustering: Find clusters that are optimized for a user-defined edge scoring function.

• Streaming Betweenness Centrality: Find the key points within information flows and structural vulnerabilities.

• K-core Extraction: Extract additional communities and filter noisy high-degree vertices.

• Classic breadth-first search: Performs a parallel breadth-first search of the graph starting at a given source vertex to find shortest paths.

Page 11: Introduction to STINGER

How is the graph stored?

Page 12: Introduction to STINGER

What can STINGER represent?• Nearly any set of

relationships• Healthcare• Social Networks• Intelligence• Systems biology• Power grid• Travel networks

• Example: Twitter• Users, hashtags, tweets as vertex types• Authorship, retweet, mentions, follows / followed by edge types

• Example: Work Environment• Users, PCs, printers, emails, URLs, files, etc. as vertex types• Email alias, from, to, access, logon/off, print, IM, etc. as edge types

Page 13: Introduction to STINGER

What can STINGER do?• Optimized to update at rates of over 3 million edges per second on

graphs of one billion edges• D. Ediger, R. McColl, J. Riedy, and D.A. Bader, "STINGER: High Performance Data Structure for Streaming

Graphs,'' The IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, September 20-22, 2012. Best Paper Award.

RMAT – Recursive MATrix graph generator. RMAT(N) indicates 2^N vertices.

Page 14: Introduction to STINGER

What can STINGER do?• Maintaining connected components in a graph of half a billion edges

• Up to 1.26 million updates per sec.• 137x faster than recomputing.

• Scalable parallel streaming community detection • Built on parallel insert / delete mechanisms.

• Streaming approximate betweenness• Used to analyze influencers on Twitter during Hurricane Sandy over time.

Page 15: Introduction to STINGER

What does STINGER not do?• Does not provide all ACID properties

• Why: Not intended to be the backing data store.• Why: Allows for greater ingest and processing speeds.• Alternative: Back STINGER ingest with an ACID DB• Alternative: STINGER does provide consistency, partial isolation

• No text base query language – for now• Why: Currently, no language is general enough to describe most or all queries• Alternative: Filtering traversal APIs, unlimited query flexibility through code• Alternative: Productivity language bindings (Python, Java)

• No distributed / Hadoop-like cluster support• Why: Good fit for ingest, but poor for streaming analysis, random access is too slow• Alternative: Larger shared memory systems such as the Cray XMT and SGI UV systems• Alternative: Processing billion-edge graphs in shared memory on affordable Intel servers• Alternative: Extract key portions of the graph from a larger data store and perform fast in-

memory processing in STINGER

Page 16: Introduction to STINGER

What sizes, performance can it handle?

V E Config Size (GB) Connected Components (s)

Updates per Sec.

1M 8M 22-14 1.184 0.316 2.7M

2M 16M 22-14 2.384 0.75 2.3M

4M 33M 22-14 4.768 2 2.3M

8M 67M 24-14 9.536 5.36 0.85M

4M 67M 24-14 7.984 3 1.38M

4M 134M 24-14 14.336 5.7 0.8M

Desktop (Intel Core i7-2600 16GB DDR3)V E Config Size (GB) Connected

Components (s)Updates per Sec.

16M 512M 25-14 60GB 13.7 696K

16M 256M 25-14 24.6GB 9.82 2.1M

Server 4x Opteron 6282 256GB DDR3

V E Config Size (GB) Connected Components (s)

Updates per Sec.

67M 512M 28-32 86GB 13.8 3.3M

268M 4.3B 28-32 312GB 52.3 2.34M

Cray XMT2 – 64 Processors 2TB DDR2

• The only limitation on size is system memory• Billions of vertices and edges are possible

• V vertices and E edges in each graph• E counts are undirected• STINGER stores both directions

• Config is STINGER-specific parameters

Page 17: Introduction to STINGER

Why not existing technologies?• Traditional SQL databases

• Not structured to do any meaningful graph queries with any level of efficiency or timeliness

• Graph databases - mostly on-disk• Distributed disk can keep up with storing / indexing, but is simply too slow at

random graph access to process on as the graph updates

• Hadoop and HDFS-based projects• Not really the right programming model for many structural queries over the

entire graph, random access performance is poor

• Smaller graph libraries, processing tools• Can't scale, can't process dynamic graphs, frequently leads to impossible

visualization attempts

Page 18: Introduction to STINGER

Who is GTRI?• Georgia Tech Research Institute

• Largest research entity at Georgia Institute of Technology• One of the world's premier university-based applied R&D

organizations for 75 years• Non-profit with over 1,600 employees and 21 locations world-wide• Over $240 million per year of government and industry contracts

• Innovative Computing Divisionof the Cyber Technology and Information Security Lab• Dedicated to the application of practical HPC expertise and

cutting‑edge fundamental research to solve real-world problems• Experts in high-performance computing, algorithms, and big data

Page 19: Introduction to STINGER

How can I start using STINGER?• Information, code, help

• http://cc.gatech.edu/stinger• [email protected]

• Together, GTRI and Georgia Tech can offer• Consulting

Understand how your organization can benefit from graph analytics.

• TrainingLearn how to use graph analysis and apply STINGER to your data.

• ImplementationCustomize and extend STINGER to suit your needs using our experts.

• Research ExpertiseConnect with researchers on the cutting edge of big data to develop novel solutions to your open problems.