Braintalk cuso nm

50
Analyzing and Querying Big Scientific Data Thomas Heinis

description

Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.

Transcript of Braintalk cuso nm

Page 1: Braintalk cuso nm

Analyzing and Querying Big Scientific Data

Thomas Heinis

Page 2: Braintalk cuso nm

2

Data-Driven Scientific Discovery

HumanBrainProjectSDSS

LHC ATLAS

Scientists Are Overwhelmed with Big Data

Large Hadron Collider12 Petabytes / experiment

Sloan Digital Sky Survey4 Petabytes / year

Human Brain Project~100 Gigabytes / sec

Page 3: Braintalk cuso nm

3

Scientific Data Growth

2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 20140

1

2

3

4

5

6

7

8

9

10 Astronomy [NRAO]

Physics [LHC]

Simulation [ICESS]

Gene Sequencing [EBI]

Year

Cum

ulati

ve S

ize

of D

atas

ets

[Pet

abyt

es]

Scientific Data Grows Exponentially!

Page 4: Braintalk cuso nm

4

Data in the Simulation Sciences

DURATION

Increasing simulation duration

COVERAGE

RESO

LUTI

ON

Incr

easi

ng le

vel o

f det

ail

Dimensions are Multiplicative!

Increasing model size by order of magnitude

Page 5: Braintalk cuso nm

What is the Human Brain Project?

A 10-year European initiative to understand the human brain, enabling advances in neuroscience, medicine and future computing.

A consortium of 250+ Scientists, 135 Research Groups, from over 80 institutions, and more than 20 countries in Europe and beyond.

Page 6: Braintalk cuso nm

Human Brain Project - Vision Future Medicine

Symptom-based to biology-based classification Unique signatures of diseases Early diagnosis

Future Neuroscience Multi-level view of brain Causal chain of events from genes to cognition

Future Computing Supercomputing as scientific method Human like intelligence

Page 7: Braintalk cuso nm

7

Brain Simulation – Wet Lab Neuron structure & electrophysiological properties:

Page 8: Braintalk cuso nm

Simulating the Brain

Page 9: Braintalk cuso nm

9

Spatial Analysis

Static 3DExploration

Interactive 3DExploration

Simulation Science Data Challenges

Simulation

Observational Data

Post Simulation

Data

Dynamic 3DExploration

Need Scalable Spatial Access Methods

Spatial Modeling

Page 10: Braintalk cuso nm

10

Spatial Analysis

Static 3DExploration

Interactive 3DExploration

Simulation Science Data Challenges

Simulation

Observational Data

Post Simulation

Data

Dynamic 3DExploration

Need Scalable Spatial Access Methods

Spatial Modeling

Page 11: Braintalk cuso nm

11

Static ExplorationNeural Tissue Model

Single Neuron3D Model

Efficient Spatial Index is Crucial

3D Spatial Range Query

Page 12: Braintalk cuso nm

c

State-of-the-Art Spatial Indexes

12

R-Tree: Hierarchy of Minimum Bounding Rectangles (MBR)

R-Trees Variants:Hilbert packed R-Tree STR R-Tree PR-Tree

Overlap

Range Query

Structural Overlap Degrades Performance

Page 13: Braintalk cuso nm

13

50 100 150 200 250 300 350 400 4500

50

100

150

200

250

300Hilbert R-Tree

STR R-Tree

PR-Tree

Dataset Density [Million of Elements per unit Volume]

Tim

e [s

econ

ds]

Scalability ChallengeDataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.Range Queries: Uniform Random 500 for each experiment.

Spatial Density Increases with Dataset Size State of the Art Does Not Scale with Density

Page 14: Braintalk cuso nm

FLAT: A Two Phase Spatial Index

2) CRAWLING: Traverse neighborhood

c1) SEEDING: Find any one object

Requires Reachability

14

Use Connectivity To Avoid Overlap

Key Idea: Two phases, each independent of overlap:

Page 15: Braintalk cuso nm

15

Earthquake simulations datasets

No Problem!

FLAT: Reachability Problem

Convex Dataset GeometryNever crawl outside the query bound

ConnectivityFor accessing neighboring objects in data.

REQUIREMENTS:

Not every dataset satisfies this requirement!

No path inside query

No Connectivity

Page 16: Braintalk cuso nm

FLAT: Reachability

16

c

1) PartitioningGroup spatially close elements

2) LinkingConnect neighboring partitions

Add Connectivity → Enable Recursive Crawling

Index Building:

Page 17: Braintalk cuso nm

FLAT: Seeding Phase

17

Seed R-Tree

R-Tree for seeding, but will it scale with density?

Seeding phase avoids overlap overhead in R-Tree

Overlap Seed query picks

one child arbitrarily

Seed Query

Seeding is fast page reads = ~height of tree.

Range Query: Find ALL element inside querySeed Query: Find ANY ONE element inside query

Page 18: Braintalk cuso nm

18

SeedPartition

FLAT: Crawling PhaseThe neighbor links are used for recursive graph traversal Starting from the seed page

Linear complexity in terms of graph edges

Range Query

Page 19: Braintalk cuso nm

19

50 100 150 200 250 300 350 400 4500

50

100

150

200

250

300 Hilbert R-TreeSTR R-TreePR-TreeFLAT

Dataset Density [Million of Elements per unit Volume]

Tim

e [s

econ

ds]

FLAT: Performance EvaluationDataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.Range Queries: Uniform Random 500 for each experiment.

Spatial Density Increases with Dataset Size Decouples Execution Time from Density

7.8 x

Page 20: Braintalk cuso nm

20

FLAT: Scalability

50 100 150 200 250 300 350 400 4500

0.51

1.52

2.53

3.54

4.55

Hilbert R-TreeSTR R-TreePR-TreeFLAT

Dataset Density [Million of Elements per Unit Volume]

Tim

e pe

r Res

ult O

bjec

t [m

s]

Seeding cost amortizes with increase in result cardinality

Trend is “FLAT”, Scales With Density

Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.Range Queries: Uniform Random 500 for each experiment.

Page 21: Braintalk cuso nm

21

FLAT: iPad Implementation

http://www.youtube.com/watch?v=zaUEARq-IY0

Page 22: Braintalk cuso nm

22

Static 3DExploration

Interactive 3DExploration

Simulation Science Data Challenges

Simulation

Observational Data

Post Simulation

Data

Dynamic 3DExploration

Spatial Modeling

Page 23: Braintalk cuso nm

23

Interactive Exploration

23

Bronchial Tree of the Lung

Arterial Tree of the Heart

Spatial Range Query SequencesGuiding

Path

Guided Analysis Ubiquitous in Scientific Applications

Neural Network

Page 24: Braintalk cuso nm

24

Guiding paths are not known in advanceInteractive execution of query sequence

Interactive Query Execution

DISK

CPU

Retrieve Query ResultsProcess Results

Time

1st Query 2nd Query 3rd Query

Predictive Prefetching Hides Data Retrieval Cost

Prefetching Opportunity

1st Query 3rd Query2nd Query Path decided after processing results

Prefetch DataPrediction

Predict next query location in the sequencePrefetch data of next query into prefetch cache

Page 25: Braintalk cuso nm

25

Existing techniques: Extrapolate past query locations

Exponential Weighted Moving Average (EWMA) Straight LineHilbert Prefetching

Predictive Prefetching

Large Volume Queries

Small Volume Queries

10k 80k 150k 220k05

101520253035404550

Volume of Query [µm3]

Cach

e Hi

t Rat

e [%

]

Neuroscience Data set25 query in sequence

Not Efficient With Arbitrary Query Volume!

Page 26: Braintalk cuso nm

26

SCOUT: Content Aware PrefetchingKey Insight: Use previous query content!

Approach:

1. Inspect query results

2. Identify guiding path

3. Predict next query using guiding path

Need to Identify Guiding Path

?

Page 27: Braintalk cuso nm

27

SCOUT: How paths are defined

Query results = many primitive spatial objects.

Idea: Graph FrameworkG(V,E) such that, Vertices = spatial objects, Edges between nearby objects.

Independence from data representation

Exact graphN2 comparisons!

Grid Hash based construction Approximate Graph Representation

Range Query

Page 28: Braintalk cuso nm

28

PathsCandidate set

SCOUT: Guiding Path IdentificationIterative Candidate PruningKey Insight: Guiding path goes through all queries!

nn+1

n+2

n+3

Guiding path

PredictedQuery

Longer Sequence → Better Prediction

Page 29: Braintalk cuso nm

29

Prefetch duration not known in advance. Query dimension not known in advance.

Idea: Incremental PrefetchingRepeatedly prefetch growing regionsBy extrapolating guiding path

nth query in sequence

SCOUT: Where to Prefetch

Independence from query size

Guiding Path

Exit…..

p1 p2 pn

Policy = safest region first

Page 30: Braintalk cuso nm

30

0102030405060708090

100 EWMA Straight Line HilbertSCOUT

Cach

e H

it Ra

te [%

]

SCOUT: Prediction Accuracy

Sequence 1 Sequence 2Visualization

Cache Hit Rate = Amount of data retrieved from cache Total amount of data retrieved x 100

80K [μm3] 32

Query Volume:Sequence Length:

20K [μm3] 32

Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk

72% - 91% Prediction AccuracySCOUT speeds up sequences up to 14.7x

Speedup 2x

Speedup 14.7x

Page 31: Braintalk cuso nm

31

SCOUT: ScalabilityIncrease in Data set Size

50M 150M 250M 350M 450M0

102030405060708090

100 SCOUT

Data set Size [# of spatial objects]

Cach

e H

it Ra

te [%

]

SCOUT scales with increase in data set size

CPU

DISK

Retrieve Query ResultsProcessing Results

Time

3rd Query2nd Query

PredictionPrefetching

SCOUT Overhead

50M 150M 250M 350M 450M0

50

100

150

200

PredictionRetrieve Query Results

Data set Size [# of spatial objects]

Tim

e [s

ec]

Selectivity increases

15-16%

Page 32: Braintalk cuso nm

32

Static 3DExploration

Interactive 3DExploration

Simulation Science Data Challenges

Simulation

Observational Data

Post Simulation

Data

Dynamic 3DExploration

Spatial Modeling

Page 33: Braintalk cuso nm

33

Dynamic Exploration

Mesh: Collection of 3D Connected Polyhedra

Mesh → Enable High Precision 3D Models

Polyhedra Connected Polyhedra Volumetric Mesh Model

3D Vertices Shared Faces

Challenge: Monitoring Memory Resident Spatial Mesh Models

Page 34: Braintalk cuso nm

34

Monitoring Mesh Simulations

Problem: Efficiently Execute Range Queries

Time step 1 Time step 2 Time step 3

timeSimulation Time step

Simulation Time step

Updates Queries

Monitor Monitor

Page 35: Braintalk cuso nm

35

Data Challenge

Need: Solution That Scales

Mesh Detail:

Highly Dynamic:Unpredictable Mesh MovementUpdates Affect Entire Dataset

Mesh Detail Increases With Dataset Size

Now Future

Timestep 2Timestep 1

Page 36: Braintalk cuso nm

36

State of the Art

Moving Object IndexesTPR-Tree, STRIPES

Neither Scales with Size nor Detail!

Mesh Movement is Inherently Unpredictable

Static Spatial IndexesR-Tree, LUR-Tree, QU-

Trade

Linear Scan

Coarse Grained Fine Grained

Page 37: Braintalk cuso nm

37

Performance Evaluation

Linear Scan Outperforms Indexed Approaches

Not Enough Queries to Invest on Index Maintenance

MonitortimeSimulation

Time stepMonitor

Simulation Time step

Few Queries

Massive Updates

SETUP:Neural Mesh Dataset: 1.32 Billion Tetrahedral Mesh (33GB)15 Queries per 60 simulation time step

Statistical Analysis Microb...0

1000

2000

3000

4000

5000

6000

7000

8000

LinearScan OCTREELUR-Tree QU-Trade

Tota

l Que

ry R

espo

nse

Tim

e [s

ec]

99.5%

80%

72%

Maintenance

Page 38: Braintalk cuso nm

38

Can We Do Better?

Mesh Connectivity → Query Execution

Reduce Search Space → Index ApproachNo Maintenance → Linear Scan

Best of Both Worlds

Not Rely on External Data Structure:→ Directly use in-memory Mesh Data

Mesh Graph Traversal: → Retrieve Results in Spatial Proximity

OCTOPUS: Idea

Vertices

Edges

Mesh Graph

Key Insight: Use Mesh Connectivity to Retrieve Query Results!

Page 39: Braintalk cuso nm

39

OCTOPUS

Range Query

Update Oblivious Query Execution

Time step 1 Time step 2 Time step 3

What About Non-Convex Meshes?

Page 40: Braintalk cuso nm

40

OCTOPUS: Non-Convex Meshes

Using Mesh Surface Guarantees Accuracy

?

No Reachability! Surface Scan

Page 41: Braintalk cuso nm

41

OCTOPUS: Mesh Deformation

Deformation: Zero Cost of surface maintenance

Scales With Massive Updates

Time step 1 Time step 2 Time step 3

Graph changes

Page 42: Braintalk cuso nm

42

OCTOPUS: Mesh Detail

Scales with Mesh Resolution

Quadratic Increase Surface Points Cubic Increase

Non-Surface Points

Scalability: Surface grows slower than volume (and therefore dataset size)!

Page 43: Braintalk cuso nm

43

OCTOPUS: Performance

7.3-8X Speedup

Visualization Microbench-

Mark

Statistical Analysis Microbenchmark

0

1000

2000

3000

4000

5000

6000

7000

8000

9000OCTOPUSLinearScanOCTREELUR-TreeQU-Trade

Tota

l Que

ry E

xecu

tion

Tim

e [s

ec]

8X 7.3X Visualization

Microbenchmark

Page 44: Braintalk cuso nm

44

OCTOPUS: Scalability

0.13 0.17 0.26 0.52 1.320

20

40

60

80

100

120

140 Graph TraversalSurface Scan

Mesh Detail[Tetrahedrals in Billions]

Tota

l Que

ry E

xecu

tion

Tim

e [s

ec]

OCTOPUS Breakdown

64%

41%

0.13 0.17 0.26 0.52 1.320

350

700

1050

1400 LinearScanOCTOPUS

Mesh Detail[Tetrahedrals in Billions]

Tota

l Que

ry E

xecu

tion

Tim

e [s

ec]

Scales with Mesh Detail

SETUP: Queries: Uniform random 15 per time step, 60 time steps

8X 10X

Page 45: Braintalk cuso nm

45

Algorithm Overview

Simulation

Observational Data

Post Simulation

Data

Spatial Analysis

Model Validation

Spatial Modeling

OCTOPUS: ICDE’14FLAT: ICDE’12SCOUT: VLDB’12

TOUCH: SIGMOD’13GIPSY: SSDBM ‘13

Page 46: Braintalk cuso nm

46

Human Brain Project:Part of the toolset used every dayFebruary 2013: first 10 million neuron model builtStill 4 orders of magnitude smaller than human brain

General Applicability:Material SciencesAstronomyGeographical InformationSystems

Impact

2010

20082006

1K 10K 100K 10M05

1015202530

Simulation Size [# Neurons]

Mod

el S

ize

[GB]

2013(2.5 TB)

Page 47: Braintalk cuso nm

47

Future ChallengesEnable Scientific Breakthroughs via Scalable Data Analysis! Address Scientific Data Trends:

→ Progressively Complex Datasets→ Increasingly Complex Scientific Queries→ Modern Hardware

Approximate Queries on Big Data:→ Use Mechanism of Learning & Forgetting to

manage Data Synopses

Page 48: Braintalk cuso nm

48

Data Privacy/Anonymization Scalable Querying of Petascale Data Cloud Analytics Quick & efficient access to raw data Distributed Workflow Execution Provenance/Reproducibility Data Personalization

HBP Data Management Challenges

Page 49: Braintalk cuso nm

49

Conclusions Enabling data exploration is key to scientific

discovery. Prior spatial access methods do not scale with

data growth. Use Spatial Connectivity to achieve scalability.

→ Explicitly Added (FLAT & TOUCH)→ Implicitly Present in the Dataset (OCTOPUS

& SCOUT) Many exciting big data management

challenges remain!

Page 50: Braintalk cuso nm

50

Thank You!

Collaborators:Farhan Tauheed, Anastasia Ailamaki,

Felix Schürmann, Henry Markram, Sadegh Nobari, Panagiotis Karras, Laurynas Biveinis, Mirjana Pavlovic