High Performance Spatial Queries and Analytics for...

29
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University

Transcript of High Performance Spatial Queries and Analytics for...

Page 1: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

High Performance Spatial Queries and

Analytics for Spatial Big Data

Fusheng Wang

Department of Biomedical Informatics

Emory University

Page 2: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Spatial “Big Data”

Introduction

Geo-crowdsourcing:OpenStreetMap

Remote Sensing

2D Digital Pathology 3D Digital Pathology

Page 3: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Pathology Analytical Imaging

• Provide rich information about morphological and functional characteristics of biological systems, have tremendous potential for understanding diseases and supporting diagnosis

– e.g.: http://www.openpais.org/portal

Introduction

Glass Slides Scanning Whole Slide Images Image Analysis

Page 4: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Example: Distinguishing Characteristics in Glioblastoma

Introduction

… Molecular data

Clinical data

[JAMIA’12]

Page 5: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

WINDOW CONTAINMENT POINT

NEAREST NEIGHBOR SPATIAL JOIN DENSITY

“GIS” Centric Queries

Introduction

Need: Manage, Query and Analyze Spatial Big Data

Page 6: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Spatial Queries and Analytics

Introduction

• Feature based descriptive queries

– Feature based filtering or feature aggregation

• Spatial relationship based queries

– Spatial join (two- or multi-), window, point-in-polygon

– Polygon overlay or spatial cross-matching

• Distance based queries

– Nearest neighbors

• Spatial analytics

– Find spatial clusters, hotspots, and anomalies

– Spatial relationship modeling, e.g., geographically weighted regression model (GWR)

Page 7: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

• Requirements: fast query response, and scalable and cost-effective architecture

• Explosion of derived data

– 105x105 pixels per image

– 1 million objects per image

– Hundreds to thousands of images per study

– OSM has two billion nodes, 600K contributors

• High computational complexity

– Multi-dimensional

– Spatial queries involve heavy duty geometric computations

Requirements and Challenges

Challenges

Page 8: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Traditional Approach: Parallel SDBMS

• Shared nothing architecture through partitioning to increase I/O bandwidth via parallel data access

• Extended from ORDBMS with spatial data types and access methods

• Partitioning: even distribution of data and colocation

• e.g: PAIS (500 images, 1TB, 30 partitions)

Traditional Approach

[JPI’12, JPI’13]

Page 9: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Advantages of Parallel SDBMS

• Comprehensive model and expressive query

language for modeling and querying spatial data

• Scale out is possible

Traditional Approach

Example spatial join query:

Page 10: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Limitations of Parallel SDBMS

• The Cancer Genomics Atlas (TCGA):

– 14,000 whole slide images, 30TB of results

– Two months to get data loaded

– 4 days to do a spatial join with 30 partitions

• Partitioning based scale out is possible but is very

difficult and expensive to scale to many nodes

• DBMS not optimized for computational intensive

operations

• Data loading is a major bottleneck

• Lack load balancing

Can we provide a solution with the advantages of RDBMS but is much more scalable and cost effective?

Traditional Approach

Page 11: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Goal: High Performance Spatial Queries and Analytics

• MapReduce provides a highly scalable and cost effective framework for processing massive data

– Map step divides input into small problems by keys

– Reduce step collects answers and combine them

• HDFS for fault tolerance and efficiency

• Spatial queries and analytics are intrinsically complex and difficult to fit into the model

• Hybrid CPU/GPU systems commonly available, but the capacity is often underutilized

Hadoop-GIS

There is a major step required on providing new spatial querying and analytical methods to run on such architectures

Page 12: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Our Approach: Hadoop-GIS

A general framework to support high performance spatial queries and analytics for spatial big data on MapReduce and CPU-GPU hybrid platforms

• Spatial data processing methods and pipelines with spatial partition level parallelism running on MapReduce

• Multi-level indexing methods to accelerate spatial data processing

• Query normalization methods for partitioning effect

• Declarative spatial queries and translation into MapReduce operations

• Utilize GPU to parallelize spatial operations and integrate them into MapReduce

Hadoop-GIS

[VLDB’12, GIS’12, GIS’13, VLDB’13]

Page 13: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Multi-level Spatial Indexing

• Tile indexing for managing tiles: filtering tasks in mapping stage

• Region based spatial indexing for grouping multiple neighboring tiles into a HDFS file: filtering data

• On demand indexing for intra-tile object queries

Spatial Query Processing with MapReduce

HDFS

Tile

Region

Tile

Page 14: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

A General Framework of Spatial Data Processing in MapReduce

Hadoop-GIS

1. Spatial partitioning

2. Data staging and global indexing in HDFS

A. Block based filtering with region indexes

B. foreach tile in input_region do (in parallel)

a. Tile based filtering based on tile indexes

b. On-demand indexing for objects in the tile

c. Tile based spatial query processing

C. Boundary-crossing object handling

D. Post-query processing

E. Result storage on HDFS

Page 15: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Example: Spatial Join in MapReduce: Data Staging and Indexing

Spatial Query Processing with MapReduce

Tile

Boundary files

Partitioning

record(imageID, tileID, boundary)

Data Staging&Indexing

HDFS

Result of Algorithm 1

Result of Algorithm 2

Algorithm results

Image

Page 16: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Example: Spatial Join in MapReduce: Query Processing

Spatial Query Processing with MapReduce

Tile

Boundary files

Partitioning

record(imageID, tileID, boundary)

Data Staging&Indexing

HDFS

Result of Algorithm 1

Result of Algorithm 2

Algorithm results

Image

MAP

(one reducer per tile)REDUCE

Spatial Join

Data Scan Data Scan Data Scan…

: one query engine per title

: group objects by tiles(keys)

Page 17: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Example: Spatial Join in MapReduce: Result Normalization

Spatial Query Processing with MapReduce

Data Scan Data Scan Data Scan

MAP

REDUCE

Page 18: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

• Effective querying methods that can run in parallel in distributed computing environments

– Spatial join, multi-way join, containment, nearest neighbor, and can be extended

– Geometric computation library (GEOS)

• On-demand indexing based query processing

– Mismatch between large block based storage and random index access

– Create indexes on demand in memory or local files

– Indexing building overhead is a very small fraction

Real-Time Spatial Query Engine (RESQUE)

Spatial Query Processing with MapReduce

Boundary File1

Boundary File2

Spatial Join Algorithm

Geometry Refinement

Spatial Measure

R*-Tree File1

R*-Tree File2

Result File

Real-time spatial querying engine (RESQUE)

Bulk R*-Tree Building

Bulk R*-Tree Building

Example: Two-way Spatial Join

Page 19: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Spatial Partitioning

Spatial Partitioning

• Effective partitioning is critical for task parallelization and load balancing: data skew

• Criteria: balanced distribution, granularity, overlapping

• Methods: top-down: recursive slicing; bottom-up: R-Tree packing

• Parallezation with MapReduce

Page 20: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Boundary-Crossing Object Handling

• “Multi-assignment, single-join”: replicate objects on boundaries to multiple tiles at partitioning

• Normalization methods are provided for each query type to correct answers

– Spatial join: removal of duplicates

– Voronoi diagram: reconstruction of diagrams on the boundaries

S T S T q p

+

ps pt

q

Spatial Partitioning

Page 21: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Nearest Neighbor Query Processing Workflow with MapReduce

Map Reduce Aggregation

• e.g.: for each cell find the closest blood vessel and return distance to that blood vessel

• Access methods can vary (R*-Tree, Voronoi, etc..)

Spatial Query Processing with MapReduce

Page 22: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Performance

5 10 20 300

500

1000

1500

2000

2500

Tim

e (

se

co

nd

s)

Processing Units

Hadoop-GIS

DBMS-X

Spatial Join (boundary objects ignored)

System Performance: Hadoop-GIS vs Parallel Spatial SDBMS

Page 23: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

0 20 40 60 80 100 120 140 160 180 2000

1000

2000

3000

4000

5000

Tim

e (

seco

nds)

Processing Unit

1X (18 images)

3X (54 images)

5X (90 images)

10X (180 images)

OSM Spatial Join Imaging Cross-Matching

0 20 40 60 80 100 120 140 160 180 200400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

Processing Unit

Tim

e (

seco

nds)

Performance

System Performance: Scalability

Nearest Neighbor

1.3TB 44GB

42GB

42GB

Page 24: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Integration with Apache Hive

• Integrating declarative query languages with MapReduce is a major trend

– Hive, Pig/Latin, Scope, Impala, Shark, Ysmart…

• Hive is a data warehouse infrastructure built on top of Hadoop, with a SQL like query language (QL)

• Hive provides major aggregation operations, and support user defined functions and data compression

• No spatial query support

Software

Goal: Integrate spatial data processing into Hive

Page 25: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Hadoop-GIS Architecture

Software

2012

2013

2014

Page 26: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

GPU Accelerated Spatial Queries

GPU Computing

– Massively parallel, hundreds to thousands of cores

– Cheap and highly available

– Programmable: CUDA, OpenCL

Spatial Join Profiling

Goal: Exploit massive GPU parallelism and maximize data parallelism to accelerate spatial queries

GPU Accelerated Spatial Operations

Page 27: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

PixelBox Algorithms for Spatial Cross-Matching

PixelBox

GPU Accelerated Spatial Queries

p q p q p q

Monte-Carlo approach

• Speed up on cross-matching for one image on a single GPU (512 cores): 120X

[VLDB’12]

Page 28: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Ongoing Work (NSF CAREER)

• Create a high performance software system for spatial queries and analytics of spatial big data on MapReduce and CPU-GPU hybrid platforms

• Promote the use of the created open source software to support problem solving in multiple disciplines, and educate the next generation workforce in big data

– Spatial and location based services (with Pitney Bowes)

– 3D Imaging GIS (with Leeds, Emory)

– Social media spatial analytics (twitter data)

– Complex spatial analytics: spatial clustering, regression

– Hybrid GPU-CPU processing

Ongoing Work

Page 29: High Performance Spatial Queries and Analytics for …yellowstone.cs.ucla.edu/symposium/2014/fusheng.pdf•Integrating declarative query languages with MapReduce is a major trend –Hive,

Thank you!

https://github.com/EmoryUniversity/libhadoopgis