Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on...

10
Towards Building a High Performance Spatial Query System for Large Scale Medical Imaging Data Ablimit Aji Fusheng Wang Joel H. Saltz Department of Mathematics & Computer Science , Emory University Department of Biomedical Informatics, Emory University Center for Comprehensive Informatics, Emory University {aaji,fusheng.wang,jhsaltz}@emory.edu ABSTRACT Support of high performance queries on large volumes of scien- tific spatial data is becoming increasingly important in many ap- plications. This growth is driven by not only geospatial problems in numerous fields, but also emerging scientific applications that are increasingly data- and compute-intensive. For example, digital pathology imaging has become an emerging field during the past decade, where examination of high resolution images of human tissue specimens enables more effective diagnosis, prediction and treatment of diseases. Systematic analysis of large-scale pathology images generates tremendous amounts of spatially derived quantifi- cations of micro-anatomic objects, such as nuclei, blood vessels, and tissue regions. Analytical pathology imaging provides high potential to support image based computer aided diagnosis. One major requirement for this is effective querying of such enormous amount of data with fast response, which is faced with two ma- jor challenges: the “big data” challenge and the high computation complexity. In this paper, we present our work towards building a high performance spatial query system for querying massive spa- tial data on MapReduce. Our framework takes an on demand index building approach for processing spatial queries and a partition- merge approach for building parallel spatial query pipelines, which fits nicely with the computing model of MapReduce. We demon- strate our framework on supporting multi-way spatial joins for al- gorithm evaluation and nearest neighbor queries for microanatomic objects. To reduce query response time, we propose cost based query optimization to mitigate the effect of data skew. Our exper- iments show that the framework can efficiently support complex analytical spatial queries on MapReduce. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications—Spatial Database and GIS, Systems, Scientific Databases General Terms Design, Management, Experimentation, Performance Keywords Spatial Query Processing, MapReduce, Pathology Imaging, Data Skew Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL GIS ’12, November 6-9, 2012. Redondo Beach, CA, USA. Copyright 2012 ACM ISBN 978-1-4503-1691-0/12/11 ...$15.00. 1. INTRODUCTION Support of high performance queries and analytics of large vol- umes of spatial data become increasingly important for many ap- plications. This growth is driven by not only geospatial problems in numerous fields but also emerging scientific applications that are increasingly data- and compute-intensive. Pathology is a medical subspecialty that practices the diagnosis of disease. Microscopic examination of tissue reveals information enabling the pathologist to render accurate diagnoses and to guide therapy. The basic process by which anatomic pathologists render diagnoses has remained relatively unchanged over the last century. However, recent advances in digital pathology imaging, specifi- cally in the arena of whole slide imaging, have initiated the tran- sition to digital pathology practice. Devices that can acquire high- resolution images from whole tissue slides and tissue microarrays have become more affordable, faster, and practical, and practices will increasingly adopt this technology and eventually produce an explosion of data to be used in healthcare informatics. In coming decades, results from pathology image analysis will emerge as a new type of healthcare records, and begin to provide diagnostic as- sistance, identify therapeutic targets and predict patient outcomes and therapeutic responses. Systematic analysis of large-scale microscopy images can in- volve many interrelated analyses, generating tremendous amounts of spatially derived quantifications for microanatomic objects such as cells, nuclei, blood vessels, and tissue regions. Analysis results will be archived and frequently queried to support multiple types of studies and diagnosis. Common queries include aggregation of fea- tures, traditional “GIS” like queries, and complex spatial queries. For example, there are spatial cross-matching queries of multiple sets of segmented spatial objects for algorithm evaluation, spatial proximity queries between micro-anatomic objects, and global spa- tial pattern mining in whole images. However, microscopic imag- ing has been underutilized in healthcare settings. One major ob- stacle which tends to reduce wider adoption of these new technolo- gies throughout the clinical and scientific communities is managing such enormous amounts of data and querying them efficiently. Ma- jor challenges include the “big data” challenge and high complexity of queries. A typical microscopy whole slide image (WSI) can con- tain 100, 000x100, 000 pixels, and one single image may contain millions of microanatomic objects and hundreds of millions of fea- tures. A moderate-size healthcare operation can routinely generate thousands of whole slide images per day, which can lead to sev- eral terabytes of derived analytical results. Spatial oriented queries involve heavy geometric computations for spatial filtering and mea- surements, and require high performance computations to support fast response of queries.

Transcript of Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on...

Page 1: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

Towards Building a High Performance Spatial QuerySystem for Large Scale Medical Imaging Data

Ablimit Aji† Fusheng Wang‡ Joel H. Saltz‡†Department of Mathematics & Computer Science , Emory University

‡Department of Biomedical Informatics, Emory University‡Center for Comprehensive Informatics, Emory University

{aaji,fusheng.wang,jhsaltz}@emory.edu

ABSTRACTSupport of high performance queries on large volumes of scien-tific spatial data is becoming increasingly important in many ap-plications. This growth is driven by not only geospatial problemsin numerous fields, but also emerging scientific applications thatare increasingly data- and compute-intensive. For example, digitalpathology imaging has become an emerging field during the pastdecade, where examination of high resolution images of humantissue specimens enables more effective diagnosis, prediction andtreatment of diseases. Systematic analysis of large-scale pathologyimages generates tremendous amounts of spatially derived quantifi-cations of micro-anatomic objects, such as nuclei, blood vessels,and tissue regions. Analytical pathology imaging provides highpotential to support image based computer aided diagnosis. Onemajor requirement for this is effective querying of such enormousamount of data with fast response, which is faced with two ma-jor challenges: the “big data” challenge and the high computationcomplexity. In this paper, we present our work towards building ahigh performance spatial query system for querying massive spa-tial data on MapReduce. Our framework takes an on demand indexbuilding approach for processing spatial queries and a partition-merge approach for building parallel spatial query pipelines, whichfits nicely with the computing model of MapReduce. We demon-strate our framework on supporting multi-way spatial joins for al-gorithm evaluation and nearest neighbor queries for microanatomicobjects. To reduce query response time, we propose cost basedquery optimization to mitigate the effect of data skew. Our exper-iments show that the framework can efficiently support complexanalytical spatial queries on MapReduce.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications—SpatialDatabase and GIS, Systems, Scientific Databases

General TermsDesign, Management, Experimentation, Performance

KeywordsSpatial Query Processing, MapReduce, Pathology Imaging, DataSkewPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ACM SIGSPATIAL GIS ’12, November 6-9, 2012. Redondo Beach, CA,USA.Copyright 2012 ACM ISBN 978-1-4503-1691-0/12/11 ...$15.00.

1. INTRODUCTIONSupport of high performance queries and analytics of large vol-

umes of spatial data become increasingly important for many ap-plications. This growth is driven by not only geospatial problemsin numerous fields but also emerging scientific applications that areincreasingly data- and compute-intensive.

Pathology is a medical subspecialty that practices the diagnosisof disease. Microscopic examination of tissue reveals informationenabling the pathologist to render accurate diagnoses and to guidetherapy. The basic process by which anatomic pathologists renderdiagnoses has remained relatively unchanged over the last century.However, recent advances in digital pathology imaging, specifi-cally in the arena of whole slide imaging, have initiated the tran-sition to digital pathology practice. Devices that can acquire high-resolution images from whole tissue slides and tissue microarrayshave become more affordable, faster, and practical, and practiceswill increasingly adopt this technology and eventually produce anexplosion of data to be used in healthcare informatics. In comingdecades, results from pathology image analysis will emerge as anew type of healthcare records, and begin to provide diagnostic as-sistance, identify therapeutic targets and predict patient outcomesand therapeutic responses.

Systematic analysis of large-scale microscopy images can in-volve many interrelated analyses, generating tremendous amountsof spatially derived quantifications for microanatomic objects suchas cells, nuclei, blood vessels, and tissue regions. Analysis resultswill be archived and frequently queried to support multiple types ofstudies and diagnosis. Common queries include aggregation of fea-tures, traditional “GIS” like queries, and complex spatial queries.For example, there are spatial cross-matching queries of multiplesets of segmented spatial objects for algorithm evaluation, spatialproximity queries between micro-anatomic objects, and global spa-tial pattern mining in whole images. However, microscopic imag-ing has been underutilized in healthcare settings. One major ob-stacle which tends to reduce wider adoption of these new technolo-gies throughout the clinical and scientific communities is managingsuch enormous amounts of data and querying them efficiently. Ma-jor challenges include the “big data” challenge and high complexityof queries. A typical microscopy whole slide image (WSI) can con-tain 100, 000x100, 000 pixels, and one single image may containmillions of microanatomic objects and hundreds of millions of fea-tures. A moderate-size healthcare operation can routinely generatethousands of whole slide images per day, which can lead to sev-eral terabytes of derived analytical results. Spatial oriented queriesinvolve heavy geometric computations for spatial filtering and mea-surements, and require high performance computations to supportfast response of queries.

Page 2: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

High performance computing capabilities are fundamental to ef-ficient handling of massive spatial datasets and to the short re-sponse times required or preferred for many applications. Tradi-tional spatial database management systems (SDBMSs) have ma-jor limitations on managing and querying large scale scientific spa-tial data. SDBMSs are often extended from traditional relationalDBMS with a tightly integrated architecture. Scalable spatial datamanagement thus can rely on parallel relational database archi-tectures, such as shared nothing architecture, for managing andquerying the data. The SDBMS approach has several limitationson achieving high performance spatial queries. Parallel SDBMSstend to reduce the I/O bottleneck through partitioning of data onmultiple parallel disks and are not optimized for computational in-tensive operations such as geometric computations. For example,our study shows that for a spatial join query, about 90% of time isspent on computation [35]. SDBMSs also support limited spatialaccess methods, and it is difficult to efficiently support complexqueries which could be more efficiently supported by other typesof access methods or query pipelines. Partitioning based parallelDBMS architecture also lacks effective space based partitioning tobalance data and task loads across database partitions. The highoverhead of data loading is another major bottleneck for SDBMSbased solutions [27]. Our experiments show that loading the re-sults from a single whole slide image into a SDBMS can take a fewminutes to dozens of minutes. Scaling out spatial queries through alarge scale parallel database infrastructure is studied in our previouswork [34], but the approach is highly expensive on software licens-ing and dedicated hardware [27, 14, 31], and requires sophisticatedtuning and maintenance.

With the rapid advancement of network technologies, and in-creasingly wide availability of low-cost and high-performance com-modity computers and storage systems, large-scale distributed clus-ters now can be conveniently built to support data- and compute-intensive applications. MapReduce provides a highly scalable, re-liable, elastic and cost effective framework for storing and pro-cessing massive data. Hadoop, an open-source implementation ofMapReduce, has been widely used in practice, especially in ma-jor Internet applications to support efficient handling of web-scaledata. While the “map” and ”reduce” programming model fits nicelywith large scale problems which are often divided through spacepartitioning, spatial queries are intrinsically complex which oftenrely on effective access methods to reduce search space and allevi-ate high cost of geometric computations. Thus, there is a significantstep required on adapting and redesigning spatial query methods totake advantage of MapReduce, and providing a scalable, efficient,expressive, and cost effective spatial querying system.

Our goal is to address the research challenges for delivering ascalable, efficient, expressive spatial query system for efficientlysupporting analytical queries on large scale spatial data, and to pro-vide feasible solutions that can be afforded for daily operations.Our main contributions include:

• A new hybrid architecture that combines MapReduce and ondemand indexing for efficient large scale spatial query sup-port;

• A parallelization oriented query engine that partitions dataand space, dynamically selects query pipelines to support di-verse spatial queries with optimal access methods in a MapRe-duce framework;

• System optimization techniques for efficient query executionand skew-mitigation.

Another major contribution is the support of declarative spatialqueries with automated query translation to MapReduce, whichwill be briefly discussed in this paper.

The rest of the paper is organized as follows. In Section 2, weoutline common spatial query cases in analytical medical imagingand main research challenges. In Section 3, we describe our pro-posed system and its architectural components. In Section 4, weprovide detailed description of specific query types supported bythe system and query processing workflows. In Section 5, we eval-uate the system with a set of queries on real world datasets. In Sec-tion 6, we discuss data skew and how it is mitigated in our system,followed by related work and conclusion.

2. BACKGROUNDSystematic analysis of large-scale image data can involve many

interrelated analyses, generating tremendous amounts of quantifi-cations such as spatial objects and features, as well as classifi-cations of the quantified attributes. For example, pathology im-age analysis offers a means of rapidly carrying out quantitative,reproducible measurements of micro-anatomical features in high-resolution pathology images and large image datasets [18, 12]. Whilethe analysis pipeline may vary depending on the actual implemen-tation within an institution, the process generally involves follow-ing operations.

i) Object Segmentation. Entities such as cell nuclei are detected,and their boundaries are identified. In our query pipelines all spatialdata are stored and managed in vector formats such as WKT.

ii) Region Segmentation. Often the entities to be segmented arecomposed of collections of simple objects and structures which aredefined by a complex textural appearance. Examples include iden-tifying the boundaries of blood vessels, lesions, and inflammation.

iii) Feature Extraction. A collection of characteristic features,such as shape and texture, are calculated and extracted for each ob-ject such as nucleus to form a feature vector. Classifications arecomputed based on image features or region classification algo-rithms.

iv) Data Management & Queries. A data management system istypically utilized to efficiently manage and query the derived datato support data retrieval, analysis and exploration.

(a) Spatial join between multi-ple datasets (green versus red)

(b) Find closest blood vessels(green) to each cell (purplespots)

Figure 1: Example spatial query cases in analytical medicalimaging (best viewed in color)

2.1 Query CasesThere are many types of queries to be supported on the spatially

derived data, summarized as follows: i) selection and feature ag-gregation over regions; ii) spatial cross-matching or spatial join ofobjects; iii) spatial proximity between objects; and iv) global spa-tial pattern discovery. Next we explain the four typical query cases,and in this paper, we will mainly focus on the second and thirdquery cases.Spatial Selection and Feature Aggregation. Aggregation or sum-mary statistics on computed features are frequently calculated for

Page 3: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

spatial applications. These queries are often implemented with aspatial filtering operation such as spatial containment followed bya feature aggregation query on qualified objects. This query typecan be taken as a special case of spatial join combined with tradi-tional structured data queries.Spatial Join or Spatial Cross Matching. There are many typesof spatial join operations based on topological relationships, suchas contains, within, intersects, touches, which find correlation be-tween multiple datasets of spatial objects. A spatial cross-matchingproblem involves identification and comparison of spatially derivedobjects belonging to different observations or analyses. Spatialcross-matching in the domain of digital sky survey aims at perform-ing one-to-one matches in order to combine physical properties orto study the temporal evolution of the source [22]. In the domainof digital pathology, spatial cross-matching of segmented spatialobjects (microanatomic objects) from different methods providesa powerful approach for testing, evaluating and iteratively devel-oping high quality algorithms to support biomedical research andcomputer aided diagnosis, and can be used in the following sce-narios: i) Algorithm Validation. Algorithms are tested, evaluatedand improved in an iterative manner by validating algorithm resultssuch as segmentations with human annotations made by patholo-gists. ii) Algorithm Consolidation. Multiple algorithms can be de-veloped in a study to solve the same problem. Different algorithmresults are aggregated and combined to generate more confidentanalysis results. iii) Algorithm Sensitivity Study. An algorithm of-ten includes a set of parameters that can be adjusted to adapt todifferent types, resolutions, and qualities of images. Exploring thesensitivity of analysis output with respect to parameter adjustmentscan provide a guideline for the best deployment of algorithms indifferent scenarios and for rapid development of robust algorithms.Figure 1(a) shows an illustrative example of a cross-matching queryin which common area between intersecting polygons from two re-sult sets computed by two different methods from the same image.Cross-matching usually involves millions or billions of spatial ob-jects, making it one of the most challenging spatial queries.Spatial Proximity Between Objects. Objects with spatial prox-imity often form correlation groups or targets of interests. For ex-ample, micro-anatomic objects with spatial proximity often formgroups of cells close to blood vessels that are biologically corre-lated. An example useful query can be like this: for each stemcell, find the nearest blood vessel, compute the variation of inten-sity of each biological property associated with the cell in respectto the distance, and return the density distribution of blood vesselsaround each cell. The spatial proximity graph in Figure 1(b) showsan example of blood vessels (in red) and stem cells (small purplespots). This query will involve millions of cells for a single image.Global Spatial Pattern Discovery. The goal of spatial pattern dis-covery is to detect and quantify patterns that are significant and dif-ferent from others. An example is the detection of spatial regionswith high scores according to some density measures or based oncertain statistical testing criteria. Consider the study of brain tu-mors. The tumor growth comes with necrosis and vascular prolif-eration which often forms spatial patterns during different stagesof tumor growth: pseudopalisades [6] in glioblastoma brain tu-mors appears as ring-enhancing lesions where the rings have muchhigher concentration of cells than adjacent cells. By analyzing thespatial distribution patterns of cells, it is possible to automate theidentification of tumor subtypes and their characteristics.2.2 Challenges

With the rapid improvement of instrument resolutions and theaccuracy of image analysis methods, such spatial queries are in-creasingly compute- and data-intensive.

High Spatial and Geometric Computation Complexity. Mostspatial queries involve geometric computations, which are oftencompute-intensive. Geometric computations are not only used forreturning measurements or generating new spatial objects, but alsoused in logical operations for topological relationships. For exam-ple, cross-matching spatial objects is a typical spatial join whichfirst identifies matching intersecting polygon pairs (topology rela-tionship verification) and then measures the ratio of overlappingareas. A naive brute force approach for such matching is extremelyexpensive and may take hours or days to compute even for a sin-gle image [33]. This is mainly due to the polynomial complexityof common computational geometry algorithms used for verifyingintersection of polygon pairs where each shape representation con-tains hundreds of points. To minimize the computational cost, ef-fective spatial access methods are critical for supporting queries,and high performance architecture is essential to provide parallelprocessing of spatial queries.The “Big Data” Challenge. High resolution whole slide imagesgenerated from high resolution tissue scanners provide rich infor-mation about spatial objects and their associated features. Wholeslide images at diagnostic resolution are very large: a typical im-age can contain 100,000x100,000 pixels. One image may containmillions of microanatomic objects, and hundreds of image featurescould be extracted for each object. A study may involve hundredsto thousands of images obtained from a large cohort of subjects.For large scale interrelated analysis, there may be dozens of algo-rithms with varying parameters to generate many different resultsets to be compared and consolidated. Thus, derived data from im-ages of a single study is often in the scale of tens of terabytes, andpetabytes of data are likely when analytical pathology imaging isadopted in the clinical environment in the future. Managing andquerying such large volumes of data combined with the complexityof spatial queries poses new research challenges on effective spatialquery systems for big spatial data.

3. ARCHITECTURE OVERVIEW

Figure 2: System Architecture

We develop a MapReduce based framework to support expres-sive and cost effective high performance spatial queries. The frame-work includes a real-time spatial query engine consisting of a vari-ety of optimized access methods, boundary and density aware spa-tial data partitioning (under development), a declarative query lan-guage interface, a query translator which automatically translatesspatial queries into MapReduce program, and an execution enginewhich parallelizes and executes queries on Hadoop. Figure 2 showsan architectural overview of the system. Data is partitioned andstaged on the HDFS system for parallel access. Users interact withthe system by submitting jobs in a declarative query language likeSQL. The queries are translated into MapReduce codes with specialhandling of spatial operators and optimized for fast query response.Then it relies on Hadoop for query execution and utilizes a stand-

Page 4: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

alone spatial query engine for spatial query processing. Next wediscuss some core components of the system in detail.

3.1 YSmart-S: Spatial Query TranslatorThe MapReduce framework simplifies distributed application de-

velopment by providing two simplified data transformation func-tions – map and reduce. While this low level of programming in-terface provides flexibility, it requires significant programming ef-fort. Debugging MapReduce code in a distributed environment isslow and inefficient. A declarative query language would greatlysimplify the query interface, reduce the programming effort andboost developer productivity. Recently, such SQL-like query lan-guages and their translators [25, 32, 11] have been widely usedin industrial production systems. We build a spatial query trans-lator YSmart-S by extending YSmart – an open source SQL-to-MapReduce-Translator [20], with spatial query capabilities. Thus,users can interact with the system by submitting SQL queries formost of their query needs, yet they can write custom MapReducecode whenever they need new query functionality.

3.2 RESQUE: Real-Time Spatial Query En-gine

To support high performance spatial queries, a stand-alone spa-tial query engine is developed to efficiently support following in-frastructure operations: i) spatial relationship comparison, such asintersects, touches, overlaps, contains, within, disjoint, ii) spatialmeasurements, such as intersection, union, convexHull, distance,centroid, area, etc; iii) spatial access methods for efficient queryprocessing, such as R∗-Tree and Voronoi Diagram building andquerying these structures. The engine is compiled as a shared li-brary and can be easily deployed on multiple cluster nodes.

Due to the high computational complexity of spatial queries, spa-tial query processing techniques traditionally employ a partition-filter-refine approach without creating an index on input datasets[26, 36]. In RESQUE, we take a hybrid approach in which spatialindexes are created on-the-fly (when needed) and used to acceler-ate spatial queries. The index creation overhead, as shown in ourexperiments, only accounts for a small fraction of overall queryresponse time.

3.3 Data Partitioning and StagingData partitioning with tiling is a standard practice [13] in manag-

ing large amounts of spatial data and it can speed up many spatialqueries. For example, to process a window query, we may onlyread the partitions which are relevant to the query window whereasa naive approach would scan the whole data table. In pathologyimage analysis stage, each image is decomposed into N fixed-sizeregular tiles (top left in Figure 2). By default, each algorithm onone tile of a partitioned image will create one one boundary resultfile. We propose to merge all small tile based result files for eachimage as a single large file, and then stage the merged large fileonto HDFS, where each spatial object is assigned an internal tileid. While it may be appealing to directly stage the data as individ-ual tiles, there are several problems associated with this approach.First, Hadoop is optimized for batch oriented processing and parti-tioning the data into large number of small chunks is detrimental tothe query performance. Second, in Hadoop each file split locationmetadata is stored in the main memory of the namenode for fast fileaccess. Large number of small files generate significant amount ofmetadata which quickly uses up the namenode main memory andaffects system stability and performance.

Tiling of pathology images offers a convenient way for manag-ing large scale image sets and it also increases the level of paral-

lelism for query processing. Spatial objects on the boundary of tilesmay need special handling in many applications. In our system, weignore the objects across partitioning boundaries for two reasons.First, boundary objects are discarded during the upstream imageanalysis steps. Most of the upstream analysis steps such as imagesegmentation use in-memory algorithms which require the inputto be small enough (tiles) to be processed in memory. While it istechnically possible to reconstruct boundary objects with additionalprocessing, they are generally discarded due to the extra computa-tional effort and simplification of the analysis pipeline. Second,as there is a large number of microanatomic objects for each im-age, pathology imaging based studies often take a statistical basedmethod, where the result will not be impacted by the small fractionof boundary objects.

4. QUERY PROCESSING

4.1 Complex Query Types

(a) (b)

Figure 3: Different spatial query types (a) star join query (b)clique join query

Spatial joins play an important role in effective spatial query pro-cessing for analytical pathology imaging. A pairwise spatial joinor two-way spatial join combines two datasets with respect to somespatial predicates. Multiway spatial joins involve more than twospatial inputs and an arbitrary number of join predicates. For ex-ample, in Figure 3, the spatial relationR0 is joined with three otherrelations with a predicate of intersects.

Depending on the actual join condition, the query graph may takedifferent shapes, such as: i) chain ii) star iii) clique and iv) com-bination of the above. The shape of the query graph dictates thecomplexity of join processing. Queries with complex topologicalrelationships are more expensive to evaluate. Here, we mainly fo-cus on star and clique joins as shown in Figure 3. The reason istwofold. First, our experience indicates that star and clique queriesare very common in spatial cross-matching and other spatial ana-lytical tasks. Secondly, a complex query graph can be decomposedinto a combination of several star and clique query graphs. Thus de-veloping effective query evaluation techniques for these two typesof queries can serve as a building block towards more complexquery evaluation.

4.2 Join ProcessingSpatial predicate checking is computationally expensive and spa-

tial objects are generally complex to represent. Reading and writ-ing spatial data incurs significant I/O overhead. Therefore, mostof the spatial query processing techniques take a filter-and-refineapproach to reduce unnecessary computation and I/O cost. Thegeneral processing pipelines are as follows. First, spatial objectsare filtered with approximate processing such as MBR (MinimumBounding Rectangle) based filtering to eliminate object pairs whichdo not satisfy the join predicate. Next, remaining candidate objectsfrom the filtering step are further refined with accurate geometrycomputation. Finally, objects which satisfy the join condition arepushed to downstream processing such as aggregation or grouping.

Page 5: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

Numerous spatial join algorithms have been developed in thepast three decades and we refer interested readers to [16] for acomprehensive overview. One class of algorithms utilize spatialindexing to process the join operation. A representative examplefrom this class is R-Tree based Synchronized Traversal algorithm[7] and it is available in major SDBMSs such as Oracle Spatial,MySQL and PostGIS. Another class of algorithms assume that spa-tial indexes are not available on the input dataset and utilizes in-memory join algorithms with external data partitioning. Represen-tative examples from this class include PBSM [26] and SSSJ [4].Due to the space restriction, here we skip detailed description ofthese algorithms and mainly focus on how they can be adapted tothe MapReduce framework to support efficient spatial query pro-cessing for analytical pathology imaging.

SELECTST_AREA(ST_INTERSECTION(ta.polygon,tb.polygon)),ST_AREA(ST_INTERSECTION(ta.polygon,tc.polygon))

FROM markups ta JOIN markups tb ON(ST_INTERSECTS(ta.polygon, tb.polygon)=TRUE) JOIN

markups tc ON (ST_INTERSECTS(ta.polygon,tc.polygon)=TRUE)WHERE

ta.provenance=’A1’ AND tb.provenance=’A2’ ANDtc.provenance=’A3’ AND ta.dataset_id = ’IMG1’ ANDtb.dataset_id = ’IMG1’ AND tc.dataset_id =’IMG1’ ;

Figure 4: An example spatial join query in SQL

Consider the join query in SQL shown in Figure 4, where threedatasets, generated from different algorithms, are cross-matched tocompare the segmentation similarity of different algorithm resultson the same set of images. This is a typical multiway star joinquery with a join cardinality of three. One straightforward way toevaluate this query is to decompose it into two pairwise joins andevaluate them separately. However, in such a left-deep query planthe intermediate results need to be materialized to the HDFS, andit incurs significant I/O cost. Therefore, in our system, we take adifferent approach in which such a plan is translated into a bushy-plan to process multiple datasets at once.

4.2.1 R∗-Tree Join ProcessingIn R*-tree, each non-leaf node of the tree stores pointers to its

child nodes and corresponding MBRs, while each leaf node storespointers to the actual spatial objects and corresponding MBRs. Wemodify and extend the SpatialIndex library [2] for building R∗-Treeindexes. The input data and indexes are read-only and no furtherupdate is needed. Therefore we apply bulk-loading techniques [5]in the R∗-Tree building process.

To join input datasets, we use the synchronized tree traversal al-gorithm [7]. Given two R∗-Trees as input indexes, the algorithmstarts from the root nodes and recursively checks each pair of nodesfrom two indexes. If the MBRs of a pair of nodes intersect, it thencontinues to join these two nodes and check their child nodes. Theprocess is repeated until the leaf nodes are reached. The algorithmthen checks each pair of the polygons indexed in these two leafnodes to find all the pairs of polygons whose MBRs intersect. Al-gorithm 1 describes the details. Extension of this algorithm to mul-tiple inputs is straightforward. So far we describe how to process ajoin of m sets of input data in a single process. Next we describehow this algorithm is adapted to the MapReduce framework to pro-cess large scale data.

We implement the spatial join operation as a reduce-side join.Specifically, in the map phase, each Map task processes a chunk ofinput data and emits each record as output value and the tile id asthe key. Thus, after the shuffle phase, spatial objects from different

Algorithm 1: R∗-Tree Join AlgorithmInput: rtreeA,rtreeBpriorityQueue = init_Priority_Queue();priorityQueue.addPair(rtreeA.root(),rtreeB.root());while priorityQueue is NOT empty do

joinPair = priorityQueue.pop();foreach pa ∈ joinPair.first().children() do

foreach pb ∈ joinPair.second().children() doif intersects(pa , pb) then

if pa and pb are leaf nodes thenreport_Intersection(pa , pb);

elsepriorityQueue.addPair(pa , pb);

input datasets but belonging to the same tile end up in the samepartition which will be processed by the same reducer task. In theReduce phase, each Reduce task reads a single partition assigned toit, invokes RESQUE to build spatial index for each tile and invokesRESQUE to perform the join operation. Detailed descriptions areshown in Algorithm 2 and 3.

Algorithm 2: Map FunctionData: set of records from input table ta, tb or tcInput: ki, vi(provenance, dataset_id, tile_uid, polygon) = parse(vi);if dataset_id == ‘IMG1’ then

km = tile_uid;if provenance == ‘A1’ then

vm = (‘ta’, polygon);

else if provenance == ‘A2’ thenvm = (‘tb’, polygon);

else if provenance == ‘A3’ thenvm = (‘tc’, polygon);

elsereturn;

emit(km, vm);

4.2.2 PBSM-Partition Based Spatial Merge JoinAn in memory algorithm is proposed in [26] to process spatial

join queries. The first step in this algorithm is to partition the uni-verse of spatial data into tiles such that each tile eventually fits intomain memory. Then, corresponding tiles from multiple datasets arebrought into memory for processing, where a simple nested loopjoin algorithm is used in combination with MBR based filtering.The partition step is very important for the efficiency of this algo-rithm. If spatial data distribution is skewed, some tiles could takemuch longer time to process than others and the overall runtime isbounded by the finish time of such stragglers. In our case, sinceinput datasets are already tile-partitioned, the algorithm only needsto perform the join operation. The join processing workflow is verysimilar to Algorithm 2 and 3 except that no index is created in thereduce function.

4.3 Nearest Neighbor QueryNearest neighbor (NN) search has broad applications, and in an-

alytical imaging, it can be computationally expensive. An examplequery for a pathologist is “for each cell, return nearest blood ves-sels and the distances”. Such queries are helpful in understandingcorrelations between spatial proximity and cell features and can beanswered with a nearest neighbor search algorithm. Nearest neigh-bor query has been studied in spatial database settings for a longtime. However, most of the work focused on point data where spa-tial objects are approximated or provided as points in space. The

Page 6: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

Algorithm 3: Reduce FunctionInput: ki, vitile_a = extract(‘ta’,vi) ;tile_b = extract(‘tb’,vi) ;tile_c = extract(‘tc’,vi) ;// build R*-Tree indexes on each tileidxa = RESQUE.build_index(tile_a);idxb = RESQUE.build_index(tile_b);idxc = RESQUE.build_index(tile_c);// execute queries using spatial indexesresult = RESQUE.execute_query(idxa, idxb, idxc);// final outputparse result and output to HDFS;

points are then stored in a spatial database together with features as-sociated with them for querying. This approach greatly simplifiesthe problems associated with managing large amounts of complexspatial data, and is applicable in scenarios where such approxima-tion is sufficient. However, the approach does not apply to ana-lytical pathology imaging. Certain objects, such as blood vessels(green markups in Figure 1(b)), can not be approximated as pointsas such approximation would lead to loss of critical spatial infor-mation. Moreover, nearest neighbor query results would be com-pletely different if we naively approximate blood vessels as points.

Spatial access methods are widely used to support point NNqueries and a number of algorithms have been developed. Gen-erally, these algorithms rely on the clustering properties of neigh-boring points and try to prune search space to quickly arrive at theneighborhood of the query point. In our system, we provide twoalgorithms for efficient nearest neighbor query support.

4.3.1 NN Search with R∗-TreeIn R-Tree, two metrics are defined to speed up the nearest neigh-

bor search process, namely mindist and maxmindist. These metricsare used to prune as much of the R-Tree nodes as possible duringboth the downward searching process and the upward refining pro-cess. Details of the algorithm can be found in [29].

An approach similar to the R∗-Tree join processing can be usedto support nearest neighbor queries in MapReduce. However, tilebased partitioning is not applicable in this scenario. Specifically,after such a partition, nearest neighbor of one object may reside inanother tile. Thus if the nearest neighbors are processed indepen-dently we may not get the correct result. There are multiple waysto remedy this problem. One approach is to process the query inmultiple passes such that in the first pass, only the index buildingprocess is initiated. In the second pass, partial indexes from the firstpass are merged and replicated to other nodes. Thus, after severalpasses, each node would gather enough information to answer thequery.

In the analytical pathology imaging setting, generally there arefewer target objects which are returned as the nearest neighbors,than the source objects. Consider the example of querying nearestneighbor blood vessel for each cell. The number of blood vessels(hundreds or thousands) is much smaller compared to the numberof cells (millions). In this case, locating the target nearest neighboris very fast whereas most of the query time is spent on iterating overmillions of source objects. Therefore, we take a simple approach inwhich only the source object set is partitioned, and the target objectset is replicated and distributed to cluster nodes. Thus, each parti-tion has a “global view” of the target search space and can carry onthe nearest neighbor search without any communication overheadbetween nodes. In the Map phase, source objects are partitionedand target objects are replicated. The reduce phase of the algorithmis described in Algorithm 4.

Algorithm 4: Reduce FunctionInput: ki, vitile = extract_source_objects(vi);k = get_K(vi);tar = read target objects from HDFS;// build R*-Tree index on target objetcsidx = RESQUE.build_index(tar);// execute queries using spatial indexesresult = RESQUE.execute_kNN_query(idx,tile,k);// final outputoutput result to HDFS;

4.3.2 Voronoi DiagramVoronoi diagram [24] has been extensively studied in computa-

tional geometry and spatio-temporal database settings to supportnearest neighbor queries. Given a set of input sites, typically pointson the plane, Voronoi diagram divides the space into disjoint poly-gons where the nearest neighbor of any point inside a polygon isthe site which has generated this polygon. These polygons arecalled Voronoi polygons and edges on adjacent Voronoi polygonsdefine equidistance regions between two polygons. A number ofalgorithms are proposed to compute Voronoi diagrams and the bestknown algorithm has a lower bound complexity of O(n log n),where n is the number of input line segments needed for computingthe Voronoi diagram.

Figure 5: Workflow of nearest neighbor search with Voronoidiagram

To answer the example nearest neighbor query, target objects(blood vessels) are replicated among cluster nodes for index con-struction. Source objects (cells) are partitioned with tiling and dis-tributed among the nodes participating the computation. Similarto the R∗-Tree nearest neighbor query processing, a reducer firstbuilds the Voronoi diagram for blood vessels which are representedas a set of line segments. Then for each cell in a given partition, thereducer queries the nearest blood vessel segments and computesthe distance. To efficiently locate the Voronoi polygons, Vornoi di-agram is clipped to the size of a tile on each reducer. The clippingcoordinates are extracted from the query tile which contains thecells. Figure 5 illustrates the query processing workflow of nearestneighbor query with Voronoi diagram.

The replication of target objects and computation of Voronoi di-agram for the same set of objects may seem to cause extra over-head. There are two reasons why we do not also partition the targetobjects to achieve higher level parallelism. First, construction ofVoronoi diagram is fairly fast due to the fact that the number of tar-get objects is much less than the number of source objects. In ourcurrent dataset, target objects – blood vessels – roughly accountfor 0.1% of all spatial objects. In this case, the extra effort to par-allelize the Voronoi diagram construction process may not justifyitself. Even if the Voronoi diagrams are built in parallel, extra post-

Page 7: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

processing is needed to merge the partial Voronoi diagrams. Sec-ond, it complicates the SQL-to-MapReduce translator. Given theseconsiderations, we do not parallelize the index building process inour current system.

5. EXPERIMENTAL EVALUATION

5.1 Experimental setupWe evaluate query performance and scalability of the system on

our in-house cluster, which comes with 10 physical nodes and 192cores (AMD 6172, 2.1GHz). Cluster nodes are connected with gi-gabit ethernet and each node is configured with CentOS 5.6 (64 bit)and Cloudera Hadoop-0.20.2-cdh3u2. Boost 1.48.0 and CGAL3.8 libraries are used to support geometry computation and spa-tial measurement in RESQUE. We extend the SpatialIndex library1.6.0 [2] for indexing building and supporting spatial joins. Theconfiguration parameters for Hadoop are: data split size = 64 MB,HDFS replication factor = 3, concurrent maps/core = 1, and mainmemory/node = 128 GB.

We use two datasets of whole slide images for brain tumor studyprovided by Emory University Hospital and TCGA (The CancerGenome Atlas). The dataset for testing join query performance isa set of 18 images with diverse disease stages. The average num-ber of nuclei per image is roughly 0.5 million and each nucleusis represented as a polygon in vector format. The average numberof points representing a nucleus is 50. For nearest neighbor queryperformance test, we use 50 images (42 GB) from TCGA. Bothdatasets have similar characteristics. The first dataset comes withpolygons of nuclei, and the second dataset comes with polygons ofnuclei and blood vessels.

5.2 Join Query PerformanceTo test the system performance, we use two types of queries with

different numbers of input datasets. For STAR-join, a query simi-lar to Figure 4 is issued with different input cardinalities. However,for CLIQUE-join queries, the WHERE clause is changed to reflectactual clique join predicates. Since both STAR-Join and CLIQUE-join are implemented as reduce side joins, the number of availablereduce nodes in the cluster has significant effect on the query run-time. Therefore to test scalability of the system, the same queryis tested with different number of reducers. We do not explicitlymanipulate the number processors available for Map phase, as it isdictated by the number of input splits.

5.2.1 STAR Query

0 20 40 60 80 100 120 140 160 180 200 2200

200

400

600

800

1000

1200

1400

1600

1800

2000

Run

time

(sec

)

Number of Reducers

|join| = 2 |join| = 3 |join| = 4 |join| = 5

(a) R∗-Tree

0 20 40 60 80 100 120 140 160 180 200 2200

200

400

600

800

1000

1200

1400

1600

1800

2000

Run

time

(sec

)

Number of Reducers

|Join| = 2 |Join| = 3 |Join| = 4 |Join| = 5 |Join| = 6

(b) PBSM

Figure 6: STAR-Join query performance

Figure 6 shows the query performance for star-shaped multiwayspatial join with different join cardinalities. The horizontal axisrepresents the number of reducers and the vertical axis represents

query runtime. As the performance numbers show, the system isvery efficient. For R∗-Tree based join processing, it takes 165 sec-onds to process two sets of images with 200 reducers – less than 10seconds per image, whereas the same query takes more than 1000seconds on a single process PostGIS for a single image [35]. Thesystem also shows good scalability. For both figures, it is notice-able that the query runtime drops linearly as the number of reducersincreases. This effect is more pronounced in the region where thenumber of reducers ranges between 20 and 80. Interestingly, whenthe join cardinality increases, the linear relationship between run-time and the number of processing units becomes more apparentand it saturates as the number of reducers approaches to the maxi-mal number of available cluster cores.

We can also notice from the figures that, R∗-Tree based join hasbetter performance on lower cardinality joins (|join| ≤ 4), whereasPBSM has better performance on higher cardinality joins. Thisinformation is useful for query optimization. It can be encodedinto the query optimizer as a prior knowledge about the algorithmcost, and during query compilation, the optimizer can select thebest algorithm to run the query depending on query predicates atruntime. Many modern RDBMS query optimizers come with suchfeature, and we are planning to integrate more sophisticated queryoptimization techniques into the system in the future.

5.2.2 CLIQUE Query

0 20 40 60 80 100 120 140 160 180 200 2200

200

400

600

800

1000

1200

1400

R

untim

e (s

ec)

Number of Reducers

|join| = 3 |join| = 4 |join| = 5 |join| = 6

(a) R∗-Tree

0 20 40 60 80 100 120 140 160 180 200 2200

1000

2000

3000

4000

5000

6000

7000

Run

time

(sec

)

Number of Reducers

|Join|=3 |Join|=4 |Join|=5 |Join|=6

(b) PBSM

Figure 7: CLIQUE-Join query performance

We test clique query performance on the same dataset we use forthe star query and Figure 7 shows the test results. Again, the systemexhibits high performance and good scalability. Surprisingly, R∗-Tree based join processing algorithm performs much better than thePBSM algorithm. As the performance numbers show, in most casesR∗-Tree based join is faster than PBSM by a factor of 5. This isespecially true when fewer reducers are used for query processing.

These experiments indicate that, our design principles – buildingindex on-the-fly and indexed based query processing with implicitparallelization on MapReduce – are well-suited for processing mas-sive spatial data.

5.3 Nearest Neighbor QueryTo test nearest neighbor query performance, we run the example

query “return the distance to the nearest blood vessel from eachcell” for each image in the dataset. This query can be parallelizedat image level or at tile level. We report results for both levels ofparallelism in Figure 8. Since the number of images used for thistest is 50, more than 50 reducers would not help to increase thesystem performance. As it can be seen from both figures, it is clearthat finer level of partition granularity offers higher level of paral-lelism which translates into better performance. The system alsoexhibits good scalability for tile level partitioning. In both figures,the execution time is reduced roughly by half when the number of

Page 8: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

0 10 20 40 60 80 100

400

800

1200

1600

2000

R

untim

e (s

ec)

Number of Reducers

Image Partition Tile Partition

(a) R∗-Tree

0 10 20 40 60 80 100

150

200

250

300

350

400

Run

time

(sec

)

Number of Reducers

Image Partition Tile Partition

(b) Voronoi

Figure 8: Nearest neighbor query performance

processing nodes is doubled (from 10 to 20 and from 20 to 40). Theexperiments also show that Voronoi based nearest neighbor searchis much faster than R∗-Tree based approach. Therefore, the queryoptimizer can automatically select Voronoi based approach as theleast cost algorithm for similar query cases.

6. PERFORMANCE OPTIMIZATION

6.1 Skew ReductionShared-nothing parallel processing systems, Hadoop for exam-

ple, can be easily scaled up by adding more nodes to the system.Ideally, the system performance should increase linearly as morenodes are available for computation. In reality, however, it is hardto achieve such linear speed-up due to load balancing issues, inparticular, skew in data partitioning. Performance charts in previ-ous section testify such cases. There are many reasons why skewmay arise in parallel processing. For example, some data recordsmay simply be expensive to process or some data partitions maycontain significantly more data records than others. In pathologyimaging, the density of microanatomic objects differs significantlyin different tissue regions. When an image is partitioned into tiles,some tiles may end up having significantly more spatial objectsthan others. During the query processing stage, nodes processingthese dense tiles become “stragglers” and have a drastic effect onquery performance.

In Hadoop, a hash partition is used to partition input records intoR buckets, where R is the number of processors available for com-putation in the system. Unfortunately, such hash partitioning isnot guaranteed to generate even buckets where each bucket takesroughly equal amounts of time to process. To illustrate the skewproblem in analytical pathology imaging, we perform a join perfor-mance test where a set of images are processed with 40 reducersand the completion time for each individual reducer is measured.The brown bars in Figure 9(a) (binned for visualization purpose)show that, without any optimization, the actual task completiontime for each reducer differs significantly and the overall systemperformance is largely affected by the long running tasks.

To remedy the skew problem, we take a cost-based greedy par-tition approach in which each reducer is assigned roughly equalamount of work to balance the workload. Consider a simplifiedversion of the query 4 where two datasets are join with a predi-

cate intersects, i.e., Q = Rintersects

./ S. To process thisquery Q, our system partitions each image into N tiles indexed byI = {1, 2, ..i..N} and each pair of tiles from R and S will be as-signed to a reducer for join processing. After the join processing isdone, a final aggregation step will be performed.

Q = Rintersects

./ S =

N⋃i=1

Riintersects

./ Si (1)

Thus, each reducer will process a setP of tiles indexed by⋃

i∈P iwhich we call the workload of a reducer node. Therefore, the queryoptimizer should generate a query plan that partitions the tiles in-dexed by I into k workloads such that I =

⋃ki=1 Pi and the max-

imal workload is minimized. There are two problems need to besolved here. First, how to estimate the runtime of each workloadWi? Second, assuming that we know the runtime Wi for eachworkload, how to solve the partition problem?

This is a classic set partition problem and it is known to be NP-Hard. In our case, an approximate solution could be sufficient.Therefore, we take a simple approach in which tiles are greedilyassigned to k partitions. However, the question of how to estimatecompletion time for each tile still remains. Similar to the cost es-timation techniques in modern database systems, we use followingformula to estimate cost of processing each individual workload.

Wj =∑i∈Pj

Cost(Ri ./ Si) (2)

Cost(Ri ./ Si) = α|Ri|+ β|Si|+ γ (3)

The coefficients α, β are introduced to reflect individual datasetcharacteristics. If we make no assumption about the dataset in-volved in query processing, we can set them to a constant value.The constant cost γ is introduced to account for the cost of trans-ferring the tile contents from other nodes across the network. Whileit can be tuned to a decent value, we simply set it to zero as in thejoin processing the predicate checking cost dominates overall run-time.

The performance results with optimization with cost based taskpartition are shown in Figure 9. Figure 9(a) shows that the opti-mized processing is less susceptible to the “straggler” problem. Asindicated by the purple bars in the figure, individual partitions fin-ish roughly at the same time. Figure 9(b) and Figure 9(c) show acomparison of query performance for different query types on theoptimized system and the original system. In those figures, the firsthorizontal axis represents join cardinality, and the second horizon-tal axis represents the number of active reducers for that run. Whilethere are still cases where the improvement is not significant, over-all, such optimization considerably reduces job completion timeand increases query performance.

6.2 Index CompressionIn R∗-Tree based spatial query processing, polygons in the leaf

nodes are encoded with additional information for retrieval. Eachpolygon record is represented as (ID, N, Point1, Point2, ... , PointN),where id is the markup id, and N is the number of points. The poly-gons usually consist of hundreds or thousands of vertices, and twoadjacent vertices usually have a distance of a couple of pixels inaxis parallel directions. For example, a polygon can be representedas (10, 60, 40961 8280, 40962 8280, 40962 8281, ... , 40961 8279),where markup id is 10, the number of points is 60, and the othernumber pairs delimited by space represent (x y) coordinates. Witha chain code representation, only the offset value between two adja-cent points is stored as replacement of original coordinates. For ex-ample, the example polygon can be represented as: (10, 60, 409618280, 1 0, 0 1, ... , 2 0). The simple chain code compression ap-proach saves space and reduces I/O. Our experiments show that, byapplying such compression schema, the storage cost can be reducedby 42%, a significant reduction of I/O during query processing.

Page 9: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

400

450

500

550

600

650

700

750

800R

untim

e (

sec)

Reduce tasks ordered by runtime

w/o optimizationw optimization

(a) Skew in join query processing

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

Ru

ntim

e (

se

c)

number of reducersjoin cardinality-->

w/ optimizationextra w/o optimization

10080604020

(b) Query optimization in star-join

0

200

400

600

800

1000

1200

1400

1600

3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

Ru

ntim

e (

se

c)

number of reducersjoin cardinality-->

w/ optimizationextra w/o optimization

10080604020

(c) Query optimization in clique-join

Figure 9: Skew in spatial query processing and its mitigation through cost-based task partition

7. RELATED WORKScientific data often comes with spatial aspects [3], for example,

Large Synoptic Survey Telescope (LSST) generates huge amountof spatially oriented sky image data. Human Atlas project include[1] and others. Digital microscopy is an emerging technology whichhas become increasingly important to support biomedical researchand clinical diagnosis. There are several projects that target cre-ation and management of microscopy image databases and process-ing of microscopy images. The Virtual Microscope system [10]developed by our group provides support for storage, retrieval, andprocessing of very large microscopy images on high-performancesystems. The Open Microscopy Environment project [15] devel-ops a database-driven system for managing analysis of biologicalimages, which is not optimized for large scale pathology images.Scaling out spatial queries through a large scale parallel databaseinfrastructure is studied in our previous work [34]. That work hasdemonstrated that although a parallel database architecture can sup-port complex spatial queries – in a limited extent, it is highly ex-pensive to scale and difficult to optimize.

Pig/MapReduce based approach has been studied in [21] for struc-tural queries for astronomy simulation analysis tasks and comparedwith IDL and DBMS approaches. In [9], an approach is proposedon bulk-construction of R-Trees and aerial image quality compu-tation through MapReduce. In [36], a spatial join algorithm onMapReduce is proposed for skewed spatial data, without using spa-tial indexes. The approach first produces tiles with close to uniformdistributions, then uses a strip based plane sweeping algorithm byfurther partitioning a tile into multiple strips. Joins are performedin memory, with a duplication avoidance technique to remove du-plicates across tiles. In our system, tiling is produced at imageanalysis step, and it is a common practice for pathology imaging todiscard objects at tile boundaries, as the final analysis result is a sta-tistical aggregation. We take a hybrid approach on combining par-titioning with index, and build spatial indexes on-the-fly for queryprocessing. Our approach is not limited to memory size, provideshigh efficiency with implicit parallelization through MapReduce.

Partitioning based approach for parallelizing spatial joins is alsodiscussed in [26, 38] where no indexing is used. An R-Tree basedspatial join is proposed in [8] with a combined shared virtual mem-ory and shared nothing architecture. Voronoi diagram and its vari-ations are extensively studied in computational geometry for poly-gon triangulation and nearest neighbor search for stationary querypoints [28]. In [17], authors extended Voronoi diagram to supportkNN queries for Spatial Network Database and Location BasedServices. Orthogonal to our work, focus in these area have been to

efficiently support moving and continuous kNN query [23]. Morerecently, Voronoi diagram is combined with R-Tree to speedupnearest neighbor queries [30].

Comparisons of MapReduce and parallel databases are discussedin [27, 14, 31]. An automatic skew mitigation approach for user-defined MapReduce programs is proposed in [19]. MapReduce sys-tems with high-level declarative languages include Pig [25], SCOPE[11], and HiveQL/Hive [32]. YSmart provides an optimized SQLto MapReduce job translation and is recently patched to Hive. Oursystem takes an approach that marries DBMS’s spatial indexingand declarative query language into MapReduce.

The previous research most closest to our work is [37] wherea multi-tier distributed index is proposed and used in MapReduceenvironment. Another related work is [36], in which authors ex-plored the possibility of performing spatial join with MapReduceand tested their system on a small scale of data.

8. CONCLUSION AND FUTURE WORKAnalytical pathology imaging provides high potential to support

biomedical research and computer aided diagnosis, and queryingmassive spatial data efficiently is among the essential tasks to ad-vance the field. In this paper, we present an overview of a MapRe-duce based framework to support scalable and high performancespatial queries, and demonstrate the feasibility on supporting tworepresentative spatial query types with high efficiency and scala-bility. We take a partition-merge based approach supported withon demand indexing and implicit parallelization on MapReduce,which can be applied to many other spatial query types. Our workis generic and can be used to support similar spatial queries fromother domains, such as remote sensing and scientific simulations.

Future work includes skew-aware spatial partitioning, query pipelineadjustment to consider spatial objects across partitioning bound-aries, and optimization of spatial query pipelines on MapReduce.We have implemented a spatial SQL query to MapReduce codetranslation tool YSmart-S by extending YSmart, and we are inte-grating YSmart-S with the spatial query pipelines. This is an opensource project and we are in the process of releasing the softwareto the community.

9. ACKNOWLEDGMENTSThis research is supported in part by PHS Grant UL1RR025008

from the CTSA program, by R24HL085343 from the NHLBI, byGrant Numbers 1R01LM011119-01 and R01LM009239 from theNLM, by NCI Contract No. N01-CO-12400 and 94995NBS23 andHHSN261200800001E, by NSF CNS 0615155, 79077CBS10, and

Page 10: Towards Building a High Performance Spatial Query System ...aaji/files/sigspatial2012.pdfrely on effective access methods to reduce search space and allevi-ate high cost of geometric

CNS-0403342, and P20 EB000591 by the BISTI program. Wethank Jun Kong and Lee Cooper for their support on preparing thedata.

10. REFERENCES[1] The allen reference atlas. http://www.brain-map.org;

http://mouse.brain-map.org/api/.[2] Spatial index library. http://libspatialindex.github.com.[3] A. Ailamaki, V. Kantere, and D. Dash. Managing scientific

data. Commun. ACM, 53, June 2010.[4] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S.

Vitter. Scalable sweeping-based spatial join. In VLDB, pages570–581, 1998.

[5] J. V. d. Bercken and B. Seeger. An evaluation of generic bulkloading techniques. In VLDB, pages 461–470, 2001.

[6] D. J. Brat, A. A. Castellano-Sanchez, S. B. Hunter, et al.Pseudopalisades in glioblastoma are hypoxic, expressextracellular matrix proteases, and are formed by an activelymigrating cell population. Cancer Res, 64(3):920–7, 2004.

[7] T. Brinkhoff, H. Kriegel, and B. Seeger. Efficient processingof spatial joins using r-trees. In SIGMOD, pages 237–246,1993.

[8] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallelprocessing of spatial joins using r-trees. In ICDE, 1996.

[9] A. Cary, Z. Sun, V. Hristidis, and N. Rishe. Experiences onprocessing spatial data with mapreduce. In SSDBM, pages302–319, 2009.

[10] Ü. V. Çatalyürek, M. D. Beynon, C. Chang, T. M. Kurç,A. Sussman, and J. H. Saltz. The virtual microscope. IEEETransactions on Information Technology in Biomedicine,7(4):230–248, 2003.

[11] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib,S. Weaver, and J. Zhou. SCOPE: easy and efficient parallelprocessing of massive data sets. PVLDB, 1(2):1265–1276,2008.

[12] L. A. D. Cooper, J. Kong, D. A. Gutman, F. Wang, et al.Integrated morphologic analysis for the identification andcharacterization of disease subtypes. J Am Med InformAssoc., Jan. 2012.

[13] O. De. Image data handling in spatial databases. GeoInfo,page 7, 2002.

[14] J. Dean and S. Ghemawat. Mapreduce: a flexible dataprocessing tool. Commun. ACM, 53(1):72–77, 2010.

[15] I. Goldberg, C. Allan, et al. The open microscopyenvironment (ome) data model and xml file: Open tools forinformatics and quantitative analysis in biological imaging.Genome Biol., 6(R47), 2005.

[16] E. H. Jacox and H. Samet. Spatial join techniques. ACMTrans. Database Syst., 32(1), Mar. 2007.

[17] M. Kolahdouzan and C. Shahabi. Voronoi-based k nearestneighbor search for spatial network databases. In VLDB,pages 840–851. VLDB Endowment, 2004.

[18] J. Kong, L. A. D. Cooper, F. Wang, D. A. Gutman, et al.Integrative, multimodal analysis of glioblastoma using tcgamolecular data, pathology images, and clinical outcomes.IEEE Trans. Biomed. Engineering, 58(12):3469–3474, 2011.

[19] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune:mitigating skew in mapreduce applications. In Proceedingsof the 2012 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’12, pages 25–36, 2012.

[20] R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang.Ysmart: Yet another sql-to-mapreduce translator. In ICDCS,2011.

[21] S. Loebman, D. Nunley, Y.-C. Kwon, B. Howe,M. Balazinska, and J. Gardner. Analyzing massiveastrophysical datasets: Can pig/hadoop or a relational dbmshelp? In CLUSTER, pages 1–10, 2009.

[22] M. A. Nieto-Santisteban, A. R. Thakar, and A. S. Szalay.Cross-matching very large datasets. In National Science andTechnology Council(NSTC) NASA Conference, 2007.

[23] S. Nutanong, R. Zhang, E. Tanin, and L. Kulik. Thev*-diagram: a query-dependent approach to moving knnqueries. Proc. VLDB Endow., 1(1):1095–1106, Aug. 2008.

[24] A. Okabe, B. Boots, and K. Sugihara. Spatial tessellations:concepts and applications of Voronoi diagrams. Wiley &Sons, 1992.

[25] C. Olston, B. Reed, U. Srivastava, R. Kumar, andA. Tomkins. Pig latin: a not-so-foreign language for dataprocessing. In SIGMOD, 2008.

[26] J. M. Patel and D. J. DeWitt. Partition based spatial-mergejoin. In SIGMOD, pages 259–270, 1996.

[27] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt,S. Madden, and M. Stonebraker. A comparison ofapproaches to large-scale data analysis. In SIGMOD, pages165–178, 2009.

[28] F. P. Preparata and M. I. Shamos. Computational geometry:an introduction. Springer-Verlag New York, Inc., New York,NY, USA, 1985.

[29] N. Roussopoulos, S. Kelley, and F. Vincent. Nearestneighbor queries. In SIGMOD, pages 71–79, New York, NY,USA, 1995. ACM.

[30] M. Sharifzadeh and C. Shahabi. Vor-tree: R-trees withvoronoi diagrams for efficient processing of spatial nearestneighbor queries. Proc. VLDB Endow., 3(1-2):1231–1242,Sept. 2010.

[31] M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden,E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and paralleldbmss: friends or foes? Commun. ACM, 53(1):64–71, 2010.

[32] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, et al. Hive: AWarehousing Solution Over a Map-Reduce Framework. InVLDB, volume 2, pages 1626–1629, August 2009.

[33] F. Wang, A. Aji, Q. Liu, and J. Saltz. Hadoop-GIS: A HighPerformance Query System for Analytical Medical Imagingwith MapReduce. Technical report, Emory University, July2011.

[34] F. Wang, J. Kong, J. Gao, C. Vergara-Niedermayr, D. Adler,L. Cooper, W. Chen, T. Kurc, and J. Saltz. High performanceanalytical pathology imaging database for algorithmevaluation. Technical report, Emory University, June 2011.

[35] K. Wang, Y. Huai, R. Lee, F. Wang, X. Zhang, and J. H.Saltz. Accelerating pathology image data cross-comparisonon cpu-gpu hybrid systems. PVLDB, 5(11):1543–1554, 2012.

[36] S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. Sjmr:Parallelizing spatial join with mapreduce on clusters. InCLUSTER, 2009.

[37] Y. Zhong, J. Han, et al. Towards parallel spatial queryprocessing for big spatial data. In IEEE IPDPSW, 2012.

[38] X. Zhou, D. J. Abel, and D. Truffet. Data partitioning forparallel spatial join processing. Geoinformatica, 2:175–204,June 1998.