Research Article Efficient and Scalable Graph...
Transcript of Research Article Efficient and Scalable Graph...
Research ArticleEfficient and Scalable Graph Similarity Joins in MapReduce
Yifan Chen1 Xiang Zhao1 Chuan Xiao2 Weiming Zhang1 and Jiuyang Tang1
1 College of Information System and Management National University of Defense Technology Changsha 410073 China2Nagoya University Nagoya Japan
Correspondence should be addressed to Xiang Zhao xiangzhaonudteducn
Received 17 March 2014 Accepted 29 May 2014 Published 8 July 2014
Academic Editor Jian J Zhang
Copyright copy 2014 Yifan Chen et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Along with the emergence of massive graph-modeled data it is of great importance to investigate graph similarity joins due to theirwide applications formultiple purposes including data cleaning and near duplicate detectionThis paper considers graph similarityjoins with edit distance constraints which return pairs of graphs such that their edit distances are no larger than a given thresholdLeveraging the MapReduce programming model we propose MGSJoin a scalable algorithm following the filtering-verificationframework for efficient graph similarity joins It relies on counting overlapping graph signatures for filtering out nonpromisingcandidates With the potential issue of too many key-value pairs in the filtering phase spectral Bloom filters are introduced toreduce the number of key-value pairs Furthermore we integrate the multiway join strategy to boost the verification where aMapReduce-based method is proposed for GED calculation The superior efficiency and scalability of the proposed algorithms aredemonstrated by extensive experimental results
1 Introduction
As the most commonly used abstract data structure graphshave been widely used for modeling the data in the fields ofbioinformatics multimedia social networking and the likeAs a consequence efforts were dedicated to various problemsin managing and analyzing graph data for example frequentsubgraph mining [1] structure search and indexing [2 3]similarity search [4 5] and so forth
This paper focuses on graph similarity join one basicoperation for processing graph data Given two graph objectsets 119877 and 119878 and a distance threshold a graph similarityjoin returns all the pairs of graph objects respectivelyfrom 119877 and 119878 the distances of which are no larger thanthe threshold Graph similarity join has a wide spectral ofapplications especially in preprocessing of graph mining forexample structural data cleaning and near replicate structuredetection
The most widely applied measure for determining graphsimilarity is graph edit distance (GED) [6 7] Compared withalternative measures GED has at least three advantages (1) itallows changes in both vertices and edges (2) it reflects thetopological information of graphs (3) it is a metric that can
be applied to any type of graphs Consequently we employGED to quantify graph similarity in this paper It is shownthat exact computation of GED is NP-hard [8]
The state-of-the-art algorithm for graph similarity joinis GSimJoin [9] which adopts the filtering-verificationframework In particular signatures are generated for everygraph with path-based 119902-gram approach for count filtering(cf Section 22) In the phase of verification GED com-putation is invoked for candidate pairs via anAlowast-basedalgorithm GSimJoin is an in-memory algorithm the perfor-mance and scalability of which are restricted to the availablememory of a machine The dataset size presented in theexperimental study is limited in the thousands since a graphsimilarity join operation in the worst case needs 119874(|119877||119878|)
count filtering condition checks and similarity computationsthereafter The era of big data calls for scalable algorithms tosupport large-scale data processing This paper attempts toaddress the challenges on massive graph data
MapReduce is a well-known programming frameworkto facilitate processing large-scale data in parallel[10] MassJoin [11] is a MapReduce-based algorithm forsimilarity join on strings Nonetheless there has been noexisting distributed algorithm for graph similarity joins
Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 749028 11 pageshttpdxdoiorg1011552014749028
2 The Scientific World Journal
Inspired by [11] this paper investigates graph similarity joinsbased on MapReduce
We firstly propose MGSJoin a MapReduce-based algo-rithm following the filtering-verification framework Itemploys signatures of path-base 119902-grams as keys and thecorresponding graphs as values forming key-value pairsThrough filtering the graph pairs whose common signaturesare less than the threshold given by count filtering conditionare filtered out The remaining pairs constitute the candidateset sent to verification thereafter Due to the potentially largenumber of key-value pairs that may incur large communi-cation cost we incorporate the Bloom filter technique byadding generated signatures to spectral Bloom filters Thiseffectively reduces the number of intermediate key-valuepairs and thus the complexity of network transmissionwhile the filtering capacity is mostly preserved Furthermorewe employmultiway join to improve the verification phase bycondensing two MapReduce rounds into one while devisinga MapReduce-based method to calculate GED
To the best of our knowledge it is among the firstattempts to present a MapReduce-based graph similarity joinalgorithm Our contribution can be summarized as follows
(i) We redesign the current in-memory graph similarityjoin algorithm and adapt it to the MapReduce frame-work The resulting baseline algorithm is capable ofprocessing large-scale graph datasets
(ii) We propose to use Bloom filters to reduce inter-mediate key-value pairs in the filtering phase whilesacrificing little filtering capacity Besides we presenta multiway join optimized verification strategy suchthat the number of required MapReduce rounds isreduced too Moreover a MapReduce-based methodis designed forGED calculation which can handle thecalculation for large graphs
(iii) We implement the proposed algorithm MGSJoin andconduct a wide range of experiments on a real datasetThe results show that both the efficiency and thescalability of MGSJoin are superior to the currentsolutions
This paper is constructed as follows In Section 2 prob-lem definition and background are provided We proposethe basic algorithm in Section 3 integrate Bloom filtersin Section 4 and optimize the verification in Section 5Section 6 lists the experimental results and analyses Relatedworks are described in Section 7 followed by conclusion inSection 8
2 Preliminaries
21 Problem Definition In this paper we focus on simplegraph namely an undirected graph without self-loops ormultiple edges A labeled graph can be represented as aquadruple 119892 = ⟨119881 119864 119897
119881 119897119864⟩ where 119881 is a set of vertices and
119864 sube 119881 times 119881 is a set of edges 119897119881and 119897119864are label functions
that assign labels to vertices and edges respectively 119881(119903)
denotes the vertex set of 119903 and119864(119903) denotes its edge set |119881(119903)|
and |119864(119903)| represent the numbers of vertices and edges in 119903
respectively 119897119881
(119906) denotes the label of 119906 isin 119881 and 119897119864(119890(119906 V))
denotes the label of the edge between 119906 and V 119906 V isin 119881Additionally we use |119892| = |119881| + |119864| to depict the size of graph119892
Definition 1 (graph pair) A graph pair is a tuple denoted by⟨119903 119904⟩ where 119903 isin 119877 119904 isin 119878119877 and 119878 are two sets of graph objectsrespectively
Definition 2 (graph isomorphism) A graph 119903 is isomorphicto another graph 119904 denoted by 119903 = 119904 if there exists a bijection119891 119881(119903) rarr 119881(119904) such that
(1) forall119906 isin 119881(119903)(119891(119906) isin 119881(119904) and 119897119881
(119906) = 119897119881
(119891(119906)))(2) forall119890(119906 V) isin 119864(119903)(119890(119891(119906) 119891(V)) isin 119864(119904) and 119897
119864(119890(119906 V)) =
119897119864(119890(119891(119906) 119891(V))))
Definition 3 (graph edit operation) A graph edit operation isan edit operation to transform one graph to another It can beone of the following six operations
(i) insert an isolated vertex into the graph(ii) delete an isolated vertex from the graph(iii) change the label of a vertex(iv) insert an edge between two disconnected vertices(v) delete an edge from the graph(vi) change the label of an edge
Definition 4 (graph edit distance) The graph edit distance(GED) between graphs 119903 and 119904 denoted by GED(119903 119904) is theminimum number of edit operations that transform 119903 to agraph isomorphic to 119904
Example 5 Figure 1(a) illustrates the molecule namedcyclopropanone while Figure 1(b) shows another moleculewhich does not exist When recording cyclopropanone intodatabase errors may be made and the molecule can becomethe one shown in Figure 1(b) Manual checking is requiredto find the errors which is very difficult Seeing that thetwo molecules are very similar we can adapt the GED tomeasure the similarity so that graph similarity join can beapplied to resolve the problem Take the two molecules asan example First change the bond (1198621 119874) from double tosingle and then transform one of the atoms 119867 bonded with1198623 into 119873 through which a molecule isomorphic to the onein Figure 1(b) is obtained So at least two edit operationsare required namely the graph edit distance Given thethreshold 120591 = 4 the two graphs are regarded similar
Problem 6 (graph similarity join) Given two sets of graphobjects 119877 and 119878 and a distance threshold 120591 as input a graphsimilarity join returns a result set ⟨119903 119904⟩ | 119903 isin 119877 and 119904 isin
119878 and GED(119903 119904) le 120591
22 Count Filtering
Definition 7 (path-based 119902-gram [9]) A path-based 119902-gramin a graph is a simple path of length 119902 ldquoSimplerdquo means thatthere is no repeated vertex in the path
The Scientific World Journal 3
0
C1
C2 C3 H
HH
H
(a) Cyclopropanone
0
C1
C2 C3H
H H
N
(b) ldquoCyclopropanonerdquo
Figure 1 Molecules
The path-based 119902-grams of a graph constitute the graphsignatures denoted by 119904119894119892 = ⟨119881 119864 119897
119881 119897119864⟩ where 119904119894119892 sube 119903|119881| =
119902+1 and |119864| = 119902 Let 119876119903be graph 119903rsquos signature set We say 119904119894119892
and 1199041198941198921015840 are common if 119904119894119892 = 119904119894119892
1015840 Note there can be multiple119902-grams that correspond to one particular signature
Lemma 8 (count filtering [9]) Graphs 119903 and 119904 satisfy thedistance constraints 120591 if the number of common signatures for⟨119903 119904⟩ is no less than 119871119861
119871119861 = max (1003816100381610038161003816119876119903
1003816100381610038161003816 minus 120591 sdot 119863 (119903) 1003816100381610038161003816119876119904
1003816100381610038161003816 minus 120591 sdot 119863 (119904)) (1)
where 119863(119903) (resp 119863(119904)) is the maximum number of affectedsignatures in 119876
119903(resp 119876
119904) when one edit operation is invoked
on 119903 (resp 119904)
The count filtering condition check requires119874(max(119889
119902
119903|119881(119903)| 119889
119902
119904|119881(119904)|)) where 119889
119903(resp 119889
119904) is the
average degree of 119903 (resp 119904) A graph similarity join requirespairwise count filtering condition checks thus resulting ina complexity of 119874(max(119889
119902
119903|119881(119903)| 119889
119902
119904|119881(119904)|)|119877||119878|) This can
be intolerable on large-scale datasets Next we present ascalable solution leveraging the MapReduce paradigm
3 Framework
This section presents a MapReduce-based graph similarityjoin algorithm following the filtering-verification fashionThe outline of the algorithm is listed below
31 Filtering Weallocate twoMapReduce jobs to the filteringphase
Job 1 Job 1 counts the same type of common signatures forgraph pairs We use graphs as values and their correspond-ing 119894119889s as keys to compose input key-value pairs 119902-gramsignatures (denoted by 119904119894119892) are generated in the Map taskWe use the generated signatures as keys and 119894119889s of graphs asvalues to form the output key-value pairs As a consequencein the Reduce task we can obtain graphs with commonsignatures For the same signature a graph 119894119889 may appear
more than once since there may exist several 119902-grams in agraph corresponding to an identical signature The functionfor Map task is shown as follows
Map⟨119903119894119889 119903⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119892 119903119894119889119904119894119889⟩
(1) input ⟨119903119894119889 119903⟩ or ⟨119904119894119889 119904⟩(2) generate 119902-gram signatures for each input
graph(3) emit ⟨119904119894119892 119903119894119889⟩ or ⟨119904119894119892 119904119894119889⟩ for each signature
with the 119894119889 of its corresponding graph
The Reducer gets ⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ in the Reduce taskWedefine119891
119903(resp119891
119904) to denote the occurrences of 119903119894119889 (resp
119904119894119889) in the 119897119894119904119905(V119886119897119906119890) of Reduce input and 119898119891to denote the
occurrences of graph pairs that share the input key (119904119894119892) ofReduce 119898
119891= min(119891
119903 119891119904) 119898119891is calculated and output with
the corresponding graph pair The function of Reduce task isas follows
Reduce⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ rarr ⟨119903119894119889 (119904119894119889 119898119891)⟩
(1) get list of values consisting of 119903119894119889 and 119904119894119889 withthe specific key (119904119894119892)
(2) split the list of values into list of 119903119894119889 and list of119904119894119889
(3) for each pair ⟨119903119894119889 119904119894119889⟩ where 119903119894119889 is from119897119894119904119905(119903119894119889) and 119904119894119889 is from 119897119894119904119905(119904119894119889) calculate 119898
119891
and output ⟨119903119894119889 (119904119894119889 119898119891)⟩
Job 2 Job 2 counts the total common signatures for graphpairs and checks the count filtering condition after which theset of candidate pairs is obtained The Map function is listedas follows
Map⟨119903119894119889 (119904119894119889 119898119891)⟩ rarr ⟨119903119894119889 (119904119894119889 119898
119891)⟩
(1) read a ⟨119903119894119889 (119904119894119889 119898119891)⟩ into Map task
(2) emit ⟨119903119894119889 (119904119894119889 119898119891)⟩ which is exactly the same
as it reads
4 The Scientific World Journal
We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output
exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898
119891is summed with the same graph pair
for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows
Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898
119891
(2) sum 119898119891and calculate the number of common
signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and
output pairs to DFS whose common signaturesare more than 119871119861
32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)
Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows
Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩
(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩
emit it exactly both of which take 119904119894119889 as the key
TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows
Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩
(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889
(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩
Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs
The function for Map task is as follows
Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩
(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group
key is 119903119894119889 for both cases
The function for Reduce task is as follows
Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive a list of values that consisted of graphs 119903
and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar
pairs
33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness
For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost
Some parameters are defined preceding the analysis |119892|
denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1
In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|
119902
so the time complexity for generating119902-grams can be estimated as 119874(|119881|
119902
) Therefore the timecomplexity for Map task is 119874(|119881|
119902
sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|
119902
sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|
119902
sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-
value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)
In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)
In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The
The Scientific World Journal 5
Input graph object sets 119877 and 119878 GED threshold 120591
Output similar graph pairs ⟨119903 119904⟩
(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs
Algorithm 1MGSJoin
(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891
119904⟩
(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891
119903⟩ or ⟨119904119894119889 119904119887119891
119904⟩
(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891
119904⟩
(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891
119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891
119904and estimate
the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition
Algorithm 2 Replacement of filter of Algorithm 1
time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by
theAlowast-based algorithm requires 119874(|119881||119881|
) In the worst case
the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|
) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)
4 Incorporating Bloom Filters
In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters
41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ
1 ℎ2 ℎ
119896
to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ
1(119890) ℎ2(119890) ℎ
119896(119890) in the array is
set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ
1(119902) ℎ2(119902) ℎ
119896(119902)
of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]
Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency
which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =
119862ℎ1(119890)
119862ℎ2(119890)
119862ℎ119896(119890)
are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min
119894isin1119896119862ℎ119894(119902)
Note that similarto Bloom filters SBF also never underestimate 119891(119902)
42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891
119903⟩ and ⟨119904119894119889 119904119887119891
119904⟩ In the Reduce task for
each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862
(1) and 119862(2) respectively it returns another SBF
with counters119862lowast119862lowast119894
= min119862(1)
119894 119862(2)
119894 119894 isin 1 119898 Hence
the number of common signatures could be estimated bylfloor(1119896) sum
119898
119894=1119862lowast
119894rfloor Subsequently the graph pairs which have
less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set
We provide the pseudocode of the aforementioned pro-cess in Algorithm 2
43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)
6 The Scientific World Journal
Table 1 Complexity analysis of filtering
Phase Job 1 Job 2 +SBF
Map IO 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|))
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902
sdot (|119877| + |119878|))
Shuffle 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)
Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)
may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters
Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|
119902
(|119877|+
|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898
2
|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1
5 Optimizing Verification Phase
In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈
119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one
51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))
(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations
119892 (119909) = GED (119903119901 119904119901)
ℎ (119909) = Γ (119871119881
(119903119901) 119871119881
(119904119902)) + Γ (119871
119864(119903119902) 119871119864
(119904119902))
(2)
119903119901consists of the vertices that have been mapped and edges
connecting them while 119903119902consists of the vertices unmapped
yet as well as their resident edges The equation for ℎ(119909)
represents the label difference of two graphsThe search space forAlowast-based approach is very large
requiring 119874(|119881||119881|
) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round
The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree
52 Multiway Join In relational database a multiway joincan process 119879
1(119860 119861) ⋈ 119879
2(119861 119862) ⋈ 119879
3(119862 119863) together in one
round where 119879119894(119883 119884) is a relational table with attributes
119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899
2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905
2(119887 119888) isin 119879
2(119861 119862) is sent to theReducer numbered
(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879
3(119888 119889)) is sent
to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909
(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]
We encapsulate the improved verification procedure inAlgorithm 4
53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
2 The Scientific World Journal
Inspired by [11] this paper investigates graph similarity joinsbased on MapReduce
We firstly propose MGSJoin a MapReduce-based algo-rithm following the filtering-verification framework Itemploys signatures of path-base 119902-grams as keys and thecorresponding graphs as values forming key-value pairsThrough filtering the graph pairs whose common signaturesare less than the threshold given by count filtering conditionare filtered out The remaining pairs constitute the candidateset sent to verification thereafter Due to the potentially largenumber of key-value pairs that may incur large communi-cation cost we incorporate the Bloom filter technique byadding generated signatures to spectral Bloom filters Thiseffectively reduces the number of intermediate key-valuepairs and thus the complexity of network transmissionwhile the filtering capacity is mostly preserved Furthermorewe employmultiway join to improve the verification phase bycondensing two MapReduce rounds into one while devisinga MapReduce-based method to calculate GED
To the best of our knowledge it is among the firstattempts to present a MapReduce-based graph similarity joinalgorithm Our contribution can be summarized as follows
(i) We redesign the current in-memory graph similarityjoin algorithm and adapt it to the MapReduce frame-work The resulting baseline algorithm is capable ofprocessing large-scale graph datasets
(ii) We propose to use Bloom filters to reduce inter-mediate key-value pairs in the filtering phase whilesacrificing little filtering capacity Besides we presenta multiway join optimized verification strategy suchthat the number of required MapReduce rounds isreduced too Moreover a MapReduce-based methodis designed forGED calculation which can handle thecalculation for large graphs
(iii) We implement the proposed algorithm MGSJoin andconduct a wide range of experiments on a real datasetThe results show that both the efficiency and thescalability of MGSJoin are superior to the currentsolutions
This paper is constructed as follows In Section 2 prob-lem definition and background are provided We proposethe basic algorithm in Section 3 integrate Bloom filtersin Section 4 and optimize the verification in Section 5Section 6 lists the experimental results and analyses Relatedworks are described in Section 7 followed by conclusion inSection 8
2 Preliminaries
21 Problem Definition In this paper we focus on simplegraph namely an undirected graph without self-loops ormultiple edges A labeled graph can be represented as aquadruple 119892 = ⟨119881 119864 119897
119881 119897119864⟩ where 119881 is a set of vertices and
119864 sube 119881 times 119881 is a set of edges 119897119881and 119897119864are label functions
that assign labels to vertices and edges respectively 119881(119903)
denotes the vertex set of 119903 and119864(119903) denotes its edge set |119881(119903)|
and |119864(119903)| represent the numbers of vertices and edges in 119903
respectively 119897119881
(119906) denotes the label of 119906 isin 119881 and 119897119864(119890(119906 V))
denotes the label of the edge between 119906 and V 119906 V isin 119881Additionally we use |119892| = |119881| + |119864| to depict the size of graph119892
Definition 1 (graph pair) A graph pair is a tuple denoted by⟨119903 119904⟩ where 119903 isin 119877 119904 isin 119878119877 and 119878 are two sets of graph objectsrespectively
Definition 2 (graph isomorphism) A graph 119903 is isomorphicto another graph 119904 denoted by 119903 = 119904 if there exists a bijection119891 119881(119903) rarr 119881(119904) such that
(1) forall119906 isin 119881(119903)(119891(119906) isin 119881(119904) and 119897119881
(119906) = 119897119881
(119891(119906)))(2) forall119890(119906 V) isin 119864(119903)(119890(119891(119906) 119891(V)) isin 119864(119904) and 119897
119864(119890(119906 V)) =
119897119864(119890(119891(119906) 119891(V))))
Definition 3 (graph edit operation) A graph edit operation isan edit operation to transform one graph to another It can beone of the following six operations
(i) insert an isolated vertex into the graph(ii) delete an isolated vertex from the graph(iii) change the label of a vertex(iv) insert an edge between two disconnected vertices(v) delete an edge from the graph(vi) change the label of an edge
Definition 4 (graph edit distance) The graph edit distance(GED) between graphs 119903 and 119904 denoted by GED(119903 119904) is theminimum number of edit operations that transform 119903 to agraph isomorphic to 119904
Example 5 Figure 1(a) illustrates the molecule namedcyclopropanone while Figure 1(b) shows another moleculewhich does not exist When recording cyclopropanone intodatabase errors may be made and the molecule can becomethe one shown in Figure 1(b) Manual checking is requiredto find the errors which is very difficult Seeing that thetwo molecules are very similar we can adapt the GED tomeasure the similarity so that graph similarity join can beapplied to resolve the problem Take the two molecules asan example First change the bond (1198621 119874) from double tosingle and then transform one of the atoms 119867 bonded with1198623 into 119873 through which a molecule isomorphic to the onein Figure 1(b) is obtained So at least two edit operationsare required namely the graph edit distance Given thethreshold 120591 = 4 the two graphs are regarded similar
Problem 6 (graph similarity join) Given two sets of graphobjects 119877 and 119878 and a distance threshold 120591 as input a graphsimilarity join returns a result set ⟨119903 119904⟩ | 119903 isin 119877 and 119904 isin
119878 and GED(119903 119904) le 120591
22 Count Filtering
Definition 7 (path-based 119902-gram [9]) A path-based 119902-gramin a graph is a simple path of length 119902 ldquoSimplerdquo means thatthere is no repeated vertex in the path
The Scientific World Journal 3
0
C1
C2 C3 H
HH
H
(a) Cyclopropanone
0
C1
C2 C3H
H H
N
(b) ldquoCyclopropanonerdquo
Figure 1 Molecules
The path-based 119902-grams of a graph constitute the graphsignatures denoted by 119904119894119892 = ⟨119881 119864 119897
119881 119897119864⟩ where 119904119894119892 sube 119903|119881| =
119902+1 and |119864| = 119902 Let 119876119903be graph 119903rsquos signature set We say 119904119894119892
and 1199041198941198921015840 are common if 119904119894119892 = 119904119894119892
1015840 Note there can be multiple119902-grams that correspond to one particular signature
Lemma 8 (count filtering [9]) Graphs 119903 and 119904 satisfy thedistance constraints 120591 if the number of common signatures for⟨119903 119904⟩ is no less than 119871119861
119871119861 = max (1003816100381610038161003816119876119903
1003816100381610038161003816 minus 120591 sdot 119863 (119903) 1003816100381610038161003816119876119904
1003816100381610038161003816 minus 120591 sdot 119863 (119904)) (1)
where 119863(119903) (resp 119863(119904)) is the maximum number of affectedsignatures in 119876
119903(resp 119876
119904) when one edit operation is invoked
on 119903 (resp 119904)
The count filtering condition check requires119874(max(119889
119902
119903|119881(119903)| 119889
119902
119904|119881(119904)|)) where 119889
119903(resp 119889
119904) is the
average degree of 119903 (resp 119904) A graph similarity join requirespairwise count filtering condition checks thus resulting ina complexity of 119874(max(119889
119902
119903|119881(119903)| 119889
119902
119904|119881(119904)|)|119877||119878|) This can
be intolerable on large-scale datasets Next we present ascalable solution leveraging the MapReduce paradigm
3 Framework
This section presents a MapReduce-based graph similarityjoin algorithm following the filtering-verification fashionThe outline of the algorithm is listed below
31 Filtering Weallocate twoMapReduce jobs to the filteringphase
Job 1 Job 1 counts the same type of common signatures forgraph pairs We use graphs as values and their correspond-ing 119894119889s as keys to compose input key-value pairs 119902-gramsignatures (denoted by 119904119894119892) are generated in the Map taskWe use the generated signatures as keys and 119894119889s of graphs asvalues to form the output key-value pairs As a consequencein the Reduce task we can obtain graphs with commonsignatures For the same signature a graph 119894119889 may appear
more than once since there may exist several 119902-grams in agraph corresponding to an identical signature The functionfor Map task is shown as follows
Map⟨119903119894119889 119903⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119892 119903119894119889119904119894119889⟩
(1) input ⟨119903119894119889 119903⟩ or ⟨119904119894119889 119904⟩(2) generate 119902-gram signatures for each input
graph(3) emit ⟨119904119894119892 119903119894119889⟩ or ⟨119904119894119892 119904119894119889⟩ for each signature
with the 119894119889 of its corresponding graph
The Reducer gets ⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ in the Reduce taskWedefine119891
119903(resp119891
119904) to denote the occurrences of 119903119894119889 (resp
119904119894119889) in the 119897119894119904119905(V119886119897119906119890) of Reduce input and 119898119891to denote the
occurrences of graph pairs that share the input key (119904119894119892) ofReduce 119898
119891= min(119891
119903 119891119904) 119898119891is calculated and output with
the corresponding graph pair The function of Reduce task isas follows
Reduce⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ rarr ⟨119903119894119889 (119904119894119889 119898119891)⟩
(1) get list of values consisting of 119903119894119889 and 119904119894119889 withthe specific key (119904119894119892)
(2) split the list of values into list of 119903119894119889 and list of119904119894119889
(3) for each pair ⟨119903119894119889 119904119894119889⟩ where 119903119894119889 is from119897119894119904119905(119903119894119889) and 119904119894119889 is from 119897119894119904119905(119904119894119889) calculate 119898
119891
and output ⟨119903119894119889 (119904119894119889 119898119891)⟩
Job 2 Job 2 counts the total common signatures for graphpairs and checks the count filtering condition after which theset of candidate pairs is obtained The Map function is listedas follows
Map⟨119903119894119889 (119904119894119889 119898119891)⟩ rarr ⟨119903119894119889 (119904119894119889 119898
119891)⟩
(1) read a ⟨119903119894119889 (119904119894119889 119898119891)⟩ into Map task
(2) emit ⟨119903119894119889 (119904119894119889 119898119891)⟩ which is exactly the same
as it reads
4 The Scientific World Journal
We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output
exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898
119891is summed with the same graph pair
for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows
Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898
119891
(2) sum 119898119891and calculate the number of common
signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and
output pairs to DFS whose common signaturesare more than 119871119861
32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)
Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows
Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩
(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩
emit it exactly both of which take 119904119894119889 as the key
TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows
Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩
(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889
(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩
Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs
The function for Map task is as follows
Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩
(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group
key is 119903119894119889 for both cases
The function for Reduce task is as follows
Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive a list of values that consisted of graphs 119903
and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar
pairs
33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness
For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost
Some parameters are defined preceding the analysis |119892|
denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1
In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|
119902
so the time complexity for generating119902-grams can be estimated as 119874(|119881|
119902
) Therefore the timecomplexity for Map task is 119874(|119881|
119902
sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|
119902
sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|
119902
sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-
value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)
In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)
In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The
The Scientific World Journal 5
Input graph object sets 119877 and 119878 GED threshold 120591
Output similar graph pairs ⟨119903 119904⟩
(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs
Algorithm 1MGSJoin
(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891
119904⟩
(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891
119903⟩ or ⟨119904119894119889 119904119887119891
119904⟩
(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891
119904⟩
(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891
119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891
119904and estimate
the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition
Algorithm 2 Replacement of filter of Algorithm 1
time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by
theAlowast-based algorithm requires 119874(|119881||119881|
) In the worst case
the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|
) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)
4 Incorporating Bloom Filters
In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters
41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ
1 ℎ2 ℎ
119896
to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ
1(119890) ℎ2(119890) ℎ
119896(119890) in the array is
set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ
1(119902) ℎ2(119902) ℎ
119896(119902)
of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]
Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency
which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =
119862ℎ1(119890)
119862ℎ2(119890)
119862ℎ119896(119890)
are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min
119894isin1119896119862ℎ119894(119902)
Note that similarto Bloom filters SBF also never underestimate 119891(119902)
42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891
119903⟩ and ⟨119904119894119889 119904119887119891
119904⟩ In the Reduce task for
each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862
(1) and 119862(2) respectively it returns another SBF
with counters119862lowast119862lowast119894
= min119862(1)
119894 119862(2)
119894 119894 isin 1 119898 Hence
the number of common signatures could be estimated bylfloor(1119896) sum
119898
119894=1119862lowast
119894rfloor Subsequently the graph pairs which have
less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set
We provide the pseudocode of the aforementioned pro-cess in Algorithm 2
43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)
6 The Scientific World Journal
Table 1 Complexity analysis of filtering
Phase Job 1 Job 2 +SBF
Map IO 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|))
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902
sdot (|119877| + |119878|))
Shuffle 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)
Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)
may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters
Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|
119902
(|119877|+
|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898
2
|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1
5 Optimizing Verification Phase
In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈
119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one
51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))
(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations
119892 (119909) = GED (119903119901 119904119901)
ℎ (119909) = Γ (119871119881
(119903119901) 119871119881
(119904119902)) + Γ (119871
119864(119903119902) 119871119864
(119904119902))
(2)
119903119901consists of the vertices that have been mapped and edges
connecting them while 119903119902consists of the vertices unmapped
yet as well as their resident edges The equation for ℎ(119909)
represents the label difference of two graphsThe search space forAlowast-based approach is very large
requiring 119874(|119881||119881|
) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round
The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree
52 Multiway Join In relational database a multiway joincan process 119879
1(119860 119861) ⋈ 119879
2(119861 119862) ⋈ 119879
3(119862 119863) together in one
round where 119879119894(119883 119884) is a relational table with attributes
119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899
2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905
2(119887 119888) isin 119879
2(119861 119862) is sent to theReducer numbered
(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879
3(119888 119889)) is sent
to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909
(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]
We encapsulate the improved verification procedure inAlgorithm 4
53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 3
0
C1
C2 C3 H
HH
H
(a) Cyclopropanone
0
C1
C2 C3H
H H
N
(b) ldquoCyclopropanonerdquo
Figure 1 Molecules
The path-based 119902-grams of a graph constitute the graphsignatures denoted by 119904119894119892 = ⟨119881 119864 119897
119881 119897119864⟩ where 119904119894119892 sube 119903|119881| =
119902+1 and |119864| = 119902 Let 119876119903be graph 119903rsquos signature set We say 119904119894119892
and 1199041198941198921015840 are common if 119904119894119892 = 119904119894119892
1015840 Note there can be multiple119902-grams that correspond to one particular signature
Lemma 8 (count filtering [9]) Graphs 119903 and 119904 satisfy thedistance constraints 120591 if the number of common signatures for⟨119903 119904⟩ is no less than 119871119861
119871119861 = max (1003816100381610038161003816119876119903
1003816100381610038161003816 minus 120591 sdot 119863 (119903) 1003816100381610038161003816119876119904
1003816100381610038161003816 minus 120591 sdot 119863 (119904)) (1)
where 119863(119903) (resp 119863(119904)) is the maximum number of affectedsignatures in 119876
119903(resp 119876
119904) when one edit operation is invoked
on 119903 (resp 119904)
The count filtering condition check requires119874(max(119889
119902
119903|119881(119903)| 119889
119902
119904|119881(119904)|)) where 119889
119903(resp 119889
119904) is the
average degree of 119903 (resp 119904) A graph similarity join requirespairwise count filtering condition checks thus resulting ina complexity of 119874(max(119889
119902
119903|119881(119903)| 119889
119902
119904|119881(119904)|)|119877||119878|) This can
be intolerable on large-scale datasets Next we present ascalable solution leveraging the MapReduce paradigm
3 Framework
This section presents a MapReduce-based graph similarityjoin algorithm following the filtering-verification fashionThe outline of the algorithm is listed below
31 Filtering Weallocate twoMapReduce jobs to the filteringphase
Job 1 Job 1 counts the same type of common signatures forgraph pairs We use graphs as values and their correspond-ing 119894119889s as keys to compose input key-value pairs 119902-gramsignatures (denoted by 119904119894119892) are generated in the Map taskWe use the generated signatures as keys and 119894119889s of graphs asvalues to form the output key-value pairs As a consequencein the Reduce task we can obtain graphs with commonsignatures For the same signature a graph 119894119889 may appear
more than once since there may exist several 119902-grams in agraph corresponding to an identical signature The functionfor Map task is shown as follows
Map⟨119903119894119889 119903⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119892 119903119894119889119904119894119889⟩
(1) input ⟨119903119894119889 119903⟩ or ⟨119904119894119889 119904⟩(2) generate 119902-gram signatures for each input
graph(3) emit ⟨119904119894119892 119903119894119889⟩ or ⟨119904119894119892 119904119894119889⟩ for each signature
with the 119894119889 of its corresponding graph
The Reducer gets ⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ in the Reduce taskWedefine119891
119903(resp119891
119904) to denote the occurrences of 119903119894119889 (resp
119904119894119889) in the 119897119894119904119905(V119886119897119906119890) of Reduce input and 119898119891to denote the
occurrences of graph pairs that share the input key (119904119894119892) ofReduce 119898
119891= min(119891
119903 119891119904) 119898119891is calculated and output with
the corresponding graph pair The function of Reduce task isas follows
Reduce⟨119904119894119892 119897119894119904119905(119903119894119889119904119894119889)⟩ rarr ⟨119903119894119889 (119904119894119889 119898119891)⟩
(1) get list of values consisting of 119903119894119889 and 119904119894119889 withthe specific key (119904119894119892)
(2) split the list of values into list of 119903119894119889 and list of119904119894119889
(3) for each pair ⟨119903119894119889 119904119894119889⟩ where 119903119894119889 is from119897119894119904119905(119903119894119889) and 119904119894119889 is from 119897119894119904119905(119904119894119889) calculate 119898
119891
and output ⟨119903119894119889 (119904119894119889 119898119891)⟩
Job 2 Job 2 counts the total common signatures for graphpairs and checks the count filtering condition after which theset of candidate pairs is obtained The Map function is listedas follows
Map⟨119903119894119889 (119904119894119889 119898119891)⟩ rarr ⟨119903119894119889 (119904119894119889 119898
119891)⟩
(1) read a ⟨119903119894119889 (119904119894119889 119898119891)⟩ into Map task
(2) emit ⟨119903119894119889 (119904119894119889 119898119891)⟩ which is exactly the same
as it reads
4 The Scientific World Journal
We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output
exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898
119891is summed with the same graph pair
for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows
Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898
119891
(2) sum 119898119891and calculate the number of common
signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and
output pairs to DFS whose common signaturesare more than 119871119861
32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)
Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows
Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩
(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩
emit it exactly both of which take 119904119894119889 as the key
TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows
Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩
(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889
(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩
Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs
The function for Map task is as follows
Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩
(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group
key is 119903119894119889 for both cases
The function for Reduce task is as follows
Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive a list of values that consisted of graphs 119903
and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar
pairs
33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness
For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost
Some parameters are defined preceding the analysis |119892|
denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1
In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|
119902
so the time complexity for generating119902-grams can be estimated as 119874(|119881|
119902
) Therefore the timecomplexity for Map task is 119874(|119881|
119902
sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|
119902
sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|
119902
sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-
value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)
In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)
In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The
The Scientific World Journal 5
Input graph object sets 119877 and 119878 GED threshold 120591
Output similar graph pairs ⟨119903 119904⟩
(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs
Algorithm 1MGSJoin
(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891
119904⟩
(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891
119903⟩ or ⟨119904119894119889 119904119887119891
119904⟩
(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891
119904⟩
(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891
119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891
119904and estimate
the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition
Algorithm 2 Replacement of filter of Algorithm 1
time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by
theAlowast-based algorithm requires 119874(|119881||119881|
) In the worst case
the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|
) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)
4 Incorporating Bloom Filters
In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters
41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ
1 ℎ2 ℎ
119896
to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ
1(119890) ℎ2(119890) ℎ
119896(119890) in the array is
set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ
1(119902) ℎ2(119902) ℎ
119896(119902)
of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]
Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency
which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =
119862ℎ1(119890)
119862ℎ2(119890)
119862ℎ119896(119890)
are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min
119894isin1119896119862ℎ119894(119902)
Note that similarto Bloom filters SBF also never underestimate 119891(119902)
42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891
119903⟩ and ⟨119904119894119889 119904119887119891
119904⟩ In the Reduce task for
each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862
(1) and 119862(2) respectively it returns another SBF
with counters119862lowast119862lowast119894
= min119862(1)
119894 119862(2)
119894 119894 isin 1 119898 Hence
the number of common signatures could be estimated bylfloor(1119896) sum
119898
119894=1119862lowast
119894rfloor Subsequently the graph pairs which have
less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set
We provide the pseudocode of the aforementioned pro-cess in Algorithm 2
43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)
6 The Scientific World Journal
Table 1 Complexity analysis of filtering
Phase Job 1 Job 2 +SBF
Map IO 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|))
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902
sdot (|119877| + |119878|))
Shuffle 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)
Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)
may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters
Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|
119902
(|119877|+
|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898
2
|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1
5 Optimizing Verification Phase
In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈
119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one
51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))
(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations
119892 (119909) = GED (119903119901 119904119901)
ℎ (119909) = Γ (119871119881
(119903119901) 119871119881
(119904119902)) + Γ (119871
119864(119903119902) 119871119864
(119904119902))
(2)
119903119901consists of the vertices that have been mapped and edges
connecting them while 119903119902consists of the vertices unmapped
yet as well as their resident edges The equation for ℎ(119909)
represents the label difference of two graphsThe search space forAlowast-based approach is very large
requiring 119874(|119881||119881|
) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round
The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree
52 Multiway Join In relational database a multiway joincan process 119879
1(119860 119861) ⋈ 119879
2(119861 119862) ⋈ 119879
3(119862 119863) together in one
round where 119879119894(119883 119884) is a relational table with attributes
119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899
2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905
2(119887 119888) isin 119879
2(119861 119862) is sent to theReducer numbered
(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879
3(119888 119889)) is sent
to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909
(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]
We encapsulate the improved verification procedure inAlgorithm 4
53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
4 The Scientific World Journal
We input ⟨119903119894119889 (119904119894119889 119898119891)⟩ to theMap task and then output
exactly the same key-value pair Therefore the Reduce taskreceives all the graph pairs generated in job 1 with the specific119904119894119889 As the graph pair may have more than one type ofcommon signatures 119898
119891is summed with the same graph pair
for each identical signature respectively Subsequently thegraph pairs with less than 119871119861 (1) common signatures will bediscarded The remaining graph pairs are candidate pairs forverification The function for Reduce is as follows
Reduce⟨119903119894119889 119897119894119904119905(119904119894119889 + 119898119891)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive output from Mappers the specific 119903119894119889and a list of 119904119894119889 with 119898
119891
(2) sum 119898119891and calculate the number of common
signatures for each pair ⟨119903119894119889 119904119894119889⟩(3) conduct the count filtering for each pair and
output pairs to DFS whose common signaturesare more than 119871119861
32 Verification In the verification phase candidate pairs⟨119903119894119889 119904119894119889⟩ are to be verified where the graphs 119903 and 119904 arerequired that is join operations are necessary to retrievegraphs 119903 and 119904 by their 119894119889rsquos Hence we allocate two MapRe-duce jobs to join (119903119894119889 119903) (119903119894119889 119904119894119889) and (119904119894119889 119904)
Job 1 Job 1 replaces 119904119894119889 with graph 119904 The map function takesgraph set 119878 and ⟨119903119894119889 119904119894119889⟩ as input and emits ⟨119903119894119889 119904⟩ which islisted as follows
Map⟨119903119894119889 119904119894119889⟩⟨119904119894119889 119904⟩ rarr ⟨119904119894119889 119903119894119889119904⟩
(1) input candidate pair ⟨119903119894119889 119904119894119889⟩ and graph set 119878(2) for ⟨119903119894119889 119904119894119889⟩ emit ⟨119904119894119889 119903119894119889⟩ and for ⟨119904119894119889 119904⟩
emit it exactly both of which take 119904119894119889 as the key
TheReduce task gathers the list 119903119894119889 and graph 119904 for the key119904119894119889 and then outputs the key-value pair ⟨119903119894119889 119904⟩ The functionfor Reducer is as follows
Reduce⟨119904119894119889 119897119894119904119905(119903119894119889119904)⟩ rarr ⟨119903119894119889 119904⟩
(1) receive a list of 119903119894119889 and graph set 119878 with thespecific key 119904119894119889
(2) for ⟨119903119894119889 119904119894119889⟩ replace 119904119894119889 with 119904 and output pair⟨119903119894119889 119904⟩
Job 2 Job 2 replaces 119903119894119889 with graph 119903 invokes label filteringconditions and calculates GED to find the similar graphpairs
The function for Map task is as follows
Map⟨119903119894119889 119904⟩⟨119903119894119889 119903⟩ rarr ⟨119903119894119889 119903119904⟩
(1) input the candidate pair ⟨119903119894119889 119904⟩ and graph set119877(2) emit the key-value pair it reads where the group
key is 119903119894119889 for both cases
The function for Reduce task is as follows
Reduce⟨119903119894119889 119897119894119904119905(119903119904)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(1) receive a list of values that consisted of graphs 119903
and 119904 corresponding to the key 119903119894119889(2) replace 119903119894119889 with graph 119903(3) calculate GED for pairs ⟨119903 119904⟩ and output similar
pairs
33 Correctness and Complexity Analysis All graph pairs thatsatisfy the edit distance constraints are returned error-freewhich justifies its correctness
For algorithm complexity we take all three phasesmdashMap Reduce and Shufflemdashinto consideration IO readingoverhead from distributed file system (DFS) is considered forMap task whereas IO writing overhead into DFS is analyzedfor Reduce task Both tasks also take time complexity intoconsideration Shuffle considers the network communicationcost
Some parameters are defined preceding the analysis |119892|
denotes the average size of a graph 120572 means candidate ratiothat is the percentage of candidate pairs from all graph pairsand 120573 means the ratio of similar pairs from pairs that passedcount filtering We assume that the size of key-value pair thatcontains only graph IDs is 1
In job 1 of filter phase Map reads the data of graph sets119877 and 119878 from DFS the cost of which is 119874(|119892| sdot (|119877| + |119878|))Consider the worst case in Map task where 119902-grams aregenerated for each graph 119903(119904) and the number of signaturesgenerated is |119881|
119902
so the time complexity for generating119902-grams can be estimated as 119874(|119881|
119902
) Therefore the timecomplexity for Map task is 119874(|119881|
119902
sdot (|119877| + |119878|)) As eachgenerated 119902-gram signature forms an output key-value pair(the size for the pair is 1) the communication cost for Shuffleis119874(|119881|
119902
sdot(|119877|+|119878|)) In theReduce task all graphs representedby 119894119889 containing the same signature are acquired and pairedThus the IO cost is 119874(|119877||119878|) Then the occurrences of119903119894119889 and 119904119894119889 are counted In other words all the key-valuepairs generated from the Map task are counted so the timecomplexity for Reduce is 119874(|119881|
119902
sdot (|119877| + |119878|))Consider job 2 in filter phase where all the output key-
value pairs from job 1 are read intoMap task of job 2 so the IOcost is 119874(|119877||119878|) Map outputs the key-value pairs it reads sothe time complexity forMap task and communication cost forShuffle are119874(|119877||119878|) In Reduce task all pairs are traversed sothe time complexity is119874(|119877||119878|) As we have 120572|119877||119878| candidatekey-value pairs the IO cost for Reduce task is 119874(120572|119877||119878|)
In verification phase of job 1 Map reads the candidatepairs represented by ⟨119903119894119889 119904119894119889⟩ and graph set 119878 from DFSwhere the IO overhead requires 119874(120572|119877||119878| + |119892||119878|) The timecomplexity of Map task and communication cost of Shuffleare also 119874(120572|119877||119878| + |119892||119878|) because in Map function we justemit what has been exactly input into Then in Reduce taskwe replace 119904119894119889with graph 119904 the time complexity is119874(120572|119877||119878|)and the IO overhead is 119874(120572|119892||119877||119878|)
In verification phase of job 2 Map reads the candidatepairs represented by ⟨119903119894119889 119904⟩ and graph set 119877 from DFSwhere the IO overhead requires 119874(|119892||119877|(120572|119878| + 1)) The
The Scientific World Journal 5
Input graph object sets 119877 and 119878 GED threshold 120591
Output similar graph pairs ⟨119903 119904⟩
(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs
Algorithm 1MGSJoin
(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891
119904⟩
(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891
119903⟩ or ⟨119904119894119889 119904119887119891
119904⟩
(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891
119904⟩
(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891
119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891
119904and estimate
the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition
Algorithm 2 Replacement of filter of Algorithm 1
time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by
theAlowast-based algorithm requires 119874(|119881||119881|
) In the worst case
the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|
) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)
4 Incorporating Bloom Filters
In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters
41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ
1 ℎ2 ℎ
119896
to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ
1(119890) ℎ2(119890) ℎ
119896(119890) in the array is
set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ
1(119902) ℎ2(119902) ℎ
119896(119902)
of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]
Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency
which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =
119862ℎ1(119890)
119862ℎ2(119890)
119862ℎ119896(119890)
are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min
119894isin1119896119862ℎ119894(119902)
Note that similarto Bloom filters SBF also never underestimate 119891(119902)
42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891
119903⟩ and ⟨119904119894119889 119904119887119891
119904⟩ In the Reduce task for
each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862
(1) and 119862(2) respectively it returns another SBF
with counters119862lowast119862lowast119894
= min119862(1)
119894 119862(2)
119894 119894 isin 1 119898 Hence
the number of common signatures could be estimated bylfloor(1119896) sum
119898
119894=1119862lowast
119894rfloor Subsequently the graph pairs which have
less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set
We provide the pseudocode of the aforementioned pro-cess in Algorithm 2
43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)
6 The Scientific World Journal
Table 1 Complexity analysis of filtering
Phase Job 1 Job 2 +SBF
Map IO 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|))
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902
sdot (|119877| + |119878|))
Shuffle 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)
Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)
may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters
Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|
119902
(|119877|+
|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898
2
|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1
5 Optimizing Verification Phase
In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈
119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one
51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))
(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations
119892 (119909) = GED (119903119901 119904119901)
ℎ (119909) = Γ (119871119881
(119903119901) 119871119881
(119904119902)) + Γ (119871
119864(119903119902) 119871119864
(119904119902))
(2)
119903119901consists of the vertices that have been mapped and edges
connecting them while 119903119902consists of the vertices unmapped
yet as well as their resident edges The equation for ℎ(119909)
represents the label difference of two graphsThe search space forAlowast-based approach is very large
requiring 119874(|119881||119881|
) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round
The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree
52 Multiway Join In relational database a multiway joincan process 119879
1(119860 119861) ⋈ 119879
2(119861 119862) ⋈ 119879
3(119862 119863) together in one
round where 119879119894(119883 119884) is a relational table with attributes
119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899
2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905
2(119887 119888) isin 119879
2(119861 119862) is sent to theReducer numbered
(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879
3(119888 119889)) is sent
to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909
(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]
We encapsulate the improved verification procedure inAlgorithm 4
53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 5
Input graph object sets 119877 and 119878 GED threshold 120591
Output similar graph pairs ⟨119903 119904⟩
(1) Filter(2) Job 1 count the same type of common signatures for graph pairs(3) Job 2 count the total common signatures and check the count filtering fir graph pairs(4) Verification(5) Job 1 replace sid with graph 119904(6) Job 2 replace rid with graph 119903 and calculate GED for candidate pairs
Algorithm 1MGSJoin
(1)Map ⟨119903119894119889 119903⟩ ⟨119904119894119889 119904⟩ rarr ⟨119903119894119889 119904119887119891119903⟩ ⟨119904119894119889 119904119887119891
119904⟩
(2) create SBF for each graph and insert the generated 119902-gram signatures into it(3) output ⟨119903119894119889 119904119887119891
119903⟩ or ⟨119904119894119889 119904119887119891
119904⟩
(4) Shuffle conduct Cartesian product between pairs ⟨119903119894119889 119904119887119891119903⟩ and pairs ⟨119904119894119889 119904119887119891
119904⟩
(5) Reduce ⟨⟨119903119894119889 119904119887119891119903⟩ 119897119894119904119905 (⟨119904119894119889 119904119887119891
119904⟩)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(6) for any ⟨119903119894119889 119904119894119889⟩ calculate the intersection of 119904119887119891119903and 119904119887119891
119904and estimate
the number of common signatures(7) invoke the count filtering for each pair ⟨119903119894119889 119904119894119889⟩ and output it if it passes the condition
Algorithm 2 Replacement of filter of Algorithm 1
time complexity of Map task and communication cost ofShuffle are also 119874(|119892||119877|(120572|119878| + 1)) for the same reason aboveThen in Reduce task we first replace 119903119894119889 with graph 119903 andthen calculate GED for graph pairs The GED calculation by
theAlowast-based algorithm requires 119874(|119881||119881|
) In the worst case
the time complexity of Reduce is 119874(120572|119877||119878||119881||119881|
) Finallysimilar graph pairs are emitted into DFS which requires119874(120572120573|119877||119878|)
4 Incorporating Bloom Filters
In the filtering phase of Algorithm 1 twoMapReduce jobs arerequired with many intermediate key-value pairs generatedand transmitted These increase the IO and communicationcost which can be fairly time-consuming This sectionintroduces the Bloom filter technique to reduce such costNext we first recall the concept of spectral Bloom filters
41 Spectral Bloom Filter Bloom filters [12] are space efficientdata structures which allow fast membership queries over agiven set A Bloom filter uses 119896 hash functions ℎ
1 ℎ2 ℎ
119896
to hash elements into an array of size 119898 For an element 119890 inthe set the bit at positions ℎ
1(119890) ℎ2(119890) ℎ
119896(119890) in the array is
set to 1 Given a query item 119902 we check its membership in theset by examining the bits at positions ℎ
1(119902) ℎ2(119902) ℎ
119896(119902)
of the array The item 119902 is reported to be contained in theset if (and only if) all the aforementioned bits are 1 Thismethod brings a small probability of false-positive that isit may return a positive result for an item which actually isnot contained in the set but no false-negative while gainingsubstantial space savings [13]
Spectral Bloom filter (SBF) [14] generalized the basicBloom filter to be able to record the element frequency
which is thus adopted in this paper SBF is represented by⟨119860 119891⟩ where 119860 is a set and 119891 is a map from 119860 to naturalnumbers that is 119891 119860 rarr 119873 where 119873 is the universe ofnatural numbers SBF replaces the bit vector with a vectorof 119898 counters For insertion of item 119890 the counters 119862 =
119862ℎ1(119890)
119862ℎ2(119890)
119862ℎ119896(119890)
are increased by 1 for insertion anddecreased by 1 for deletion Let 119891(119902) denote the frequency of119902 A basic query for SBF on an item 119902 returns an estimationon 119891(119902) that is 119891(119902) = min
119894isin1119896119862ℎ119894(119902)
Note that similarto Bloom filters SBF also never underestimate 119891(119902)
42 Algorithm Incorporating SBF not only reduces the num-ber of key-value pairs but also contracts the two MapReducejobs into one in the filtering phase In particular the Maptask takes graph sets 119877 and 119878 as input Then we createa SBF for each graph by adding the 119902-gram signaturesCartesian product is conducted for the output key-valuepairs ⟨119903119894119889 119904119887119891
119903⟩ and ⟨119904119894119889 119904119887119891
119904⟩ In the Reduce task for
each graph pair their SBFs are intersected to estimate thenumber of common signatures By intersecting two SBFswithcounters 119862
(1) and 119862(2) respectively it returns another SBF
with counters119862lowast119862lowast119894
= min119862(1)
119894 119862(2)
119894 119894 isin 1 119898 Hence
the number of common signatures could be estimated bylfloor(1119896) sum
119898
119894=1119862lowast
119894rfloor Subsequently the graph pairs which have
less than 119871119861 common signatures given by count filteringcondition will be discarded and the remaining pairs formthe candidate set
We provide the pseudocode of the aforementioned pro-cess in Algorithm 2
43 Correctness and Complexity Analysis There is smallprobability that the false positive case happens with SBFSpecifically a query for an item 119902 in SBF on 119891(119902) 119891(119902)
6 The Scientific World Journal
Table 1 Complexity analysis of filtering
Phase Job 1 Job 2 +SBF
Map IO 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|))
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902
sdot (|119877| + |119878|))
Shuffle 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)
Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)
may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters
Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|
119902
(|119877|+
|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898
2
|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1
5 Optimizing Verification Phase
In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈
119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one
51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))
(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations
119892 (119909) = GED (119903119901 119904119901)
ℎ (119909) = Γ (119871119881
(119903119901) 119871119881
(119904119902)) + Γ (119871
119864(119903119902) 119871119864
(119904119902))
(2)
119903119901consists of the vertices that have been mapped and edges
connecting them while 119903119902consists of the vertices unmapped
yet as well as their resident edges The equation for ℎ(119909)
represents the label difference of two graphsThe search space forAlowast-based approach is very large
requiring 119874(|119881||119881|
) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round
The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree
52 Multiway Join In relational database a multiway joincan process 119879
1(119860 119861) ⋈ 119879
2(119861 119862) ⋈ 119879
3(119862 119863) together in one
round where 119879119894(119883 119884) is a relational table with attributes
119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899
2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905
2(119887 119888) isin 119879
2(119861 119862) is sent to theReducer numbered
(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879
3(119888 119889)) is sent
to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909
(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]
We encapsulate the improved verification procedure inAlgorithm 4
53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
6 The Scientific World Journal
Table 1 Complexity analysis of filtering
Phase Job 1 Job 2 +SBF
Map IO 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 sdot (|119877| + |119878|))
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119896|119881|119902
sdot (|119877| + |119878|))
Shuffle 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (119898 |119877| |119878|)
Reduce IO 119874(|119877||119878|) 119874 (120572 |119877| |119878|) 119874 (120572 |119877| |119878|)
Time 119874 ((|119881|119902
sdot (|119877| + |119878|)) 119874 (|119877| |119878|) 119874 (|119877| |119878|)
may be larger than 119891(119902) Therefore the number of commonsignatures estimated this way may be larger than the actualvalue Nonetheless false-negative will never happen whichensures the correctness of the algorithm in a certain case thepruning power of count filtering will be impaired Besidesfalse-positive will be less likely to happen if one carefullychooses the hash functions and configures the sizes ofcounters
Then we analyze the complexity Let 119898 be the size ofa SBF The Map task reads the entire sets 119877 and 119878 so itsIO cost is 119874(|119892|(|119877| + |119878|)) Then signatures are generatedand added to SBF and 119896 hash values are calculated for eachsignatureThus the time complexity forMap is119874(119896|119881|
119902
(|119877|+
|119878|)) Map emits the SBF for each input graph and thenCartesian product is conductedThe communication cost forShuffle is119874(119898
2
|119877||119878|) In the Reduce task for each graph pairthe number of common signatures is calculated and countfiltering condition is checked Thus the time complexity is119874(|119877||119878|) Regardless of false-positive cases the IO cost is thesame as before Denoting the improved algorithm by ldquo+SBFrdquowe summarize the complexity results in Table 1
5 Optimizing Verification Phase
In verification phase we need to calculate the GED ofcandidate pairs Nevertheless it is not capable of finishingthe calculation with large graphs and threshold 120591 Thereforewe devise a MapReduce-based method for GED calculationwhich is able of handling large-scale graphs Besides joinoperations are required preceding the GED calculation to getthe entire graphThus three relations 119877(119903119894119889 119903)⋈119862(119903119894119889 119904119894119889)⋈
119878(119904119894119889 119904) are joined to obtain the input of the GED algorithmwhere 119862(119903119894119889 119904119894119889) is the output of the filtering phase Inspiredby the idea of multiway join we can reduce the number ofrequired MapReduce jobs from two to one
51 MapReduce for GED Calculation TheGED calculation isbased onAlowast algorithmAlowast constructs a search-tree the nodeof which represents a mapping status A mapping status isstored in an array (denoted by 119909) the index of which standsfor different vertices in graph 119903 and the corresponding valuestands for the vertices of graph 119904Alowast explores the space of allpossible vertex mappings between two graphs in a best-firstsearch fashion with function (denoted by 119891(119909)) establishedto determine the order in which the search visits vertexmappings 119891(119909) is a sum of two functions (1) the distancefrom the initial state to the current state (denoted by 119892(119909))
(2) a heuristic estimate of the distance from the current stateto the goal (denoted by ℎ(119909)) 119892(119909) and ℎ(119909) are calculated bythe following equations
119892 (119909) = GED (119903119901 119904119901)
ℎ (119909) = Γ (119871119881
(119903119901) 119871119881
(119904119902)) + Γ (119871
119864(119903119902) 119871119864
(119904119902))
(2)
119903119901consists of the vertices that have been mapped and edges
connecting them while 119903119902consists of the vertices unmapped
yet as well as their resident edges The equation for ℎ(119909)
represents the label difference of two graphsThe search space forAlowast-based approach is very large
requiring 119874(|119881||119881|
) In order to boost up the searchingprocedure parallelization is a common way to think aboutThe naive way is to allocate different branches of search-tree to different workers so that the searching procedure canproceed in parallel However the load is not balanced this wayso that the final runtime is determined on the worker withthe heaviest load As a consequence we devise aMapReduce-based method to calculate GED denoted by MRGED whichreallocates the works after each MapReduce round
The format of the key-value pairs to be manipulated byMapReduce is ⟨119909 119891(119909)⟩ The procedure of searching for theresult is through iterations that each iterationwalks down onelayer of the search-tree
52 Multiway Join In relational database a multiway joincan process 119879
1(119860 119861) ⋈ 119879
2(119861 119862) ⋈ 119879
3(119862 119863) together in one
round where 119879119894(119883 119884) is a relational table with attributes
119883 and 119884 (Algorithm 3) Following the same idea we canconsolidate the twoMapReduce jobs required for verificationin Algorithm 1 Specifically let ℎ be a hash function withrange 1 2 119899 where 119899
2 is the number of Reducers Weassociate eachReduce taskwith a pair (119894 119895) where 119894 119895 isin [1 119899]Each tuple 119905
2(119887 119888) isin 119879
2(119861 119862) is sent to theReducer numbered
(ℎ(119887) ℎ(119888)) while each tuple in1198791(119886 119887) (resp119879
3(119888 119889)) is sent
to Reducers numbered (ℎ(119887) 119909) (resp (119910 ℎ(119888))) for any 119909
(resp 119910) Each Reduce task computes the join of the tuplesit receives It is shown that multiway join is more efficient inpractice than two simple joins [15]
We encapsulate the improved verification procedure inAlgorithm 4
53 Correctness and Complexity Analysis One may immedi-ately verify that Algorithm 4 correctly conducts the verifica-tion
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 7
Input 119903 119904 are graphs from candidate pair ⟨119903 119904⟩
Output Graph edit distance of 119903 and 119904
(1) construct the search tree with the root node ⟨0 0 0 0⟩(2) read nodes into Mapper if the node read in is fully mapped namely 119903
119902is empty go to line (8)
(3)Map ⟨119909 119891(119909)⟩ rarr ⟨119909 119891(119909)⟩
(4) calculate 119891(119909) for node 119909 and check whether 119891(119909) gt 120591 if it is continue(5) create new nodes by mapping a vertex V
1isin 119903119902(119909) to a vertex V
2isin 119904119902(119909) for any V
1and V
2
(6) output all the newly created nodes(7) back to line (2)(8) gather the nodes calculate the minimum 119891(119909) and return it
Algorithm 3 MRGED (119903 119904)
(1)Map ⟨119903119894119889 119903⟩ ⟨119903119894119889 119904119894119889⟩ ⟨119904119894119889 119904⟩
(2) if ⟨119903119894119889 119903⟩ then emit(⟨(ℎ(119903119894119889) 119909) ⟨119903119894119889 119903⟩⟩) for any 119909 if ⟨119903119894119889 119904119894119889⟩ thenemit(⟨(ℎ (119903119894119889) ℎ (119904119894119889)) ⟨119903119894119889 119904⟩⟩) for any 119910 else emit(⟨(119910 ℎ(119904119894119889)) ⟨119904119894119889 119904⟩⟩)
(3) Reduce ⟨(ℎ(119903119894119889) ℎ(119904119894119889)) 119897119894119904119905(V119886119897119906119890)⟩ rarr ⟨119903119894119889 119904119894119889⟩
(4) split 119897119894119904119905(V119886119897119906119890) into 119897119894119904119905 (⟨119903119894119889 119903⟩) 119897119894119904119905 (⟨119904119894119889 119904⟩) and 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩)(5) denote 119897119894119904119905 (⟨119903119894119889 119904119894119889⟩) as joinkey(6) for ⟨119903119894119889 119904119894119889⟩ isin 119895119900119894119899119896119890119910119904 119908ℎ119890119903119890 ⟨119903119894119889 119903⟩ isin 119897119894119904119905 (⟨119903119894119889 119903⟩) 119886119899119889 ⟨119904119894119889 119904⟩ isin 119897119894119904119905 (⟨119904119894119889 119904⟩) do(7) if 119892119903119886119901ℎ119904 119903 119886119899119889 119904 119894119899 ⟨119903 119904⟩ 119886119903119890 V119890119903119910 119897119886119903119892119890 then invoke MRGED
calculation else invoke the Alowast-based in-memory algorithm
Algorithm 4 Replacement of verification of Algorithm 1
In the multiway join based verification phase the Maptask takes graph sets 119877 and 119878 and the candidate pairs repre-sented by their 119894119889s as input The input IO cost is 119874(|119892|(|119877| +
|119878|) + 120572|119877||119878|) The key-value pair ⟨119903119894119889 119903⟩ (resp ⟨119904119894119889 119904⟩) issent to 119899 Reduce tasks numbered (119903119894119889 119909) (resp (119910 119904119894119889)) forany119909 (resp119910) whereas the key-value pair ⟨119903119894119889 119904119894119889⟩ is sent tothe only Reduce task numbered (ℎ(119903119894119889) ℎ(119904119894119889)) As a resultthe communication cost for Shuffle is 119874(119899|119892|(|119877| + |119878|) +
120572|119877||119878|) where 1198992 is the number of Reducers In the Reduce
task all candidate graph pairs go through edit distancecomputation Pairs with larger size go through MRGEDwhile pairs with smaller size go throughAlowast For simplicitythe complexity of GED calculation is regarded the same asthe baseline algorithm Labelling the resulting algorithmwithldquo+MJrdquo we summarize the complexity results in Table 2
6 Experiments
61 Experiment Setup We conducted experiments on severalpublicly available real datasets but only present the resultson Pubchem (httppubchemncbinlmnihgov) due to theinterest of space The dataset is constructed by sampling1000000 graphs from Pubchem
Amazon cloud services were used as our experimentplatform Specifically we used Elastic Compute Cloud (EC2)in which the computing nodes are called instances In theexperiment 31 instances were used by defaultmdashone set asmaster node and others as worker nodes The standardconfiguration of all EC2 instances is m1small one CPUof single core with 17 GB memory running Hadoop 112(Table 3)
62 Evaluating Filters In order to evaluate the effectivenessof our filtering techniques we use the term ldquoBasicrdquo for thebaseline algorithm for processing graph similarity joins basedonMapReduce ldquo+SBFrdquo denotes filtering improved algorithmof Basic by incorporating SBF
The algorithm efficiency has been studied and shownin Figure 2(a) However the pruning power is somewhatimpaired so we conducted the experiment to record theincrease of candidate pairs through ldquo+SBFrdquo (cf Figure 2(b))It can be revealed that ldquo+SBFrdquo outweigh ldquoBasicrdquo in efficiencyby sacrificing little pruning power When 120591 = 5 less than 300more candidate pairs are generatedwhile about 5000 secondsare reserved
63 Evaluating Verification The verification was evaluatedwith candidate pairs generated by Basic Term ldquo+MJrdquodenotes applying multiway join in verification whileldquo+MRGEDrdquo denotes adapting alternative MapReduce-basedGED calculation We use term ldquoMGSJoinrdquo to indicate thebasic algorithm incorporating both techniques Figure 3(a)shows the runtime comparison between +MRGED
and MGSJoin The result illustrates the superiority ofapplying multiway join When 120591 = 5 algorithm withmultiway join is about 6000 seconds faster than withordinary joins Figure 3(b) presents the result of evaluatingMRGED where Basic and +MRGED are compared It canbe observed that the runtime of both algorithms growsexponentially When 120591 equals 1 the Basic finishes quickerthan +MRGED (162 s and 204 s resp) while 120591 equals 2 the+MRGED outweighs Basic (the runtime is 712 s and 580 sresp) This is because the calculation required for 120591 = 1
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
8 The Scientific World Journal
Table 2 Complexity analysis of verification
Phase Job 1 Job 2 +MJ
Map IO 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Time 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (119899 (|119877| + |119878|) + 120572 |119877| |119878|)
Shuffle 119874 (120572 |119877| |119878| +1003816100381610038161003816119892
1003816100381610038161003816 |119878|) 119874 (1003816100381610038161003816119892
1003816100381610038161003816 |119877| sdot (120572 |119878| + 1)) 119874 (1198991003816100381610038161003816119892
1003816100381610038161003816 (|119877| + |119878|) + 120572 |119877| |119878|)
Reduce IO 119874 (1205721003816100381610038161003816119892
1003816100381610038161003816 sdot |119877| |119878|) 119874 (120572120573 |119877| |119878|) 119874 (120572120573 |119877| |119878|)
Time 119874 (120572 |119877| |119878|) 119874 (120572|119877||119878|119881||119881|
) 119874 (120572|119877||119878|119881||119881|
)
0 1 2 3 4 5
Elap
sed
time (
s)
GED threshold (120591)
Basic
+SBF
15k
12k
9k
6k
3k
(a)
0
50
100
150
200
250
300
1 2 3 4 5GED threshold (120591)
Num
ber o
f Δ (c
andi
date
pai
rs)
(b)
Figure 2 Evaluating filtering
Table 3 Dataset statistics
Dataset |119877| |119881| |119864| Disk size (GB)Enamine 1000000 5232 5037 07
is small where the MRGED is clumsy compared withAlowastwhereas when 120591 = 2 more calculation is required so that theadvantage of MRGED comes out When 120591 is within the rangeof 3ndash5 Basic is unable to finish because the large calculationdrives Basic out of memory
64 Comparing with State-of-the-Art Method We comparedour algorithm with the state-of-the-art method GSimJoinIn Figure 4(a) we chose 10000 graphs in order to comparewith GSimJoin and the result witnesses the obvious supe-riority of MGSJoin over GSimJoin Figure 4(b) is drawn inlog scale which varies the number of graphs and records theelapsed time Both algorithms grow linear in the figure whichreflect their exponential growth MGSJoin rises mush slowerthan GSimJoin The runtime of GSimJoin is about 10 timeslonger than MGSJoinwhen joining 100 graphs and 100 timeslonger than MGSJoinwhen joining 10000 graphsMoreoverwhen we have to join 100000 graphs GSimJoin is runningout of memory so that no result is recorded
65 Speedup We evaluated the speedup of our algorithm byvarying the number of instances from 10 to 50 The exper-imental results are shown in Figure 5(a) We can see that
with the increase of instances in the cluster the performanceof MGSJoin significantly improved The improvement issignificantly shown when threshold 120591 equals 4 With moreinstances running the count filtering is getting faster bycounting for the common signatures simultaneously andthe verification is getting quicker by joining relations andcalculating GED in parallel
66 Scale-Up We evaluated the scale-up of our algorithmby increasing both dataset sizes and number of nodes in thecluster The result is shown in Figure 5(b) It is worth notingthat as the dataset increased the results for different values of120591 get similar in the trend of increase All lines rise smoothlywhich reveals good scalability of MGSJoin
7 Related Work
Graph SimilarityQueries Similarity joins retrieve similar dataobject pairs which can be strings sets trees and graphs [16]As to GED-based graph similarity search [17] proposed 120581-AT a tree-based 119902-gram approach However it is associatedwith the drawback of usually loose lower bound for countfiltering Seeing the drawback [9] presented a path-based 119902-gram approach In comparison with 120581-AT GSimJoin is moreefficient by leveraging more advanced filtering techniquesThus we adopt the path-based 119902-gram approach in this paper
MapReduce-Based Graph Algorithms MapReduce is a dis-tributed programming framework [10] which has been
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 9
Elap
sed
time (
s)
1 2 3 4 5GED threshold (120591)
16k
12k
8k
4k
MGSJoin
+MRGED
(a)
Elap
sed
time (
s)
Basic
+MR
105
104
103
102
1 2 3 4 5GED threshold (120591)
(b)
Figure 3 Evaluating verification
0
Elap
sed
time (
s)
GSimJoin
MGSJoin
1 2 3 4 5GED threshold (120591)
5k
4k
3k
2k
1k
(a)
Elap
sed
time (
s)
Number of graphs
GSimJoin
MGSJoin
105
104
103
102
102
101
103
104
106
105
(b)
Figure 4 Comparing with state-of-the-art method
2k
4k
6k
8k
10k
504540353025201510
Elap
sed
time
Number of instances
120591 = 1
120591 = 2
120591 = 3
120591 = 4
(a)
010 20 30 40 50
Elap
sed
time (
s)
Number of instances and number of graphs (times2000)
5k
4k
3k
2k
1k
(b)
Figure 5 Speedup and Scale-up
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
10 The Scientific World Journal
applied in processing large graphs Many graph algorithmsusing MapReduce were discussed in [18] including tri-anglesrectangles enumeration and 119896-cliques computationIn [19] several techniques were proposed to reduce theinput size of MapReduce and the techniques are appliedfor minimum spanning trees approximate maximal match-ings approximate nodeedge covers and minimum cutsPersonalized PageRank computation inMapReduce was dis-cussed in [20] Matrix multiplication based graph min-ing algorithms in MapReduce were investigated in [21]More recently densest subgraph computation [22] subgraphinstances enumeration [23] and connected componentscomputation in logarithmic rounds [24] were researched inMapReduce
Graph Processing Systems in Cloud Many systems weredeveloped in order to deal with big graphs Such a one repre-sentative system is Pregel [25] which takes a vertex-centricapproach and implements a bulk synchronous parallel (BSP)computation model HipG [26] improves BSP by using asyn-chronous messages to avoid synchronization PowerGraph[27] is a distributed graph processing system that is optimizedto process power-law graphs Giraph++ was proposed in[28] to take graph partitioning into consideration whenprocessing graphs Workload balancing for graph processingin cloud was discussed in [29]
8 Conclusion
In this paper we have investigated the problem of scalablegraph similarity joins We firstly present a MapReduce-based graph similarity join algorithm MGSJoin following thefiltering-verification framework To reduce the communica-tion cost in the filtering phase it incorporates the Bloomfilter technique to reduce the number of intermediate key-value pairs In addition we devise a multiway join optimizedverification procedure for further speedup Extensive experi-ments are conducted on real datasets to confirm the efficiencyand scalability of the proposed solution Furthermore theverification phase is further optimized with MapReducewhich enables the test on larger and denser graphs
As a future direction we plan to explore the possibility ofoptimizing the verificationwithmultithreaded programmingparadigm Additionally it is also of interest to test theefficiency and scalability of proposed algorithms on evenlarger andor denser graphs
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
The research is supported by the doctoral program of highereducation of China (No 2011437110008) and the nationalnatural science foundation of China (No 61303062)
References
[1] X Yan and J Han ldquogSpan graph-based substructure patternminingrdquo inProceedings of the 2nd IEEE International Conferenceon Data Mining (ICDM rsquo02) pp 721ndash724 December 2002
[2] X Yan P S Yu and J Han ldquoGraph indexing a frequentstructure-based approachrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMODrsquo04) pp 335ndash346 June 2004
[3] H He and A K Singh ldquoClosure-tree an index structurefor graph queriesrdquo in Proceedings of the 22nd InternationalConference on Data Engineering (ICDE rsquo06) p 38 April 2006
[4] X Yan P S Yu and J Han ldquoSubstructure similarity searchin graph databasesrdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 766ndash777June 2005
[5] Y Tian and J M Patel ldquoTALE a tool for approximate largegraph matchingrdquo in Proceedings of the 24th IEEE InternationalConference on Data Engineering (ICDE rsquo08) pp 963ndash972 April2008
[6] A Sanfeliu and K-S Fu ldquoA distance measure betweenattributed relational graphs for pattern recognitionrdquo IEEETransactions on Systems Man and Cybernetics vol 13 no 3 pp353ndash362 1983
[7] H Bunke and G Allermann ldquoInexact graph matching forstructural pattern recognitionrdquo Pattern Recognition Letters vol1 no 4 pp 245ndash253 1983
[8] M R Garey and D S Johnson Computers and Intractabilityvol 174 Freeman San Francisco Calif USA 1979
[9] X Zhao C Xiao X Lin and W Wang ldquoEfficient graphsimilarity joins with edit distance constraintsrdquo in Proceedingsof the 28th IEEE International Conference on Data Engineering(ICDE rsquo12) pp 834ndash845 April 2012
[10] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008
[11] D Deng G Li S Hao JWang J Feng andW S Li ldquoMaSSJoinaMapReduce-basedmethod for scalable string similarity joinsrdquoinProceedings of the 30th IEEE International Conference onDataEngineering (ICDE rsquo14) pp 340ndash351 Chicago Ill USA
[12] B H Bloom ldquoSpacetime trade-offs in hash coding withallowable errorsrdquoCommunications of the ACM vol 13 no 7 pp422ndash426 1970
[13] A Broder and M Mitzenmacher ldquoNetwork applications ofbloom filters a surveyrdquo Internet Mathematics vol 1 no 4 pp485ndash509 2004
[14] S Cohen and Y Matias ldquoSpectral bloom filtersrdquo in Proceedingsof the ACM SIGMOD International Conference on Managementof Data pp 241ndash252 June 2003
[15] F N Afrati and J D Ullman ldquoOptimizing joins in a map-reduce environmentrdquo in Proceedings of the 13th InternationalConference on Extending Database Technology Advances inDatabase Technology (EDBT rsquo10) pp 99ndash110 March 2010
[16] Y N Silva and J M Reed ldquoExploiting MapReduce-based sim-ilarity joinsrdquo in Proceedings of the ACM SIGMOD InternationalConference onManagement of Data (SIGMOD 12) pp 693ndash696May 2012
[17] G Wang B Wang X Yang and G Yu ldquoEfficiently indexinglarge sparse graphs for similarity searchrdquo IEEE Transactions onKnowledge and Data Engineering vol 24 no 3 pp 440ndash4512012
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 11
[18] J Cohen ldquoGraph twiddling in a MapReduce worldrdquo Computingin Science and Engineering vol 11 no 4 pp 29ndash41 2009
[19] S Lattanzi B Moseley S Suri and S Vassilvitskii ldquoFilteringa method for solving graph problems in MapReducerdquo inProceedings of the 23rd ACM Symposium on Parallelism inAlgorithms and Architectures (SPAA rsquo11) pp 85ndash94 June 2011
[20] B Bahmani K Chakrabarti and D Xin ldquoFast personalizedPageRank onMapReducerdquo in Proceedings of the ACM SIGMODInternational Conference on Management of Data pp 973ndash984June 2011
[21] U Kang H Tong J Sun C-Y Lin and C Faloutsos ldquoGBASEa Scalable and general graph management systemrdquo in Pro-ceedings of the 17th ACM SIGKDD International Conference onKnowledge Discovery andDataMining (KDD rsquo11) pp 1091ndash1099August 2011
[22] B Bahmani R Kumar and S Vassilvitskii ldquoDensest subgraphin streaming andMapReducerdquo Proceedings of the VLDB Endow-ment vol 5 no 5 pp 454ndash465 2012
[23] F N Afrati D Fotakis and J D Ullman ldquoEnumeratingsubgraph instances using map-reducerdquo in Proceedings of the29th International Conference on Data Engineering (ICDE rsquo13)pp 62ndash73 April 2013
[24] V Rastogi A Machanavajjhala L Chitnis and A Das SarmaldquoFinding connected components in map-reduce in logarithmicroundsrdquo in Proceedings of the 29th International Conference onData Engineering (ICDE rsquo13) pp 50ndash61 April 2013
[25] G Malewicz M H Austern A J C Bik et al ldquoPregel asystem for large-scale graph processingrdquo in Proceedings of theInternational Conference on Management of Data (SIGMODrsquo10) pp 135ndash145 June 2010
[26] E Krepska T Kielmann W Fokkink and H Bal ldquoHipg par-allelprocessing of large-scale graphsrdquo ACM SIGOPS OperatingSystems Review vol 45 no 2 pp 3ndash13 2011
[27] J E Gonzalez Y Low H Gu D Bickson and C GuestrinldquoPowergraph distributed graph-parallel computation on natu-ral graphsrdquo in Proceedings of the 10th USENIX Symposium onOperating Systems Design and Implementation (OSDI rsquo12) pp17ndash30 2012
[28] Y Tian A Balmin S A Corsten S Tatikonda and J McPher-son ldquoFrom ldquothink like a vertexrdquo to ldquothink like a graphrdquordquoProceedings of the VLDB Endowment vol 7 no 3 2013
[29] Z Shang and J X Yu ldquoCatch the wind graph workloadbalancing on cloudrdquo in Proceedings of the 29th InternationalConference on Data Engineering (ICDE rsquo13) pp 553ndash564 April2013
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014